Tuesday, January 8, 2019

OpenMP Accelerator Support for GPUs on POWER architecture

The combination of the IBM® POWER® processors and the NVIDIA GPUs provides a platform for heterogeneous high-performance computing that can run several technical computing workloads efficiently. The computational capability is built on top of massively parallel and multithreaded cores within the NVIDIA GPUs and the IBM POWER processors. You can offload parallel operations within applications, such as data analysis or high-performance computing workloads, to GPUs or other devices. 

       OpenMP API specification provides a set of directives to instruct the compiler and runtime to offload a block of code to the device. The device can be GPU, FPGA etc. In the new generations of POWER architecture, the POWER processor can be attached to the Nvidia GPU via the high speed NVLINK for fast data transfer between CPU and GPU. This hardware configuration is an essential part of the CORAL project with the U.S. national labs and can bring us closer to the Exascale Computing. The IBM XL compilers has a long history of supporting OpenMP API starting from the first version of the specification. The XL compilers continue to support OpenMP specification and exploit the POWER hardware architecture with GPU. The XL compiler team works closely with the IBM Research team to develop the compiler infrastructure for the offloading mechanism. In addition, the team also collaborates with the open source community for the runtime interface on the GPU device runtime library.


source

The OpenMP program (C, C++ or Fortran) with device constructs is fed into the High-Level Optimizer and partitioned into the CPU and GPU parts. The intermediate code is optimized by High-level Optimizer. Note that such optimization benefits both code for CPU as well as GPU. The CPU part is sent to the POWER Low-level Optimizer for further optimization and code generation. The GPU part of the code is translated to the LLVM IR and then fed into the LLVM optimizer in the CUDA Toolkit for optimization specific for Nvidia GPU and PTX code generation. Finally, the linker is invoked to link the objects to create an executable. From this outline view, one can see that the compiler employs the expertise from the both worlds to ensure that the applications are being optimized accordingly. For the CPU part, the POWER Low-level Optimizer which accumulates many years of optimization knowledge on the POWER architecture generates optimized code. For the GPU part, the GPU expertise from the CUDA Toolkit is used to generate optimized code on the Nvidia device. As a result, the entire applications are optimized in a balanced way.

The XL C/C++ V13.1.5 and XL Fortran V15.1.5 compilers are one of the first compilers that provide support for Nvidia GPU offloading using OpenMP 4.5 programming model. This release has the basic device constructs (i.e. target, target update and target data directives) support to allow users to experiment the offloading mechanism and porting code for GPU. The other important aspect of offloading computation to devices is the data mapping. The map clause is also supported in this release.

 Example:
Compile the above code using IBM XL compiler  with below mentioned flags

 xlc_r -O3 -fopenmp -qsmp=omp -qoffload sample_offload.c

Observe the offloaded task running on GPU



  OpenMP offloading compilers for NVIDIA GPUs have improved dramatically over the past year and are ready for real use.



Reference:
1) https://www.openmp.org/updates/openmp-accelerator-support-gpus/
2) https://computation.llnl.gov/projects/co-design/lulesh
3) https://www.ibm.com/support/knowledgecenter/en/SSXVZZ_13.1.5/com.ibm.xlcpp1315.lelinux.doc/proguide/offloading.html
4) https://openpowerfoundation.org/presentations/openmp-accelerator-support-for-gpu/