OpenACC and OpenMP4.x Accelerator Directives
The OpenACC API describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator, providing portability across operating systems, host CPUs and accelerators. The OpenMP API is similar in functionality and performance. There is nearly a 1:1 mapping between equivalent OpenACC "acc" and OpenMP 4.x "omp target" directives.
How to use OpenACC or OpenMP4.x
Use the options from the table below to match your project requirements. NCSA recommends using only one programming environment for your code development and testing. The Cray compiler tends to adhere very strictly to the standard while the PGI compiler allows for more flexibility in mixing directives that are not explicitly stated to work together. Note that combining OpenACC directives within host OpenMP regions will result in runtime errors with the Cray programming environment so it's best to disable host OpenMP when getting started with the Cray compilers and OpenACC.
OpenMP 4.x "omp target" directives are only supported by the Cray compiler to date.
These are samples for a Laplace2d solver demonstrating OpenACC in Fortran and C (note the Fortran code contains OpenMP directives for comparison purposes, in practice you would compile either the OpenMP version or the OpenACC version). The samples were built with PrgEnv-pgi: laplace2d.c , timer.h, laplace2d.f90 . This sample was built with the PrgEnv-cray and craype-accel-nvidia35 modules and represents the OpenMP4.x markup of the laplace code: laplace2d_omp4.c .
Both programming environments provide good feedback when using OpenACC or OpenMP directives. Sample output from -h msgs (Cray) and -Minfo=accel (PGI) are shown. The cray loopmark .lst output will also show accelerated regions of code in the marked up source.
Cray compiler sample output ( -h msgs ):
CC-6430 craycc: ACCEL File = laplace2d_omp4.c, Line = 65
A loop was partitioned across the 128 threads within a threadblock.
PGI compiler sample output ( -Minfo=accel ):
145, Accelerator kernel generated
145, CC 1.3 : 18 registers; 112 shared, 32 constant, 0 local memory bytes
CC 2.0 : 26 registers; 0 shared, 132 constant, 0 local memory bytes 148, ...
169, Sum reduction generated for sum1
145, Generating present_or_copy(part(:4,:nop)) Generating present_or_copyin(fxy(:,:,:)) ...
Through Cray libsci_acc, BLAS, LAPACK, and ScaLAPACK routines are provided to improve performance by generating and running automatically-tuned accelerator kernels on the XK nodes when appropriate. Use it with PrgEnv-cray or PrgEnv-gnu, by adding the module:
aprun -cc none -n <numranks> ... # Cray recommends allowing threads to migrate within a node when using libsci_acc
See also: "man intro_libsci_acc" .
Additional Information / References