OpenACC and OpenMP4.x Accelerator Directives

The OpenACC API describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator, providing portability across operating systems, host CPUs and accelerators. The OpenMP API is similar in functionality and performance. There is nearly a 1:1 mapping between equivalent OpenACC "acc" and OpenMP 4.x "omp target" directives.

How to use OpenACC or OpenMP4.x

Use the options from the table below to match your project requirements. NCSA recommends using only one programming environment for your code development and testing. The Cray compiler tends to adhere very strictly to the standard while the PGI compiler allows for more flexibility in mixing directives that are not explicitly stated to work together. Note that combining OpenACC directives within host OpenMP regions will result in runtime errors with the Cray programming environment so it's best to disable host OpenMP when getting started with the Cray compilers and OpenACC.

OpenMP 4.x "omp target" directives are only supported by the Cray compiler to date.

OpenACC Programming Environment
Env.	Required Modules	Fortran flags for Large static memory	C flags for large static memory	helpful flags
Cray	PrgEnv-cray craype-accel-nvidia35 cudatoolkit (libsci_acc is automatically included) craype-accel-host for host cpu code	-h acc,noomp #OpenACC -h noacc,omp #OpenMP4.x -fpic -dynamic -lcudart -G2	-h pragma=acc -h nopragma=omp -fpic -dynamic -lcudart -Gp	-rm -h msgs
PGI	PrgEnv-pgi cudatoolkit	-acc -ta=nvidia or -ta=multicore for host cpu code -lcudart -mcmodel=medium	-acc -ta=nvidia -lcudart -mcmodel=medium	-Minfo=accel
GNU	not supported	not supported	not supported	N/A
Intel	not supported	not supported	not supported	N/A

When submitting jobs, be sure to specify "xk" as the node type. Access to the accelerator is limited to 1 process or MPI rank per node unless CRAY_CUDA_MPS=1. To use the remaining Interlagos cores effectively, some sort of work division and load balance would be required in the application code. The simple (and inefficient) case would employ 1 process per node to drive the accelerators:

#PBS -l nodes=512:ppn=1:xk

Examples

These are samples for a Laplace2d solver demonstrating OpenACC in Fortran and C (note the Fortran code contains OpenMP directives for comparison purposes, in practice you would compile either the OpenMP version or the OpenACC version). The samples were built with PrgEnv-pgi: laplace2d.c , timer.h, laplace2d.f90 . This sample was built with the PrgEnv-cray and craype-accel-nvidia35 modules and represents the OpenMP4.x markup of the laplace code: laplace2d_omp4.c .

Compiler feedback

Both programming environments provide good feedback when using OpenACC or OpenMP directives. Sample output from -h msgs (Cray) and -Minfo=accel (PGI) are shown. The cray loopmark .lst output will also show accelerated regions of code in the marked up source.

Cray compiler sample output ( -h msgs ):

CC-6430 craycc: ACCEL File = laplace2d_omp4.c, Line = 65

A loop was partitioned across the 128 threads within a threadblock.

PGI compiler sample output ( -Minfo=accel ):

145, Accelerator kernel generated

145, CC 1.3 : 18 registers; 112 shared, 32 constant, 0 local memory bytes

CC 2.0 : 26 registers; 0 shared, 132 constant, 0 local memory bytes 148, ...

169, Sum reduction generated for sum1

145, Generating present_or_copy(part(:4,:nop)) Generating present_or_copyin(fxy(:,:,:)) ...

libsci_acc

Through Cray libsci_acc, BLAS, LAPACK, and ScaLAPACK routines are provided to improve performance by generating and running automatically-tuned accelerator kernels on the XK nodes when appropriate. Use it with PrgEnv-cray or PrgEnv-gnu, by adding the module:

module load craype-accel-nvidia35 # <-- automatically includes libsci_acc

aprun -cc none -n <numranks> ... # Cray recommends allowing threads to migrate within a node when using libsci_acc

Additional Information / References

With PrgEnv-cray loaded
- "man openacc.examples" will show more code examples and OpenAcc usage
- "man intro_openacc" describes the current Cray implementation of OpenACC directives and library calls
OpenACC Vector Addition at ORNL/TITAN
PGI Webinars and Videos about programming with accelerators
OpenACC Quick Reference Guide from www.openacc.org
OpenMP 4.0.2 examples
CUDA-Aware MPI (See section on CUDA-AWARE MPI and OpenACC)