OpenACC and OpenMP4.x Accelerator Directives

The OpenACC API describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator, providing portability across operating systems, host CPUs and accelerators.  The OpenMP API is similar in functionality and performance.  There is nearly a 1:1 mapping between equivalent OpenACC "acc" and OpenMP 4.x "omp target" directives.

How to use OpenACC or OpenMP4.x 

Use the options from the table below to match your project requirements.  NCSA recommends using only one programming environment for your code development and testing.  The Cray compiler tends to adhere very strictly to the standard while the PGI compiler allows for more flexibility in mixing directives that are not explicitly stated to work together.  Note that combining OpenACC directives within host OpenMP regions will result in runtime errors with the Cray programming environment so it's best to disable host OpenMP when getting started with the Cray compilers and OpenACC.

OpenMP 4.x "omp target" directives are only supported by the Cray compiler to date.

Env.

Required

Modules

Fortran flags for

Large static memory

C flags for large

static memory

helpful flags
OpenACC Programming Environment
Cray

PrgEnv-cray

craype-accel-nvidia35

cudatoolkit

(libsci_acc is

automatically

included)

-h acc,noomp #OpenACC

-h noacc,omp #OpenMP4.x

-fpic -dynamic

-lcudart

-G2

-h pragma=acc

-h nopragma=omp

-fpic -dynamic

-lcudart

-Gp

-rm

-h msgs

 

PGI

PrgEnv-pgi

cudatoolkit

-acc -ta=nvidia

-lcudart

-mcmodel=medium

-acc -ta=nvidia

-lcudart

-mcmodel=medium

-Minfo=accel
GNU not supported not supported not supported N/A
Intel not supported not supported not supported N/A


 

When submitting jobs, be sure to specify "xk" as the node type.  Access to the accelerator is limited to 1 process or MPI rank per node unless CRAY_CUDA_MPS=1.  To use the remaining Interlagos cores effectively, some sort of work division and load balance would be required in the application code.  The simple (and inefficient) case would employ 1 process per node to drive the accelerators:

#PBS -l nodes=512:ppn=1:xk

Examples

These are samples for a Laplace2d solver demonstrating OpenACC in Fortran and C (note the Fortran code contains OpenMP directives for comparison purposes, in practice you would compile either the OpenMP version or the OpenACC version).  The samples were built with PrgEnv-pgi:  laplace2d.c , timer.h, laplace2d.f90 .  This sample was built with the PrgEnv-cray and craype-accel-nvidia35 modules and represents the OpenMP4.x markup of the laplace code: laplace2d_omp4.c .

Compiler feedback

Both programming environments provide good feedback when using OpenACC or OpenMP directives.  Sample output from -h msgs (Cray) and -Minfo=accel (PGI) are shown.  The cray loopmark .lst output will also show accelerated regions of code in the marked up source.

Cray compiler sample output ( -h msgs ):

CC-6430 craycc: ACCEL File = laplace2d_omp4.c, Line = 65

  A loop was partitioned across the 128 threads within a threadblock.

 

PGI compiler sample output ( -Minfo=accel ):

145, Accelerator kernel generated

145, CC 1.3 : 18 registers; 112 shared, 32 constant, 0 local memory bytes

CC 2.0 : 26 registers; 0 shared, 132 constant, 0 local memory bytes 148, ...

169, Sum reduction generated for sum1

145, Generating present_or_copy(part(:4,:nop)) Generating present_or_copyin(fxy(:,:,:)) ...

libsci_acc

Cray libsci_acc BLAS, LAPACK, and ScaLAPACK routines are provided to improve performance by generating and running automatically-tuned accelerator kernels on the XK nodes when appropriate.  To use it with PrgEnv-cray just add the module:

module load craype-accel-nvidia35 # <-- automatically includes libsci_acc

aprun -cc none -n <numranks> ...   # Cray recommends allowing threads to migrate within a node when using libsci_acc

 

See also: "man intro_libsci_acc" .

Additional Information / References