OpenACC and OpenMP4.x Accelerator DirectivesThe OpenACC API describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator, providing portability across operating systems, host CPUs and accelerators. The OpenMP API is similar in functionality and performance. There is nearly a 1:1 mapping between equivalent OpenACC "acc" and OpenMP 4.x "omp target" directives. How to use OpenACC or OpenMP4.xUse the options from the table below to match your project requirements. NCSA recommends using only one programming environment for your code development and testing. The Cray compiler tends to adhere very strictly to the standard while the PGI compiler allows for more flexibility in mixing directives that are not explicitly stated to work together. Note that combining OpenACC directives within host OpenMP regions will result in runtime errors with the Cray programming environment so it's best to disable host OpenMP when getting started with the Cray compilers and OpenACC. OpenMP 4.x "omp target" directives are only supported by the Cray compiler to date.
ExamplesThese are samples for a Laplace2d solver demonstrating OpenACC in Fortran and C (note the Fortran code contains OpenMP directives for comparison purposes, in practice you would compile either the OpenMP version or the OpenACC version). The samples were built with PrgEnv-pgi: laplace2d.c , timer.h, laplace2d.f90 . This sample was built with the PrgEnv-cray and craype-accel-nvidia35 modules and represents the OpenMP4.x markup of the laplace code: laplace2d_omp4.c . Compiler feedbackBoth programming environments provide good feedback when using OpenACC or OpenMP directives. Sample output from -h msgs (Cray) and -Minfo=accel (PGI) are shown. The cray loopmark .lst output will also show accelerated regions of code in the marked up source. Cray compiler sample output ( -h msgs ): CC-6430 craycc: ACCEL File = laplace2d_omp4.c, Line = 65 A loop was partitioned across the 128 threads within a threadblock.
PGI compiler sample output ( -Minfo=accel ): 145, Accelerator kernel generated 145, CC 1.3 : 18 registers; 112 shared, 32 constant, 0 local memory bytes CC 2.0 : 26 registers; 0 shared, 132 constant, 0 local memory bytes 148, ... 169, Sum reduction generated for sum1 145, Generating present_or_copy(part(:4,:nop)) Generating present_or_copyin(fxy(:,:,:)) ... libsci_accThrough Cray libsci_acc, BLAS, LAPACK, and ScaLAPACK routines are provided to improve performance by generating and running automatically-tuned accelerator kernels on the XK nodes when appropriate. Use it with PrgEnv-cray or PrgEnv-gnu, by adding the module:
aprun -cc none -n <numranks> ... # Cray recommends allowing threads to migrate within a node when using libsci_acc
See also: "man intro_libsci_acc" . Additional Information / References
|