Profiling and performance measure with OpenACC and CUDA code

Profiling and performance counters are available with both PrgEnv-cray and PrgEnv-pgi for OpenACC codes. For cuda codes, the Nvidia profiling tools may be used.

OpenACC performance measure and profiling

PrgEnv-cray

Cray perftools can gather profile and counter information when run with OpenACC programs. Compile the executable and instrument it with perftools. This is an example scenario, though other compiler and pat_build flags may also be appropriate:

ftn -h func_trace -o mycode.exe mycode.f90
pat_build -w mycode.exe

Perftools may capture any of many defined accelerator counters set by PAT_RT_ACCPC. See "man accpc_k20" for the full list of available metrics. Only one metrice from the set may be measured per aprun invocation. The requirements for a batch job are:

module load PrgEnv-cray
module load craype-accel-nvidia35
module load perftools
module unload darshan
export PAT_RT_ACCPC=threads_launched
export CRAY_ACC_DEBUG=1 # <-- set this to trace all the kernel calls to the device
                        # to stderr, see man intro_openacc for more info.  Levels 1,2,3
                        # will increase the level of detail.  Level 3 shows information
                        # about how kernels are launched.
# an alternative and simpler perftools approach:
# module load perftools-base perftools-lite-gpu
# module unload darshan
# rebuild code and run the a.out produced ,
#  *.rpt and *.ap2 files will be generated automatically
aprun -n N mycode.exe+pat

After the batch job has run, a .xf file or directory ending in t (for mpi codes) will be created. Process the .xf file or directory with pat_report and a .ap2 file or directory will be created that you can view with apprentice2 (app2).

pat_report mycode.exe+pat+78082-81t.xf
... # text output results from pat_report, can redirect to file with "> file.rpt"

Table 2:  Time and Bytes Transferred for Accelerator Regions

  Host  |   Host  |   Acc  | Acc Copy  | Acc Copy  | Events  |Calltree
 Time%  |   Time  |  Time  |       In  |      Out  |         | PE=HIDE
        |         |        | (MBytes)  | (MBytes)  |         |  Thread=HIDE

 100.0% | 117.505 | 49.067 |     29781 |     0.063 |    9574 |Total
|----------------------------------------------------------------------------------
| 100.0% | 117.505 | 49.067 |     29781 |    0.063 |    9574 |cc_triples_restart_
|       |        |        |          |          |         | cc_triples_
...

# launch the X-window apprentice2 GUI:
app2 mycode.exe+pat+78082-81t.ap2

PrgEnv-pgi

Profile and trace info for OpenACC kernels is available via environment variables with the PGI environment. The batch job will need:

module load cudatoolkit
module load PrgEnv-pgi
module unload darshan
export PGI_ACC_TIME=1  # profile , and/or PGI_ACC_NOTIFY=1 or 3 for tracing
aprun -n N mycode.exe

stdout will contain information about OpenACC regions:

main()
Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269

Accelerator Kernel Timing data
./laplace2d.c
  laplace
    74: region entered 1000 times
        time(us): total=1,742,193 init=192 region=1,742,001
                  kernels=1,624,325
        w/o init: total=1,742,001 max=72,263 min=1,666 avg=1,742
        77: kernel launched 1000 times
            grid: [64x1024]  block: [64x4]
            time(us): total=1,624,325 max=1,683 min=1,620 avg=1,624
./laplace2d.c
  laplace
    63: region entered 1000 times
        time(us): total=3,973,944 init=151 region=3,973,793
                  kernels=3,572,204
        w/o init: total=3,973,793 max=66,883 min=3,899 avg=3,973
        66: kernel launched 1000 times
            grid: [64x1024]  block: [64x4]
            time(us): total=3,435,500 max=4,745 min=3,429 avg=3,435
        70: kernel launched 1000 times
            grid: [1]  block: [256]
            time(us): total=136,704 max=1,384 min=134 avg=136
./laplace2d.c
  laplace
    58: region entered 1 time
        time(us): total=6,259,767 init=469,009 region=5,790,758
                  data=71,063
        w/o init: total=5,790,758 max=5,790,758 min=5,790,758 avg=5,790,758
 total: 6.259794 s
Application 140307 exit codes: 19
Application 140307 resources: utime ~4s, stime ~3s

CUDA performance measure and profiling

Only 1 of nvprof or command-line profiling below may be used per program invocation.

Nvprof

The Nvidia profiling and tracing tool nvprof is available and can be used with cuda code. The requirements for using nvprof from a batch job are:

module load cudatoolkit
module unload darshan
export LD_LIBRARY_PATH=$CRAY_CUDATOOLKIT_DIR/lib64:$LD_LIBRARY_PATH
export COMPUTE_PROFILE=0  # or unset

# sample MPI wrapper script for profiling MPI applications with nvprof
# ( aprun -n <ranks> wrap.sh ) 
$ cat -n wrap.sh
1  #!/bin/bash
2  export LD_LIBRARY_PATH=$CRAY_CUDATOOLKIT_DIR/lib64:$LD_LIBRARY_PATH
3  nvprof -o output.%h.%p.%q{ALPS_APP_PE} --profile-all-processes  &
4  sleep 1
5  `pwd`/laplace2d_f90_mpi_acc

This is a sample run.

laplace2d-data> export \
  LD_LIBRARY_PATH=/opt/nvidia/cudatoolkit/default/lib64:$LD_LIBRARY_PATH
laplace2d-data> cd $PBS_O_WORKDIR
laplace2D-data> aprun -b -n 1 nvprof laplace2d_accpgi
======== NVPROF is profiling laplace2d_accpgi...
======== Command: laplace2d_accpgi
main()
Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 6.712810 s
======== Warning: Application returned non-zero code 19
======== Profiling result:
 Time(%)      Time   Calls       Avg       Min       Max  Name
   65.24     3.48s    1000    3.48ms    3.47ms    3.49ms  laplace_66_gpu
   31.11     1.66s    1000    1.66ms    1.66ms    1.66ms  laplace_77_gpu
    2.41  128.73ms    1000  128.73us  127.68us  130.33us  laplace_70_gpu_red
    0.72   38.63ms    1001   38.59us    2.53us   36.03ms  [CUDA memcpy DtoH]
    0.51   27.25ms    1128   24.16us    3.74us  182.66us  [CUDA memcpy HtoD]
Application 83077 resources: utime ~5s, stime ~3s

command-line profiler via environment variables (MPI or serial profiling, for CUDA versions < 9.x )

In addition to the nvprof profiler, the CUDA environment provides a built-in profiler via the libraries in your code. PGI OpenACC code can also be profiled with this method. MPI codes profiled this way may be analyzed with the NVVP tool by following the steps at http://docs.nvidia.com/cuda/profiler-users-guide/index.html#import-multi-nvprof-session (section 2.2.2.3) . You can employ the built-in profiler by setting COMPUTE_PROFILE to non-zero:

-data> module unload darshan
-data> export COMPUTE_PROFILE=1
nid00031-[IN_JOB]arnoldg@nid00010:-data> aprun -b -n 1 ./laplace2d_acc
main()
Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 3.945941 s
Application 140290 resources: utime ~4s, stime ~1s
nid00031-[IN_JOB]arnoldg@nid00010:-data> ls -lt|head -2
total 3876
-rw------- 1 arnoldg bw_staff  236416 Mar 12 13:39 cuda_profile_0.log
-data> more cuda_profile_0.log
# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 Tesla K20X
# CUDA_CONTEXT 1
# TIMESTAMPFACTOR fffff69047ada518
method,gputime,cputime,occupancy
method=[ memcpyHtoD ] gputime=[ 53270.656 ] cputime=[ 53558.000 ]
method=[ memcpyHtoD ] gputime=[ 1.600 ] cputime=[ 37.000 ]
method=[ laplace$ck_L64_3 ] gputime=[1899.712 ] cputime=[ 26.0 ] occupancy=[ 0.75 ]
method=[ memcpyDtoH ] gputime=[ 3.104 ] cputime=[ 49.000 ]
method=[ laplace$ck_L75_5 ] gputime=[1757.760 ] cputime=[ 10.0 ] occupancy=[ 1.00 ]
method=[ laplace$ck_L64_3 ] gputime=[1905.536 ] cputime=[ 8.0 ] occupancy=[ 0.75 ]
...

For MPI, a wrapper may be used to assign unique logfiles. Use aprun with the wrapper script:

>cat simpleMPI.sh
	#!/bin/bash -login
	module load cudatoolkit
	THIS_NODE=`hostname`
	export COMPUTE_PROFILE_LOG=$THIS_NODE.log
	export COMPUTE_PROFILE=1
       export COMPUTE_PROFILE_CSV=1
       export COMPUTE_PROFILE_CONFIG=mynvvp.cfg
	./simpleMPI

>cat mynvvp.cfg
streamid
gpustarttimestamp

>grep aprun myjobscript.pbs

        aprun -b -n 16 ./simpleMPI.sh

> nvvp *.log  # after job completes