Perftools with Cray's timeline view: OpenACC

 

The latest perftools support OpenACC accelerated regions (and they should support CUDA in a similar fashion).  A compute_pi c code was marked up with OpenACC pragmas for the main loop and run with perftools to produce the following views with app2 and reveal.  

Here are the steps.

(for building):

module load craype-accel-nvidia35

module unload darshan

module load perftools/6.1.3 # or later

cc -h pl=myprogram_library.pl cpi.c   

cc -h acc,msgs cpi.c  

pat_build -u -gmpi a.out

(for the compute node/job):

module unload darshan # conflicts with perftools 

module load perftools 

export CRAY_CUDA_MPS=0 

# crashes perftools, also make sure CRAY_CUDA_PROXY is unset or 0 

export PAT_RT_SUMMARY=0 

# enables the timeline view for app2  

# see Cray Performance Measurement and Analysis Tools section 5.4.11, page 76.

module load perftools  

export CRAY_ACC_DEBUG=1  # ok simultaneously with perftools 

aprun -n 2 -N 1 ./a.out+pat

 CrayPat/X:  Version 6.1.3 Revision 12145  11/18/13 21:56:10
     ACC: Transfer 1 items (to acc 8 bytes, to host 0 bytes) from cpi.c:18
     ACC: Transfer 2 items (to acc 4 bytes, to host 0 bytes) from cpi.c:18
     ACC: Execute kernel main$ck_L18_2 async(auto) from cpi.c:18
     ACC: Transfer 2 items (to acc 0 bytes, to host 0 bytes) from cpi.c:18
     ACC: Transfer 1 items (to acc 8 bytes, to host 0 bytes) from cpi.c:18
     ACC: Transfer 2 items (to acc 4 bytes, to host 0 bytes) from cpi.c:18
     ACC: Execute kernel main$ck_L18_2 async(auto) from cpi.c:18
     ACC: Transfer 2 items (to acc 0 bytes, to host 0 bytes) from cpi.c:18
     ACC: Wait async(auto) from cpi.c:21
     ACC: Transfer 1 items (to acc 0 bytes, to host 8 bytes) from cpi.c:21
     pi is approximately 3.1415926535897931, Error is 0.0000000000000000
     ACC: Wait async(auto) from cpi.c:21
     ACC: Transfer 1 items (to acc 0 bytes, to host 8 bytes) from cpi.c:21
     Experiment data file written:
     /mnt/abc/u/staff/arnoldg/c/cpi/a.out+pat+208328-81t.xf
     Application 208328 resources: utime ~0s, stime ~2s, Rss ~120416, inblocks ~3389,
      outblocks ~4231

(for analysis after the job has run):

pat_report a.out+pat+*.xf

app2  a.out+pat+*.xf

reveal  program_library.pl a.out+pat+*.ap2