CPMAT (Cray CrayPat/Perftools)

The Cray Performance Measurement and Analysis Tools (CPMAT), is the primary performance analysis toolset for Cray systems. The tool provides a productive environment that assists the user with application performance analysis and optimization. The toolset consists of components that prepare a program for performance analysis experiments, capture performance data during program execution, process and analyze the data, and present performance results to the user in both a text report and through an interactive graphical user interface.

Perftools-lite

Cray provides a simple method to get started with profiling your application via the perftools-lite module. The steps for using it are:

module unload darshan ; module load perftools-base # and one of the following :
1. perftools-lite # time sampling experiment
2. perftools-lite-gpu # for gpu kernels
3. perftools-lite-events # tracing top functions, may include rank reorder suggestions
4. perftools-lite-loops # for use with Cray Reveal and loop analysis
rebuild your application (compile and link ), then run it as usual
view the *.rpt text report and/or use the Apprentice2 GUI (app2) to inspect the *.ap2 file(s) produced from the run
gather more detailed information with pat_report or Cray Apprentice2 after the run

CrayPat

CrayPat is the data capture tool, which is used to prepare user programs for performance analysis experiments, to specify the kind of data to be captured during program execution, and to prepare the captured data for text reports or for use with other programs. The user can optionally control the behavior of the instrumented program during execution through a set of runtime environment variables that affect what and how the performance data is collected. Examples of this include the enabling of predefined hardware counter groups that track chosen sets of hardware events, the ability to choose the mechanism to use to sample the application, and the ability to modify the number of data files that are written in parallel by the processes. By default, a runtime summarization of the data is provided, which involves aggregation of the data.

HOW TO USE CRAYPAT

Follow the steps below :

module unload darshan; module load perftools-base perftools
build your application and keep all the .o files (these two steps are important--the program must be built with the perftools module already loaded for this to work)
instrument the application using pat_build. This will generate a new executable whose name is appended with "+pat"
Run the instrumented executable generated in step 3, probably within a job. This will generate a performance data file with suffix ".xf"
Run pat_report on the performance data file (generated in step 4) to view the results

As an example, to build an excutable, the command sequence can be:

% module unload darshan; module load perftools-base perftools

% ftn -c mytest.f90

% ftn -o mytest mytest.o

% pat_build -S mytest

To view the CrayPat result after step 4, do:

% pat_report mytest+pat+123-456s.xf

This will generate the mytest+pat+123-456s.ap2.

The information in the .xf file (or directory) is dependent on the exact executable that was run to generate it. You should run pat_report soon after you get your results and before you build your next version of the executable. If you over-write the executable before running pat_report, data in the .xf file becomes useless. pat_report generates a .ap2 file, which contains the same information as the .xf file but with all the meta-information from the executable. The user can use pat_report to extract useful information from the .ap2 file at a later time, without the original executable, on another machine, etc.

SAMPLING EXAMPLES

CrayPat can be used to sample counters at the entrance and exit to certain functions. This is a good way to measure the call frequency of those functions and measure their intensity of certain operations (floating point instructions, for instance).

The "-T" option tells CrayPat to instrument the calls (both entrance and exit) of the function name (or names) listed after that option. For instance, "pat_build -T my_main_function -w my_program" will create an instrumented executable version of my_program where the entrance and exit of the function my_main_function are instrumented for counter recording. Instrumenting one main function (which is likely called once and exits once) is useful, for, say, measuring the floating-point instruction density of the entire program. F90 and C++ users would need to use the mangled form of the function name from the linking stage with the -T option. To look up the mangled name for a function "myabcroutine", use: nm a.out | grep myabcroutine.

The counters that are sampled by a CrayPat executable are controlled by environment variables at *run time* when the instrumented +pat executable is run. You will need to put those definitions in the run script of the job so that the run script environment will pass the variables on, and they will be visible to the running program. The environment variable PAT_RT_PERFCTR (which used to be PAT_RT_HWPC in older versions of CrayPat) selects the counters that are recorded. You can record up to two program counters simultaneously; their names must be separated by commas in the environment variable. To run an instrumented program that measures the number of floating point operations and the number of total instructions executed within the instrumented functions, put this statement in your run script:

export PAT_RT_PERFCTR="PAPI_FP_OPS,PAPI_TOT_INS"
PAPI_FP_OPS instruments for floating-point operations, and PAPI_TOT_INS measures the total instructions executed.

Typically, if you're running an executable for the first time, or are making modifications, you'll first instrument the application using the pat_build -S option to run a sampling experiment to get a general idea of what functions the code spends the most time executing. The next step might be to then specifically instrument the several functions that the code spends the most time on and instrument them separately using the -T option.

Cray Apprentice

Cray Apprentice displays data that was captured by CrayPat. This visualization tool displays a variety of different data panels, depending on the type of performance experiment that was conducted. Its target is to help identify conditions including load imbalance, excessive serialization, excessive communication and network contention.

Cray Apprentice provides call-graph-based profile information with source code mapping and timeline-based trace visualization, also with source code mappings. It is capable of running either on the Cray system service nodes, or on a remote Linux server or workstation. Examples of Cray Apprentice displays are depicted below.

Other Features of CPMAT

AUTOMATIC PROFILING ANALYSIS

Automatic Profiling Analysis, or APA, is a feature of the Cray Performance Tools designed to simplify the process of program instrumentation and data collection for purposes of performance analysis.

APA works by first profiling the application for time usage. It uses the profile information gathered to determine an appropriate data collection strategy specifically for that application. APA specifies the strategy in the form of a template file that can be used as input into subsequent performance analysis runs. The steps for using this feature are similar to the five steps discussed for the use of CrayPat. The difference is the ``-O apa'' option for pat_build.

Therefore, the above example becomes:

% module unload darshan; module load perftools-base perftools

% ftn -c mytest.f90

% ftn -o mytest mytest.o

% pat_build -O apa mytest

This will produce the excutable mytest+pat. After running mytest+pat, do:

% pat_report mytest+pat+123-456s.xf

This will generate the mytest+pat+123-456s.ap2 and mytest+pat+123-456s.apa files, with the ".apa" file contains suggested pat_build options for building an executable for tracing experiments. One can edit this file, if desired, and use it to re-instrument the program for future tracing.

CLIENT SERVER MODEL

To further improve tool response time, a new distributed Cray Apprentice2 client for Linux has been introduced so that the graphical presentation is handled locally and not passed through the ssh connection between the user's laptop and the Cray service node. Prior to a client/server model, all of the performance data collected from an experiment needed to be loaded into memory before any results were displayed. This created size limitations as well as long load times. The combination of a new scalable data format and this new client/server model minimizes the amount of data loaded into memory at any given time and thus creates a smaller footprint on the Cray service node. A client version for Linux is currently available

AUTOMATIC GRID DETECTION

This is example output from a pat_report:

================  Observations and suggestions  ========================

MPI Grid Detection:

    There appears to be point-to-point MPI communication in a 96 X 8
    grid pattern. The 51.4% of the total execution time spent in MPI
    functions might be reduced with a rank order that maximizes
    communication between ranks on the same node. The effect of several
    rank orders is estimated below.

    A file named MPICH_RANK_ORDER.Grid was generated along with this
    report and contains usage instructions and the Custom rank order
    from the following table.

         Rank    On-Node     On-Node   MPICH_RANK_REORDER_METHOD
        Order   Bytes/PE   Bytes/PE%
                            of Total
                            Bytes/PE

        Custom  1.851e+12      96.56%  3
           SMP  1.459e+12      76.08%  1
          Fold  1.056e+09       0.06%  2
    RoundRobin  0.000e+00       0.00%  0

APPRENTICE2 Timeline view

Use the microscope icon in app2 to look at the mpi (PE) and thread (TH) timelines.

ijk triples timeline view in apprentice 2

Cray Reveal

Reveal is Cray's next-generation integrated performance analysis and code optimization tool. Reveal extends Cray's existing performance measurement, analysis, and visualization technology by combining run-time performance statistics and program source code visualization with Cray Compiling Environment (CCE) compile-time optimization feedback. See the man page "man reveal" with perftools loaded for more information.

cray reveal example view

Additional Material

Slides on Cray Performance Tools [pdf1] [pdf2 ] from Programming Environment Training
2016 Blue Waters Symposium presentation on cpu performance tuning
More information can be found at CRAYDOC . Always refer to docs.cray.com for the version available on your system or the latest version of the document.
Using Cray Performance Analysis Tools - version 5.3.0 [PDF ]
Man Page - pat_build
Man Page - intro_craypat