Skip to Content

Low Level Benchmark Instructions and 
Result Tables

Introduction

Benchmarks play a critical role in the performance evaluation of lsystems.  The Low Level-2017 (LL-2017) benchmarks serve three purposes:

  1. The benchmarks are carefully chosen to represent characteristics of the current Blue Waters and possible future workloads, which consists of solving complex scientific problems using diverse computational techniques at high degrees of parallelism.
  2. The benchmarks give the opportunity to provide the concrete data associated with the performance and scalability of the systems.
  3. The benchmarks can be used as part of the system acceptance and /or regression testing and as a measurement of performance throughout the operational lifetime of the system.

Observed benchmark performance should be obtained from a system under consideration or a system configured as closely as possible to the target system.  For systems targeted at supporting highly parallel computation, it is critical that the evaluators provide observed benchmark performance using the largest (most parallel) test inputs as well as other sizes.  The largest scale jobs in the benchmark suite should not be interpreted as the limit for the job concurrency for the target system. Performance projections are permissible if they are derived from a similar system that is considered an earlier generation and/or smaller system.  Projections should be rigorously derived, using best practices for application and system performance modeling and be thoroughly documented and easily understood.  In the tables below, the "Projected" column refers to the value projected for the full system target where direct measurements are not possible. 

Submission Guidelines

The benchmark results (or projections including original results) for the target system should be recorded in the tables provided at the end of this document.  Additionally, the evaluator should submit the completed tables, benchmark codes and output files and as well as documentation on any code optimizations or configuration changes.  The submitted source should be in a form that can be readily compiled on the target system.  Please do not include object and executable files, core dump files or large binary data files.  An audit trail should be supplied for any changes made to the benchmark codes.  The audit trail should be sufficient to demonstrate that the changes made conform to the spirit of the benchmark and do not violate the specific restrictions on the various benchmark codes. The compile and run logs should also be submitted.

If performance projections are used, this should be clearly indicated.  The output files on which the projections are based, and a description of the projection method should be included.  In addition, each system used for benchmark projections should be described in Table 2 below.  Each projection in benchmarks results tables should indicate on which system the benchmark was originally run.  Enter the corresponding letter of the "System" column of Table 2 into the "System" column of the benchmark result table. 

Run Rules

The run rules, which are included in the source distribution, supply specific requirements and instructions for compiling, executing, verifying numerical correctness and reporting results for each benchmark.  The benchmark performance should only be accepted from runs that exhibit correct execution.  Only software tools and libraries that will be included for general use in the target system as supported product offerings are permissible to build and execute the benchmarks.

Message passing programs should be built using an implementation that supports 64-bit virtual memory pointers and a thread-safe communication library that implements the MPI standard.  All tests are to be in 64 bit floating point mode unless otherwise allowed.

Benchmark Descriptions

Lower Level Tests

The Lower Level Tests, listed in Table 1, are simple focused tests that are easily compiled and executed.  The results allow a uniform comparison of features and provide an estimation of system balance.  Descriptions and requirements for each test are included in the source distribution.  The results for the target system should be recorded in Table 3 under the column "Target".  In the event that benchmark results are being projected, columns "Benchmarked" and "Target" should be filled out.

Modifications to the Lower Level 2017 Benchmark are only permissible to enable correct execution on the target platform.  No changes related to optimization are permissible except in the case of the NAS FT benchmark where the values for fftblock_default and fftblockpad_default may be changed to suit the target architecture.

Table 1.  Lower Level Tests

Benchmark

Purpose

NAS Parallel 2.4 Class D, 256 tasks

Parallel Performance/Interconnect

NAS Parallel UPC Class D

PGAS Performance/One-sided-Messages

STREAM

Memory Bandwidth

PSNAP

OS Jitter

OMB MPI microbenchmarks

All tests are to be run in fully packed mode unless otherwise described below.  In architectures with multiple cores[1] per node, "fully packed" means that the number of instances, threads or MPI tasks per node should at least equal the total number of physical cores available on the node.

The NPB UPC FT Class D benchmark should execute with 256 UPC threads.  The evaluator may choose on how many physical nodes the code will run.

The PSNAP benchmark should execute on all available processors on the benchmark system.  The operating system used for the PSNAP run(s) should be configured as the system would be delivered for regular, production purposes.

Special rules regarding packing apply to the STREAM benchmarks.

Base Case

The base case limits the scope of optimization and the allowable concurrency to prescribed values.  Certain minimal exceptions are allowed for hardware multithreading and if there is insufficient memory per node to execute the application.  The base case also limits the parallel programming model to MPI only.  Each of these points is covered in more detail below.  In the Base Case for all Full Application runs, modifications are permissible only to enable porting and correct execution on the target platform.  No changes related to optimization are permissible.  Library routines may be used as long as they currently exist in an evaluator's supported set of general or scientific libraries, and should be in such a set when the system is delivered.  As well, they should not specialize or limit the applicability of the benchmark nor violate the measurement goals of the particular benchmark.  Source preprocessors, execution profile feedback optimizers, etc. are allowed as long as they are, or will be, available and supported as part of the compilation system for the full-scale systems.  Only publicly available and documented compiler switches should be used.  Compiler optimizations will be allowed only if they do not increase the runtime or artificially increase the delivered FLOP/s rate by performing non-useful work.

If a benchmark will not run on its target number of processors due to memory limitations, the evaluator may use the least number of additional processors necessary.  The evaluator should still solve the same global problem, using the same input files as for the target concurrency when the MPI concurrency is higher than the original target.  For codes where the number of processors to be used is included in the input files the input files may be modified accordingly if a larger number of processors than the target is required. Other than that, no changes to the input files are allowed.

For all Base Case runs the benchmarks should be executed in a fully-packed manner on the computational nodes. In architectures with multiple cores per node, the number of MPI tasks times the number of threads per MPI tasks should equal the total number of physical cores available on the node. 

It is permissible for applications to run with more than one MPI task per core if the target system has the hardware capability to run multiple tasks, and the capability can be activated with a simple environmental setting that would be available to Track-1 users.  To use hardware multithreading, the evaluator should first start with the LL-2017 target concurrency given in the tables and then expand MPI concurrency to occupy hardware threads.  For example, for 2-way hardware multithreading, the evaluator should first start with the target concurrency and then expand to 1024 in order to engage the 2-way hardware threading.  The increase in MPI concurrency should be the minimum needed to exploit the hardware multithreading features.

Optional Optimized Case

An optional optimized case has been added to allow the evaluator to highlight the features and benefits of the target system by submitting benchmarking results obtained through a variety of optimizations. The evaluator may choose to optimize the source code for data layout and alignment or to enable specific hardware or software features that may include (but are not limited to):

·      Using Hybrid OpenMP+MPI for concurrency;

·      Using vendor-specific hardware features to accelerate code;

·      Running the benchmarks at a higher or lower concurrency than the targets;

·      Running at the same concurrency as the targets but in an "unpacked" mode;

·      Any combination of the above.  

For example, if the scheduling unit is a node, all the cores in all the nodes assigned to the job should be counted as being used.  The evaluator should determine if the benchmark performance increases or decreases when running in an unpacked mode before submitting results.

Wholesale changes to the parallel algorithms are also permitted as long as the full capabilities of the code are maintained; the code can still pass validation tests; and the underlying purpose of the benchmark is not compromised.  As many changes to the code may be made as wanted so long as the following conditions are met:

·      All simulation parameters such as grid size, number of particles, etc., should not be changed. 

·      The optimized code execution should still result in correct numerical results.

·      Any code optimizations should be available to the general Track-1 user community, either through a system library or a well-documented explanation of code improvements.

·      Any library routines used should currently exist in an evaluator's supported set of general or scientific libraries, or should be in such a set when the system is delivered, and should not specialize or limit the applicability of the benchmark nor violate the measurement goals of the particular benchmark.

·      Source preprocessors, execution profile feedback optimizers, etc. are allowed as long as they are, or will be, available and supported as part of the compilation system for the full-scale systems. 

·      Only publicly available and documented compiler switches should be used.

·      Finally, the same code optimizations should be made for all runs of a benchmark.  For example, one set of code optimizations may not be made for the smaller concurrency while a different set of optimizations are made for the larger concurrency. 

Any specific code changes and the runtime configuration used should be clearly documented with a complete audit trail and all supporting documentation.

Result Tables

Table 2. System Description

Enter the system details in this table for each system used in benchmarking.  Use the System label to refer to system corresponding to each test in the following tables.

 

System

Processor

Clock/MHz

Interconnect

Total Core Count

A

 

 

 

 

B

 

 

 

 

C

 

 

 

 

 

For each application run, enter the run time variation in the column marked COV.

Table 3. Lower Level Test Results

NPB 2.4 Class D, 256 tasks

 

 

 

 

System

Benchmarked Rate

Projected      Rate

Units

BT

 

 

 

MOP/s/process

CG

 

 

 

MOP/s/process

FT

 

 

 

MOP/s/process

LU

 

 

 

MOP/s/process

MG

 

 

 

MOP/s/process

SP

 

 

 

MOP/s/process

 

 

 

 

 

NPB UPC Class D, 256 tasks

 

 

 

System

Benchmarked Rate

Projected      Rate

Units

FT

 

 

 

MOP/s/process

 

 

 

 

 

 

 

 

 

 

PSNAP

 

 

 

 

 

System

Average Deviation

Number of Processors

Units

 

 

 

 

percent

           

                                                                                                                 

STREAM Triad

 

 

 

 

System

Benchmarked Rate

Projected      Rate

Units

Single proc. 30%

 

 

 

MB/s

Full node

 

 

 

MB/s



[1]
 For the purpose of this evaluation, core = CPU = processor element.

Please submit issues regarding the benchmarks to help+bw@ncsa.illinois.edu.

NAS Parallel Benchmarks (NPB) 3.3.1 Class D, 256 tasks

Skip to end of metadataGo to start of metadata

The NAS Parallel Benchmarks (NPB) are derived from computational fluid dynamics (CFD) applications and consist of five kernels and three pseudo-applications in the original "pencil-and-paper" specification (NPB 1). The benchmark suite has been extended to include new benchmarks for unstructured adaptive mesh, parallel I/O, multi-zone applications, and computational grids.  NPB problem sizes are predefined and indicated by different classes. Reference implementations of NPB are available in commonly-used programming models like MPI and OpenMP. NPB LU, CG, MG, FT, SP, and BT are used to obtain performance data for class D using 256 MPI ranks.

NPB 3.3.1 can be obtained from https://www.nas.nasa.gov/publications/npb.html, using its table ``Summary of Source Code Releases with Download Links''.  

The code is built using the cray compiler with default optimization. The performance data are collection on Blue Waters using 8 XE nodes with 32 MPI ranks per node. 

 
LU
CG
MG
FT
BT
SP
Mop/s total

252722.99

27883.32

121352.45

93977.67

242401.03

96020.41

Time in sec

157.87

130.65

25.66

95.38

240.66

307.60

 

NAS Parallel UPC FFT Class DD16

This is the UPC implementation of the NAS Parallel Benchmark FT, in which the transpose communication is implemented using both blocking functions (upc_memget) and nonblocking functions (upc_memput_nb).  The default is nonblocking functions defined in UPC description 3.1.

The class DD16 (16 times bigger than NAS FT Class D) version of the UPC FT benchmark, is obtained from NERSC at http://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/npb-upc-ft/.  

The code is built using the cray compiler with UPC and the default optimization. The performance data are collection on Blue Waters using 256 XE nodes with 32 threads per node. 

CLASS DD16
Mflop

MFlop/s

MFlop/s/Thread

FT (64x128)

128

8.74299e-06

1.06726e-09

PSNAP

Description

Psnap is a microbenchmark used to measure the the effects CPU interrupts on applications in a computer system.  It measures the amount of real clock time that a (by default 1 ms) static-length loop takes to execute. This page briefly outlines how the psnap microbenchmark was used as an acceptance test for Blue Waters Petascale Computing System in 2012. 

Download

PSNAP was originally developed by LANL and has been used for various DOE lab procurements. The code is availble from NERSC here: http://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/psnap/.. That page also has some explanation of the code and how it's used.

Instructions

All nodes and all cores on a node are to be used for the test.

Example of an application that does a block collective operation every 200 microseconds on 25,000 ranks. The example is running on 1,000 XE nodes.

% aprun -n 16000 -N 16 -d 2 ./psnap -g 200 > my_psnap_data_00.dat

The number of nodes here doesn't matter; higher is better statistics for the estimates.)

% ./histogram_psnap.pl < my_psnap_data_00.dat > my_psnap_histo_00.hist

% ./compute_weighted_time_from_histo.pl 25000 < my_psnap_histo_00.hist

This script outputs two numbers:

all bins will be above NNN

weighted time is MMM.MMM.

The first, NNN, is the time at which integration reached probability 1, that is, the collective time will NEVER be below this.  MMM.MMM is the weighted average; the collective time will mostly fall at this value. 

So as a random example, if you ran psnap on a 200 micro-second granulariy, and the "weighted time" at the end is 220 microseconds, then the application under the rank-to-CPU configuration that psnap was run under and at the rank count input to the analysis script will have a slowdown of about 10% (in this example).

STREAM

STREAM is a  synthetic benchmark designed to measure sustained memory bandwidth (in MB/s) and, correspondingly, a computation rate for four, uniform stride-one, vector kernels: copy, scale, add, triad.

Download

The STREAM benchmark can be obtained from the STREAM home page or from other sources.

Instructions

STREAM provides instructions on how to build and size the benchmark appropriately.

 

Sample output

Single XE node of Blue Waters

-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 80000000 (elements), Offset = 0 (elements)
Memory per array = 610.4 MiB (= 0.6 GiB).
Total memory required = 1831.1 MiB (= 1.8 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 16
Number of Threads counted = 16
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 20311 microseconds.
 (= 20311 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 71878.0 0.017854 0.017808 0.017896
Scale: 67984.2 0.018846 0.018828 0.018865
Add: 62424.9 0.030839 0.030757 0.030931
Triad: 62053.6 0.031027 0.030941 0.031116
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Ohio Micro Benchmark (OMB)

The Ohio MicroBenchmark suite is a collection of independent MPI message passing performance microbenchmarks developed and written at The Ohio State University.  It includes traditional benchmarks and performance measures such as latency, bandwidth and host overhead and can be used for both traditional and GPU-enhanced nodes.

Instructions

Download

https://www.nersc.gov/assets/Trinity--NERSC-8-RFP/Benchmarks/July12/osu-micro-benchmarks-3.8-July12.tar

Build

tar xf osu-micro-benchmarks-3.8-July12.tar
cd osu-micro-benchmarks-3.8-July12
# 2017-08-15 include patch data
# patch -p1 <../msgsize.patch
patch -p1 <<"EOF"
diff -ur osu-micro-benchmarks-3.8-July12.orig/mpi/pt2pt/osu_latency.c osu-micro-benchmarks-3.8-July12/mpi/pt2pt/osu_latency.c
--- osu-micro-benchmarks-3.8-July12.orig/mpi/pt2pt/osu_latency.c        2013-06-28 13:08:27.000000000 -0500
+++ osu-micro-benchmarks-3.8-July12/mpi/pt2pt/osu_latency.c     2017-02-23 15:05:08.000000000 -0600
@@ -19,7 +19,7 @@
 #include <stdio.h>
 #define MESSAGE_ALIGNMENT 64
-#define MAX_MSG_SIZE (1<<10)
+#define MAX_MSG_SIZE (1<<16)
 #define MYBUFSIZE (MAX_MSG_SIZE + MESSAGE_ALIGNMENT)
 #define SKIP_LARGE  10
 #define LOOP_LARGE  100
EOF
module unload darshan
# disable upc since the Cray compiler does not support it
ac_cv_func_upc_memput=no CC=cc ./configure --prefix=$PWD/../omb-CPU
make -j6 install

Run

cat >omb.job <<EOF
#!/bin/bash
#PBS -l nodes=4:ppn=32:xe
#PBS -l walltime=0:15:00
set -e
set -x
NCORE=32
NODES=$(uniq $PBS_NODEFILE | wc --lines)
TWONODES=$(uniq $PBS_NODEFILE | awk 'NR==1||NR==2' | paste -d, -s)
EXEDIR=omb-CPU/libexec/osu-micro-benchmarks
cd $PBS_O_WORKDIR
# 1
aprun -L $TWONODES -n2 -N1  $EXEDIR/mpi/pt2pt/osu_latency
# 2
aprun -L $TWONODES -n2 -N1  $EXEDIR/mpi/pt2pt/osu_multi_lat
aprun -L $TWONODES -n4 -N2  $EXEDIR/mpi/pt2pt/osu_multi_lat
aprun -L $TWONODES -n$NCORE -N$(($NCORE/2))  $EXEDIR/mpi/pt2pt/osu_multi_lat
aprun -L $TWONODES -n$((2*$NCORE)) -N$NCORE  $EXEDIR/mpi/pt2pt/osu_multi_lat
# 3
aprun -L $TWONODES -n2 -N1  $EXEDIR/mpi/pt2pt/osu_bw
# 4
aprun -L $TWONODES -n2 -N1  $EXEDIR/mpi/pt2pt/osu_bibw
# 5 skipped
# 6
aprun -n$(($NODES*$NCORE)) -N$NCORE  $EXEDIR/mpi/collective/osu_allreduce -f
EOF
qsub -l nodes=2000:ppn=32:xe omb.job

 

Sample results

# OSU MPI Latency Test v3.8
# Size          Latency (us)
0                       1.67
1                       1.67
2                       1.67
4                       1.68
8                       1.69
16                      1.71
32                      1.78
64                      1.77
128                     1.81
256                     1.86
512                     1.96
1024                    2.09
2048                    2.41
4096                    3.26
8192                    7.55
16384                   9.04
32768                  12.24
65536                  18.15


# OSU MPI Multi Latency Test v3.8
# Size          Latency (us)
# [ pairs: 1 ]
0                       1.57
1                       1.65
2                       1.64
4                       1.67
8                       1.67
16                      1.73
32                      1.73
64                      1.75
128                     1.78


# OSU MPI Multi Latency Test v3.8
# Size          Latency (us)
# [ pairs: 2 ]
0                       1.93
1                       1.96
2                       1.98
4                       2.00
8                       2.02
16                      2.03
32                      2.08
64                      2.07
128                     2.09


# OSU MPI Multi Latency Test v3.8
# Size          Latency (us)
# [ pairs: 16 ]
0                       2.30
1                       2.33
2                       2.35
4                       2.38
8                       2.39
16                      2.41
32                      2.47
64                      2.46
128                     2.49


# OSU MPI Multi Latency Test v3.8
# Size          Latency (us)
# [ pairs: 32 ]
0                       2.59
1                       2.60
2                       2.58
4                       2.62
8                       2.63
16                      2.66
32                      2.85
64                      2.91
128                     3.17

# OSU MPI Bandwidth Test v3.8
# Size      Bandwidth (MB/s)
1                       1.21
2                       2.42
4                       4.84
8                       9.73
16                     19.39
32                     38.76
64                     78.23
128                   154.42
256                   301.35
512                   569.60
1024                 1042.07
2048                 1777.48
4096                 2705.56
8192                 2299.13
16384                4410.40
32768                4588.47
65536                4863.01
131072               5294.19
262144               5300.69
524288               5472.33
1048576              6175.13
2097152              6615.82
4194304              6818.93

# OSU MPI Bi-Directional Bandwidth Test v3.8
# Size    Bi-Bandwidth (MB/s)
1                       1.50
2                       3.00
4                       6.00
8                      11.97
16                     23.96
32                     47.95
64                     96.51
128                   189.70
256                   365.51
512                   682.16
1024                 1218.62
2048                 1985.04
4096                 2838.91
8192                 2946.67
16384                5555.37
32768                6897.41
65536                7261.32
131072               7483.45
262144               7546.08
524288               8849.55
1048576              9412.38
2097152              9988.43
4194304             10438.49

# OSU Allreduce Latency Test v3.8
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
4                     282.07            234.66            326.39        1000
8                     212.76            193.34            228.81        1000
16                    192.80            174.76            200.94        1000
32                    240.75            228.26            250.22        1000
64                    231.87            214.21            241.37        1000
128                   166.14            144.30            175.35        1000
256                   189.50            166.50            198.97        1000
512                   252.27            228.19            263.96        1000
1024                  272.64            245.14            286.02        1000
2048                  253.12            216.73            267.76        1000
4096                  528.84            499.83            548.47        1000
8192                  736.69            709.78            755.63        1000
16384                 614.93            581.46            635.31        1000
32768                 678.12            639.36            697.73        1000
65536                1116.67           1049.75           1222.80         100
131072               1705.46           1612.91           1831.69         100
262144               2688.42           2530.99           2870.38         100
524288               5698.59           5482.73           5870.42         100
1048576             11659.68          11462.81          11784.65         100

Changelog

Date Nature of change
2017-08-15 Provide data for patch for message size.