Low Level Benchmark Instructions and
Result Tables

Introduction

Benchmarks play a critical role in the performance evaluation of lsystems. The Low Level-2017 (LL-2017) benchmarks serve three purposes:

The benchmarks are carefully chosen to represent characteristics of the current Blue Waters and possible future workloads, which consists of solving complex scientific problems using diverse computational techniques at high degrees of parallelism.
The benchmarks give the opportunity to provide the concrete data associated with the performance and scalability of the systems.
The benchmarks can be used as part of the system acceptance and /or regression testing and as a measurement of performance throughout the operational lifetime of the system.

Observed benchmark performance should be obtained from a system under consideration or a system configured as closely as possible to the target system. For systems targeted at supporting highly parallel computation, it is critical that the evaluators provide observed benchmark performance using the largest (most parallel) test inputs as well as other sizes. The largest scale jobs in the benchmark suite should not be interpreted as the limit for the job concurrency for the target system. Performance projections are permissible if they are derived from a similar system that is considered an earlier generation and/or smaller system. Projections should be rigorously derived, using best practices for application and system performance modeling and be thoroughly documented and easily understood. In the tables below, the "Projected" column refers to the value projected for the full system target where direct measurements are not possible.

Submission Guidelines

The benchmark results (or projections including original results) for the target system should be recorded in the tables provided at the end of this document. Additionally, the evaluator should submit the completed tables, benchmark codes and output files and as well as documentation on any code optimizations or configuration changes. The submitted source should be in a form that can be readily compiled on the target system. Please do not include object and executable files, core dump files or large binary data files. An audit trail should be supplied for any changes made to the benchmark codes. The audit trail should be sufficient to demonstrate that the changes made conform to the spirit of the benchmark and do not violate the specific restrictions on the various benchmark codes. The compile and run logs should also be submitted.

If performance projections are used, this should be clearly indicated. The output files on which the projections are based, and a description of the projection method should be included. In addition, each system used for benchmark projections should be described in Table 2 below. Each projection in benchmarks results tables should indicate on which system the benchmark was originally run. Enter the corresponding letter of the "System" column of Table 2 into the "System" column of the benchmark result table.

Run Rules

The run rules, which are included in the source distribution, supply specific requirements and instructions for compiling, executing, verifying numerical correctness and reporting results for each benchmark. The benchmark performance should only be accepted from runs that exhibit correct execution. Only software tools and libraries that will be included for general use in the target system as supported product offerings are permissible to build and execute the benchmarks.

Message passing programs should be built using an implementation that supports 64-bit virtual memory pointers and a thread-safe communication library that implements the MPI standard. All tests are to be in 64 bit floating point mode unless otherwise allowed.

Benchmark Descriptions

Lower Level Tests

The Lower Level Tests, listed in Table 1, are simple focused tests that are easily compiled and executed. The results allow a uniform comparison of features and provide an estimation of system balance. Descriptions and requirements for each test are included in the source distribution. The results for the target system should be recorded in Table 3 under the column "Target". In the event that benchmark results are being projected, columns "Benchmarked" and "Target" should be filled out.

Modifications to the Lower Level 2017 Benchmark are only permissible to enable correct execution on the target platform. No changes related to optimization are permissible except in the case of the NAS FT benchmark where the values for fftblock_default and fftblockpad_default may be changed to suit the target architecture.

Table 1. Lower Level Tests

Benchmark	Purpose
NAS Parallel 2.4 Class D, 256 tasks	Parallel Performance/Interconnect
NAS Parallel UPC Class D	PGAS Performance/One-sided-Messages
STREAM	Memory Bandwidth
PSNAP	OS Jitter
OMB	MPI microbenchmarks

All tests are to be run in fully packed mode unless otherwise described below. In architectures with multiple cores[1] per node, "fully packed" means that the number of instances, threads or MPI tasks per node should at least equal the total number of physical cores available on the node.

The NPB UPC FT Class D benchmark should execute with 256 UPC threads. The evaluator may choose on how many physical nodes the code will run.

The PSNAP benchmark should execute on all available processors on the benchmark system. The operating system used for the PSNAP run(s) should be configured as the system would be delivered for regular, production purposes.

Special rules regarding packing apply to the STREAM benchmarks.

Base Case

The base case limits the scope of optimization and the allowable concurrency to prescribed values. Certain minimal exceptions are allowed for hardware multithreading and if there is insufficient memory per node to execute the application. The base case also limits the parallel programming model to MPI only. Each of these points is covered in more detail below. In the Base Case for all Full Application runs, modifications are permissible only to enable porting and correct execution on the target platform. No changes related to optimization are permissible. Library routines may be used as long as they currently exist in an evaluator's supported set of general or scientific libraries, and should be in such a set when the system is delivered. As well, they should not specialize or limit the applicability of the benchmark nor violate the measurement goals of the particular benchmark. Source preprocessors, execution profile feedback optimizers, etc. are allowed as long as they are, or will be, available and supported as part of the compilation system for the full-scale systems. Only publicly available and documented compiler switches should be used. Compiler optimizations will be allowed only if they do not increase the runtime or artificially increase the delivered FLOP/s rate by performing non-useful work.

If a benchmark will not run on its target number of processors due to memory limitations, the evaluator may use the least number of additional processors necessary. The evaluator should still solve the same global problem, using the same input files as for the target concurrency when the MPI concurrency is higher than the original target. For codes where the number of processors to be used is included in the input files the input files may be modified accordingly if a larger number of processors than the target is required. Other than that, no changes to the input files are allowed.

For all Base Case runs the benchmarks should be executed in a fully-packed manner on the computational nodes. In architectures with multiple cores per node, the number of MPI tasks times the number of threads per MPI tasks should equal the total number of physical cores available on the node.

It is permissible for applications to run with more than one MPI task per core if the target system has the hardware capability to run multiple tasks, and the capability can be activated with a simple environmental setting that would be available to Track-1 users. To use hardware multithreading, the evaluator should first start with the LL-2017 target concurrency given in the tables and then expand MPI concurrency to occupy hardware threads. For example, for 2-way hardware multithreading, the evaluator should first start with the target concurrency and then expand to 1024 in order to engage the 2-way hardware threading. The increase in MPI concurrency should be the minimum needed to exploit the hardware multithreading features.

Optional Optimized Case

An optional optimized case has been added to allow the evaluator to highlight the features and benefits of the target system by submitting benchmarking results obtained through a variety of optimizations. The evaluator may choose to optimize the source code for data layout and alignment or to enable specific hardware or software features that may include (but are not limited to):

· Using Hybrid OpenMP+MPI for concurrency;

· Using vendor-specific hardware features to accelerate code;

· Running the benchmarks at a higher or lower concurrency than the targets;

· Running at the same concurrency as the targets but in an "unpacked" mode;

· Any combination of the above.

For example, if the scheduling unit is a node, all the cores in all the nodes assigned to the job should be counted as being used. The evaluator should determine if the benchmark performance increases or decreases when running in an unpacked mode before submitting results.

Wholesale changes to the parallel algorithms are also permitted as long as the full capabilities of the code are maintained; the code can still pass validation tests; and the underlying purpose of the benchmark is not compromised. As many changes to the code may be made as wanted so long as the following conditions are met:

· All simulation parameters such as grid size, number of particles, etc., should not be changed.

· The optimized code execution should still result in correct numerical results.

· Any code optimizations should be available to the general Track-1 user community, either through a system library or a well-documented explanation of code improvements.

· Any library routines used should currently exist in an evaluator's supported set of general or scientific libraries, or should be in such a set when the system is delivered, and should not specialize or limit the applicability of the benchmark nor violate the measurement goals of the particular benchmark.

· Source preprocessors, execution profile feedback optimizers, etc. are allowed as long as they are, or will be, available and supported as part of the compilation system for the full-scale systems.

· Only publicly available and documented compiler switches should be used.

· Finally, the same code optimizations should be made for all runs of a benchmark. For example, one set of code optimizations may not be made for the smaller concurrency while a different set of optimizations are made for the larger concurrency.

Any specific code changes and the runtime configuration used should be clearly documented with a complete audit trail and all supporting documentation.

Result Tables

Table 2. System Description

Enter the system details in this table for each system used in benchmarking. Use the System label to refer to system corresponding to each test in the following tables.

System	Processor	Clock/MHz	Interconnect	Total Core Count
A
B
C

For each application run, enter the run time variation in the column marked COV.

Table 3. Lower Level Test Results

NPB 2.4 Class D, 256 tasks
	System	Benchmarked Rate		Projected Rate	Units
BT					MOP/s/process
CG					MOP/s/process
FT					MOP/s/process
LU					MOP/s/process
MG					MOP/s/process
SP					MOP/s/process

NPB UPC Class D, 256 tasks
	System	Benchmarked Rate		Projected Rate	Units
FT					MOP/s/process


PSNAP
	System		Average Deviation	Number of Processors	Units
					percent

STREAM Triad
	System	Benchmarked Rate	Projected Rate	Units
Single proc. 30%				MB/s
Full node				MB/s

[1] For the purpose of this evaluation, core = CPU = processor element.

Please submit issues regarding the benchmarks to help+bw@ncsa.illinois.edu.

NAS Parallel Benchmarks (NPB) 3.3.1 Class D, 256 tasks

Skip to end of metadata Go to start of metadata

The NAS Parallel Benchmarks (NPB) are derived from computational fluid dynamics (CFD) applications and consist of five kernels and three pseudo-applications in the original "pencil-and-paper" specification (NPB 1). The benchmark suite has been extended to include new benchmarks for unstructured adaptive mesh, parallel I/O, multi-zone applications, and computational grids. NPB problem sizes are predefined and indicated by different classes. Reference implementations of NPB are available in commonly-used programming models like MPI and OpenMP. NPB LU, CG, MG, FT, SP, and BT are used to obtain performance data for class D using 256 MPI ranks.

NPB 3.3.1 can be obtained from https://www.nas.nasa.gov/publications/npb.html, using its table ``Summary of Source Code Releases with Download Links''.

The code is built using the cray compiler with default optimization. The performance data are collection on Blue Waters using 8 XE nodes with 32 MPI ranks per node.

	LU	CG	MG	FT	BT	SP
Mop/s total	252722.99	27883.32	121352.45	93977.67	242401.03	96020.41
Time in sec	157.87	130.65	25.66	95.38	240.66	307.60

NAS Parallel UPC FFT Class DD16

This is the UPC implementation of the NAS Parallel Benchmark FT, in which the transpose communication is implemented using both blocking functions (upc_memget) and nonblocking functions (upc_memput_nb). The default is nonblocking functions defined in UPC description 3.1.

The class DD16 (16 times bigger than NAS FT Class D) version of the UPC FT benchmark, is obtained from NERSC at http://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/npb-upc-ft/.

The code is built using the cray compiler with UPC and the default optimization. The performance data are collection on Blue Waters using 256 XE nodes with 32 threads per node.

CLASS DD16	Mflop	MFlop/s	MFlop/s/Thread
FT (64x128)	128	8.74299e-06	1.06726e-09

PSNAP

Description

Psnap is a microbenchmark used to measure the the effects CPU interrupts on applications in a computer system. It measures the amount of real clock time that a (by default 1 ms) static-length loop takes to execute. This page briefly outlines how the psnap microbenchmark was used as an acceptance test for Blue Waters Petascale Computing System in 2012.

Download

PSNAP was originally developed by LANL and has been used for various DOE lab procurements. The code is availble from NERSC here: http://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/psnap/.. That page also has some explanation of the code and how it's used.

Instructions

All nodes and all cores on a node are to be used for the test.

Example of an application that does a block collective operation every 200 microseconds on 25,000 ranks. The example is running on 1,000 XE nodes.

% aprun -n 16000 -N 16 -d 2 ./psnap -g 200 > my_psnap_data_00.dat

The number of nodes here doesn't matter; higher is better statistics for the estimates.)

% ./histogram_psnap.pl < my_psnap_data_00.dat > my_psnap_histo_00.hist

% ./compute_weighted_time_from_histo.pl 25000 < my_psnap_histo_00.hist

This script outputs two numbers:

all bins will be above NNN

weighted time is MMM.MMM.

The first, NNN, is the time at which integration reached probability 1, that is, the collective time will NEVER be below this. MMM.MMM is the weighted average; the collective time will mostly fall at this value.

So as a random example, if you ran psnap on a 200 micro-second granulariy, and the "weighted time" at the end is 220 microseconds, then the application under the rank-to-CPU configuration that psnap was run under and at the rank count input to the analysis script will have a slowdown of about 10% (in this example).

STREAM

STREAM is a synthetic benchmark designed to measure sustained memory bandwidth (in MB/s) and, correspondingly, a computation rate for four, uniform stride-one, vector kernels: copy, scale, add, triad.

Download

The STREAM benchmark can be obtained from the STREAM home page or from other sources.

Instructions

STREAM provides instructions on how to build and size the benchmark appropriately.

Sample output

Single XE node of Blue Waters

-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 80000000 (elements), Offset = 0 (elements)
Memory per array = 610.4 MiB (= 0.6 GiB).
Total memory required = 1831.1 MiB (= 1.8 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 16
Number of Threads counted = 16
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 20311 microseconds.
 (= 20311 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 71878.0 0.017854 0.017808 0.017896
Scale: 67984.2 0.018846 0.018828 0.018865
Add: 62424.9 0.030839 0.030757 0.030931
Triad: 62053.6 0.031027 0.030941 0.031116
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Ohio Micro Benchmark (OMB)

The Ohio MicroBenchmark suite is a collection of independent MPI message passing performance microbenchmarks developed and written at The Ohio State University. It includes traditional benchmarks and performance measures such as latency, bandwidth and host overhead and can be used for both traditional and GPU-enhanced nodes.

Instructions

Download

				
										https://www.nersc.gov/assets/Trinity--NERSC-8-RFP/Benchmarks/July12/osu-micro-benchmarks-3.8-July12.tar

Build

				
										tar xf osu-micro-benchmarks-3.8-July12.tar

										cd osu-micro-benchmarks-3.8-July12

										# 2017-08-15 include patch data

										# patch -p1 <../msgsize.patch

										patch -p1 <<"EOF"

										diff -ur osu-micro-benchmarks-3.8-July12.orig/mpi/pt2pt/osu_latency.c osu-micro-benchmarks-3.8-July12/mpi/pt2pt/osu_latency.c

--- osu-micro-benchmarks-3.8-July12.orig/mpi/pt2pt/osu_latency.c        2013-06-28 13:08:27.000000000 -0500

+++ osu-micro-benchmarks-3.8-July12/mpi/pt2pt/osu_latency.c     2017-02-23 15:05:08.000000000 -0600

@@ -19,7 +19,7 @@

 #include <stdio.h>

 #define MESSAGE_ALIGNMENT 64

-#define MAX_MSG_SIZE (1<<10)

+#define MAX_MSG_SIZE (1<<16)

 #define MYBUFSIZE (MAX_MSG_SIZE + MESSAGE_ALIGNMENT)

 #define SKIP_LARGE  10

 #define LOOP_LARGE  100

EOF

										module unload darshan

										# disable upc since the Cray compiler does not support it

										ac_cv_func_upc_memput=no CC=cc ./configure --prefix=$PWD/../omb-CPU

										make -j6 install

Run

				
										cat >omb.job <<EOF

										#!/bin/bash

										#PBS -l nodes=4:ppn=32:xe

										#PBS -l walltime=0:15:00

										set -e

										set -x

										NCORE=32

										NODES=$(uniq $PBS_NODEFILE | wc --lines)

										TWONODES=$(uniq $PBS_NODEFILE | awk 'NR==1||NR==2' | paste -d, -s)

										EXEDIR=omb-CPU/libexec/osu-micro-benchmarks

										cd $PBS_O_WORKDIR

										# 1

										aprun -L $TWONODES -n2 -N1  $EXEDIR/mpi/pt2pt/osu_latency

										# 2

										aprun -L $TWONODES -n2 -N1  $EXEDIR/mpi/pt2pt/osu_multi_lat

										aprun -L $TWONODES -n4 -N2  $EXEDIR/mpi/pt2pt/osu_multi_lat

										aprun -L $TWONODES -n$NCORE -N$(($NCORE/2))  $EXEDIR/mpi/pt2pt/osu_multi_lat

										aprun -L $TWONODES -n$((2*$NCORE)) -N$NCORE  $EXEDIR/mpi/pt2pt/osu_multi_lat

										# 3

										aprun -L $TWONODES -n2 -N1  $EXEDIR/mpi/pt2pt/osu_bw

										# 4

										aprun -L $TWONODES -n2 -N1  $EXEDIR/mpi/pt2pt/osu_bibw

										# 5 skipped

										# 6

										aprun -n$(($NODES*$NCORE)) -N$NCORE  $EXEDIR/mpi/collective/osu_allreduce -f

										EOF

										qsub -l nodes=2000:ppn=32:xe omb.job

Sample results

# OSU MPI Latency Test v3.8 # Size Latency (us) 0 1.67 1 1.67 2 1.67 4 1.68 8 1.69 16 1.71 32 1.78 64 1.77 128 1.81 256 1.86 512 1.96 1024 2.09 2048 2.41 4096 3.26 8192 7.55 16384 9.04 32768 12.24 65536 18.15

# OSU MPI Multi Latency Test v3.8 # Size Latency (us) # [ pairs: 1 ] 0 1.57 1 1.65 2 1.64 4 1.67 8 1.67 16 1.73 32 1.73 64 1.75 128 1.78

# OSU MPI Multi Latency Test v3.8 # Size Latency (us) # [ pairs: 2 ] 0 1.93 1 1.96 2 1.98 4 2.00 8 2.02 16 2.03 32 2.08 64 2.07 128 2.09

# OSU MPI Multi Latency Test v3.8 # Size Latency (us) # [ pairs: 16 ] 0 2.30 1 2.33 2 2.35 4 2.38 8 2.39 16 2.41 32 2.47 64 2.46 128 2.49

# OSU MPI Multi Latency Test v3.8 # Size Latency (us) # [ pairs: 32 ] 0 2.59 1 2.60 2 2.58 4 2.62 8 2.63 16 2.66 32 2.85 64 2.91 128 3.17

# OSU MPI Bandwidth Test v3.8 # Size Bandwidth (MB/s) 1 1.21 2 2.42 4 4.84 8 9.73 16 19.39 32 38.76 64 78.23 128 154.42 256 301.35 512 569.60 1024 1042.07 2048 1777.48 4096 2705.56 8192 2299.13 16384 4410.40 32768 4588.47 65536 4863.01 131072 5294.19 262144 5300.69 524288 5472.33 1048576 6175.13 2097152 6615.82 4194304 6818.93

# OSU MPI Bi-Directional Bandwidth Test v3.8 # Size Bi-Bandwidth (MB/s) 1 1.50 2 3.00 4 6.00 8 11.97 16 23.96 32 47.95 64 96.51 128 189.70 256 365.51 512 682.16 1024 1218.62 2048 1985.04 4096 2838.91 8192 2946.67 16384 5555.37 32768 6897.41 65536 7261.32 131072 7483.45 262144 7546.08 524288 8849.55 1048576 9412.38 2097152 9988.43 4194304 10438.49

# OSU Allreduce Latency Test v3.8 # Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations 4 282.07 234.66 326.39 1000 8 212.76 193.34 228.81 1000 16 192.80 174.76 200.94 1000 32 240.75 228.26 250.22 1000 64 231.87 214.21 241.37 1000 128 166.14 144.30 175.35 1000 256 189.50 166.50 198.97 1000 512 252.27 228.19 263.96 1000 1024 272.64 245.14 286.02 1000 2048 253.12 216.73 267.76 1000 4096 528.84 499.83 548.47 1000 8192 736.69 709.78 755.63 1000 16384 614.93 581.46 635.31 1000 32768 678.12 639.36 697.73 1000 65536 1116.67 1049.75 1222.80 100 131072 1705.46 1612.91 1831.69 100 262144 2688.42 2530.99 2870.38 100 524288 5698.59 5482.73 5870.42 100 1048576 11659.68 11462.81 11784.65 100

Changelog

Date	Nature of change
2017-08-15	Provide data for patch for message size.

Low Level Benchmark Instructions and Result Tables

Introduction

Submission Guidelines

Run Rules

Benchmark Descriptions

Base Case

Optional Optimized Case

Result Tables

NAS Parallel Benchmarks (NPB) 3.3.1 Class D, 256 tasks

NAS Parallel UPC FFT Class DD16

PSNAP

Description

Download

Instructions

STREAM

Download

Instructions

Sample output

Ohio Micro Benchmark (OMB)

Instructions

Download

Build

Run

Changelog

Low Level Benchmark Instructions and
Result Tables