SPP-2017 Instructions and Reporting

Introduction

Benchmarks play a critical role in the performance evaluation of large scale systems. The SPP-2017 benchmarks have the potential to serve three purposes:

The benchmarks have been carefully chosen to represent characteristics of the current and possibly future workloads, which consists of solving complex scientific problems using diverse computational techniques at high degrees of parallelism.
The benchmarks give the opportunity to provide the concrete data associated with the application performance and scalability of systems.
The benchmarks can be used as an part of the system acceptance and/or regression testing and as a measurement of performance throughout the operational lifetime of the system.

Results from the SPP-2017 Full Application Benchmarks are used to derive the Sustained Petascale Performance (SPP) metric, which is used to assess the potential of a system to execute the scientific application workload. Evaluators should pay particular attention to the SPP calculation, as it is one of the key indicators for system evaluation.

Observed benchmark performance should be obtained from a the system under consideration, or a system configured as closely as possible to the target system. For systems targeted at supporting highly parallel computation, it is critical that the evaluators provide observed application performance using the largest test inputs ~~as well as the medium~~ for the Full Application Benchmarks. The largest scale jobs in the benchmark suite should not be interpreted as the limit for the job concurrency for the target system. The target system will support jobs that range in scale from the supplied SPP benchmark scale all the way up to jobs that span the entire system. Performance projections are permissible if they are derived from a similar system that is considered an earlier generation system. Projections shall be rigorously derived, using best practices for application and system performance modeling and be thoroughly documented and easily understood. In the tables below, the “Projected” column refers to the value projected for the full system proposed where direct measurements are not possible.

Submission Guidelines

The benchmark results (or projections including original results) for the proposed system shall be recorded in the tables provided at the end of this document. Additionally, the evaluators should submit benchmark codes and output files, as well as documentation on any code optimizations or configuration changes. The submitted source shall be in a form that can be readily compiled on the target system. Please do not include object and executable files, core dump files or large binary data files. An audit trail must be supplied for any changes made to the benchmark codes. The audit trail must be sufficient to demonstrate that the changes made conform to the spirit of the benchmark and do not violate the specific restrictions on the various benchmark codes.

If performance projections are used, this should be clearly indicated. The output files on which the projections are based, and a description of the projection method shall be included. In addition, each system used for benchmark projections must be described in Table 2 below. Each projection in benchmarks results tables must indicate on which system the benchmark was originally run. Enter the corresponding letter of the “System” column of Table 3 into the “System” column of the benchmark result table.

Consistency

Performance results should show consistent and reproducible execution times in multi-user production mode for the targeted system. The evaluator should document the amount of run time variation that the system shall have for production mode by including the expected coefficient of variation in the elapsed time in the tables. The coefficient of variation is defined as the standard deviation of run times divided by the mean of the same run times for a minimum of five consecutive runs. In addition, the evaluator should document the coefficient of variation for the SPP in Table 3, for the benchmarks comprising the SPP running in production mode. Production mode refers to the multi-user, general use environments.

Run Rules

The run rules, which are included in the source distribution, supply specific requirements and instructions for compiling, executing, verifying numerical correctness and reporting results for each benchmark. The benchmark performance shall only be accepted from runs that exhibit correct execution. Only software tools and libraries that will be included for general use in the target system as supported product offerings are permissible to build and execute the benchmarks.

Message passing programs must be built using an implementation that supports 64-bit virtual memory pointers and a thread-safe communication library that implements the MPI standard. All tests are to be in 64 bit floating point mode unless otherwise allowed.

Benchmark Descriptions

Full Application Benchmarks

The Full Application Benchmarks are a representation of a leadership-class system workload and span a variety of algorithmic and computational characteristics. The list of application benchmarks is shown in Table 1. Documentation for each application is included with the source distribution. For most applications there are two problem sizes provided for each application, a medium size and a large size as shown in Table 3. Medium size problems are intended for testing, porting and possible initial projections for the targeted systems. The purpose of this large size is to make the SPP metric more representative of the actual leadership-class system workload.

Table 1: Full Application Benchmarks

Application	Discipline
AWP-ODC	Seismic
CACTUS	Relativistic Astrophysics
MILC	Particle Physics Lattice QCD
NAMD	Molecular Dynamics
NWChem	Chemistry
PPM	Astrophysics
PSDNS	Fluid Dynamics
QMCPACK	Quantum Chemistry
RMG	Electronic Structures
VPIC	Plasma Physics
WRF	Weather

Two cases for running the eleven SPP-2017 Full Application benchmarks are described below, a ‘Base Case’, which can be the basis for comparison amongst alternative systems, and an optional ‘Optimized Performance’ case that allows the evaluator broader latitude to optimize code to demonstrate the best-case performance potential of the system.

Base Case

The base case limits the scope of optimization and the allowable concurrency to prescribed values. Certain minimal exceptions are allowed for hardware multithreading and if there is insufficient memory per node to execute the application. Each of these points is covered in more detail below. In the Base Case for all Full Application runs, modifications are permissible only to enable porting and correct execution on the target platform. No changes related to optimization are permissible. Library routines may be used as long as they currently exist in a supported set of general or scientific libraries. As well, they must not specialize or limit the applicability of the benchmark nor violate the measurement goals of the particular benchmark. Source preprocessors, execution profile feedback optimizers, etc. are allowed as long as they are, or will be, available and supported as part of the compilation system for the full-scale systems. Only publicly available and documented compiler switches shall be used. Compiler optimizations will be allowed only if they do not increase the runtime or artificially increase the delivered FLOP/s rate by performing non-useful work.

For each benchmark code a target concurrency count is given for two problem sizes and the evaluator should provide results at the target concurrencies if it is possible to fit the benchmark on its target number of processors.

If a benchmark will not run on its target number of processors due to memory limitations, the evaluator may use the least number of additional processors necessary. The evaluator must still solve the same global problem, using the same input files as for the target concurrency when the MPI concurrency is higher than the original target. For codes where the number of processors to be used is included in the input files the input files may be modified accordingly if a larger number of processors than the target is required. Other than that, no changes to the input files are allowed.

For all Base Case runs the benchmarks must be executed in a fully-packed manner on the computational nodes. In this mode all the Full Applications will run at the concurrencies listed. In architectures with multiple cores per node, the number of MPI tasks per node times the number of threads per MPI task shall equal the total number of physical cores available on the node.

It is permissible for applications to run with more than one MPI task per core if the proposed system has the hardware capability to run multiple tasks, and the capability can be activated with a simple environmental setting that would be available to users. To use hardware multithreading, the evaluator should first start with the SPP-2017 target concurrency given in the tables and then expand MPI concurrency to occupy hardware threads. For example, for 2-way hardware multithreading, the evaluator should first start with the target concurrency and then expand to 1,024 in order to engage the 2-way hardware threading. The increase in MPI concurrency should be the minimum needed to exploit the hardware multithreading features.

Optional Optimized Case

An optional optimized case has been added to allow the evaluator to highlight the features and benefits of the proposed system by submitting benchmarking results obtained through a variety of optimizations. This case applies only to the eleven Full Application Benchmarks and it applies to all sizes (subject to the constraints below). The evaluator may choose to optimize the source code for data layout and alignment or to enable specific hardware or software features that may include (but are not limited to):

· Using Hybrid OpenMP+MPI for concurrency;

· Using vendor-specific hardware features to accelerate code;

· Running the benchmarks at a higher or lower concurrency than the targets;

· Running at the same concurrency as the targets but in an “unpacked” mode;

· Any combination of the above.

Note: When running in an unpacked mode, the number of tasks used in the SPP calculation for that application should be calculated using the total number of processors blocked from other use. For example, if the scheduling unit is a node, all the cores in all the nodes assigned to the job must be counted as being used. The evaluator should determine if the SPP increases or decreases when running in an unpacked mode before submitting results.

Wholesale changes to the parallel algorithms are also permitted as long as the full capabilities of the code are maintained; the code can still pass validation tests; and the underlying purpose of the benchmark is not compromised. As many changes to the code may be made as wanted so long as the following conditions are met:

· All simulation parameters such as grid size, number of particles, etc., must not be changed.

· The optimized code execution must still result in correct numerical results.

· Any code optimizations must be available to the general user community, either through a system library or a well-documented explanation of code improvements.

· Any library routines used must currently exist in an evaluator’s supported set of general or scientific libraries, or must be in such a set when the system is delivered, and must not specialize or limit the applicability of the benchmark nor violate the measurement goals of the particular benchmark.

· Source preprocessors, execution profile feedback optimizers, etc. are allowed as long as they are, or will be, available and supported as part of the compilation system for the full-scale systems.

· Only publicly available and documented compiler switches shall be used.

· Finally, the same code optimizations must be made for all runs of a benchmark. For example, one set of code optimizations may not be made for the smaller concurrency while a different set of optimizations are made for the larger concurrency.

Any specific code changes and the runtime configuration used should be clearly documented with a complete audit trail and all supporting documentation included.

Full Configuration Test

This test examines the capability and performance of a single application executed over all computational cores and the entire interconnect infrastructure. At least two of the application SPP tests will be run/projected for the entire system with the largest input problem set for that code (strong scaling). The evaluator may choose which two applications are run in full configuration mode.

SPP

The SPP is a derived measure of computational capability relevant to achievable scientific work. It is used to validate the system and monitor delivered performance. The SPP is derived from an application performance figure, per second per core[1]. Given a system configured with N computational cores, the SPP is the geometric mean of . The floating-point operation count used in calculating component applications has been pre-determined by using a hardware performance counter on a single reference system (Blue Waters) and these values may not be altered. The floating-point operation counts are not measured on the target system; only the wall clock time to complete the entire benchmark run is needed.

As calculated above, the SPP represents an “instantaneous” measure of computational capability as of the date the application run times were measured. To represent the cumulative computational capability of a system over a specific period of time the SPP as calculated above is integrated over that time period by multiplying the instantaneous value by the length of time. The result is expressed in units of TFlops (with a factor of 1,000 to convert from GFlops/s to TFlops/s).

In all cases, the number of cores used to calculate the SPP for a given application is the number of hardware cores blocked from other use while the application is executing, rather than the number of MPI tasks. This is particularly important to note in the base case when hardware multithreading is used, or where a code must be run on additional cores due to memory limitations. It also applies in the optional optimized case when an application is run unpacked on a node or OpenMP is used.

Result Tables

Table 2. System Description

Enter the target system details in this table for each system used in benchmarking. Use the System label to refer to system corresponding to each test in the following tables.

System	Processor	Clock/MHz	Interconnect	Total Core Count
A
B
C

For each application run, enter the run time variation in the column marked COV. Copy the proposed elapsed times into Table 5, in preparation for the SPP calculation as described below.

Table 3A. Application Benchmark Results – Base Case (no optimization - target concurrency or necessary higher concurrency)

MEDIUM SIZE (for porting and testing, possibly performance projection)
Application	System	Target Concurrency	Concurrency Used	Elapsed Time Benchmarked	Elapsed Time Proposed	COV
AWP-ODC		65,536
CATCUS		12,800
MILC		2,592
NAMD		3,200
NWCHEM		3,200
PPM		2,112
PSDNS		4,096
QMCPACK		3,200
RMG		640
VPIC		4,608
WRF		14,592

LARGE SIZE (for SPP computation)
Application	System	Target Concurrency	Concurrency Used	Elapsed Time Benchmarked	Elapsed Time Proposed	COV
AWP-ODC		65,536
CATCUS		131,072
MILC		41,472
NAMD		144,000
NWCHEM		160,000
PPM		270,336
PSDNS		262,144
QMCPACK		160,000
RMG		110,592
VPIC		147,456
WRF		145,920

Table 3B. Application Benchmarks – Optional Optimized Case

MEDIUM SIZE (for porting and testing, possibly performance projection)
Application	System	Target Concurrency	Concurrency Used	Elapsed Time Benchmarked	Elapsed Time Proposed	COV
AWP-ODC		65,536
CATCUS		12,800
MILC		2,592
NAMD		3,200
NWCHEM		3,200
PPM		2,112
PSDNS		4,096
QMCPACK		3,200
RMG		640
VPIC		4,608
WRF		14,592

LARGE SIZE (for SPP computation)
Application	System	Target Concurrency	Concurrency Used	Elapsed Time Benchmarked	Elapsed Time Proposed	COV
AWP-ODC		65,536
CATCUS		131,072
MILC		41,472
NAMD		144,000
NWCHEM		160,000
PPM		270,336
PSDNS		262,144
QMCPACK		160,000
RMG		110,592
VPIC		147,456
WRF		145,920

Table 4. Other Tests

Full Configuration	System	Benchmarked	Proposed
Memory per Task/MB
Dimension
# of MPI tasks
# of Nodes
Elapsed Time (seconds)

Table 5A. Application Performance Table For SPP – Base Case

Application	Elapsed Time	Target Concurrency	Concurrency Used	FLOP Count	P_i
AWP-ODC		65,536		6.153 x 10¹⁵
CATCUS		131,072		2.0 x 10¹⁶
MILC		41,472		5.17 x 10¹⁶
NAMD		144,000		2.5 x 10¹⁶
NWCHEM		160,000		2.71 x 10¹⁸
PPM		270,336		4.351 x10¹⁸
PSDNS		262,144		4.374 x 10¹⁶
QMCPACK		160,000		1.88 x 10¹⁷
RMG		110,592		1.533 x 10¹⁸
VPIC		147,456		1.06 x 10¹⁸
WRF		145,920		2.92 x 10¹⁶

Table 5B. Application Performance Table For SPP – Optional Optimized Case

Application	Elapsed Time	Target Concurrency	Alternate Concurrency	FLOP Count	P_i	Y_i=P_iN
AWP-ODC		65,536		6.153 x 10¹⁵
CATCUS		131,072		2.0 x 10¹⁶
MILC		41,472		5.17 x 10¹⁶
NAMD		144,000		2.5 x 10¹⁶
NWCHEM		160,000		2.71 x 10¹⁸
PPM		270,336		4.351 x10¹⁸
PSDNS		262,144		4.374 x 10¹⁶
QMCPACK		160,000		1.88 x 10¹⁷
RMG		110,592		1.533 x 10¹⁸
VPIC		147,456		1.06 x 10¹⁸
WRF		145,920		2.92 x 10¹⁶

Application Performance: P_i Flop Count /(Elapsed Time*Concurrency)

Table 6. SPP Calculation

	System Size (N)
	Base Case	Optional Optimized Case
	P_i	P_i
AWP-ODC
CATCUS
MILC
NAMD
NWCHEM
PPM
PSDNS
QMCPACK
RMG
VPIC
WRF
Geometric Mean
SPP in TFLOPS

Enter the results from Table 5. The evaluator should provide a SPP calculation for column A, the target concurrency or necessary higher concurrency SPP. Additionally the evaluator may provide an alternative SPP if different concurrencies provide better performance.

[1] Recall, for the purpose of this procurement, core = CPU = processor

Please submit issues regarding the benchmarks to help+bw@ncsa.illinois.edu.