Skip to Content

SPP-2017 Benchmark Codes and Inputs

Please see the SPP FAQ for any frequently asked questions.

Please submit issues regarding the benchmarks to help+bw@ncsa.illinois.edu.

Table of Application Characteristics

Code NSF FoS

Structured Grid

Unstructured Grid

Dense Matrix

Sparse Matrix

N-Body Monte Carlo FFT I/O Languages Programming Models Production GPU Support
AWP-ODC 524 X     X       X Fortran,C++ MPI/Cuda Yes
Cactus 135 X     X         C++,C,Fortran MPI/OpenMP No
MILC 131 X   X           C/C++ MPI/OpenMP some (Quda in devel)
NAMD 411         X   X   C++ Charm++ (pthreads) Most features
NWCHEM 140     X   X       C,C++,Fortran Global Arrays over MPI Some
PPM 123 X   X         X Fortran MPI/OpenMP In developement via PAID program
PSDNS 614 X           X   Fortran MPI/OpenMP/CAF No
QMCPACK 150       X   X X   C/C++ MPI/OpenMP Most real-valued wave function features (no complex)
RMG 154 X   X   X   X X C,C++,Fortran MPI/pthreads Yes
VPIC 517 X       X   X X C++ MPI/OpenMP No
WRF 510 X     X         Fortran MPI/OpenMP No

AWP-ODC

The Anelastic Wave Propagation, AWP-ODC, independently simulates the dynamic rupture and wave propagation that occurs during an earthquake. Dynamic rupture produces friction, traction, slip, and slip rate information on the fault. The moment function is constructed from this fault data and used to initialize wave propagation.

A staggered-grid finite difference scheme is used to approximate the 3D velocity-stress elastodynamic equations. The user can has the choice of modeling dynamic rupture with either the Stress Glut (SG) or the Staggered Grid Split Node (SGSN) method. The user has the choice of two external boundary conditions that minimize artificial reflections back into the computational domain: the absorbing boundary conditions (ABC) of Cerjan and the Perfectly Matched Layers (PML) of Berenger.

AWP-ODC has been written in Fortran 77 and Fortran 90. The Message Passing Interface enables parallel computation (MPI-2) and parallel I/O (MPI-IO).

Input Description

Input files are included into the tarball.

Instructions

Download

Download awp-odc_cpu.tgz from the awd-odc directory in the shared benchmarks directory using Globus Online.

Make/Build

$ tar xvfz awp-odc_cpu.tgz
$ cd AWP-ODC-v1.1.2/src-v1.1.2/
$ module swap $( module list | grep -o PrgEnv-.*$ ) PrgEnv-pgi
$ make -f makefile.bluewaters

Run

$ cd AWP-ODC-v1.1.2/examples/wave_propagation/128x128x128

$ qsub run_small.pbs
$ qsub run_large.pbs

Check Results

The run will print the final message, such as:

Final---------
    inialization time =     0.662 sec
       get station time =     0.001 sec
       read source time =     0.049 sec
       read media time  =     0.508 sec
    computing time per time step =     0.058 sec
    mpiio time per time step =     0.002 sec
    total elapsed time is =     6.675 sec

Timing and Reference FLOP count

The FLOP count of the SPP size run is 61.53 PFLOP. It takes 1051s to run on 9% of the system (2048 nodes × 32 cores = 65536 cores).

Changelog

date - Nature of change

Cactus

We use verion ET_2016_05 of the Einstein Toolkit and version 45bb0d7 and c50f88b of Zelmani checked out on Dec 26 2016. We modified the code to reduce startup time in grid creation and data reading and those changes were partially incorporated in the current development branch of the Einstein Toolkit.

Cactus is a publicly available, community driven, open-source problem solving environment designed for scientists and engineers. Its modular structure easily enables parallel computation across different architectures and collaborative code development between different groups. Zelmani is a core collapse supernova code build on top of Cactus. It employs the adaptive mesh refinement driver Carpet, the metric solver McLachlan and the hydrodynamics code GRHydro of the Einstein Toolkit and the curvilinear grid code Llama. Zelmani provides neutrino radiation transport and complex equations of state required to simulate core collapse supernovae.

Input Description

This SPP benchmark requires a parameter file s15WH07_R0000.par and three data files SFHo.h5, SFHo_NuLib_rho82_temp100_ye100_ng12_ns3_Itemp100_Ieta120_version1.0_20160528.h5 and s15WH07_SFHo_gr1dott_riso_format2_at_time_00.24171.dat which contain the tabulated equation of state, neutrino interaction table and initial stellar profile, respectively. The initial data represent a proto-neutron star that has formed after a core collapse event and is in the process of cooling down via neutrino emmision. The input files set up a simulation of a core collapse supernova after a proto neutron star has formed and while material acretes onto the proto neutron star. It employs general relativistic gravity, hydrodnamics and neurtrino radiation transport on a block structured adaptive mesh refinement grid. It sets up a simulation analogous to the system studied in arXiv:1604.07848.

Instructions

Download

We use verion ET_2016_05 of the Einstein Toolkit and version 45bb0d7 and c50f88b of Zelmani checked out on Dec 26 2016. We modified the code to reduce startup time in grid creation and data reading and those changes were partially incorporated in the current development branch of the Einstein Toolkit.

Download Zelmani.tar.gz from the cactus-zelmani directory in the shared benchmarks directory using Globus Online.

Make/Build

Untar the tarball

tar xf Zelmani.tar.gz
cd Zelmani

Modify the file options.cfg to match your needs for compiler libraries etc. It is currently set up to compile Zelmani on Blue Waters using the GNU compiler suite. Setting the library directories to BUILD, e.g., HDF5_DIR=BUILD, will compile the library from scratch instead of using the cluster's version. You can find option files for many clusters in simfactory/mdb/optionlists that you can use as starting points. Then compile using

module unload PrgEnv-cray PrgEnv-gnu PrgEnv-intel PrgEnv-pathscale PrgEnv-pgi
module load PrgEnv-gnu
module load acml/5.3.1
module load cray-hdf5/1.8.14
# 2017-05-27 removed since module no longer exists
# module load cudatoolkit/7.0.28-1.0502.10742.5.1 # only to force dynamic linking
# 2017-05-27 use dyncamic linkin
export CRAYPE_LINK_TYPE=dynamic
export CRAY_ADD_RPATH=yes
module load gsl/1.15.1
module load pmi
make -j4 sim-config options=options.cfg THORNLIST=thornlists/Zelmani.th PROMPT=no

Run

Edit Zelmani.qsub to match your cluster, it is currently set up for Blue Waters. Be sure to set CACTUS_NUM_THREADS and CACTUS_NUM_PROCS to the number of threads per MPI rank and the number of MPI ranks to use. Then for the compact problem submit using

qsub Zelmani400.qsub

and

qsub Zelmani4096.qsub

for the large problem.

Check Results

Timing and Reference FLOP count

Timing should be obtained as the difference betwen the "Starting:" and "Stopping" lines in the Zelmani.out file and is output to a file Timing.txt.

Case Fraction of system Number of cores used FLOPs Wall Time
Compact problem 1.78% 12800 3.5x1014 670 s
SPP problem 18.1% 131072 2.0x1016 4800 s

Changelog

date Nature of change
2017-05-27 Removed cudatoolkit/7.0.28-1.0502.10742.5.1 since module no longer exists Use env variable settings to enforce dynamic linking instead.
2017-09-06 Add sample stdout file for compact problem in sample-output/Zelmani.out file.

 

MILC

Description

The following description of the MILC code generally and this benchmark specifically is from the NERSC MILC page.  

The benchmark code MILC represents part of a set of codes written by the MIMD Lattice Computation (MILC) collaboration used to study quantum chromodynamics (QCD), the theory of the strong interactions of subatomic physics. It performs simulations of four dimensional SU(3) lattice gauge theory on MIMD parallel machines. "Strong interactions" are responsible for binding quarks into protons and neutrons and holding them all together in the atomic nucleus. QCD discretizes space and evaluates field variables on sites and links of a regular hypercube lattice in four-dimensional space time. Each link between nearest neighbors in this lattice is associated with a 3-dimensional SU(3) complex matrix for a given field.

The MILC collaboration has produced application codes to study several different QCD research areas, only one of which is used here. This code generates lattices using rational function approximations for the fermion determinants using the Rational Hybrid Monte Carlo (RHMC) algorithm and implementing the HISQ action. 

This benchmark is based on MILC 7.7.13, with modifications by NERSC for the downloaded version (linked below) and again by NCSA in the run scripts.

Problem set:

This problem set is a modified version of the MILC benchmark found here at the NERSC web site.  To do perform *this* benchmark, download the source package below.  For reference, the original NERSC source/build file is here: http://portal.nersc.gov/project/m888/apex/MILC_160413.tgz .

Problem Size:

The "medium" problem set included in this package runs MILC on 81 Cray XE6 (or equivalent) at 32 ranks per node using the  36x36x36x72 problem size from the MILC_lattices directory.  The "large" problem set runs a 72x72x72x144 lattice on 1296 Cray XE nodes.  The lattice files used for these benchmarks are rather large and so are not included.  They can be downloaded from the original directory at NERSC. Go to this URL: http://portal.nersc.gov/project/m888/apex/MILC_lattices/ and download the lattice appropriate for your problem size (command files are included in the tar package).

configuration list

Config 

name

lattice

geometry

 XE6

nodes

FP Op

Count

walltime

(s)

cores/

node

FP Ops 

per sec

per core

test (or "medium") 36x36x36x72 81 3.22131e15 3365 32 369.3 M
SPP (or "large") 72x72x72x144 1296 5.17825e16 7916 32 157.7 M

(Run configuration for the benchcmarks in this table are 32 ranks per node on Blue Waters, pure MPI, without optimization of rank order.  Better performance is available using specialized software that may not be available on all systems.)

Instructions: 

Download:

To run these benchmarks, download the modified source package MILC_2017_BW_versE.tgz from the milc-nersc directory in the shared benchmarks directory using Globus Online.  Unpack the tar file and cd into the resulting directory, MILC-apex_BW.  

Build: 

To build the code on a Cray system, in the MILC-apex_BW directory, run the build_noperftools.sh script; this builds the executable without instrumentation by the Cray PerfTools package.  This script includes loading the proper modules for the build on a Cray system.  The script build.sh_NERSC is the original build script; it may work under different circumstances.  

Get Lattices:

From the MILC-apex_BW directory, cd to benchmark/lattices.  You will need to download the lattice start file for the benchmark you wish to run.  On the command line, "source wget_command_BW_36_72" to download the lattice for the medium benchmark, or "source wget_command_BW_72_144" to get the large benchmark file.  (These files are about 1 GB and 15 GB respectively.)  Leave these files in this directory; there are links.  

Test or Medium Benchmark:

To run the medium benchmark: 

cd from MILC-apex_BW to benchmarks/medium

on a moab system, submit a job to run the benchmark using this command: qsub run_medium_BW_TOP_nopat.sh

When the job is finished running, there will be an output file in that directory that matches the pattern milc_*.<JOBID> which contains timing information.  There will also be a results_* directory; that directory contains the MILC output including the milc_* specific output file.  To determine the performance result of that benchmark run, run this perl analysis script on the timing file and the MILC output file.

BW_MILC_flops_from_output_medium.pl <job_output_file> <milc result file.job_id>  

(JobID will be different depending on your scheduler configuration.  An example for the name of an actual result file from Blue Waters is "milc_72x72x72x144_on_1296_32_1.5975725".) This script will output the aggregate Floating-point-operations-per-second for the whole job and the floating-point-operations-per-second-per-node (assuming it knows the number of nodes correctly). 

Example:

the name of the output file depends on the job scheduler configuration.  Here is an actual example of invoking the analysis script:

BW_MILC_flops_from_output_medium.pl milc-medium.o7108560 results_7108560.bw/milc_36x36x36x72_on_81_32_1.7108560

SPP or Large Benchmark:

This is the same procedure as the medium benchmarks, in a different directory with slighly different scripts.  Starting from MILC-apex_BW:

cd benchmarks/large

qsub run_large_BW_nopat.sh

BW_MILC_flops_from_output_large.pl <job_output_file> <milc result file.job_id>  

Non-Cray Systems

These scripts were set up for a current (late 2016) Cray XE/XK system with Intel compilers.  As such, they use the Cray module infrastructure, Intel compilers, Cray wrapper compiler infrastructure, and Moab job control system.  Running these benchmarks on other systems will require modifying these scripts, which will include the build scripts, Makefiles, and job submission scripts.  

Changelog

2017 JUL 13:  Minor (< 1%) tweak to op count for large problem.  Major tweak (more than 2x) to op count for medium problem (there was a mismatch that we finally tracked down to a transcription error).  Removed Blue-Waters specific node information from run scripts, cleaned up some script mismatches.  Updated documentation on portal. 

NAMD

Description:

NAMD is an application for performing classical molecular dynamics simulations of biomolecules that is able to scale to 100 million atoms on hundreds of thousands of processors~\cite{Mei:2011:ESB:2063384.2063466}. Interactions between atoms in the simulation are parameterized based on the species of each atom and its chemical role. The forces on all atoms are integrated by the explicit, reversible, and symplectic Verlet algorithm to simulate the dynamic evolution of the system with a timestep of 2 fs. 

 

The force field includes bond forces among groups of 2-4 atoms, Lennard-Jones forces, and short-range and long-range electrostatics. All of these forces except the long-range electrostatics are localized and scale well on large distributed computers. The long-range electrostatics are computed using the FFT-based particle-mesh Ewald (PME) algorithm, which requires computing two 3-D FFTs every time step. Since NAMD is designed to overlap various force computations, with increasing numbers of compute nodes, the local computation time decreases to the point where the all-to-all communication for the FFT stage dominates execution time.

The science problem is a 100M atom chromatophore simulation, running a constant-pressure equilibration at 298K. The benchmark input file was assembled by equilibrating the chromatophore of the purple bacterium Rodobacter Sphaeroids. This provides a science input set appropriate for scaling NAMD to petascale systems        

Download:

  1. Download the namd-spp-1.1.tar.gz file from the namd directory in the shared benchmarks directory using Globus Online.

  2. Extract the NAMD source code and benchmark problem: tar zxvf namd-spp-1.1.tar.gz

Build:

Change into the namd-spp-1.0 directory and run the build script: ./build.sh

Run:

Change into the namd-spp-1.0/run directory and submit the benchmark run for the desired node count (e.g qsub namd.100.cpu.pbs):

  % of system XE Nodes Tasks x threads per node Job script Steps

Step range measured

Time (s) Time per step FP ops measured GFLOPS GFLOPS per node
Test problem 0.4% 100 4 x 7 namd.100.cpu.pbs 2000 800-1800 684 0.684 2.5x1015 3650 36.5
SPP problem 20% 4500 4 x 7 namd.4500.cpu.pbs 80000 69800-79800 242.2 0.0242 2.5x1016 103000 22.9

Performance:

Performance is quantified by measuring the steady-state time-per-step of a series of timesteps near the end of the benchmark run. NAMD uses a measurement-based load balancer based on the behavior of the steps at the start of the run, so measuring performance at the start of the run is un-load-balanced, and not representative of the actual performance in a long-running production simulation. The measurement window for the tests are 1000 and 10,000 steps respectively, which is sufficient to include both computation and I/O in equal proportion to the amount of computation and I/O in a full simulation run.

Processor counter data at small scale indicates a floating-point operation count of 2.5x1012 operations per time step. The FLOPS for a particular simulation over a particular number of steps is 2.5x1012 * nsteps / tsteps operations. The operation count was measured by noting the total number of operations for two runs of different lengths, each running on 50 nodes, which was approximately the smallest node-count with sufficient memory to run the problem. By subtracting the operation count of the longer run from the operation count of the shorter run, operations associated with startup cancel out, and only the repeated operations remain.

Changelog

date - Nature of change

NWCHEM

Application Paragraph

NWChem provides scalable computational chemistry tools that are able to treat large scientific computational chemistry problems, and efficiently use available parallel computing resources from conventional workstation clusters to high-performance parallel supercomputers. NWChem development community represents a consortium of developers that is lead by the EMSL team located at the Pacific Northwest National Laboratory (PNNL) in Washington State. The NWChem development strategy focuses on delivering essential scientific capabilities to its users in the areas of kinetics and dynamics of chemical transformations, which undergo in gas phase, at interfaces, and in condensed phase.

Input Description

The benchmark test involves an electronic structure calculation of Guanine dimer at Coupled Cluster Singles and Doubles excitation including perturbative Triple excitations known as CCSD(T) level of theory. The computation employs correlation-consitent aug-cc-pvtz basis set. The outcome of the computation is the total energy of the molecular system at the fixed geometry representing a stationary solution of the Schroedinger equation.

The test includes two input files flops.nw.ccsd and flops.nw.t, which constitute a composite job. Each of the input files utilizes 5000 XE6 nodes on Blue Waters. These jobs use different number of cores per node, therefore they have to be executed one after another within a single PBS script. The total run takes about 7 hours on Blue Waters. The first input file flops.nw.ccsd performs iterative computation of single and double excitations, and saves the fully converged amplitues in a binary restart file. The second input file flops.nw.t reads the previously saved aplitudes from the disk file and performs perturbative calculation of triple excitations. 

Instructions

Download

The benchmark test requires NWChem ver 6.6, which is publicly available from http://www.nwchem-sw.org/index.php/Download

The input files are available in the nwchem.tar file in the nwchem directory in the shared benchmarks directory using Globus Online.

Make/Build

Compilation of NWChem on Blue Waters involves executing the following steps:

module swap PrgEnv-cray PrgEnv-gnu

 

export NWCHEM_TOP=`pwd`

export NWCHEM_MODULES="nwpw driver stepper mp2_grad rimp2 ccsd property hessian vib"

export FC=ftn

export CC=cc

export MSG_COMMS=MPI

export TARGET=LINUX64

export NWCHEM_TARGET=LINUX64

export ARMCI_NETWORK=MPI-PR

export HAS_BLAS=yes

export BLAS_OPT=''

export LIBMPI=''

export USE_MPI=y

export USE_MPIF=y

export USE_MPIF4=y

export USE_64TO32=y

export MA_USE_ARMCI_MEM=Y

export BLAS_SIZE=4

export LAPACK_SIZE=4

export SCALAPACK_SIZE=4

export SCALAPACK=-lsci_gnu

export BLASOPT=-lsci_gnu

export GA_DIR=ga-5-4

export USE_NOFSCHECK=TRUE

export USE_NOIO=TRUE

 

cd $NWCHEM_TOP/src

# make 64_to_32 converts BLAS routines to 32-bit indices

# apply it only once to the fresh code; repeated application breaks the code

make 64_to_32   # Cray specific - not needed on full 64-bit platform

make realclean

make nwchem_config

make FC=ftn GA_DIR=ga-5-4

Run Small Test

The small test uses the files in the small/ directory of the tar file.

Running NWChem benchmark on Blue Waters requires executing the following PBS script:

#!/bin/bash

#PBS -j oe

#PBS -l nodes=100:ppn=32

#PBS -l walltime=04:00:00

#PBS -N zsmall

 

cd $PBS_O_WORKDIR

 

aprun -n 3200 -N 32 ./nwchem flops.nw > job.out

Check Results for Small Test

At the end of the computation, the output file job.out should report "Total CCSD energy: -1082.513675531692343" in the units of Hartree. The numerical agreement between the computed energy and its reference value withing 7 digits after decimal point confirms the correctness of computation.

Timing and Reference FLOP count for Small Test

Following script reports the performance of the job:

grep "Total times  cpu:" job.out | awk '{print $6}' | sed 's/s//' | awk '{print 1227.827/$1,"TF/s"}'

 

The FLOP count of the job is 1227827 GFLOPS.

The total time to solution of the job is 8011.4 seconds.

The aggregate performance of the job is 153.260 GF/s.

Number of schedulable processing units is 3200.

Performance per core is 0.048 GF / (second * core).

Run SPP (or Large) Test

The large test uses the files in the large/ directory of the tar file.

Running NWChem benchmark on Blue Waters requires executing the following PBS script:

#!/bin/bash

#PBS -j oe

#PBS -l nodes=5000:ppn=32

#PBS -l walltime=10:00:00

#PBS -N zlarge

 

cd $PBS_O_WORKDIR

 

aprun -n 15000 -N 3 -cc 0,8,16 ./nwchem flops.nw.ccsd > job.ccsd.out

aprun -n 160000 -N 32 ./nwchem flops.nw.t > job.t.out

Check Results for SPP (or Large) Test

At the end of the computation, the output file job.t.out should report "Total CCSD(T) energy: -1083.426702937265190" in the units of Hartree. The numerical agreement between the computed energy and its reference value withing 7 digits after decimal point confirms the correctness of computation.

Timing and Reference FLOP count for SPP (or Large) Test

Following script reports the overall performance of the composite job:

grep "Total times  cpu:" job.ccsd.out job.t.out | awk '{print $7}' | sed 's/s//' | awk 'BEGIN{sum=0} {sum += $1} END{print 2710500.771/sum,"TF/s"}'

 

The total FLOP count of the composite job is 2710500771 GFLOPS. The CCSD part performed 121720604 GFLOPS, and the T part made 2588780167 GLOPS.

The total time to solution of the composite job is 24159.6 seconds. The CCSD portion of the run took 16725.4 seconds, and the T portion took 7434.2 seconds.

The aggregate performance of the composite job is 112191 GF/s.

Number of schedulable processing units is 160000.

Performance per core is 0.701 GFLOP / (second * core).

Changelog

06/26/2017 - Included timing of CCSD and T parts of the composite jobs.

09/01/2017 - Added output files into the tar file.

10/30/2017 - Added the number of GFLOPS consumed by CCSD and T parts individually.

PPM

Application Paragraph

The piecewise parabolic method (PPM) developed initially for gas dynamical simulations including shocks and discontinuities in stellar processes has been applied to the problem of inertial confinement fusion (ICF). The ICF problem, like the star problems, involves turbulent mixing due to instabilities at multi-fluid interfaces, and in both problems the details of this mixing affect the combustion that results.

The inertial confinement fusion (ICF) test problem is being investigated under DoE funding from the Los Alamos National Lab, which also funds the team's work on giant stars, which is also an important phenomena to understand for the nation's effort to find ways to generate energy by means other than burning fossil fuels. In the latest elaboration of the Science Team's computational strategy, the requirement for memory bandwidth to the CPU chip is reduced by nearly a factor of 3. The ICF problem, like their giant star problem, involves turbulent mixing due to instabilities at multifluid interfaces, and in both problems the details of this mixing affect the combustion that results. The ICF problem exercises all the features of the PPM codes, including strong shocks and very elaborate treatments of unstable multifluid interfaces, while at the same time performing an important scientific problem that has a highly transient character.

The inertial confinement fusion (ICF) process is initiated by shining 192 powerful laser beams on a container in which a small capsule is suspended. The spherical capsule confining the hydrogen fuel is driven radially inward by the very high pressure of the surface material heated by the laser light. This inward motion exceeds the speed of sound in the unshocked capsule and fuel materials, and therefore it takes place very rapidly. In just a few sound crossing times of the system, the compression of capsule and fuel is complete. In this problem, the time step is set by the speed of sound in the hot gases. Thus, only a relatively small number of time steps are needed. The test case is a modification of the ICF test problem devised by Youngs in 2008. These modifications increase the radial compression that it produces from about a factor of 4 to about a factor of 10. This is still much less than is needed for inertial confinement fusion, but it is closer, and the test problem is still feasible to specify and set up. The emphasis is on the mixing of the capsule and fuel materials at the inner surface of the capsule.

All the fluids are treated as ideal monatomic gases. Shock Mach numbers are then very high, which of course would ionize the material. This detail is not especially relevant to the unstable behavior of the capsule-fuel boundary. Instead, the focus is the interface instability in the context of a strongly converging flow field. The ICF problem and code are further described in Paul Woodward's "ESS Success Story" report presented to the NSF Blue Waters Review panel in 2012. The code for the SPP benchmark was finally updated at May 1, 2012 by Paul Woodward. 

Input Description

The test case concerns Inertial Confinement Fusion, in which a laser pulse implodes a fuel pellet, triggering a complex flow field and nuclear fusion reactions.  Since the simulation generates the initial state based on parameters in the source, no input data is necessary. An additional set of I/O server MPI tasks are interleaved to stream the volume data output to disk asynchronously. 

  • Compact problem: Using 66 XE nodes, compute on a 1,2803 zone mesh. This uses 2,112 MPI tasks.  Of this total, 64 MPI tasks are I/O servers. 2,000 time steps were taken, generating 135 GB of output (excluding restart files) in 4 dumps
  • SPP problem: Using 8,448 XE nodes, compute on a 5,1203 zone mesh. This uses 270,336 MPI tasks.  Of this total, 8192 MPI tasks are I/O servers. 8,016 time steps were taken, generating 8.4 TB of output excluding restart files. In 4 dumps.

Instructions

Download

The PPM2F, a 2-fluid, explicit gas dynamics program is developed by and the intellectual property of Prof. Paul R. Woodward. It is Copyright (C) 2014, by the Regents of the University of Minnesota.

This program is provided only for use in benchmarking only used with the defined problem sets associated with this download. A license to use this software, any part of this software, and/or the methods encapsulates in this software is not granted for any other purpose. This software and associated files shall not be redistributed to others. Any requests to use this software in other ways should be communicated to and approved by the author of the software, Paul R. Woodward at the address given below.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. The Developer does not commit to provide any support and assistance for using this Program for benchmarking purposes.

Paul R. Woodward can be contacted at
paul@lcse.umn.edu
399 Walter Library, 117 Pleasant St. S. E., University of Minnesota
Minneapolis, Minnesota, USA 55455

By downloading this program and associated files, you are acknowledging and agreeing to the above conditions. (Click to acknowledge)

Acknowledge

 

Download PPM2F_5120.tar.gz from the ppm directory in the shared benchmarks directory using Globus Online.

Make/Build

The following is an example for the SPP problem. 

module unload darshan
module load craype-hugepages2M
./compile-JK-noPapi 5120M2

compile-JK-noPapi has the following line:

ln -sf iq$1.h iq.h
ftn -h nocaf -static -O2 -h fp3 -r m -h keepfiles PPM2F-tp3-5-1-12-ICF-short-loops.F cio.o -o PPM2F-$1

Run

qsub run-PPM2F-5120M2

The following is a job script example for the SPP problem:

	#PBS -N PPM2F-5120M2
#PBS -lnodes=8448:ppn=32:xe
#PBS -l walltime=02:30:00
#PBS -j oe

. /opt/modules/default/init/bash
module unload darshan
module swap PrgEnv-gnu PrgEnv-cray
module load craype-hugepages2M
module list

cd $PBS_O_WORKDIR

here="/projects/bluewaters/benchmarking/SPP/ppm/PPM2F_5120"

data=DATA5120M2.$PBS_JOBID

mkdir -p /scratch/staff/jkwack2/PPM_test/$data
ln -sf /scratch/staff/jkwack2/PPM_test/$data $data
lfs setstripe --count=2 /scratch/staff/jkwack2/PPM_test/$data
cd /scratch/staff/jkwack2/PPM_test/$data

export MPICH_RANK_REORDER_METHOD=3
$here/generate-MPICH_RANK_ORDER_V2 270336 8192

touch $here/PPM4Fperf.ppm.$PBS_JOBID
ln -sf $here/PPM4Fperf.ppm.$PBS_JOBID  PPM4Fperf.ppm

ulimit -s unlimited
export OMP_STACKSIZE=200M
export OMP_NUM_THREADS=1

date
/usr/bin/time aprun -n 270336 -ss $here/PPM2F-5120M2 >& $here/stderrout.5120M2.$PBS_JOBID
date

 

Check Results

Statistical summary data for the increasing resolution cases from Paul Woodward is used to verify results.

 

Timing and Reference FLOP count

Case Fraction of XE system  Ranks x tasks FLOPs Wall Time FLOPs/sec FLOPs/sec/node
Compact problem 1/340 ( 66 XE nodes) 2,112 1.619x1016 3673 sec 4.408x1012 6.679x1010
SPP problem 1/3 (8448 XE nodes) 270,336 4.351x1018 7790 sec 5.585x1014 6.611x1010


FLOP calculation

  1. Update "flops" with data from "PPM4Fperf.ppm.$PBS_JOBID"
  2. ./flops 
  3. It returns TFlops/sec and Gflops/sec/node

The following is an example of "flops" for the SPP problem (e.g., flops-PPM2F-5120M2):

		#!/usr/bin/bc
scale= 5

nx=5120
nnode=8448
nmin=129
nsec=50.00

# average flops per zone
f= ( 3.96690 +  4.06932  +  4.06932  +  4.06932 ) / 4 * 1000

# total timesteps
k = 8016

# elapsed wall time, seconds
t=nmin*60 + nsec

# flop rate average
n = f * k * nx^3
a = n/t
print "Wall time = ",t,"sec\n"
print "PFlops =", n/(10^15),"\n"
print "TFlops/sec aggregate= ", a / ( 10^12 ), "\n"
print "Gflops/sec/node = ", a / nnode / ( 10^9 ), "\n"

quit

Changelog

January 24, 2017 - The compact problem for PPM2F was added.

January 26, 2017 - The SPP problem for PPM2F was added. 

June 5, 2017 - A tar ball for PPM2F was updated to include the performance result files. 

PSDNS

Description:

PSDNS is a highly parallelized application code used for performing direct numerical simulations (DNS) of three-dimensional unsteady turbulent fluid flows, under the assumption of statistical homogeneity in space. A system of partial differential equations expressing the fundamental laws of conservation of mass and momentum is solved using Fourier pseudo-spectral methods in space and explicit Runge-Kutta integration in time. The pseudo-spectralapproach requires multiple FFTs to be taken in three directions per time step, resulting in communication-intensive operations due to transposes involving collective communication among processors. However, remote-memory addressing techniques such as Co-Array Fortran on Blue Waters have been found to be helpful.

Simulation outputs can include flow field information stored at a large number of grid points, as well as the trajectories of infinitesimal fluid elements which are tracked over sustained periods of time for the study of turbulent dispersion. The current code has enabled a production simulation of turbulent flow using 4 trillion grid points on 262,144 CPU cores.

For the SPP benchmark, PSDNS is initialized with a simple sinusoidal velocity field, and solves for the velocity field variables using the 4th order Runge–Kutta method. Its IO and checkpoint are performed at the frequency specified in the input file. 

Download:

Download spp_PSDNS.tar.gz from the psdns directory in the shared benchmarks directory using Globus Online.

Unpack spp_PSDNS.tar.gz will generate the directory spp_PSDNS. It contians subdirectories PSDNS, dir_runs1, dir_runs128, and dir_runs8192, as well as file README.spp.

Build:

  1. Create the directory for PSDNS;
  2. tar zxvf spp_PSDNS.tar.gz;
  3. Read build and run instructions in README.spp;
  4. Modify makefile in PSDNS if needed;
  5. cd PSDNS, then do the following to generate the executable DNS2d_mpi_p8._CAF.x:
    • module load fftw
    • module load cray-hdf5-parallel 
    • module load craype-hugepages2M 
    • make cleanest
    • make srcmake
    • make

Run:

Three run directories dir_runs1, dir_runs128, and dir_runs8192, can be found after unpacking spp_PSDNS.tar.gz. In addition to the PSDNS directory, dir_runs1 is set up for a test run of a 1283 PSDNS problem on 1 node; dir_runs128 is a directory for a 20483 PSDNS problem on 128 node; And dir_runs8192 is a directory for running a 81923 PSDNS problem on 8192 nodes.

For a run, cd to one of the three directories. The runs using 128 and 8192 nodes are set up for a 20-30 minute run with two checkpoints and a few I/O steps, with the 128 node run taking 90 steps, and the 8192 node run taking 28 steps. After modifying the PBS script file batch for the account the job to be charged to, one should be able to successfully run the code with ``qsub batch'' from a sample directory.

To repeat the runs defined in dir_runs1, in dir_runs128, or in dir_runs8192, simply

  • create a new directory, then copy files input, dims, batch, prep_dirs from one of the provided diretories to the new directory;
  • cd to the new directory;
  • mkdir iostep_timings
  • csh prep_dirs M

with M = 4, or 64, or 512 depending on if the run is for 1 node, or 128 nodes, or 8192 nodes.

To run the PSDNS code for other problem sizes or for different number steps etc, one needs create a new directory, then copy and modify the files input, dims, and batch accordingly. In addition, one needs to create the directory iostep_timings, as well as the directory outpen by using the shell script prep_dirs. Discussions on how to modify the files input and dims, and how to use prep_dirs to generate outpen, are in README.spp.

 

Results

Affer a successful run, many files are generated by PSDNS analyzing the program and providing information on time spent on different tasks.  To check for correctness, use the file eulstat. For timing informaiton, reference the file log. More details are provided in README.spp.

                             

Timing and Reference FLOP Count

Problem XE Fraction Nodes Parallelism Pronlem Size Num Steps Total Flop Total Time (sec)
Small 0.0565% 128 4096 20483 90 1773.78x1012 1741
SPP 36.18% 8192 262144 81923 28 43740.83x1012 1538

 

Note1: Total Flop is based on performance data; Num Steps are chosen for runs lasting 20-30min on Blue Waters.

 

Note2: A forumula estimating the floating point operation counts per step is:

Flop/step = 220 * N3  * log2 N

 

Changelog

date - Nature of change

QMCPACK

The code for these tests is based on revision 6854, which can be downloaded from the QMCPACK website.

Description

Capable of running on both CPUs and GPUs, QMCPACK is a Quantum Monte Carlo (QMC) code used for many-body ab initio simulation of atoms, molecules, and solid materials.  Open-source and written in C++, QMCPACK solves the many-body Schrödinger equation by stochastically sampling the configuration domain.  A Variational Monte Carlo (VMC) algorithm can be used either to obtain the results directly or to quickly find a “ballpark” estimate of the solution, which is then refined by a Diffusion Monte Carlo (DMC) algorithm.  In both configurations, a number of VMC walkers (each a complete state representation) randomly move through the energy domain each time step.  If the VMC-DMC setup is used, the VMC walkers are sampled to create walkers for the DMC phase.  The output is the lowest energy state within a statistical uncertainty, which can be reduced by taking more samples (i.e., using more walkers).

Input Problem

The 4x4x1 graphite problem consists of 4x4x1 blocks of carbon atom supercells, each containing four carbon atoms (64 total per 4x4x1 block), which are repeated in three dimensions (i.e., the boundary conditions are periodic).  With a total of 256 valence electrons per 4x4x1 block (four per atom), this problem represents two graphene layers stacked in the third dimension.  The goal is to find the lowest energy state that describes the system.  Both VMC and DMC algorithms are used to solve this problem.

The given tarball contains configurations for both CPU and GPU builds, but only two CPU tests are given: a small 3,200-XE-task case (100 nodes) and the 160,000-XE-task SPP case (5,000 nodes).  Both tests use the same input problem.  The only difference is in how many samples are computed.

The SPP problem does not write any checkpoints.

Instructions

Download

Download qmcpack_spp_20170127.tar.gz from the qmcpack directory in the shared benchmarks directory using Globus Online.

The directories in the base directory of the tarball include:

external_libs - location of amdlibm library
input - graphite input problem files
qmcpack - QMCPACK source, configuration, and build files
runs - run directories for different job sizes

The source code is based on the publicly-available revision 6854, which can be downloaded from the QMCPACK website here: http://qmcpack.org/downloads/releases

From that code, three files were modified for benchmarking and Blue Waters compatibility purposes, and three scripts were added to the qmcpack/build directory for building each available version of QMCPACK (CPU with real or complex wave functions and GPU with real wave functions).

The following is a list of changes made to the code:

1. qmcpack/src/QMCApp/qmcapp.cpp

This file contains main().  A timer was added to record the amount of time spent in the main() function, which is displayed at the end of the output file in the format “Total Time Spent in Main: xxx seconds”.

2. qmcpack/config/attic/BWGNU.cmake

This is the cmake configuration file for the CPU (XE) gnu build on Blue Waters.  A number of paths were changed.

3. qmcpack/config/attic/BWCUDA.cmake

This is the cmake configuration file for the GPU (XK) gnu-cuda build on Blue Waters.  Again, a number of paths were changed.

4. The build_script_cpu_real, build_script_cpu_complex, and build_script_gpu_real scripts were added to qmcpack/build to make building a particular version much easier.

Also included in that tarball is a copy of AMD’s libm (see here for documentation: http://developer.amd.com/tools-and-sdks/archive/compute/libm).  While not necessary for building QMCPACK, using AMD libm on Blue Waters has shown a performance improvement of a few percent versus using gnu’s libm.

Build

Before building a CPU version of QMCPACK, open qmcpack/config/attic/BWGNU.cmake in an editor, and make sure that the paths for AMD's libm are set correctly.

vim qmcpack/config/attic/BWGNU.cmake

and modify

# for AMD's libm
include_directories(/u/staff/rmokos/scratch/spp/qmcpack_new_spp/external_libs/amdlibm/include)
link_libraries(/u/staff/rmokos/scratch/spp/qmcpack_new_spp/external_libs/amdlibm/lib/static/libamdlibm.a)

To compile QMCPACK, simply cd into qmcpack/build and run the script for the required version: CPU with real wave functions, CPU with complex wave functions, or GPU with real wave functions (GPU with complex wave functions is not supported yet).  The CPU real build is the only one necessary for obtaining the results shown below.

Verify from the CMAKE output that it was able to find all of the various packages needed for an optimal build (note: the paths and versions may be different, and not all of these lines will be sequential).  For CPU builds:

-- Found ZLIB: /usr/lib64/libz.so (found version "1.2.7")
-- Found LibXml2: /usr/lib64/libxml2.a (found version "2.7.6")
-- Found HDF5: /opt/cray/hdf5/default/gnu/49/lib/libhdf5.a
-- Boost version: 1.53.0
-- Found FFTW

And for GPU builds:

-- Found ZLIB: /usr/lib64/libz.so (found version "1.2.7")
-- Found LibXml2: /usr/lib64/libxml2.a (found version "2.7.6")
-- Found HDF5: /opt/cray/hdf5/default/gnu/49/lib/libhdf5.a  
-- Boost version: 1.53.0
-- Found FFTW
-- Found CUDA: /opt/nvidia/cudatoolkit7.5/7.5.18-1.0502.10743.2.1 (found version "7.5")

Once built, the executable is moved to the qmcpack/build/exe directory and appended with the compile date and time to avoid overwriting/deleting executables when others are built.

cd qmcpack/build
./build_script_cpu_real

...

ls -l exe

total 110316
-rwxr-xr-x 1 rmokos bw_staff 82475908 Jan 12 16:07 qmcpack_cpu_real_170112_160711

Run

Run scripts can be found in the runs directory for the 3,200-task (100 XE nodes) and the 160,000-task (5,000 XE nodes) CPU jobs.  The broken symbolic links need to be set to point to the executables that were placed in the qmcpack/build/exe directory by the build procedure.  The name of the link itself (e.g., qmcpack_cpu_real) needs to be maintained.  For example:

cd runs/cpu_3200
rm qmcpack_cpu_real
ln -s ../../qmcpack/build/exe/qmcpack_cpu_real_170112_160711 qmcpack_cpu_real

After the appropriate executables are linked in, the bw.pbs batch script needs to be submitted via the qsub command.

qsub bw.pbs

Results

To validate the output, the system energy is compared to an expected value.  To determine the accuracy of the output, the statistical uncertainty is calculated by a Perl script, energy.pl, which is provided in the QMCPACK repository.  The target accuracy depends on the problem of interest.  For example, 0.001 Ha per atom is needed to resolve energy scales involving graphite systems. However, smaller errors are necessary for many classes of problems.

The specific input parameters for the given 4x4x1 graphite benchmark problem will generate results with uncertainties that are less than 0.001 Ha per atom.

To obtain the system energy from the output files, run the energy.pl script on the DMC phase output file, which is of the form gr4x4x1.p*x*.s001.scalar.dat.  Specify “0” as the second parameter to the script to use the output from every DMC iteration (“0” means it will start reading at line 0).  For example:

$ ./energy.pl gr4x4x1.p160000x1.s001.scalar.dat 0

LocalEnergy           =          -364.917 +/-            0.040
Variance              =                13 +/-               29
LocalPotential        =           -640.69 +/-             0.14
Kinetic               =            275.78 +/-             0.10
ElecElec              =           -11.886 +/-            0.027
IonIon                =      -270.8254378 +/-        0.0000023
LocalECP              =           -399.19 +/-             0.14
NonLocalECP           =            41.202 +/-            0.017
BlockWeight           =          26828294 +/-           287824
BlockCPU              =             51.02 +/-             0.42
AcceptRatio           =         0.9911569 +/-        0.0000011
Efficiency            =        525796.170 +/-            0.000

LocalEnergy is the system energy in Hartrees (Ha).  As previously discussed, to be scientifically useful, the error for this graphite problem needs to be less than 0.001 Ha per atom.  In the above example, 0.040 Ha / 64 atoms = 0.000625 Ha/atom < 0.001 Ha/atom, so the result is useful.  The expected system energy is in the neighborhood of -364.9 Ha, so the result is also valid.

Timing and Reference FLOP count

The total flops were calculated from OVIS data collected by the system during the runs.

The run time was taken from the real part of the output of the time command used in conjunction with aprun (i.e., time aprun <parameters>), which can be found at the end of the qmcpack.err file.  For example:

real 1764.83
user 8.70
sys 3.86
Problem Parallelism XE Fraction Total DMC Samples DMC Blocks Total Flops Walltime (s) BW Task Flop Rate
Test 3,200 0.442% 51,200 30 3.63 x 1015 1,143 0.992 GFlops/s
SPP 160,000 22.1% 2,560,000 30 1.88 x 1017 1,765 0.666 GFlops/s

 

Changelog

5/12/2017 - Due to a programming environment update that broke the qmcpack_spp_20170127.tar.gz code, a new version was added, which should now be used for all tests: qmcpack_dev_20170512-1310.tgz.  Use of this new version resulted in a small performance hit.

Problem Parallelism XE Fraction Total DMC Samples DMC Blocks Total Flops Walltime (s) BW Task Flop Rate
Test 3,200 0.442% 51,200 30 3.63 x 1015 1,157 0.980 GFlops/s
SPP 160,000 22.1% 2,560,000 30 1.87 x 1017 1,832 0.638 GFlops/s

 

RMG

Application Paragraph

RMG is a DFT-based electronic structure code developed at North Carolina State University. The original version was written in 1993-1994 and it has been updated and expanded continuously since that time. It uses real-space meshes to represent the wavefunctions, the charge density, and the ionic pseudopotentials. The real-space formulation is advantageous for parallelization, because each processor can be assigned a region of space, and for convergence acceleration, since multiple length scales can be dealt with separately. Both norm-conserving and ultrasoft pseudopotentials (UPPs) are allowed. For both efficiency and accuracy, the implementation of UPPs uses three grids, for the wave functions, the charge density and the short-ranged projectors associated with UPPs. To further reduce the grid density, high-order Mehrstellen discretizations were developed, as an alternative to central difference discretization of the kinetic energy operator. The Mehrstellen discretization employs a weighted sum of the wavefunction and potential values to improve the accuracy of the discretization of the entire differential equation, not just the kinetic energy operator. For a given discretization order, it is also significantly shorter range than central difference formulas, decreasing communication needs in a parallel implementation.

Input Description

The benchmark is for a 4,096 atoms calculation of a nitrogen-vacancy quantum spin system embedded in a diamond lattice. The calculations are carried out at the spin-polarized density functional theory level and include 16,382 occupied orbitals and 2050 unoccupied orbitals. The result is the total ground state energy of the system for the given ionic configuration.

Computation produces two images, one for spin up, and the other for spin down. Each image uses half of the processors. The full domain of the job maps onto 3D processor grid (288,288,288), which works out to a (24,24,24) decomposition on each node. The configuration of 4 MPI tasks per node on 3456 nodes uses (24,24,12) processor grid. This corresponds to 13824 MPI tasks. Other acceptable processor counts may be obtained by multiplying or dividing the present count by factor of two.

The test includes input file as well as two data files C.pbe-mt_fhi.UPF and N.pbe-mt_fhi.UPF, which need to be present in the directory that the executable is run from. 

Instructions

Download

The build script bellow includes the following download operations:

 

wget http://downloads.sourceforge.net/project/openbabel/openbabel/2.3.2/openbabel-2.3.2.tar.gz

git clone git://git.code.sf.net/p/plplot/plplot plplot.git

wget http://downloads.sourceforge.net/project/rmgdft/Releases/2.1/Sources/rmg-release_2.2.tar.gz

Use rmg-code.tgz that comes with the archive until RMG release 2.2 becomes available.

Build

Unpack the source code archive.

 

# specify installation directory

export INSTDIR=$HOME/benchmarking/SPP/rmg/cpu

mkdir $INSTDIR

 

module swap PrgEnv-cray PrgEnv-gnu

module swap gcc gcc/4.9.3

module load boost

module load cmake/3.1.3

module load fftw

module unload darshan

 

export CRAYPE_LINK_TYPE=dynamic

export CRAY_ADD_RPATH=yes

 

# install openbabel library

cd $INSTDIR

wget http://downloads.sourceforge.net/project/openbabel/openbabel/2.3.2/openbabel-2.3.2.tar.gz

tar zxvf openbabel-2.3.2.tar.gz

cd openbabel-2.3.2/

mkdir build

cd build/

cmake -DCMAKE_INSTALL_PREFIX=$INSTDIR/openbabel-2.3.2/install ..

make -j8

make install

 

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$INSTDIR/openbabel-2.3.2/install/lib

 

export CC=cc

export CXX=CC

export FC=ftn

 

# install plplot library

cd $INSTDIR

git clone git://git.code.sf.net/p/plplot/plplot plplot.git

cd plplot.git/

git tag

git checkout v5_9_0

mkdir build

cd build

cmake -DCMAKE_INSTALL_PREFIX=$INSTDIR/plplot.git/install ..

make

make install

 

# install CPU-version of RMG

cd $INSTDIR

tar zxvf rmg-code.gz

cd src

mkdir build-cpu; cd build-cpu

cmake -DOPENBABEL_INCLUDES=$INSTDIR/openbabel-2.3.2/install/include/openbabel-2.0/openbabel -DOPENBABEL_LIBRARIES=$INSTDIR/openbabel-2.3.2/install/lib/libopenbabel.so.4.0.2 -DPLplot_INCLUDE_DIR=$INSTDIR/plplot.git/install/include -DPLplot_cxx_LIBRARY=$INSTDIR/plplot.git/install/lib/libplplotcxx.so -DPLplot_LIBRARY=$INSTDIR/plplot.git/install/lib/libplplot.so ..

make -j12 rmg-cpu

Run Small Test

The small test represents a system of 302 water molecules in a unit cell under periodic boundary condition. The computation performs 2 molecular dynamics steps and runs on 20 XE nodes on Blue Waters.

 

To run the test, change directory to small/

qsub run.pbs

 

#!/bin/bash

#PBS -N z302water_cpu100

#PBS -j oe

#PBS -l walltime=02:00:00

#PBS -l nodes=20:ppn=32:xe

#PBS -q normal

source /opt/modules/default/init/bash

module list

cd $PBS_O_WORKDIR

export MPICH_MAX_THREAD_SAFETY=serialized

export OMP_WAIT_POLICY=passive

export MPICH_ENV_DISPLAY=1

ulimit -a

export MPICH_GNI_MBOX_PLACEMENT=nic

export MPICH_GNI_RDMA_THRESHOLD=1024

export MPICH_GNI_NUM_BUFS=512

export MPICH_GNI_MAX_EAGER_MSG_SIZE=65536

 

aprun -n 640 -N 32 ./rmg-cpu 302waters.rmg_cpu

 

Check Results for Small Test

Following script extracts the final value of the total energy:

grep "TOTAL ENERGY" *.log | tail -n1 | awk '{printf("TOTAL ENERGY = %12.6f Hartree\n",$5)}'

 

The reference value:

TOTAL ENERGY = -5236.425 Hartree

 

Obtaining the agreement up to three digits after decimal point in the total energy confirms the correctness of the computation.

Timing and Reference FLOP count for Small Test

Total FLOP count      = 20127711 GFLOPs

 

20-node test:

Time to solution      = 6711.12 sec

Aggregate performance = 2999 GF/s

Number of schedulable processing units = 640

Performance per core  = 4.686 GF/s

Run Large Test

Change directory to large/

qsub run.pbs

 

#!/bin/bash

#PBS -N dnv4096_3456xe_large

#PBS -j oe

#PBS -l walltime=03:00:00

#PBS -l nodes=3456:ppn=32:xe

#PBS -q normal

 

source /opt/modules/default/init/bash

module swap PrgEnv-cray PrgEnv-gnu

module load boost

module unload darshan

module list

 

cd $PBS_O_WORKDIR

 

export MPICH_MAX_THREAD_SAFETY=serialized

export OMP_NUM_THREADS=8

export OMP_WAIT_POLICY=passive

export MPICH_ENV_DISPLAY=1

 

module load craype-hugepages128K

aprun -n 13824 -N 4 -d 8 ./rmg-cpu in.dnv4096_3456xe_large

 

Check Results for Large Test

Following script extracts the final value of the total energy:

grep "TOTAL ENERGY" *.log | tail -n1 | awk '{printf("TOTAL ENERGY = %12.6f Hartree\n",$5)}'

 

TOTAL ENERGY = -23331.804 Hartree

 

Obtaining the agreement up to three digits after decimal point in the total energy confirms the correctness of the computation.

Timing and Reference FLOP count for Large Test

Total FLOP count      = 1533034872 GFLOPs

 

3456 nodes (110592 cores)

Time to solution      = 7310.45 sec

Aggregate performance  = 209705 GF/s

Number of schedulable processing units = 110592

Performace per core   = 1.896 GF/(second * core)

Hard disk usage space = 3.3 Tbytes

Changelog

 

06/06/2017 - Noticed that the RMG version 2.1, which is downloadable from sourceforge.net, is outdated, and does not reproduce the total energies suppied with the SPP tests to check correctness of the computations. Compilation of RMG should use the version of the code that comes in rmg-code.tgz included into the tar ball of the SPP distribution. Untill RMG developers make RMG version 2.2 downloadable from sourceforge net, the portion of the Build instructions that pertains to compilation of the RMG code should use the following steps:

# install CPU-version of RMG

cd $INSTDIR

tar zxvf rmg-code.tgz

cd src

mkdir build-cpu; cd build-cpu

cmake -DOPENBABEL_INCLUDES=$INSTDIR/openbabel-2.3.2/install/include/openbabel-2.0/openbabel -DOPENBABEL_LIBRARIES=$INSTDIR/openbabel-2.3.2/install/lib/libopenbabel.so.4.0.2 -DPLplot_INCLUDE_DIR=$INSTDIR/plplot.git/install/include -DPLplot_cxx_LIBRARY=$INSTDIR/plplot.git/install/lib/libplplotcxx.so -DPLplot_LIBRARY=$INSTDIR/plplot.git/install/lib/libplplot.so ..

make -j12 rmg-cpu

 

06/26/2017 - Included disk usage size by the large job 
07/10/2017 - The new release of the RMG code soon to be released by the developers produces different total energy and FLOP count than those presented above. Therefore, one should use rmg-code.tgz included into the SPP tar ball for the SPP benchmarking purpose.
07/28/2017 - Updated build instructions with the information how to download plplot library version 5.9.0 to avoid compilation errors.
09/01/2017 - Included output files into the tar file.

VPIC

The code for these tests is based on version 407, which can be downloaded from github.

Description

To simulate plasma, the Vector Particle-In-Cell (VPIC) code follows the movement of charged particles in simulated electric and magnetic fields that interact with the particles.  VPIC integrates the relativistic Maxwell-Boltzmann system in a linear background medium for multiple particle species, in time with an explicit-implicit mixture of velocity Verlet, leapfrog, Boris rotation and exponential differencing based on a reversible phase-space volume conserving 2nd order Trotter factorization.  VPIC can be used in studies of magnetic reconnection of high temperature plasmas (H+ and e-).

Magnetic reconnection is an energy conversion process that occurs within high temperature plasmas, and often produces an explosive release of energy as magnetic fields are reconfigured and destroyed. In space and astrophysical plasmas, the onset of magnetic reconnection occurs within intense current layers, where the magnetic field rapidly rotates.  Most simulations of this process focus on these thin layers, but there are a variety of possibilities to choose from for the initial conditions. While most studies have focused on the Harris-type equilibrium that is relevant to the Earth’s magnetosphere, there has been growing interest in the past few years on force-free current sheets that are thought to be more relevant to the solar atmosphere, as well as many astrophysical problems. 

The objective of kinetic simulations is to understand the three-dimensional evolution of force-free current layers.  One important first step is to understand the 3D evolution of tearing modes – a type of plasma instability that spontaneously produces magnetic reconnection while giving rise to topological changes in the magnetic field.

Input Problem

The number of mesh cells and the number of particles are the key parameters used to determine problem size.  The full-size 147,456-XE-task SPP input problem (4,608 nodes) specifies a cell grid of 1536 x 1536 x 1536 with 1.15964 trillion particles, run over 4683 time steps.  A smaller 4,608-XE-task test problem (144 nodes) specifies a cell grid of 1200 x 1200 x 200 with 86.4 billion particles, run over 419 time steps.

For a typical use case, VPIC writes a significant amount of output data, and thus such is the case for the SPP problem.  Data dumps for t=0 and t=1 are written to the fields directory, a data dump for t=0 is written to hydro, and grid files are written to rundata.  For the given problem, approximately 100 GB of data is written for each field dump, 331 GB is written for the hydro dump, and the grid files consume another 361 GB, for a total of around 891 GB (and ~737k files).

Due to the fairly short run time of the SPP problem (a little over an hour), no checkpoints are written.

Instructions

Download

Download vpic_spp_20170127.tar.gz from the vpic directory in the shared benchmarks directory using Globus Online.

The files and directory in the base directory of the tarball include:

build_vpic_407 - script for building VPIC version 407
bw_gnu  - machine configuration file for Blue Waters that specifies build parameters/flags for the gnu environment
ideck_bld_template - input deck build script template file
runs - directory containing 4608-task test and 147,456-task SPP problem
README - instructions for building VPIC

Note that the parameters used in the given build are fairly generic.  They do not include much in the way of architecture-specific optimizations.

Build

Using VPIC begins with compiling the VPIC code, which creates a library: libvpic.bw_gnu.a.  An input deck (its own C++ file) is then compiled and linked against the VPIC library, creating an executable that is specific to that simulation.  Any time any of the input parameters are changed, the input deck must be recompiled to create a new executable.

To build VPIC, simply run the build_vpic_407 script, which does the following:

1. Downloads vpic-407 from github
2. Copies the bw_gnu machine configuration file into the machine directory
3. Compiles VPIC, which creates the libvpic.bw_gnu.a library and build.bw_gnu script that compiles input decks, linking them with libvpic.bw_gnu.a
4. Creates the build_input_deck script from ideck_bld_template by inserting the location of build.bw_gnu

To compile an input deck, simply run the build_input_deck script with the input deck C++ file name as the sole parameter.  The examples in the runs directory already have soft links to the build_input_deck script for convenience.

This is the entire procedure for compiling VPIC and an input deck:

./build_vpic_407
cd runs/<input_problem_directory>
./build_input_deck <input_deck>.cxx

Run

Batch scripts can be found in each directory in runs, which need to be submitted with qsub.

[from runs/<input_problem_directory>]
qsub bw.pbs

Results

The simulation history files in runs/<input_problem_directory>/rundata contain integrated quantities such as various types of energy, which can be used to compare different runs.  One of the teams that has run VPIC on Blue Waters validated the results of the original SPP run.  The results below, which come from the new SPP run, located in runs/147456_tasks/rundata/energies, are almost identical to the original SPP run energy values.

% Layout
% step ex ey ez bx by bz "ion" "electron"
% timestep = 8.541188e-02
0    0.000000e+00 0.000000e+00 0.000000e+00 2.801200e+06 4.820000e+07 6.749999e+02 8.097715e+05 7.939355e+05
500  2.040838e+02 2.090150e+02 2.005151e+02 2.800486e+06 4.820008e+07 7.323829e+02 8.097900e+05 7.939277e+05
1000 3.217724e+02 3.218211e+02 2.576881e+02 2.800315e+06 4.820015e+07 7.690864e+02 8.098353e+05 7.937209e+05
1500 4.257822e+02 4.168414e+02 3.086536e+02 2.800873e+06 4.820023e+07 7.972705e+02 8.098812e+05 7.928314e+05
2000 5.194471e+02 5.129934e+02 3.544362e+02 2.800346e+06 4.820029e+07 8.168990e+02 8.099351e+05 7.930669e+05
2500 6.034148e+02 5.974236e+02 3.957946e+02 2.800243e+06 4.820036e+07 8.347893e+02 8.099939e+05 7.928957e+05
3000 6.799359e+02 6.662084e+02 4.328894e+02 2.800551e+06 4.820045e+07 8.525022e+02 8.100612e+05 7.923314e+05
3500 7.501141e+02 7.377953e+02 4.667192e+02 2.800000e+06 4.820050e+07 8.676510e+02 8.101408e+05 7.926588e+05
4000 8.149924e+02 8.069576e+02 4.969732e+02 2.799996e+06 4.820055e+07 8.848753e+02 8.102376e+05 7.924360e+05
4500 8.754572e+02 8.600517e+02 5.265210e+02 2.800107e+06 4.820058e+07 9.048031e+02 8.103561e+05 7.921173e+05

The following are the energy values for the 4608-task run.

% Layout
% step ex ey ez bx by bz "ion" "electron"
% timestep = 1.428942e-01
0   0.000000e+00 0.000000e+00 0.000000e+00 2.400000e+05 2.500000e+05 0.000000e+00 6.438001e+04 6.437946e+04
100 9.340233e+00 1.028762e+01 9.081234e+00 2.396366e+05 2.500056e+05 2.730003e+00 6.454416e+04 6.454361e+04
200 1.310648e+01 1.420859e+01 1.302465e+01 2.396818e+05 2.500065e+05 4.632905e+00 6.451517e+04 6.451465e+04
300 1.692604e+01 1.784077e+01 1.670971e+01 2.396832e+05 2.500085e+05 6.435436e+00 6.450799e+04 6.450746e+04
400 2.032134e+01 2.180119e+01 2.024031e+01 2.396832e+05 2.500101e+05 8.152853e+00 6.450195e+04 6.450141e+04

Timing and Reference FLOP count

The total flops were calculated from OVIS data collected by the system during the runs.

The run time was taken from the real part of the output of the time command used in conjunction with aprun (i.e., time aprun <parameters>), which can be found at the end of the vpic.err file.  For example:

real 4217.83
user 1.88
sys 1.02
Problem Parallelism XE Fraction Cell Grid Total Particles Time Steps Total Flops Walltime (s) BW Task Flop Rate
Small 4,608 0.636% 1200 x 1200 x 200 8.64 x 1010 419 7.77 x 1015 1,032 1.63 GFlops/s
SPP 147,456 20.4% 1536 x 1536 x 1536 1.15964 x 1012 4,683 1.06 x 1018 4,218 1.70 GFlops/s

Changelog

7/7/2017 - Added energy value results for the small (4608-task) problem.

8/2/2017 - Corrected the cell grid listed for the small problem in the "Timing and Reference FLOP count" section table (it's 1200 x 1200 x 200, not 1200 x 1200 x 1200).

8/15/2017 - Removed a few unnecessary lines that were commented out in build_vpic_407.

9/6/2017 - Added a tarball to the Globus vpic shared benchmarks directory named vpic_spp_large_output_files.tar.gz, which contains the info, global.vpc, and vpic.err (stderr) files from the 147,456-task run.

WRF

Application Paragraph

The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. WRF is suitable for a broad spectrum of applications across scales ranging from meters to thousands of kilometers. The WRF-ARW core is based on a Eulerian solver using a third-order Runge-Kutta time-integration scheme coupled with a split, explicit 2nd-order time integration scheme.

Input Description

The primary restart and boundary data files are "wrfinput_d01" and "wrfbdy_d01", respectively.  For historical reasons, the input file may have a single letter appended to the name (e.g., "wrfinput_d01a").  They represent conditions appropriate for a simulation of Hurricane Sandy.  A "namelist.input" file is also used.

Instructions

Download

Download the source code, job scripts and data from the wrf directory in the shared benchmarks directory using Globus Online.  This is a slightly modified version of WRF version 3.3.1.

Make/Build

First, make sure that the "build_it", "compile", and "clean" files are executable:

cd WRFV3

chmod +x ./build_it ./clean ./compile

Run the "build_it" script in the "WRFV3" directory to produce executable "WRFV3/main/wrf.exe".

Make sure that  the files "grid_order", "wrfetime" and "waitforfile.sh" are executable (this should already be the case).

chmod +x ./grid_order ./wrfetime ./waitforfile.sh

Run Test Case (2.01% of XE Nodes)

The "namelist.input" files for the small case are in a sub-directory called "456node_namelist_tac".  Copy these files to the main working directory (i.e., the directory containing the input and boundary condition files, "wrfinput_d01a" and "wrfbdy_d01", respectively):

cd 456node_namelist_tac

/bin/cp -f namelist* ..

cd ..

 

Modify the path to the WRF executable in "runit.xe.omp.scale.ideal.tac", if necessary, then use the following command to run the small Hurricane Sandy benchmark (assumes bash as the default shell):

./qsub.xe.omp.scale.ideal.tac 64 114 16 2 4 a

This command runs the Hurricane Sandy benchmark for 900 time steps on 2.01% of the XE nodes, 16 cores per node, and 2 OpenMP threads per MPI rank.

Run SPP Problem (20.1% of XE Nodes)

The "namelist.input" files for the large case are in a sub-directory called "4560node_namelist_tac".  Copy these files to the main working directory (i.e., the directory containing the input and boundary condition files, "wrfinput_d01a" and "wrfbdy_d01", respectively):

cd 4560node_namelist_tac

/bin/cp -f namelist* ..

cd ..

 

Modify the path to the WRF executable in "runit.xe.omp.scale.ideal.tac", if necessary, then use the following command to run the large Hurricane Sandy benchmark (assumes bash as the default shell):

qsub.xe.omp.scale.ideal.tac 320 228 16 2 4 a

This command runs the Hurricane Sandy benchmark for 9000 time steps on 20.1% of the XE nodes, 16 cores per node, and 2 OpenMP threads per MPI rank.

Check Results

Job results are summarized in a file beginning with "NSTRIPE_WRF", and customized with the problem name, problem size and job ID.  The bottom of this file should have a message indicating "WRF Success".

The actual WRF simulation data are written to a directory beginning with "APRIL_23", and customized wth the problem size and job ID.  Each MPI rank also writes a "rsl.out.XXXX" and "rsl.error.XXXX" file into this directory; check the corresponding "0000" files for additional status information.

Timing and Reference FLOP count

Problem Number of Nodes Time Steps Total Flops Walltime (s)
Test Case 2.01% of XE Nodes 900 29,209,000 GFLOPs 3,220
SPP Problem 20.1% of XE Nodes 9000 292,090,000 GFLOPs 10,260

Changelog

September 2017:   A sample output directory for the 4560-node WRF run, "APRIL_23.320x228_N16.7565719", is provided. The run used an tuned Lustre striping to reduce the IO time and has a smaller time to solution than the baseline run provided in the table above.

October 2017:  Another sample output directory for the 4560-node WRF run, "APRIL_23.320x228_N16.7607983" is also provided.  This directory corresponds to a non-optimized baseline run, as in the table above.