Skip to Content

I/O Benchmark Instructions and 
Result

Introduction

Benchmarks play a critical role in the performance evaluation of lsystems.  The I/O-2017 (IO-2017) benchmarks serve three purposes:

  1. The benchmarks are carefully chosen to represent characteristics of the current Blue Waters and possible future workloads, which consists of solving complex scientific problems using diverse computational techniques at high degrees of parallelism.
  2. The benchmarks give the opportunity to provide the concrete data associated with the performance and scalability of the systems.
  3. The benchmarks can be used as part of the system acceptance and /or regression testing and as a measurement of performance throughout the operational lifetime of the system.

Observed benchmark performance should be obtained from a system under consideration or a system configured as closely as possible to the target system.  For systems targeted at supporting highly parallel computation, it is critical that the evaluators provide observed benchmark performance using the largest (most parallel) test inputs as well as other sizes.  The largest scale jobs in the benchmark suite should not be interpreted as the limit for the job concurrency for the target system. Performance projections are permissible if they are derived from a similar system that is considered an earlier generation and/or smaller system.  Projections should be rigorously derived, using best practices for application and system performance modeling and be thoroughly documented and easily understood.  In the tables below, the “Projected” column refers to the value projected for the full system target where direct measurements are not possible. 

Submission Guidelines

The benchmark results (or projections including original results) for the target system should be recorded in the tables provided at the end of this document.  Additionally, the evaluator should submit the completed tables, benchmark codes and output files and as well as documentation on any code optimizations or configuration changes.  The submitted source should be in a form that can be readily compiled on the target system.  Please do not include object and executable files, core dump files or large binary data files.  An audit trail should be supplied for any changes made to the benchmark codes.  The audit trail should be sufficient to demonstrate that the changes made conform to the spirit of the benchmark and do not violate the specific restrictions on the various benchmark codes. The compile and run logs should also be submitted.

If performance projections are used, this should be clearly indicated.  The output files on which the projections are based, and a description of the projection method should be included.  In addition, each system used for benchmark projections should be described in Table 3 below.  Each projection in benchmarks results tables should indicate on which system the benchmark was originally run.  Enter the corresponding letter of the “System” column of Table 3 into the “System” column of the benchmark result table. 

Run Rules

The run rules, which are included in the source distribution, supply specific requirements and instructions for compiling, executing, verifying numerical correctness and reporting results for each benchmark.  The benchmark performance should only be accepted from runs that exhibit correct execution.  Only software tools and libraries that will be included for general use in the target system as supported product offerings are permissible to build and execute the benchmarks.

Message passing programs should be built using an implementation that supports 64-bit virtual memory pointers and a thread-safe communication library that implements the MPI standard.  All tests are to be in 64 bit floating point mode unless otherwise allowed.

Benchmark Descriptions

The I/O Tests, listed in Table 1, are simple focused tests that are easily compiled and executed.  The results allow a uniform comparison of features and provide an estimation of system balance.  Descriptions and requirements for each test are included in the source distribution.  The results for the target system should be recorded in Table 4 under the column “Target”.  In the event that benchmark results are being projected, columns “Benchmarked” and “Target” should be filled out. 

 

Modifications to the Lower Level 2017 Benchmark are only permissible to enable correct execution on the target platform.  No changes related to optimization are permissible except in the case of the NAS FT benchmark where the values for fftblock_default and fftblockpad_default may be changed to suit the target architecture.

 

Table 1.  I/O Tests

Benchmark

Purpose

IOR

Characteristic bulk IO operations

mdtest

Metadata performance tests

FLASH I/O

Application specific I/O performance

 

All tests are to be run in fully packed mode unless otherwise described below.  In architectures with multiple cores[1] per node, “fully packed” means that the number of instances, threads or MPI tasks per node should at least equal the total number of physical cores available on the node.

 

Please submit issues regarding the benchmarks to help+bw@ncsa.illinois.edu.

IOR

IOR can be downloaded from the LLNL IOR github site https://github.com/LLNL/ior

1. Verify that the aggregate IOR write I/O performance across all file systems is at least 1 TB/s when using an optimal number client nodes. The number of clients is site dependent. Each IOR test will run until it reaches a steady state of operations per second, but no longer than five minutes. as controlled by the stonewall option. The performance will be the average of the IOR write performance across 10 repetitions.  Command for standard IOR: IOR -a POSIX -w -o [directory_path]/POSIX -D 180 -b 4096g -E -C -g -k -s 1 -t 4m -F -e 

2. Verify a single file system server achieves a specific read and write performance using IOR separately and then as peices of the storage environment is added, it continues to scale in performance.  Each IOR test will run until it reaches a steady state of operations per second, but no longer than five minutes as controlled by the stonewall option. To test the scalability of the environment after reaching a single storage server performance repeat the process after adding 1/3 of the server systems, then 2/3, and then finally 3/3.   At each interval, the I/O performance needs to scale within 10% linearity.  The performance will be the average of the IOR write/read performance across three repetitions for each  1/3, 2/3 and 3/3.  There will be at least 3 separate iterations to each "file system" served by the number of servers.   This shows as the number of servers and disks are added, the environment continues to scale with more performance with each addition.  

WRITE IOR: IOR -a POSIX -w -o [directory_path]/POSIX -D 180 -b 4096g -E -C -g -k -s 1 -t 4m -F -e   

READ IOR: IOR -a POSIX -r -o [directory_path]/POSIX -D 120 -b ${SMALLEST} -E -C -g -k -s 1 -t 4m -F -e

$SMALLEST is the smallest file from the IOR write job.

 

Sample run on 4320 XE nodes of Blue Waters

 

 

Test Type servers bandwidth IOR parameters
Max Write 4320 732912.20 MiB/sec -a POSIX -w -D 180 -b 4096g -E -C -g -k -s 1 -t 4m -F -e
Max Read 4320 432682.33 MiB/sec -a POSIX -4 -D 120 -b 14763950080 -E -C -g -k -s 1 -t 4m -F -e

3. All client nodes will be concurrently writing individual 1 GB files using IOR and performance will be at least 50% of optimal performance  at every step. 

Process: A single job executes IOR simultaneously on each client node in the system to write individual 1 GB files concurrently to the /scratch file system.  Each rank will operate on its own file (file-per-process) .

The IOR command will be run with the following parameters:

  • 1 rank per node
  • 22600 nodes
  • Buffered I/O
  • POSIX file-per-process mode
  • Random file placement using "lfs setstripe -c 1"
  • Transfer size of 16MB
  • "-w -a POSIX -b 1g -E -C -e -g -k -s 1 -t 16m -F"

4. All  client nodes will be concurrently reading individual 1 GB files using IOR and performance will be at least 50% of optimal performance at every step. 

Process: A single job executes IOR simultaneously on each client node in the system to write individual 1 GB files concurrently to the /scratch file system.  Each rank will operate on its own file (file-per-process) .

The IOR command will be run with the following parameters:

  • 1 rank per node
  • 22600 nodes
  • Buffered I/O
  • POSIX file-per-process mode
  • Random file placement using "lfs setstripe -c 1"
  • Transfer size of 16MB
  • "-w -a POSIX -b 1g -E -C -e -g -k -s 1 -t 16m -F"

MDTEST

The metadata test can be downloaded from the github mdtest LANL site.

1.  Concurrent create of files will be at an aggregate rate of > 25,000 mean creates/second using MDTEST using up to all client nodes. This will be 2 tests.  One for all files in a single directory (this directory must already contain 1million files)  and then every file in a separate directory.   

mdtest -b 3 -v -z 3 -u -I 900 -k -i 4 -d [test_dir]

2.  Concurrent delete of those files (from step above) at an aggregate rate of > 30,000 mean deletes/second using MDTEST using up to all  client nodes. This test is to be run twice. Once for all files in a single directory and a second time for all files in a separate directory.

 mdtest -b 3 -v -z 3 -u -I 900 -k -i 4 -d [test_dir]

3. Concurrent stat() from each node to modify those files at an aggregate rate of > 40,000 mean stat()/second using MDTEST using up to all XE6 client nodes. This test is to be run twice. Once for all files in a single directory and a second time for all files in a separate directory.

 mdtest -b 3 -v -z 3 -u -I 900 -k -i 4 -d [test_dir]

Sample run on 48 XE nodes of Blue Waters

mdtest options:  -i 20 -v -u -n 1041

test type Max (ops/sec) Min (ops/s) Ave (ops/sec) std. dev.
Directory creation 49624.992 24935.055 47542.806 5211.479
Directory stat 162900.785 154878.512 158480.51 2169.141
Directory removal 73602.499 21432.838 67091.509 13058.863
File creation 46562.997 24553.01 43254.472 5306.844
File stat 79607.828 74170.735 77641.366 1185.805
File removal 73452.214 34063.653 43862.148 14615.957
Tree creation 505.277 342.001 456.44 37.577
Tree removal 256.674 2.624 226.869 52.541

 

 


FLASH application I/O kernel benchmark

The FLASH I/O kernel is based on typical I/O operations generated by users of the FLASH application. In some ways it is more representtative of application I/O patterns than synthetic benchmarks such as IOR beow.

An example of running the code is provided below.

Instructions

Download code

git clone https://github.com/live-clones/pnetcdf.git
 
# Drill down to  FlashIO benchmark
cd pnetcdf/benchmarks/FLASH-IO
 
# Load Required modules on Blue Waters.
module load cray-parallel-netcdf
module load autoconf/2.69
 
# configure and build
autoreconf
./configure --with-pnetcdf=${CRAY_PARALLEL_NETCDF_DIR}
make
 
# create a testdir on scratch and set the striping for a Lustre file system
mkdir -p flashIO/test1
lfs setstripe -c 16 flashIO/test1
 

Sample run from Blue Waters

% aprun -n 64 ./flash_benchmark_io /scratch/path-to-directory/flashIO/test1/iotest.out
 number of guards      :             4
 number of blocks      :            80
 number of variables   :            24
 checkpoint time       :             3.51  sec
        max header     :             0.24  sec
        max unknown    :             3.27  sec
        max close      :             0.00  sec
        I/O amount     :          3888.06  MiB
 plot no corner        :             0.58  sec
        max header     :             0.01  sec
        max unknown    :             0.56  sec
        max close      :             0.00  sec
        I/O amount     :           324.51  MiB
 plot    corner        :             0.38  sec
        max header     :             0.01  sec
        max unknown    :             0.37  sec
        max close      :             0.00  sec
        I/O amount     :           389.13  MiB
 -------------------------------------------------------
 File base name        : /scratch/path-to-directory/flashIO/test1/iotest.out
   file striping count :            16
   file striping size  :       1048576     bytes
 Total I/O amount      :          4601.70  MiB
 -------------------------------------------------------
 nproc    array size      exec (sec)   bandwidth (MiB/s)
   64    16 x  16 x  16      4.47     1030.32
 
nodes tasks array size time (s) bandwidth (MiB/s)
2 64 16x16x16 4.47 1030.32
         

Result Tables

Table 2. System Description

Enter the system details in this table for each system used in benchmarking.  Use the System label to refer to system corresponding to each test in the following tables.

 

System

Processor

Clock/MHz

Interconnect

Total Core Count

A

 

 

 

 

B

 

 

 

 

C

 

 

 

 

 

For each application run, enter the run time variation in the column marked COV.

Table 3. I/O Test Results

IOR

 

 

 

System

Benchmarked Rate

Projected      Rate

Units

 

 

 

 

 

 

 

 

 

 

mdtest

 

 

 

System

Benchmarked Rate

Projected      Rate

Units

 

 

 

 

 

 

 

 

 

 

FLASH

 

 

 

System

Benchmarked Rate

Projected      Rate

Units