I/O Benchmark Instructions and
Result

Introduction

Benchmarks play a critical role in the performance evaluation of lsystems. The I/O-2017 (IO-2017) benchmarks serve three purposes:

The benchmarks are carefully chosen to represent characteristics of the current Blue Waters and possible future workloads, which consists of solving complex scientific problems using diverse computational techniques at high degrees of parallelism.
The benchmarks give the opportunity to provide the concrete data associated with the performance and scalability of the systems.
The benchmarks can be used as part of the system acceptance and /or regression testing and as a measurement of performance throughout the operational lifetime of the system.

Observed benchmark performance should be obtained from a system under consideration or a system configured as closely as possible to the target system. For systems targeted at supporting highly parallel computation, it is critical that the evaluators provide observed benchmark performance using the largest (most parallel) test inputs as well as other sizes. The largest scale jobs in the benchmark suite should not be interpreted as the limit for the job concurrency for the target system. Performance projections are permissible if they are derived from a similar system that is considered an earlier generation and/or smaller system. Projections should be rigorously derived, using best practices for application and system performance modeling and be thoroughly documented and easily understood. In the tables below, the “Projected” column refers to the value projected for the full system target where direct measurements are not possible.

Submission Guidelines

The benchmark results (or projections including original results) for the target system should be recorded in the tables provided at the end of this document. Additionally, the evaluator should submit the completed tables, benchmark codes and output files and as well as documentation on any code optimizations or configuration changes. The submitted source should be in a form that can be readily compiled on the target system. Please do not include object and executable files, core dump files or large binary data files. An audit trail should be supplied for any changes made to the benchmark codes. The audit trail should be sufficient to demonstrate that the changes made conform to the spirit of the benchmark and do not violate the specific restrictions on the various benchmark codes. The compile and run logs should also be submitted.

If performance projections are used, this should be clearly indicated. The output files on which the projections are based, and a description of the projection method should be included. In addition, each system used for benchmark projections should be described in Table 3 below. Each projection in benchmarks results tables should indicate on which system the benchmark was originally run. Enter the corresponding letter of the “System” column of Table 3 into the “System” column of the benchmark result table.

Run Rules

The run rules, which are included in the source distribution, supply specific requirements and instructions for compiling, executing, verifying numerical correctness and reporting results for each benchmark. The benchmark performance should only be accepted from runs that exhibit correct execution. Only software tools and libraries that will be included for general use in the target system as supported product offerings are permissible to build and execute the benchmarks.

Message passing programs should be built using an implementation that supports 64-bit virtual memory pointers and a thread-safe communication library that implements the MPI standard. All tests are to be in 64 bit floating point mode unless otherwise allowed.

Benchmark Descriptions

The I/O Tests, listed in Table 1, are simple focused tests that are easily compiled and executed. The results allow a uniform comparison of features and provide an estimation of system balance. Descriptions and requirements for each test are included in the source distribution. The results for the target system should be recorded in Table 4 under the column “Target”. In the event that benchmark results are being projected, columns “Benchmarked” and “Target” should be filled out.

Modifications to the Lower Level 2017 Benchmark are only permissible to enable correct execution on the target platform. No changes related to optimization are permissible except in the case of the NAS FT benchmark where the values for fftblock_default and fftblockpad_default may be changed to suit the target architecture.

Table 1. I/O Tests

Benchmark	Purpose
IOR	Characteristic bulk IO operations
mdtest	Metadata performance tests
FLASH I/O	Application specific I/O performance

All tests are to be run in fully packed mode unless otherwise described below. In architectures with multiple cores[1] per node, “fully packed” means that the number of instances, threads or MPI tasks per node should at least equal the total number of physical cores available on the node.

Please submit issues regarding the benchmarks to help+bw@ncsa.illinois.edu.

IOR

IOR can be downloaded from the LLNL IOR github site https://github.com/LLNL/ior

1. Verify that the aggregate IOR write I/O performance across all file systems is at least 1 TB/s when using an optimal number client nodes. The number of clients is site dependent. Each IOR test will run until it reaches a steady state of operations per second, but no longer than five minutes. as controlled by the stonewall option. The performance will be the average of the IOR write performance across 10 repetitions. Command for standard IOR: IOR -a POSIX -w -o [directory_path]/POSIX -D 180 -b 4096g -E -C -g -k -s 1 -t 4m -F -e

2. Verify a single file system server achieves a specific read and write performance using IOR separately and then as peices of the storage environment is added, it continues to scale in performance. Each IOR test will run until it reaches a steady state of operations per second, but no longer than five minutes as controlled by the stonewall option. To test the scalability of the environment after reaching a single storage server performance repeat the process after adding 1/3 of the server systems, then 2/3, and then finally 3/3. At each interval, the I/O performance needs to scale within 10% linearity. The performance will be the average of the IOR write/read performance across three repetitions for each 1/3, 2/3 and 3/3. There will be at least 3 separate iterations to each "file system" served by the number of servers. This shows as the number of servers and disks are added, the environment continues to scale with more performance with each addition.

WRITE IOR: IOR -a POSIX -w -o [directory_path]/POSIX -D 180 -b 4096g -E -C -g -k -s 1 -t 4m -F -e

READ IOR: IOR -a POSIX -r -o [directory_path]/POSIX -D 120 -b ${SMALLEST} -E -C -g -k -s 1 -t 4m -F -e

$SMALLEST is the smallest file from the IOR write job.

Sample run on 4320 XE nodes of Blue Waters

Test Type	servers	bandwidth	IOR parameters
Max Write	4320	732912.20 MiB/sec	-a POSIX -w -D 180 -b 4096g -E -C -g -k -s 1 -t 4m -F -e
Max Read	4320	432682.33 MiB/sec	-a POSIX -4 -D 120 -b 14763950080 -E -C -g -k -s 1 -t 4m -F -e

3. All client nodes will be concurrently writing individual 1 GB files using IOR and performance will be at least 50% of optimal performance at every step.

Process: A single job executes IOR simultaneously on each client node in the system to write individual 1 GB files concurrently to the /scratch file system. Each rank will operate on its own file (file-per-process) .

The IOR command will be run with the following parameters:

1 rank per node
22600 nodes
Buffered I/O
POSIX file-per-process mode
Random file placement using "lfs setstripe -c 1"
Transfer size of 16MB
"-w -a POSIX -b 1g -E -C -e -g -k -s 1 -t 16m -F"

4. All client nodes will be concurrently reading individual 1 GB files using IOR and performance will be at least 50% of optimal performance at every step.

The IOR command will be run with the following parameters:

1 rank per node
22600 nodes
Buffered I/O
POSIX file-per-process mode
Random file placement using "lfs setstripe -c 1"
Transfer size of 16MB
"-w -a POSIX -b 1g -E -C -e -g -k -s 1 -t 16m -F"

MDTEST

The metadata test can be downloaded from the github mdtest LANL site.

1. Concurrent create of files will be at an aggregate rate of > 25,000 mean creates/second using MDTEST using up to all client nodes. This will be 2 tests. One for all files in a single directory (this directory must already contain 1million files) and then every file in a separate directory.

mdtest -b 3 -v -z 3 -u -I 900 -k -i 4 -d [test_dir]

2. Concurrent delete of those files (from step above) at an aggregate rate of > 30,000 mean deletes/second using MDTEST using up to all client nodes. This test is to be run twice. Once for all files in a single directory and a second time for all files in a separate directory.

mdtest -b 3 -v -z 3 -u -I 900 -k -i 4 -d [test_dir]

3. Concurrent stat() from each node to modify those files at an aggregate rate of > 40,000 mean stat()/second using MDTEST using up to all XE6 client nodes. This test is to be run twice. Once for all files in a single directory and a second time for all files in a separate directory.

mdtest -b 3 -v -z 3 -u -I 900 -k -i 4 -d [test_dir]

Sample run on 48 XE nodes of Blue Waters

mdtest options: -i 20 -v -u -n 1041

test type	Max (ops/sec)	Min (ops/s)	Ave (ops/sec)	std. dev.
Directory creation	49624.992	24935.055	47542.806	5211.479
Directory stat	162900.785	154878.512	158480.51	2169.141
Directory removal	73602.499	21432.838	67091.509	13058.863
File creation	46562.997	24553.01	43254.472	5306.844
File stat	79607.828	74170.735	77641.366	1185.805
File removal	73452.214	34063.653	43862.148	14615.957
Tree creation	505.277	342.001	456.44	37.577
Tree removal	256.674	2.624	226.869	52.541

FLASH application I/O kernel benchmark

The FLASH I/O kernel is based on typical I/O operations generated by users of the FLASH application. In some ways it is more representtative of application I/O patterns than synthetic benchmarks such as IOR beow.

An example of running the code is provided below.

Instructions

Download code

git clone https://github.com/live-clones/pnetcdf.git

# Drill down to FlashIO benchmark

cd pnetcdf/benchmarks/FLASH-IO

# Load Required modules on Blue Waters.

module load cray-parallel-netcdf

module load autoconf/2.69

# configure and build

autoreconf

./configure --with-pnetcdf=${CRAY_PARALLEL_NETCDF_DIR}

make

# create a testdir on scratch and set the striping for a Lustre file system

mkdir -p flashIO/test1

lfs setstripe -c 16 flashIO/test1

Sample run from Blue Waters

% aprun -n 64 ./flash_benchmark_io /scratch/path-to-directory/flashIO/test1/iotest.out

number of guards : 4

number of blocks : 80

number of variables : 24

checkpoint time : 3.51 sec

max header : 0.24 sec

max unknown : 3.27 sec

max close : 0.00 sec

I/O amount : 3888.06 MiB

plot no corner : 0.58 sec

max header : 0.01 sec

max unknown : 0.56 sec

max close : 0.00 sec

I/O amount : 324.51 MiB

plot corner : 0.38 sec

max header : 0.01 sec

max unknown : 0.37 sec

max close : 0.00 sec

I/O amount : 389.13 MiB

-------------------------------------------------------

File base name : /scratch/path-to-directory/flashIO/test1/iotest.out

file striping count : 16

file striping size : 1048576 bytes

Total I/O amount : 4601.70 MiB

-------------------------------------------------------

nproc array size exec (sec) bandwidth (MiB/s)

64 16 x 16 x 16 4.47 1030.32

nodes	tasks	array size	time (s)	bandwidth (MiB/s)
2	64	16x16x16	4.47	1030.32

Result Tables

Table 2. System Description

Enter the system details in this table for each system used in benchmarking. Use the System label to refer to system corresponding to each test in the following tables.

System	Processor	Clock/MHz	Interconnect	Total Core Count
A
B
C

For each application run, enter the run time variation in the column marked COV.

Table 3. I/O Test Results

IOR
	System	Benchmarked Rate	Projected Rate	Units

mdtest
	System	Benchmarked Rate	Projected Rate	Units

FLASH
	System	Benchmarked Rate	Projected Rate	Units

I/O Benchmark Instructions and Result

Introduction

Submission Guidelines

Run Rules

Benchmark Descriptions

IOR

MDTEST

FLASH application I/O kernel benchmark

Instructions

Download code

Sample run from Blue Waters

Result Tables

I/O Benchmark Instructions and
Result