Using aprun

The aprun command is used to specify to ALPS the resources and placement parameters needed for your application at application launch. At a high level, aprun is similar to mpiexec or mpirun.

The following are the most commonly used options for aprun.

-n: Number of processing elements PEs required for the application (pes)
-N: Number of PEs to place per node
-S: Number of PEs to place per NUMA node.
-d: Number of CPU cores required for each PE and its threads
-j: Number of CPUs to use per compute unit (bulldozer core-module)
-cc: Bind PEs to CPU cores.
-r: Number of CPU cores to be used for core specialization
-ss: Enable strict memory containment per NUMA node
-q: Suppress all non-fatal messages from aprun
-R: aprun restart on minimum number of PEs

At a minimum, the user should provide the "-n" option to specify the number of PEs required to run the job.

Task placement

By default, MPI processes are assigned to cores in a packed manner. If, for example, you run a single-node pure-MPI job on 8 cores (i.e., aprun -n 8 ...), the MPI ranks will be placed on cores 0-7. If you run a hybrid job with OpenMP, the -d parameter should be set to the number of OpenMP threads per MPI rank. In this case, the MPI ranks will be spaced by the value of -d. So, if you run a single-node job with 8 MPI ranks and 2 OpenMP threads per rank (aprun -n 8 -d 2 ...), the MPI ranks will be placed on cores 0, 2, 4, 6, 8, 10, 12, and 14, and the OpenMP threads will be placed on cores 0 through 15.

For the vast majority of codes, it's best to distribute MPI processes among the numa nodes to avoid bottlenecks to cache, pci bus, main memory, etc. Note the core layout of an XE node as shown by /usr/bin/numactl:

	numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 16383 MB
node 0 free: 15686 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 16384 MB
node 1 free: 15179 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 16384 MB
node 2 free: 15869 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 16384 MB
node 3 free: 15633 MB

There are a couple ways to modify ALPS' task placement behavior.

Simple task placement: -n, -N, -d, and -j

The simple way of evenly distributing the tasks on a node is with a combination of the -n, -N, and -d aprun parameters. For a pure-MPI job, specify the total number of MPI processes with -n, the number of MPI processes per node with -N, and then use -d to space them.

Example 1 (single-node job):

aprun -n 4 -N 4 -d 2 ./myexe

Result: MPI processes will be assigned to cores 0, 2, 4, and 6.

Example 2 (2-node job):

aprun -n 8 -N 4 -d 4 ./myexe

Result: MPI processes will be assigned to cores 0, 4, 8, and 12 on both nodes.

If you want to place one MPI rank on each of the 16 bulldozer core-modules in a node, simply use -N 16 -d 2. Alternatively, -j may be used instead of -d. The -j parameter specifies the number of CPUs to be allocated per compute unit, which is a bulldozer core-module on Blue Waters. As there are two integer cores per bulldozer core-module, the valid values for -j are 0 (use the system default), 1 (use one integer core per bulldozer), and 2 (use both integer cores in each bulldozer; this is the system default). So, using -N 16 -j 1 is another way to place one MPI rank on each bulldozer core-module. Note that using -d in combination with -j has a multiplicative effect:

Example 3 (single-node jobs):

aprun -n 4 -N 4 -d 1 -j 1
Result: MPI processes will be assigned to cores 0, 2, 4, and 6.

aprun -n 4 -N 4 -d 2 -j 1
Result: MPI processes will be assigned to cores 0, 4, 8, and 12.

aprun -n 4 -N 4 -d 4 -j 1
Result: MPI processes will be assigned to cores 0, 8, 16, and 24.

For hybrid jobs with OpenMP threads, set -d to the number of threads per MPI rank.

Example 4 (single-node job with OpenMP, using a bash PBS script):

Set the number of threads: export OMP_NUM_THREADS=8

aprun -n 4 -N 4 -d 8 ./myexe

Resulting assignments:

core 0:    MPI rank 0 (and main/master OpenMP thread for MPI rank 0)
cores 1-7:    OpenMP threads for MPI rank 0
core 8:    MPI rank 1 (and main/master OpenMP thread for MPI rank 1)
cores 9-15: OpenMP threads for MPI rank 1
core 16:     MPI rank 2 (and main/master OpenMP thread for MPI rank 2)
cores 17-23: OpenMP threads for MPI rank 2
core 24: MPI rank 3 (and main/master OpenMP thread for MPI rank 3)
cores 25-31: OpenMP threads for MPI rank 3

Advanced task placement: -cc

Use the -cc parameter if you need more control over where the MPI processes are placed. The list following -cc specifies the cores to which MPI processes are bound. This list may be comma delimited (e.g., 2,4,6,8), contain ranges (e.g., 2-4,5-7), or both.

A typical -cc layout for using the 16 FPU units in an XE compute node would look like:

	aprun -cc 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 -n <MYNUMRANKS> ./a.out

Examples:

		> aprun -n 4 ./hello  # default layout with integer cores and sharing FPUs
rank 0 of 4 on nid00028 core 0
rank 3 of 4 on nid00028 core 3
rank 1 of 4 on nid00028 core 1
rank 2 of 4 on nid00028 core 2

> aprun -cc 0,2,4,6 -n 4 ./hello  # skipping odd integer cores, no FPU sharing
rank 1 of 4 on nid00028 core 2
rank 0 of 4 on nid00028 core 0
rank 2 of 4 on nid00028 core 4
rank 3 of 4 on nid00028 core 6

A exmaple for using -cc for a MPI+openMP hybrid application with 4 MPI per node and 4 openMP threads per MPI on an XE compute node is:

	aprun -N 4 -cc 0,2,4,6:8,10,12,14:16,18,20,22:24,26,28,30 -n <MYNUMRANKS> ./a.out

XK compute nodes contain only numa nodes 0 and 1 with the same memory and enumeration of cores 0-15. Nodes 2 and 3 (the 2nd processor socket) are vacant to provide space for the Nvidia Kepler GPU.

The -cc parameter may be used to bind multiple MPI tasks to the same core (e.g., aprun -cc 0,0,1,1 ...). However, this is typically very undesirable as doing so will result in an extreme load imbalance for any application that tries to keep the amount of work done by each MPI rank the same.

Important note: the bindings specified by using -cc apply to each node. This means that the only valid values for Blue Waters are 0-31. Values above 31 will be ignored (no error is given).

See the aprun man page for more information.

This example code can be compiled and run on a node(s) to show how the core placement changes with various arguments to aprun:

hello_world.c

Restart for aprun resiliency

A feature of aprun is the ability to auto-retry the executation of the application based on how many fewer processing elements (PE) can be tolerated in the event of node failure. This feature provides a level of resiliency by enabling application relaunch so that should the application experience certain system failures, ALPS will attempt to relaunch and complete in a degraded manner. The option is:

-R pe_dec

where pe_dec is the processing element (PE) decrement tolerance. If pe_dec is non-zero, aprun attempts to relaunch with a maximum of pe_dec fewer PEs. If pe_dec is 0, aprun will attempt relaunch with the same number of PEs specified with original launch. Relaunch is supported per aprun instance. A decrement count value greater than zero will fail for MPMD launches with more than one element. Options -C and -R are mutually exclusive.

Aprun output

Upon an application exits, aprun sends to stdout: utime, stime, maxrss, inblocks, and outblocks. These are for user time, system time, maximum resident set size, block input operations, and block output operations. The values given are approximate as they are rounded aggregate scaled by the number of resources used. For more information on these values, see the getrusage(2) man page.

An example of the output is:

Application 2243970 resources: utime ~4385s, stime ~114s, Rss ~1109576, inblocks ~59037162, outblocks ~882

References

http://docs.cray.com/books/S-2496-4101/html-S-2496-4101/cnlexamples.html # contains many sample codes demonstrating the common HPC programming paradigms along with various aprun invocations and program output