The aprun command is used to specify to ALPS the resources and placement parameters needed for your application at application launch. At a high level, aprun is similar to mpiexec or mpirun.
The following are the most commonly used options for aprun.
At a minimum, the user should provide the "-n" option to specify the number of PEs required to run the job.
By default, MPI processes are assigned to cores in a packed manner. If, for example, you run a single-node pure-MPI job on 8 cores (i.e., aprun -n 8 ...), the MPI ranks will be placed on cores 0-7. If you run a hybrid job with OpenMP, the -d parameter should be set to the number of OpenMP threads per MPI rank. In this case, the MPI ranks will be spaced by the value of -d. So, if you run a single-node job with 8 MPI ranks and 2 OpenMP threads per rank (aprun -n 8 -d 2 ...), the MPI ranks will be placed on cores 0, 2, 4, 6, 8, 10, 12, and 14, and the OpenMP threads will be placed on cores 0 through 15.
For the vast majority of codes, it's best to distribute MPI processes among the numa nodes to avoid bottlenecks to cache, pci bus, main memory, etc. Note the core layout of an XE node as shown by /usr/bin/numactl:
numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 16383 MB node 0 free: 15686 MB node 1 cpus: 8 9 10 11 12 13 14 15 node 1 size: 16384 MB node 1 free: 15179 MB node 2 cpus: 16 17 18 19 20 21 22 23 node 2 size: 16384 MB node 2 free: 15869 MB node 3 cpus: 24 25 26 27 28 29 30 31 node 3 size: 16384 MB node 3 free: 15633 MB
There are a couple ways to modify ALPS' task placement behavior.
The simple way of evenly distributing the tasks on a node is with a combination of the -n, -N, and -d aprun parameters. For a pure-MPI job, specify the total number of MPI processes with -n, the number of MPI processes per node with -N, and then use -d to space them.
aprun -n 4 -N 4 -d 2 ./myexe
Result: MPI processes will be assigned to cores 0, 2, 4, and 6.
aprun -n 8 -N 4 -d 4 ./myexe
Result: MPI processes will be assigned to cores 0, 4, 8, and 12 on both nodes.
For hybrid jobs with OpenMP threads, set -d to the number of threads per MPI rank.
Set the number of threads: export OMP_NUM_THREADS=8
aprun -n 4 -N 4 -d 8 ./myexe
core 0: MPI rank 0 (and main/master OpenMP thread for MPI rank 0)
Use the -cc parameter if you need more control over where the MPI processes are placed. The list following -cc specifies the cores to which MPI processes are bound. This list may be comma delimited (e.g., 2,4,6,8), contain ranges (e.g., 2-4,5-7), or both.
A typical -cc layout for using the 16 FPU units in an XE compute node would look like:
aprun -cc 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 -n <MYNUMRANKS> ./a.out
> aprun -n 4 ./hello # default layout with integer cores and sharing FPUs rank 0 of 4 on nid00028 core 0 rank 3 of 4 on nid00028 core 3 rank 1 of 4 on nid00028 core 1 rank 2 of 4 on nid00028 core 2 > aprun -cc 0,2,4,6 -n 4 ./hello # skipping odd integer cores, no FPU sharing rank 1 of 4 on nid00028 core 2 rank 0 of 4 on nid00028 core 0 rank 2 of 4 on nid00028 core 4 rank 3 of 4 on nid00028 core 6
A exmaple for using -cc for a MPI+openMP hybrid application with 4 MPI per node and 4 openMP threads per MPI on an XE compute node is:
aprun -N 4 -cc 0,2,4,6:8,10,12,14:16,18,20,22:24,26,28,30 -n <MYNUMRANKS> ./a.out
XK compute nodes contain only numa nodes 0 and 1 with the same memory and enumeration of cores 0-15. Nodes 2 and 3 (the 2nd processor socket) are vacant to provide space for the Nvidia Kepler GPU.
The -cc parameter may be used to bind multiple MPI tasks to the same core (e.g., aprun -cc 0,0,1,1 ...). However, this is typically very undesirable as doing so will result in an extreme load imbalance for any application that tries to keep the amount of work done by each MPI rank the same.
Important note: the bindings specified by using -cc apply to each node. This means that the only valid values for Blue Waters are 0-31. Values above 31 will be ignored (no error is given).
See the aprun man page for more information.
This example code can be compiled and run on a node(s) to show how the core placement changes with various arguments to aprun:
Restart for aprun resiliency
A feature of aprun is the ability to auto-retry the executation of the application based on how many fewer processing elements (PE) can be tolerated in the event of node failure. This feature provides a level of resiliency by enabling application relaunch so that should the application experience certain system failures, ALPS will attempt to relaunch and complete in a degraded manner. The option is:
where pe_dec is the processing element (PE) decrement tolerance. If pe_dec is non-zero, aprun attempts to relaunch with a maximum of pe_dec fewer PEs. If pe_dec is 0, aprun will attempt relaunch with the same number of PEs specified with original launch. Relaunch is supported per aprun instance. A decrement count value greater than zero will fail for MPMD launches with more than one element. Options -C and -R are mutually exclusive.
Upon an application exits, aprun sends to stdout: utime, stime, maxrss, inblocks, and outblocks. These are for user time, system time, maximum resident set size, block input operations, and block output operations. The values given are approximate as they are rounded aggregate scaled by the number of resources used. For more information on these values, see the getrusage(2) man page.
An example of the output is:
Application 2243970 resources: utime ~4385s, stime ~114s, Rss ~1109576, inblocks ~59037162, outblocks ~882
http://docs.cray.com/books/S-2496-4101/html-S-2496-4101/cnlexamples.html # contains many sample codes demonstrating the common HPC programming paradigms along with various aprun invocations and program output