Hybrid MPI and OpenMP
OpenMP with MPI is an efficient way to exploit multicore processors on Blue Waters. Each OpenMP thread typically runs on one compute core. Therefore, the maximum number of threads per node on Blue Waters is 32.
How to use MPI + OpenMP
Running with openMP:
The following aprun options are relevant when running MPI with OpenMP:
-n: it specifies how many total MPI tasks for the job.
-N: it specifies how many MPI tasks per compute node.
-d: it sets the number of OpenMP threads per MPI task. This is in addition to setting the environment variable OMP_NUM_THREADS. They should be the same value. Typically, the value for -d multiplied by the value for -N does not exceed 32 for Blue Waters.
-S: it specifies the number of MPI tasks to allocate per NUMA node. There are 4 NUMA regions per node, and each NUMA region has 8 integer cores. Having OpenMP threads running in the same NUMA region can improve performance due to memory affinity.
-ss: it specifies strict memory containment per NUMA node. When -ss is specified, an MPI task and its openMP threads can allocate only the memory local to its assigned NUMA node. This potentially improves memory affinity and may help performance.
About thread levels:
MPI defines four "levels" of thread safety. The default thread support level on Blue Waters is MPI_THREAD_SINGLE, where only one thread of execution exists. The maximum thread support level is returned by the MPI_Init_thread() call in the "provided" argument.
You can set an environment variable MPICH_MAX_THREAD_SAFETY to different values to increase the thread safety.
Some applications may benefit from MPI using helper threads to progress the MPI state engine while OpenMP threads are computing. Using dedicated asynchronous progress threads requires the highest thread safety level, i.e., "multiple". Also, it is best if this is used in conjunction with core specialization (see aprun -r argument), which reserves a core for the progress thread.
To enable MPI's asynchronous progress threads, one needs to set the following environment variables:
For example, run the application requesting core specialization and reserving a core:
Due to losing cores for the progress threads, one needs to scale up the process count needed to run the job for a given number of cores/node reserved for corespec. One can use "apcount" to help calculate the new width of the batch reservation. See the "apcount" man page.
Additional Information / References