Hybrid MPI and OpenMP

OpenMP with MPI is an efficient way to exploit multicore processors on Blue Waters. Each OpenMP thread typically runs on one compute core. Therefore, the maximum number of threads per node on Blue Waters is 32.

How to use MPI + OpenMP

Running with openMP:

The following aprun options are relevant when running MPI with OpenMP:

-n:  it specifies how many total MPI tasks for the job.

-N:  it specifies how many MPI tasks per compute node.

-d:   it sets the number of OpenMP threads per MPI task. This is in addition to setting the environment variale OMP_NUM_THREADS. They should be the same value. Typically, the value for -d multiplied by the value for -N does not exceed 32 for Blue Waters.

-S:   it specifies the number of MPI tasks to allocate per NUMA node. There are 4 NUMA regions per node, and each NUMA region has 8 integer cores. Having OpenMP threads running in a same NUMA regions can help the performance by improving memory affinity.

-ss:  it specifies strict memory containment per NUMA node. When -ss is specified, an MPI task and its openMP threads can allocate only the memory local to its assigned NUMA node. This potentially improves memory affinity and may help performance.

About thread levels:

MPI defines four "levels" of thread safety.  The default thread support level on Blue Waters is MPI_THREAD_SINGLE, where only one thread of execution exists.  The maximum thread support level is returned by the MPI_Init_thread() call in the "provided" argument.

You can set an environment variable MPICH_MAX_THREAD_SAFETY to different values to increase the thread safety. 

 

envronment variable
MPICH_MAX_THREAD_SAFETY value
Supported Thread Level
not set MPI_THREAD_SINGLE
single MPI_THREAD_SINGLE
funneled MPI_THREAD_FUNNELED
serialized MPI_THREAD_SERIALIZED
multiple MPI_THREAD_MULTIPLE

    

MPI+OpenMP Performance:

  •  MPI Asynchronous progress:

Some applications may benefit from MPI using helper threads to progress the MPI state engine while
OpenMP threads are computing. Using dedicated asynchronous progress threads requies the highest thread safety level, i.e., "multiple". Also, it is best if this is used in conjunction with core specialization (see aprun -r argument), which reserve a core for the progress thread.

To enable MPI's asynchronous progress threads, one needs to set the following environment variables:
export MPICH_NEMESIS_ASYNC_PROGRESS=1
export MPICH_MAX_THREAD_SAFETY=multiple

For example,  run the application requesting core-spec and reserving a core:

aprun –n X –r 1 ./a.out

Due to losing cores for the progress threads, one need to scale up the process count needed to run the
job for a given number of cores/node reserved for corespec. One can use apcount to help caculating the new width of the batch reservation. See apcount man page.

      

 

Additional Information / References

https://bluewaters.ncsa.illinois.edu/compiling#OpenMP