Usage tips for the Nvidia Kepler K20x GPUs

Sharing the GPU in an XK node

CRAY_CUDA_MPS ( also known as Nvidia Hyper Q, was CRAY_CUDA_PROXY in earlier versions of aprun/alps )

After selecting the XK GPU compute nodes via the xk resource specifier ( #PBS -l ), you may further set the runtime mode with the CRAY_CUDA_MPS environment variable.

The Nvidia GPUs default to dedicated mode where each GPU is mapped to one and only one linux process per compute node (typically 1 MPI rank). This default behavior may be overridden with the CRAY_CUDA_MPS environment variable. When set to 1:

export CRAY_CUDA_MPS=1

...the Nvidia driver will multiplex cuda kernels from different processes (multiple cooperating MPI ranks) onto the Kepler GPU. The driver presents a virtual GPU (reporting to be Device 0 so there's no need to modify your GPU code) to each process requesting it. In some cases, this may allow for more efficient loading and utilization of the GPU. Keep in mind that the basic limitations of the hardware are still in effect (6 GB global memory ) and that processes will be sharing GPU resources. When running in proxy mode, you're more likely to see errors of the type: CUDA_ERROR_OUT_OF_MEMORY if care is not taken to size the work and timing of the kernels so that they fit onto the GPU. For debugging, set CRAY_CUDA_MPS=0.

The environment variable should be used with APRUN_XFER_LIMITS disabled (not set). If APRUN_XFER_LIMITS is set, you may see false positive CUDA_ERROR_OUT_OF_MEMORY errors.

A known issue: CRAY_CUDA_MPS =1 is incompatible with OpenCL. Do not use CRAY_CUDA_MPS =1 with Open CL.

Performance related runtime variables

MPICH_RDMA_ENABLED_CUDA

Module load craympich2/5.6.4 or later

From the man pages [man mpi]:
MPICH_RDMA_ENABLED_CUDA
If set, allows the MPI application to pass GPU pointers directly to point to point and collective communication functions. Currently, if the send or receive buffer for a point to point or collective communication is on the GPU,
the network transfer and the transfer between the host CPU and the GPU are pipelined to improve performance. Future implementations may use an RDMA based approach to write/read data directly to/from the GPU, bypassing the host CPU. NOTE, the cray mpi runtime will already have your cuda context created so your cuda code should check for existing context on device 0 and proceed accordingly.
Default: not set

MPICH_G2G_PIPELINE

Module load craympich2/5.6.4 or later

If nonzero, the device host and network transfers will be overlapped to pipeline GPUtoGPU
transfers. Setting MPICH_G2G_PIPELINE to N will allow N GPUtoGPU messages to be efficiently inflight
at any one time. If MPICH_G2G_PIPELINE is nonzero but MPICH_RDMA_ENABLED_CUDA is
disabled, MPICH_G2G_PIPELINE will be turned off. If MPICH_RDMA_ENABLED_CUDA is enabled but MPICH_G2G_PIPELINE is 0, the default value is set to 16. Pipelining is never used
on Aries networks for messages with sizes >= 8 KB and < 128 KB.
Default: not set

NCSA recommends testing MPICH_G2G_PIPELINE=4 , 8, and 16 to see which yields the best performance with your application.

references: man aprun , man mpi

GPU Direct mpich-enabled cuda at ORNL/TITAN