Shifter is a software solution that enables execution of applications on HPC systems within isolated Linux environments. Shifter containers, called User-Defined Images, are analogous to Linux containers and provide a convenient way of bundling applications with all of their prerequisites and necessary components of the underlying operating system. Shifter UDIs are designed for quick distribution across compute nodes of supercomputers and can be generated from various types of Linux containers, including Docker images. This guide demonstrates how to work with Shifter UDIs based on Docker images from Docker Hub on Blue Waters.
General Shifter Workflow
The general workflow for using Docker images with Shifter on Blue Waters, is as follows:
In what follows, we will guide you through steps 2 – 4 as the process of building Docker images is application specific and can be skipped if you plan on using existing images from public repositories on Docker Hub.
Shifter on Blue Waters
On Blue Waters, Shifter is available in a form of a module
The the entire Blue Waters system has 64 MOM nodes only. These nodes are shared by all Blue Waters users whose interactive or batch jobs started their execution. Therefore, it is important not to use them for compute- or data-intensive tasks.
On a personal computer, one can work with Docker images
via a full-featured set of commands provided by
Docker. On Blue Waters, where Docker is
not available, so one has to use
As we have already mentioned, one can interact with Shifter on Blue Waters MOM nodes only. Here is how you can start an interactive job on a single node (must be executed on one of the Blue Waters' login nodes):
In this guide, lines in code blocks that start with
Once the interactive job starts, we can load shifter module
and try out the
Let's now go over the three sub-commands provided by
Generic resource request must be specified as a PBS directive either on the command line:
$ qsub -l gres=shifter16 ...
or in a job script:
#PBS -l gres=shifter16
This request ensures that Torque, the Blue Waters' resource manager, executes a proper prologue script.
There are also two ways to specify which
UDI we wish to use in our job, so
Let's have a closer look at both of them.
Specifying UDI with a PBS directive
If you would like to use only one image in your job, you can specify it using
an environment variable called
UDI. Simply set it to the full
name of the image
imagename:tag when you submit your job. You can do so either
on the command line:
$ qsub -l gres=shifter16 -v UDI=centos:latest -l nodes=2:ppn=32:xe -l walltime=00:30:00
or as a directive in a job script:
#PBS -v UDI=centos:latest
When a job that specifies User-Defined Image using UDI environment variable, Blue Waters' special prologue script mounts named image on the compute nodes and performs some other maintenance tasks. If successful, you should see a message similar to this one:
In Torque Shifter prologue batchID: 224301 Starting munge service on compute nodes Successfully started munge service on compute nodes Initializing udiRoot, please wait. Retrieving Docker Image udiRoot Start successful
This message indicates that Torque prologue has set up the requested
UDI to be used in a job. If the requested image has not been
previously downloaded and packaged, the above command will initiate both of
these steps and, as a result, may time out (currently, the time limit for
pulling an image is 1 hour). Therefore, if you would like to specify
UDI on the command line, it is strongly recommended that you
download that image in advance.
Specifying UDI in a
shifter command call
Another way to specify which
UDI to use in a job is by
shifter command provided by the module and using
--image option. This approach enables multi-step workflows
in which every step is performed in its own
$ qsub -l gres=shifter16 ......
$ # On a MOM node $ module load shifter $ aprun -b ... -- shifter --image=docker:centos:latest -- <app> <args>
A few notes on the above command.
CRAY_ROOTFSenvironment variable has to be unset when we call
shiftercommand on the compute nodes using
- Specifying image type (
--image=docker:centos:latest) is optional and is not required.
- Specifying UDI as a PBS directive (as in
#PBS -v UDI=image:tag) and using
shiftercommand at the same time will produce an error. When we specify
UDIas a PBS directive, Torque prologue sets up the container environment on compute nodes. When we then call
aprunand specify an image with
--imageflag, it attempts to set up Shifter environment on the compute nodes again and fails because it can not overwrite the one that was set up by Torque.
Running applications in Shifter environment
When Blue Waters starts a compute job, it first places it on a
node. From there, we have to send our applications to compute nodes for execution either in a
Shifter environment. The set of commands for using Shifter environment
depends on whether we specified
UDI as a PBS directive or not.
UDI was specified as a PBS directive, we have to set
CRAY_ROOTFS environment variable to
SHIFTER. In Bash,
the default shell on Blue Waters, you can do so by executing:
$ export CRAY_ROOTFS=SHIFTER
If all of the code that we plan to run is part of the UDI, we can set
CRAY_ROOTFS environment variable the same way we set
UDI, that is:
$ qsub ... -v CRAY_ROOTFS=SHIFTER,UDI=...
Keep in mind, that if you choose to set
CRAY_ROOTFS on the
command line and you need to run some code on compute nodes that is not
contained in the
UDI, you have to unset
$ export -n CRAY_ROOTFS
Now, we are ready to execute our application packaged in
For example, all we have to do in order to print the contents of
/etc/centos-release file that is part of the
$ aprun -b -n 1 -- cat /etc/centos-releaseCentOS Linux release 7.4.1708 (Core) Application 63788813 resources: utime ~0s, stime ~1s, Rss ~18096, inblocks ~4, outblocks ~0
Note, this file exists in
UDI only and if we unset
CRAY_ROOTFS variable, we
will not be able to access this file:
$ export -n CRAY_ROOTFS $ aprun -b -n 1 -- cat /etc/centos-releasecat: /etc/centos-release: No such file or directory Application 63788941 exit codes: 1 Application 63788941 resources: utime ~0s, stime ~1s, Rss ~18096, inblocks ~3, outblocks ~0
If we don't specify
UDI as a PBS directive, we have to use
shifter command provided by the Shifter module.
Therefore, the above example translates to:
$ module load shifter $ aprun -b -n 1 -- shifter --image=centos:latest -- cat /etc/centos-release
A clear advantage that
shifter command provides is that once the
above command completes, we can use a different
UDI to execute
another application in a new environment.
Please note the
-b option that we added to the
aprun call above. This is an important flag to remember when
Shifter. It instructs
not transfer executable file (
shifter) from the
MOM node to the compute nodes. If we forget this flag,
shifter executable to the compute nodes and unset its
setuid bit there. This, in turn, would cause the entire command to
Be extra caferul with the special symbols (such as
*) on the
command line when submitting Shifter jobs. Bash performs
pathname expansion before passing arguments to the
command. So, if you are not careful you might see a No
such file or directory error. Therefore, we recommend that you
use scripts for all your Shifter-related
work on Blue Waters.
Mapping directoties in Shifter
A distinct feature of Shifter images is that they are read-only and the only way to update them is by pulling (downloading) newer versions of corresponding Docker images. This, in turn, means that although input files can be part of
UDIs, results of simulations and analysis produced in Shifter jobs have to be stored on the Blue Waters filesystem. For that purpose, Shifter adds special hooks into
UDIs to make sure that
/mnt/a/u/sciteam/<username>) filesystems are available when Shifter jobs run.
In addition to these automatic hooks, Shifter allows us to manually map existing directories of the Blue Waters filesystems to existing directories within
UDIs. For example, we can map our home directory on Blue Waters system to
/home directory within the image. There are two ways to specify such mappings:
|a.||when Shifter job is submitted to the queue|
|b.||as an argument to the
Let's have a look at both of these methods.
Mapping directoties when submitting a Shifter job
When we submit Shifter jobs to the queue, we have an option to specify mapping between existing directories of the Blue Waters filesystems and those in the user-defined images by amending the UDI assignment in the following way:
$ qsub -l gres=shifter16 -v UDI="centos:latest -v /mnt/a/u/sciteam/<username>:/home" ...
Once the above job starts, specified Blue Waters directory will be mapped onto
/home directory within
centos:latest UDI. This mapping makes all files and folders within the directory on Blue Waters accessible from the
/home directory in
UDI. It also ensures that any changes made to
/home directory in the job are reflected on the actual directory on Blue Waters.
When mapping directories, contents of the directory on Blue Waters replaces the contents of the directory in
UDI for that Shifter job only, that is: no changes to the actual
UDI are made. Make sure to not use directories within UDI that have any imformation required for the job to run.
Mapping directoties from within a Shifter job
The other way to specify volume mapping between Blue Waters filesystems and UDI is by using the
shifter command and its
-V flags directly. For example, to achieve the same mapping as above, we would use the following sequence of commands:
$ qsub -l gres=shifter16 ... $ module load shifter $ aprun -b ... -- shifter --image=centos:latest --volume=/mnt/a/u/sciteam/<username>:/home ... $ # or $ aprun -b ... -- shifter --image=centos:latest -V /mnt/a/u/sciteam/<username>:/home ...
Note, that one can not:
- Overwrite volume mappings specified by Shifter itself
- Map a directory to any of the following directories and their subdirectories within
- Use symbolic links when specifying the directory to be mapped, that is
/u/sciteam/user:/path/in/imagewill fail and the correct syntax is
If we try to map one of the restricted folders, we will receive one of the following error messages:
$ aprun -b -- shifter --image=centos:latest --volume=/mnt/a/u/sciteam/<username>:/etc -- ...Invalid Volume Map: /mnt/a/u/sciteam/<username>:/etc, aborting! 1 Failed to parse volume map options ...
$ aprun -b -- shifter --image=centos:latest --volume=/mnt/a/u/sciteam/<username>:/dev -- ...mount: warning: ufs seems to be mounted read-only. Mount request path /var/udiMount/dev not on an approved device for volume mounts. FAILED to setup user-requested mounts. FAILED to setup image.
Accessing compute nodes running Shifter jobs via SSH
Just like with any other application, you might need to interact with the
application running in a Shifter environment for debugging,
monitoring, or other purposes. To enable such interactions,
Shifter allows users to log in to compute nodes that are part
of its pool via the standard
line tool. There are several requirements, however, in order to make use of
- 1. Specify UDI as a PBS directive.
- To allow users log in to its compute nodes, Shifter
can start up
SSHdaemons. The daemons on the compute nodes can be launched only by the prologue script, which is executed when the job starts. Therefore, in order to be able to login to compute nodes with a Shifter job running on it, it is necessary to specify UDI as a PBS directive.
- 2. Prepare special SSH key pair.
- On startup, the
SSHdaemons enabled by Shifter look for a private SSH key in
$HOME/.shifterand wait for a connection on port
1204authenticated with this key. To prepare such a key pair, execute:
$ mkdir -p ~/.shifter $ ssh-keygen -t rsa -f ~/.shifter/id_rsa -N ''
Once the above two steps are completed, we can log into the compute nodes using:
$ ssh -p 1204 -i ~/.shifter/id_rsa -o StrictHostKeyChecking=no \ -o UserKnownHostsFile=/dev/null -o LogLevel=error nodename
It is advisable to save all the above options into a configuration file. To do that, execute:
$ cat <<EOF > ~/.shifter/config Host * Port 1204 IdentityFile ~/.shifter/id_rsa StrictHostKeyChecking no UserKnownHostsFile /dev/null LogLevel error EOF
Now, we can log in to the compute nodes with a simple:
$ ssh -F ~/.shifter/config nodename
To login to a remote machine using
ssh command we have to specify the remote machine's
network name. To find the name of compute nodes assigned to the Shifter job, execute the
following command on a MOM node before setting the
$ aprun -n $PBS_NUM_NODES -N 1 -b -- hostname
You should see a list of names of the form:
is a five-digit number. Use these to connect to the compute nodes:
$ ssh -F ~/.shifter/config nidXXXXX
Make sure, however, that you do not accidentally copy a network name of a
MOM node where you execute all
ssh will fail with a Permission
denied error if your login shell does not exist in the container or
is not listed in the container's
GPUs in Shifter jobs
If your application benefits from or relies upon CUDA-capable accelerators,
make sure to use NVIDIA Kepler K20X GPUs that are installed on the Blue Water's
XK nodes. Currently, this is only supported when using the
command provided by the module. To control which GPU devices should be
accessible from within the container, Shifter uses an
CUDA_VISIBLE_DEVICES. The value of this
variable is a 0-based, comma-separated list of CUDA-capable device IDs on the host
system (Blue Waters). Because XK nodes have only one NVIDIA GPU each, the only
value we can set this variable to is
0. Note, that on systems with
more than one NVIDIA GPU, device IDs within the container would start with 0
regardless of their IDs in the host system. This enables transparent use of
containers on systems with different number of GPUs per node.
As an example, here is how we can start a 2-node Shifter job that uses GPUs:
$ qsub -l gres=shifter16,nodes=2:ppn=16:xk ... $ # On a MOM node $ module load shifter craype-accel-nvidia35 $ export CUDA_VISIBLE_DEVICES=0 $ aprun -b -- shifter --image=centos:latest -- nvidia-smimount: warning: ufs seems to be mounted read-only. mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/lib64/libcuda.so.1 seems to be mounted read-only. mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/lib64/libcuda.so seems to be mounted read-only. mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/lib64/libnvidia-compiler.so.352.68 seems to be mounted read-only. mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/lib64/libnvidia-compiler.so seems to be mounted read-only. [ GPU SUPPORT ] =WARNING= Could not find library: nvidia-ptxjitcompiler mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/lib64/libnvidia-encode.so.1 seems to be mounted read-only. mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/lib64/libnvidia-encode.so seems to be mounted read-only. mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/lib64/libnvidia-ml.so.1 seems to be mounted read-only. mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/lib64/libnvidia-ml.so seems to be mounted read-only. [ GPU SUPPORT ] =WARNING= Could not find library: nvidia-fatbinaryloader mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/lib64/libnvidia-opencl.so.1 seems to be mounted read-only. mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/lib64/libnvidia-opencl.so seems to be mounted read-only. mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/bin/nvidia-cuda-mps-control seems to be mounted read-only. mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/bin/nvidia-cuda-mps-server seems to be mounted read-only. mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/bin/nvidia-debugdump seems to be mounted read-only. which: no nvidia-persistenced in (/opt/cray/nvidia/default/bin:/usr/local/bin:/usr/bin:/bin:/sbin) [ GPU SUPPORT ] =WARNING= Could not find binary: nvidia-persistenced mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/bin/nvidia-smi seems to be mounted read-only. Thu Jan 4 19:25:45 2018 +------------------------------------------------------+ | NVIDIA-SMI 352.68 Driver Version: 352.68 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K20X On | 0000:02:00.0 Off | 0 | | N/A 27C P8 17W / 225W | 31MiB / 5759MiB | 0% E. Process | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ Application 64245641 resources: utime ~0s, stime ~0s, Rss ~7660, inblocks ~57564, outblocks ~43459
LD_LIBRARY_PATH environment variables with paths from the host
operating system that contain CUDA executables and shared libraries,
correspondingly. Therefore, it is important to keep these changes to the
environment variables if you plan to run GPU-enabled applications in
MPI in Shifter jobs
Applications in Shifter UDIs can use Blue Waters' high-speed interconnect to send messages between nodes oing ne has to There are, however, a few requirements in order for an application in a container to use Blue Waters' fast interconnect for MPI.
- 1. Preparing Docker image: Use Docker images with modern
- Using MPI with Shifter requires the GNU C library
glibc) version 2.17 or above. This means that you can use containers based on CentOS / Scientific Linux / RedHat 7, Ubuntu 14.04, or newer.
- 2. Preparing Docker image: Use compatible MPI implementation.
- Applications in User-Defined Images have to be compiled against an MPI
implementation that is part of the MPICH
ABI Compatibility Initiative, an effort to maintain ABI (Application
Binary Interface) compatibility between MPICH-derived MPI implementations.
Currently, the list of compatible MPI implementations includes:
- MPICH v3.1
- Intel® MPI Library v5.0
- Cray MPT v7.0.0
- MVAPICH2 2.0
- Parastation MPI 5.1.7-1
- IBM MPI v2.1
1. Solution: build MPI implementation from source. Luckily, it is not that difficult! For example, here is how you can build MPICH 3.2 (feel free to copy and paste that command into your Dockerfile):
# In Dockerfile $ cd /usr/local/src/ && \ $ wget http://www.mpich.org/static/downloads/3.2/mpich-3.2.tar.gz && \ $ tar xfmpich-3.2.tar.gz && \ $ rm mpich-3.2.tar.gz && \ $ cd mpich-3.2 && \ $ ./configure && \ $ make && make install && \ $ cd /usr/local/src && \ $ rm -rf mpich-3.2*
- 3. In a Shifter job: set up Programming Environment.
- Use Intel or GNU Programming Environments.
- Unload Cray Compiling Environment (cce) module.
- Use Cray MPICH ABI Compatibility Module (cray-mpich-abi).
We prepared a script that does that for you: shifter_mpi_modules.sh.
# On a MOM node or in a job script $ module unload PrgEnv-cray PrgEnv-gnu PrgEnv-intel PrgEnv-pgi $ module unload cce $ module unload cray-mpich $ module load PrgEnv-gnu # or PrgEnv-intel $ module load cray-mpich-abi $ module load shifter
- 4. In a Shifter job: set up
- Ability of an application in Shifter
UDIs use Blue Waters' interconnect relies on its ability to find proper libraries. After modifying the programming environment as described above, all of the required libraries can be found in
CRAY_LD_LIBRARY_PATHenvironment variable and in
/opt/cray/wlm_detect/default/lib64. Below is the code snippet that adds directories with the required libraries to your
We prepared a script that does that for you: shifter_mpi_library_path.sh.
# on a mom node or in a job script $ for dir in $(echo $CRAY_LD_LIBRARY_PATH:/opt/cray/wlm_detect/default/lib64 | tr ':' ' ') $ do $ realpath=$(readlink -f "$dir") $ if [[ -z $LD_LIBRARY_PATH ]] $ then $ eval 'export LD_LIBRARY_PATH=/dsl'$realpath $ else $ eval 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/dsl'$realpath $ fi $ done
- 5. In a Shifter job: Make sure
LD_LIBRARY_PATHis passed on to the compute nodes.
If you use
shiftercommand to execute your application in the container on the compute nodes, be aware that it erases the
LD_LIBRARY_PATHvariable. To preserve this (and any other) environment variable on the compute nodes, create a simple wrapper script:
and use it in your
$ cat > wrapper_script.sh <<EOS export LD_LIBRARY_PATH=$LD_LIBRARY_PATH ./your/application --with --flags EOS $ chmod u+x wrapper_script.sh
Note that you don't need to create a wrapper script if you're not using
$ aprun -b ... -- shifter --image=... -- ./wrapper_script.sh
MPI + GPU in Shifter jobs
Now that we know how to execute both MPI and GPU-enabled applications with Shifter on Blue Waters, we can combine these requirements for GPU-enabled MPI applications:
- When preparing a Docker image
- Use Docker images with
- Use compatible MPI implementation
- Use Docker images with
- When submitting a job
shifter16generic resource request (
- Request Blue Waters' XK nodes
- Do not specify UDI
- On a MOM node
- Use either GNU or Intel Programming Environments
- Unload Cray Compiling Environment (cce) module
CUDA_VISIBLE_DEVICESto 0 (a comma-separated zero-based list of GPU devices on compute nodes to be used in the job.)
- Prepare libraries to be used in the container as described above.
- Use a wrapper script to pass
LD_LIBRARY_PATHto the compute nodes.
Below is an example of sequence of commands that one can use for running a GPU-enabled MPI application in Shifter once the image has been created and uploaded to Docker Hub.
qsub -l nodes=2:ppn=16:xk,walltime=00:30:00,gres=shifter16 -q normal -N shifter-job # On a MOM node or in a job script $ source shifter_mpi_modules.sh $ source shifter_mpi_library_path.sh $ export CUDA_VISIBLE_DEVICES=0 $ cat > wrapper_script.sh <<EOS export LD_LIBRARY_PATH=$LD_LIBRARY_PATH ./your/application --with --flags EOS $ chmod u+x wrapper_script.sh $ aprun -b ... -- shifter --image=... -- ./wrapper_script.sh
Further reading: Run Tensorflow 1.12 on Blue Waters Using Shifter