Skip to Content

Shifter


Shifter is a software solution that enables execution of *nix applications on HPC systems within isolated Linux environments. Shifter containers, called User-Defined Images, are analogous to Linux containers and provide a convenient way of bundling applications with all of their prerequisites and necessary components of the underlying operating system. Shifter UDIs are designed for quick distribution across compute nodes of supercomputers and can be generated from various types of Linux containers, including Docker images. This guide demonstrates how to work with Shifter UDIs based on Docker images from Docker Hub on Blue Waters.

General Shifter Workflow

The general workflow for using Docker images with Shifter on Blue Waters, is as follows:

1. Create (build) a Docker image on your computer
2. Upload (push) the image to a registry (Docker Hub)
3. Download (pull) the image from the registry to an HPC system (Blue Waters)
4. Launch use the container in a job

In what follows, we will guide you through steps 2 – 4 as the process of building Docker images is application specific and can be skipped if you plan on using existing images from public repositories on Docker Hub.

Shifter on Blue Waters

On Blue Waters, Shifter is available in a form of a module called shifter. Unlike other modules, however, it can be loaded on the Blue Waters' MOM nodes only. These are the nodes where you land in an interactive job and where your batch job scripts are executed.

Important


The the entire Blue Waters system has 64 MOM nodes only. These nodes are shared by all Blue Waters users whose interactive or batch jobs started their execution. Therefore, it is important not to use them for compute- or data-intensive tasks.

On a personal computer, one can work with Docker images via a full-featured set of commands provided by Docker. On Blue Waters, where Docker is not available, one has to use shifterimg tool from the shifter module. This tool provides a small set of essential utility functions that enable the use of Docker images on Blue Waters. This documentation describes the workflow the latest version of the shifter module available on the system (called shifter). In previous versions of the module, shifterimg tool was called getDockerImage.

Getting Started

Because interaction with Shifter can take place on MOM nodes only, let's start an interactive job on a single node. From a Blue Waters login node, execute the following command:

$ qsub -I -l nodes=1:ppn=1 -l walltime=00:10:00

Important


In this guide, lines in code blocks that start with '$' indicate commands typed manually in a terminal. Feel free to copy and paste them into your terminal but make sure not to include the '$'.

Once the interactive job starts, we can load shifter module and try out the shifterimg command:

$ module load shifter
$ shifterimg
Must specify mode (images, lookup, pull)!
Usage:
 shifterimg [options] <mode> <type:tag>

    Mode: images, lookup, or pull

Options:
--user/-u <list>    List of users allowed to access a private image
--group/-g <list>   List of groups allowed to access a private image
--verbose/-v              Verbose output
--help/-h                 Display this help

Note: the --user and --group options are only relevant  for image pulls

Let's now go over the three sub-commands provided by shifterimg.

shifterimg images

shifterimg images command lists User-Defined Images that are already available on Blue Waters:

$ shifterimg images
bluewaters docker READY a8b4df3be8 2017-08-24T17:12:55 centos:6.7
bluewaters docker READY 2fc0dfcb36 2017-08-28T12:05:39 mbelkin/centos7-mpich-ext:3.2
bluewaters docker READY cbbf067bf4 2017-08-29T10:16:58 opensuse:13.2
bluewaters docker READY 58597429ab 2017-08-22T20:32:54 ubuntu:latest

Let's have a closer look at the first line of the output above. It tells us:

  • Short name of the system where the image is stored (on Blue Waters it is bluewaters)
  • Image type (docker).
  • Image status (READY).
  • Image identifier (a8b4df3be8)
  • Date and time the image was created on Blue Waters (2017-08-24T17:12:55)
  • Image name (centos:6.7)

Shifter UDIs follow the naming convention of Docker images: repo/image:tag, where tag identifies a specific version of an image called image in the Docker repository repo. Similarly to Docker, repo can be empty. Unlike Docker which permits omission of a tag (in which case it defaults to latest), Shifter requires that image names always include tags.

shifterimg lookup

shifterimg lookup subcommand returns identifier of an image if it is currently available on the machine. If the image is not available, shifterimg does not print any message but returns an exit status of 1:

$ shifterimg lookup centos:6.9
bf9ec5e347df5637a0efad42bbd4fefdad887a4334c905d00ea3a5892592d8cc
$ echo $?
0
$ shifterimg lookup ubuntu:zesty
$ echo $?
1

Note, that lookup subcommand ignores command line arguments after the first one:

$ shifterimg lookup centos:6.9 ubuntu:zesty
bf9ec5e347df5637a0efad42bbd4fefdad887a4334c905d00ea3a5892592d8cc
$ echo $?
0
shifterimg pull

shifterimg pull command downloads new and updates existing Docker images on Blue Waters. All downloaded or updated images are automatically converted to the UDI format. The syntax of the command is as follows:

$ shifterimg pull repo/image:tag

For example, to download the latest version of an image called polyglot that is located in the ncsa repository on Docker Hub, we can use:

$ shifterimg pull ncsa/polyglot:latest

When we execute the shifterimg pull command, Shifter checks if the requested image already exists on Blue Waters and is up-to-date. If it does not exist or is not up-to-date, it downloads the missing Docker image layers and converts the entire image stack to the UDI format. As the new layers are downloaded and processed, Shifter displays the following messages one after another: INIT, PULLING, EXAMINATION, CONVERSION, TRANSFER, and READY:

2017-09-11T10:41:20 Pulling Image: docker:ncsa/polyglot:latest, status: INIT
2017-09-11T10:41:29 Pulling Image: docker:ncsa/polyglot:latest, status: PULLING
2017-09-11T10:42:03 Pulling Image: docker:ncsa/polyglot:latest, status: EXAMINATION
2017-09-11T10:42:19 Pulling Image: docker:ncsa/polyglot:latest, status: CONVERSION
2017-09-11T10:42:36 Pulling Image: docker:ncsa/polyglot:latest, status: TRANSFER
2017-09-11T10:42:45 Pulling Image: docker:ncsa/polyglot:latest, status: READY

Output lines above will be replacing one another and, in the end, you will see the last line only.

Important


Shifter on Blue Waters can download images from Docker Hub (hub.docker.com) only. Please contact us if you would like to download an image from some other image registry.

In the rest of this guide we will work with the centos:latest Docker image:

$ shifterimg lookup centos:latest
f3b88ddaed1649ab85c7627d3ffdf2b3235d6dcbab5a5f6d9134088b18dfb598

We can verify that the image is up-to-date using the shifterimg pull command:

$ shifterimg pull centos:latest
2017-09-11T12:57:12 Pulling Image: docker:centos:latest, status: READY

Once the Docker image is downloaded and packaged into an UDI, we can use it in batch and interactive jobs.

Pulling images from private repositories

Shifter provides a way to pull images from private repositories. First, let's provide Shifter with our Docker Hub credentials (username/password pair):

$ shifterimg login
default username:<Docker Hub username>
default password:<Docker Hub password>

Now, we can pull image(s) from our private repositories on Docker Hub. To keep images private on Blue Waters, we have to specify users who can access them when we pull images. To do that, we can use either --user or --group options of the shifterimg command. These options restrict visibility of and access to the pulled private image to select users or groups. For example:

$ shifterimg --user bw_user1,bw_user2,bw_user3 pull myrepo/myprivateimage1:latest

will pull the myprivateimage1:latest image from myrepo repository on Docker Hub and set it up in such a way, so that it is accessible by Blue Waters users bw_user1, bw_user2, and bw_user3 only. No other users will be able to see or work with these images.

Likewise, --group option limits access to the pulled image to provided lists of groups:

$ shifterimg --group bw_group1,bw_group2 pull myrepo/myprivateimage2:latest

Here, only members of groups bw_group1 and bw_group2 will be able to access myprivateimage2:latest image. Note, that to list all the groups you are member of on Blue Waters, you can use either id -Gn or groups commands. Also note, that even if you use --user or --group options when pulling public images, they still will be visible to and accessible by all Blue Waters users.

Preparing to Run Shifter Jobs

In order to use Shifter in a job, we must make a special shifter16 generic resource request.

a. request shifter16 generic resource
b. specify UDI to be sent to compute nodes

There are two ways to make a generic resource request. We can do so either on the command line:

$ qsub -l gres=shifter16 ...

or in a job script as an additional PBS directive:

#PBS -l gres=shifter16

This request ensures that Torque, the Blue Waters' resource manager, executes a proper prologue script.

There are also two ways to specify which UDI you wish to use:

a. with a PBS directive (UDI environment variable)
b. by using shifter command within the job

Let's have a closer look at both methods.

Specifying UDI with a PBS directive

We can specify an image to use in a job with a PBS directive by setting an environment variable called UDI. We have to set it to the full name of the image, imagename:tag, and we can do so either on the command line:

$ qsub -l gres=shifter16 -v UDI=centos:latest -l nodes=2:ppn=32:xe -l walltime=00:30:00

or as a directive in a job script:

#PBS -v UDI=centos:latest

When a job that specifies User-Defined Image using either of the above approaches starts up, Blue Waters' special prologue script mounts named image on the compute nodes and performs some other maintenance tasks. If successful, you should see a message similar to this one:

In Torque Shifter prologue batchID: 224301
Starting munge service on compute nodes
Successfully started munge service on compute nodes
Initializing udiRoot, please wait.
Retrieving Docker Image
udiRoot Start successful

This message indicates that Torque prologue has set up the requested UDI to be used in a job. If the requested image has not been previously downloaded and packaged, the above command will initiate both of these steps and, as a result, may time out (currently, the time limit for pulling an image is 1 hour). Therefore, if you would like to specify UDI on the command line, it is strongly recommended that you download that image in advance.

Specifying UDI in a shifter command call

Another way to specify which UDI to use in a job is by calling the shifter command provided by the module and using its --image option. This approach enables multi-step workflows in which every step is performed in its own UDI:

$ qsub -l gres=shifter16 ...
...
$ # On a MOM node
$ module load shifter
$ aprun -b ... -- shifter --image=docker:centos:latest -- <app> <args>

A few notes on the above command.

  1. CRAY_ROOTFS environment variable has to be unset when we call shifter command on the compute nodes using aprun.
  2. Specifying image type (docker as in --image=docker:centos:latest) is optional and is not required.
  3. Specifying UDI as a PBS directive (as in -v UDI=image:tag or #PBS -v UDI=image:tag) and using shifter command at the same time will produce an error. When we specify UDI as a PBS directive, Torque prologue sets up the container environment on compute nodes. When we then call shifter with aprun and specify an image with --image flag, it attempts to set up Shifter environment on the compute nodes again and fails because it can not overwrite the one that was set up by Torque.
Running applications in Shifter environment

When Blue Waters starts a compute job, it places it on a MOM node. From there, we can send our applications for execution either in a standard or Shifter environment. The set of commands to do that depends on whether we specified a UDI to use as a PBS directive or are planning to specify it as an argument to the shifter command.

In order to execute an application in Shifter environment when UDI was specified as a PBS directive, we have to set CRAY_ROOTFS environment variable to SHIFTER. In Bash, the default shell on Blue Waters, you can do so by executing:

$ export CRAY_ROOTFS=SHIFTER

If all of the code that we plan to run is part of the UDI, we can set CRAY_ROOTFS environment variable the same way we set UDI, that is:

$ qsub ... -v CRAY_ROOTFS=SHIFTER ...

Keep in mind, that if you choose to set CRAY_ROOTFS on the command line and you need to run some code on compute nodes that is not contained in the UDI, you have to unset CRAY_ROOTFS:

$ export -n CRAY_ROOTFS

Now, we are ready to execute our application packaged in UDI! For example, all we have to do in order to print the contents of /etc/centos-release file that is part of the UDI is execute:

$ aprun -n 1 -- cat /etc/centos-release
CentOS Linux release 7.4.1708 (Core)
Application 63788813 resources: utime ~0s, stime ~1s, Rss ~18096, inblocks ~4, outblocks ~0

Note, this file exists in centos:latest Shifter UDI only and if we unset CRAY_ROOTFS variable, we will not be able to access this file:

$ export -n CRAY_ROOTFS
$ aprun -n 1 -- cat /etc/centos-release
cat: /etc/centos-release: No such file or directory
Application 63788941 exit codes: 1
Application 63788941 resources: utime ~0s, stime ~1s, Rss ~18096, inblocks ~3, outblocks ~0

If we don't specify UDI as a PBS directive, we have to use shifter command provided by the Shifter module. Therefore, the above example translates to:

$ module load shifter
$ aprun -n 1 -b -- shifter --image=centos:latest -- cat /etc/centos-release

A clear advantage that shifter command provides is that once the above command completes, we can use a different UDI to execute another application in a new environment.

Please note the -b option that we added to the aprun call above. This is an important flag to remember when working with Shifter. It instructs aprun not to transfer executable file (shifter) from the MOM node to the compute nodes. If we forget this flag, aprun would transfer the shifter executable to the compute nodes and unset its setuid bit there. This, in turn, would cause the entire command to fail.

Important


Be extra caferul with the special symbols (such as *) on the command line when submitting Shifter jobs. Bash performs pathname expansion before passing arguments to the aprun command. So, if you are not careful you might see a No such file or directory error. Therefore, we recommend that you use scripts for all your Shifter-related work on Blue Waters.

Mapping directoties in Shifter UDIs

A distinct feature of Shifter images is that they are read-only and the only way to update them is by pulling (downloading) newer versions of corresponding Docker images. This, in turn, means that although input files can be part of UDIs, results of simulations and analysis produced in Shifter jobs have to be stored on the Blue Waters filesystem. For that purpose, Shifter adds special hooks into UDIs to make sure that scratch (/mnt/c/scratch), /projects (/mnt/b/projects), and home (/mnt/a/u/sciteam/<username>) filesystems are available when Shifter jobs run.

In addition to these automatic hooks, Shifter allows us to manually map existing directories of the Blue Waters filesystems to existing directories within UDIs. For example, we can map our home directory on Blue Waters system to /home directory within the image. There are two ways to specify such mappings:

a. when Shifter job is submitted to the queue
b. as an argument to the shifter command (which is provided by the shifter module)

Let's have a look at both of these methods.

Mapping directoties when submitting a Shifter job

When we submit Shifter jobs to the queue, we have an option to specify mapping between existing directories of the Blue Waters filesystems and those in the user-defined images by amending the UDI assignment in the following way:

$ qsub -l gres=shifter16 -v UDI="centos:latest -v /mnt/a/u/sciteam/<username>:/home" ...

Once the above job starts, specified Blue Waters directory will be mapped onto /home directory within centos:latest UDI. This mapping makes all files and folders within the directory on Blue Waters accessible from the /home directory in UDI. It also ensures that any changes made to /home directory in the job are reflected on the actual directory on Blue Waters.

Important


When mapping directories, contents of the directory on Blue Waters replaces the contents of the directory in UDI for that Shifter job only, that is: no changes to the actual UDI are made. Make sure to not use directories within UDI that have any imformation required for the job to run.

Mapping directoties from within a Shifter job

The other way to specify volume mapping between Blue Waters filesystems and UDI is by using the shifter command and its --volume or -V flags directly. For example, to achieve the same mapping as above, we would use the following sequence of commands:

$ qsub -l gres=shifter16 ...
$ module load shifter
$ aprun -b ... -- shifter --image=centos:latest --volume=/mnt/a/u/sciteam/<username>:/home ...
$ # or
$ aprun -b ... -- shifter --image=centos:latest -V /mnt/a/u/sciteam/<username>:/home ...

Note, that one can not:

  1. Overwrite volume mappings specified by Shifter itself
  2. Map a directory to any of the following directories and their subdirectories within UDI: /dev, /etc, /opt/udiImage, /proc, /var.
  3. Use symbolic links when specifying the directory to be mapped, that is /u/sciteam/user:/path/in/image will fail and the correct syntax is /mnt/a/u/sciteam/user:/path/in/image.

If we try to map one of the restricted folders, we will receive one of the following error messages:

$ aprun -b -- shifter --image=centos:latest --volume=/mnt/a/u/sciteam/<username>:/etc -- ...
Invalid Volume Map: /mnt/a/u/sciteam/<username>:/etc, aborting! 1
Failed to parse volume map options
...
$ aprun -b -- shifter --image=centos:latest --volume=/mnt/a/u/sciteam/<username>:/dev -- ...
mount: warning: ufs seems to be mounted read-only.
Mount request path /var/udiMount/dev not on an approved device for volume mounts.
FAILED to setup user-requested mounts.
FAILED to setup image.
Accessing compute nodes running Shifter jobs via SSH

Just like with any other application, you might need to interact with the application running in a Shifter environment for debugging, monitoring, or other purposes. To enable such interactions, Shifter allows users to log in to compute nodes that are part of its pool via the standard ssh command line tool. There are several requirements, however, in order to make use of this feature:

1. Specify UDI as a PBS directive.
To allow users log in to its compute nodes, Shifter can start up SSH daemons. The daemons on the compute nodes can be launched only by the prologue script, which is executed when the job starts. Therefore, in order to be able to login to compute nodes with a Shifter job running on it, it is necessary to specify UDI as a PBS directive.
2. Prepare special SSH key pair.
On startup, the SSH daemons enabled by Shifter look for a private SSH key in $HOME/.shifter and wait for a connection on port 1204 authenticated with this key. To prepare such a key pair, execute:
$ mkdir -p ~/.shifter
$ ssh-keygen -t rsa -f ~/.shifter/id_rsa -N ''

Once the above two steps are completed, we can log into the compute nodes using:

$  ssh -p 1204 -i ~/.shifter/id_rsa -o StrictHostKeyChecking=no \
-o UserKnownHostsFile=/dev/null -o LogLevel=error nodename

It is advisable to save all the above options into a configuration file. To do that, execute:

$  cat <<EOF > ~/.shifter/config
Host *
    Port 1204
    IdentityFile ~/.shifter/id_rsa
    StrictHostKeyChecking no
    UserKnownHostsFile /dev/null
    LogLevel error
EOF

Now, we can log in to the compute nodes with a simple:

$  ssh -F ~/.shifter/config nodename

To login to a remote machine using ssh command we have to specify the remote machine's network name. To find the name of compute nodes assigned to the Shifter job, execute the following command on a MOM node before setting the CRAY_ROOTFS environment variable:

$ aprun -n $PBS_NUM_NODES -N 1 -b -- hostname

You should see a list of names of the form: nidXXXXX, where XXXXX is a five-digit number. Use these to connect to the compute nodes:

$ ssh -F ~/.shifter/config nidXXXXX

Make sure, however, that you do not accidentally copy a network name of a MOM node where you execute all aprun commands.

Important


ssh will fail with a Permission denied error if your login shell does not exist in the container or is not listed in the container's /etc/shells file.

GPUs in Shifter jobs

If your application benefits from or relies upon CUDA-capable accelerators, make sure to use NVIDIA Kepler K20X GPUs that are installed on the Blue Water's XK nodes. Currently, this is only supported when using the shifter command provided by the module. To control which GPU devices should be accessible from within the container, Shifter uses an environment variable CUDA_VISIBLE_DEVICES. The value of this variable is a 0-based, comma-separated list of CUDA-capable device IDs on the host system (Blue Waters). Because XK nodes have only one NVIDIA GPU each, the only value we can set this variable to is 0. Note, that on systems with more than one NVIDIA GPU, device IDs within the container would start with 0 regardless of their IDs in the host system. This enables transparent use of containers on systems with different number of GPUs per node.

As an example, here is how we can start a 2-node Shifter job that uses GPUs:

$ qsub -l gres=shifter16,nodes=2:ppn=16:xk ...
$ # On a MOM node
$ module load shifter
$ export CUDA_VISIBLE_DEVICES=0
$ aprun -b -- shifter --image=centos:latest -- nvidia-smi
mount: warning: ufs seems to be mounted read-only.
mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/lib64/libcuda.so.1 seems to be mounted read-only.
mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/lib64/libcuda.so seems to be mounted read-only.
mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/lib64/libnvidia-compiler.so.352.68 seems to be mounted read-only.
mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/lib64/libnvidia-compiler.so seems to be mounted read-only.
[ GPU SUPPORT ] =WARNING= Could not find library: nvidia-ptxjitcompiler
mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/lib64/libnvidia-encode.so.1 seems to be mounted read-only.
mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/lib64/libnvidia-encode.so seems to be mounted read-only.
mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/lib64/libnvidia-ml.so.1 seems to be mounted read-only.
mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/lib64/libnvidia-ml.so seems to be mounted read-only.
[ GPU SUPPORT ] =WARNING= Could not find library: nvidia-fatbinaryloader
mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/lib64/libnvidia-opencl.so.1 seems to be mounted read-only.
mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/lib64/libnvidia-opencl.so seems to be mounted read-only.
mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/bin/nvidia-cuda-mps-control seems to be mounted read-only.
mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/bin/nvidia-cuda-mps-server seems to be mounted read-only.
mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/bin/nvidia-debugdump seems to be mounted read-only.
which: no nvidia-persistenced in (/opt/cray/nvidia/default/bin:/usr/local/bin:/usr/bin:/bin:/sbin)
[ GPU SUPPORT ] =WARNING= Could not find binary: nvidia-persistenced
mount: warning: /var/udiMount/opt/shifter/site-resources/gpu/bin/nvidia-smi seems to be mounted read-only.
Thu Jan  4 19:25:45 2018
+------------------------------------------------------+
| NVIDIA-SMI 352.68     Driver Version: 352.68         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20X          On   | 0000:02:00.0     Off |                    0 |
| N/A   27C    P8    17W / 225W |     31MiB /  5759MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Application 64245641 resources: utime ~0s, stime ~0s, Rss ~7660, inblocks ~57564, outblocks ~43459

Shifter populates PATH and LD_LIBRARY_PATH environment variables with paths from the host operating system that contain CUDA executables and shared libraries, correspondingly. Therefore, it is important to keep these changes to the environment variables if you plan to run GPU-enabled applications in Shifter environment.

MPI in Shifter UDI

Shifter allows applications to send messages between nodes using the underlying high-speed interconnect. There are a few requirements, however, in order for an application in Shifter UDI to use this feature.

1. Use compatible MPI implementation.
Applications in User-Defined Images have to be compiled against an MPI implementation that is part of the MPICH ABI Compatibility Initiative, an effort to maintain ABI (Application Binary Interface) compatibility between MPICH-derived MPI implementations. Currently, the list of compatible MPI implementations includes:
  • MPICH v3.1
  • Intel® MPI Library v5.0
  • Cray MPT v7.0.0
  • MVAPICH2 2.0
  • Parastation MPI 5.1.7-1
  • IBM MPI v2.1
or a later release of any of these. ABI compatibility allows Shifter to replace MPI libraries in the container with those from Cray at run time.
2. Don't use package manager to install MPI libraries
Currently, Shifter requires that the MPI implementation that you link your application against resides in a “user space”. If you link your application against MPI libraries provided by the package manager, it will not be able to use the interconnect of the underlying system and every MPI rank will think that MPI_COMM_SIZE is 1. Solution: build MPI implementation from source. Luckily, it is not at all difficult.
3. Use modern glibc 2.17+
Shifter requires the GNU C library (glibc) version 2.17 or above. This means that you can use containers based on CentOS / Scientific Linux / RedHat 7, Ubuntu 14.04, or newer. However, if you absolutely must use a container based on an older Operating System, you can try updating its glibc, for example, like so. If you experience any difficulties, feel free to contact Blue Waters support at help+bw@ncsa.illinois.edu"