Skip to Content

PyTorch

PyTorch is an open source machine learning framework being able to use MPI and GPUs to speed up training and inference.

Using precompiled images

There are two options for precompiled images:

  1. the pytorch module installed in bwpy, currently limited to pytorch-0.4.1, with both GPU and MPI support (with module load bwpy-mpi)
  2. shifter images provided by Bryan Lunt, via his GitHub repository, which provides pytorch 1.5.0 with GPU and MPI support and pytorch 1.8.0 with CPU support only due to CUDA version issues

Compiling your own code

Compiling pytorch is quite involved and slow. You should consider using one of the precompiled images listed above. If however you require special code in pytorch these instructions, adapted from Bryan Lunt's compilation instruction will let you build pytorch 1.4 and possibly newer version on Blue Waters.

Compiling PyTorch takes long enough that the compiler driver is killed on the login nodes due to running for too long. Instead we will use an interactive session (a single node, and an XK node assumes there are free ones):

qsub -I -l nodes=1:xk:ppn=16 -l walltime=12:00:00 -l gres=ccm
module load ccm
ccmlogin

Notice that after ccmlogin the prompt will change to reflect that you are now on a compute node and no longer on the shared MOM node. You must not attempt to build pytorch on the compute nodes since this would be very disruptive to other, concurrent users of the shared MOM node.

Now, you must start from a setup with just default module loaded i.e. module list shows:

module list
Currently Loaded Modulefiles:
  1) modules/3.2.10.5
  2) nodestat/2.2-1.0502.60539.1.31.gem
  3) sdb/1.1-1.0502.63652.4.27.gem
  4) alps/5.2.4-2.0502.9774.31.12.gem
  5) lustre-cray_gem_s/2.5_3.0.101_0.46.1_1.0502.8871.45.1-1.0502.21728.75.4
  6) udreg/2.3.2-1.0502.10518.2.17.gem
  7) ugni/6.0-1.0502.10863.8.28.gem
  8) gni-headers/4.0-1.0502.10859.9.27.gem
  9) dmapp/7.0.1-1.0502.11080.8.74.gem
 10) xpmem/0.1-2.0502.64982.7.27.gem
 11) hss-llm/7.2.0
 12) Base-opts/1.0.2-1.0502.60680.2.4.gem
 13) cce/8.7.7
 14) craype-network-gemini
 15) craype/2.5.16
 16) cray-libsci/18.12.1
 17) pmi/5.0.10-1.0000.11050.179.3.gem
 18) rca/1.0.0-2.0502.60530.1.63.gem
 19) atp/2.0.4
 20) PrgEnv-cray/5.2.82
 21) cray-mpich/7.7.4
 22) craype-interlagos
 23) torque/6.1.2
 24) moab/9.1.2.h6-sles11
 25) xalt/0.7.6.local
 26) scripts
 27) OpenSSL/1.0.2m
 28) cURL/7.59.0
 29) git/2.17.0
 30) wget/1.19.4
 31) user-paths
 32) gnuplot/5.0.5
 33) darshan/3.1.3

then proceed by adapting instructions on the Python portal page to compile code for bwpy.

First load modules for Python, gcc and CUDA

module load bwpy/2.0.4
module swap PrgEnv-cray PrgEnv-gnu/5.2.82-gcc.4.9.3
module swap gcc/4.9.3 gcc/5.3.0
module load cmake/3.9.4 cudatoolkit/9.1.85_3.10-1.0502.df1cc54.3.1

and set up a virtualenv to hold the resulting code:

mkdir $HOME/pytorch-1.4.0
cd $HOME/pytorch-1.4.0
virtualenv --system-site-packages $PWD
source bin/activate

Since we want to use bwpy but not its included pytorch module we need to fudge things a bit to prevent bwpy from interfering:

# fudge things so that pytorch in bwpy does not interfere
mkdir -p bwpy/lib bwpy/include
ln -s /mnt/bwpy/single/usr/include/cudnn.h /mnt/bwpy/single/usr/include/nccl.h bwpy/include
for i in libcudnn.so libcudnn.so.7 libcudnn.so.7.0.5 libcudnn_static.a libnccl.so \
  libnccl.so.1 libnccl.so.1.2.3 libnccl.so.2 libnccl.so.2.1.15 libnccl_static.a ; do
  ln -s /mnt/bwpy/single/usr/lib/$i bwpy/lib/
done

PKG_CONFIG_PATH=$EBROOTOPENSSL/lib/pkgconfig:$PKG_CONFIG_PATH
unset CPATH
unset LIBRARY_PATH

Then install prerequisites (in the virtualenv):

pip install pybind11==2.6.2

and configure the build system:

# build pytorch
export CC=gcc
export CXX=g++
export USE_MKLDNN=1
export MKLDNN_THREADING=OMP
export USE_CUDA=1
export TORCH_CUDA_ARCH_LIST="3.5"
export BUILD_CAFFE2_OPS=0
export BUILD_TEST=0

export USE_SYSTEM_NCCL=1
export NCCL_ROOT=/mnt/bwpy/single/usr
export NCCL_LIB_DIR=$PWD/bwpy/lib
export NCCL_INCLUDE_DIR=$PWD/bwpy/include
export CUDNN_LIB_DIR=$PWD/bwpy/lib
export CUDNN_INCLUDE_DIR=$PWD/bwpy/include

After that clone pytorch, check out the version required and patch out some code that requires a newer glibc than Blue Waters can provide:

git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
git checkout v1.4.0
git submodule sync
git submodule update --init --recursive
REPO_URL=https://raw.githubusercontent.com/bryan-lunt-supercomputing/blue-waters-pytorch
HASH=b4ab4c1b6ceb0b2a508ab426b63c030822adee5b
wget $REPO_URL/$HASH/pt-bw.patch
patch -p1 <pt-bw.patch

Once done you can build and install pytorch in the virtualenv, which will however take several hours:

python setup.py build --verbose 2>&1 | tee setup.log
python setup.py install --verbose 2>&1 | tee --append setup.log

A quick test to show the pytorch version in use is:

# move out of pytorch build directory
cd ../
python -c 'import torch;print(torch.__version__)'

Acknowledgement

These instructions build heavily on the work of Bryan Lunt.