Skip to Content

PyTorch

PyTorch is an open source machine learning framework being able to use MPI and GPUs to speed up training and inference.

Using precompiled images

There are two options for precompiled images:

  1. the pytorch module installed in bwpy, currently limited to pytorch-0.4.1, with both GPU and MPI support (with module load bwpy-mpi)
  2. shifter images provided by Bryan Lunt, via his GitHub repository, which provides pytorch 1.5.0 with GPU and MPI support and pytorch 1.8.0 with CPU support only due to CUDA version issues

Compiling your own code

Compiling pytorch is quite involved and slow. You should consider using one of the precompiled images listed above. If however you require special code in pytorch these instructions, adapte from Bryan Lunt's compilation instruction will let you build pytorch 1.4 and possibly newer version on Blue Waters.

You must start from a setup with just default module loaded i.e. module list shows:

> module list
Currently Loaded Modulefiles:
  1) modules/3.2.10.4                      17) PrgEnv-cray/5.2.82
  2) eswrap/1.3.3-1.020200.1280.0          18) cray-mpich/7.7.4
  3) cce/8.7.7                             19) craype-interlagos
  4) craype-network-gemini                 20) torque/6.0.4
  5) craype/2.5.16                         21) moab/9.1.2-sles11
  6) cray-libsci/18.12.1                   22) openssh/7.5p1
  7) udreg/2.3.2-1.0502.10518.2.17.gem     23) xalt/0.7.6.local
  8) ugni/6.0-1.0502.10863.8.28.gem        24) scripts
  9) pmi/5.0.14                            25) OpenSSL/1.0.2m
 10) dmapp/7.0.1-1.0502.11080.8.74.gem     26) cURL/7.59.0
 11) gni-headers/4.0-1.0502.10859.7.8.gem  27) git/2.17.0
 12) xpmem/0.1-2.0502.64982.5.3.gem        28) wget/1.19.4
 13) dvs/2.5_0.9.0-1.0502.2188.1.113.gem   29) user-paths
 14) alps/5.2.4-2.0502.9774.31.12.gem      30) gnuplot/5.0.5
 15) rca/1.0.0-2.0502.60530.1.63.gem       31) darshan/3.1.3
 16) atp/2.0.4

then porceed by adapting instructions on the Python portal page to compile code for bwpy.

First load modules for Python, gcc and CUDA

module load bwpy/2.0.4
module swap PrgEnv-cray PrgEnv-gnu/5.2.82-gcc.4.9.3
module swap gcc/4.9.3 gcc/5.3.0
module load cmake/3.9.4 cudatoolkit/9.1.85_3.10-1.0502.df1cc54.3.1

and set up a virtualenv to hold the resulting code:

virtualenv --system-site-packages $PWD
source bin/activate

Since we want to use bwpy but not its included pytorch module we need to fudge things a bit to prevent bwpy from interfering:

# fudge things so that pytorch in bwpy does not interfere
mkdir -p bwpy/lib bwpy/include
ln -s /mnt/bwpy/single/usr/include/cudnn.h /mnt/bwpy/single/usr/include/nccl.h bwpy/include
for i in libcudnn.so libcudnn.so.7 libcudnn.so.7.0.5 libcudnn_static.a libnccl.so \
  libnccl.so.1 libnccl.so.1.2.3 libnccl.so.2 libnccl.so.2.1.15 libnccl_static.a ; do
  ln -s /mnt/bwpy/single/usr/lib/$i bwpy/lib/
done

PKG_CONFIG_PATH=$EBROOTOPENSSL/lib/pkgconfig:$PKG_CONFIG_PATH
unset CPATH
unset LIBRARY_PATH

Then install prerequisites (in the virtualenv):

pip install pybind11==2.6.2

and configure the build system:

# build pytorch
export CC=gcc
export CXX=g++
export USE_MKLDNN=1
export MKLDNN_THREADING=OMP
export USE_CUDA=1
export TORCH_CUDA_ARCH_LIST="3.5"
export BUILD_CAFFE2_OPS=0
export BUILD_TEST=0

export USE_SYSTEM_NCCL=1
export NCCL_ROOT=/mnt/bwpy/single/usr
export NCCL_LIB_DIR=$PWD/bwpy/lib
export NCCL_INCLUDE_DIR=$PWD/bwpy/include
export CUDNN_LIB_DIR=$PWD/bwpy/lib
export CUDNN_INCLUDE_DIR=$PWD/bwpy/include

After that clone pytorch, check out the version required and patch out some code that requires a newer glibc than Blue Waters can provide:

git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
git checkout v1.4.0
git submodule sync
git submodule update --init --recursive
REPO_URL=https://raw.githubusercontent.com/bryan-lunt-supercomputing/blue-waters-pytorch
HASH=b4ab4c1b6ceb0b2a508ab426b63c030822adee5b
wget $REPO_URL/$HASH/pt-bw.patch
patch -p1 <pt-bw.patch

Once done you can build and install pytorch in the virtualenv, which will however take several hours:

python setup.py build --verbose 2>&1 | tee setup.log
python setup.py install --verbose 2>&1 | tee --append setup.log

A quick test to show the pytorch version in use is:

# move out of pytorch build directory
cd ../
python -c 'import torch;print(torch.__version__)'

Acknowledgement

These instructions build heavily on the work of Bryan Lunt.