PyTorch
PyTorch is an open source machine learning framework being able to use MPI and GPUs to speed up training and inference.
Using precompiled images
There are two options for precompiled images:
- the pytorch module installed in bwpy, currently limited to pytorch-0.4.1, with both GPU and MPI support (with
module load bwpy-mpi
) - shifter images provided by Bryan Lunt, via his GitHub repository, which provides pytorch 1.5.0 with GPU and MPI support and pytorch 1.8.0 with CPU support only due to CUDA version issues
Compiling your own code
Compiling pytorch is quite involved and slow. You should consider using one of the precompiled images listed above. If however you require special code in pytorch these instructions, adapted from Bryan Lunt's compilation instruction will let you build pytorch 1.4 and possibly newer version on Blue Waters.
Compiling PyTorch takes long enough that the compiler driver is killed on the login nodes due to running for too long. Instead we will use an interactive session (a single node, and an XK node assumes there are free ones):
qsub -I -l nodes=1:xk:ppn=16 -l walltime=12:00:00 -l gres=ccm module load ccm ccmlogin
Notice that after ccmlogin the prompt will change to reflect that you are now on a compute node and no longer on the shared MOM node. You must not attempt to build pytorch on the compute nodes since this would be very disruptive to other, concurrent users of the shared MOM node.
Now, you must start from a setup with just default module loaded i.e. module list
shows:
module list Currently Loaded Modulefiles: 1) modules/3.2.10.5 2) nodestat/2.2-1.0502.60539.1.31.gem 3) sdb/1.1-1.0502.63652.4.27.gem 4) alps/5.2.4-2.0502.9774.31.12.gem 5) lustre-cray_gem_s/2.5_3.0.101_0.46.1_1.0502.8871.45.1-1.0502.21728.75.4 6) udreg/2.3.2-1.0502.10518.2.17.gem 7) ugni/6.0-1.0502.10863.8.28.gem 8) gni-headers/4.0-1.0502.10859.9.27.gem 9) dmapp/7.0.1-1.0502.11080.8.74.gem 10) xpmem/0.1-2.0502.64982.7.27.gem 11) hss-llm/7.2.0 12) Base-opts/1.0.2-1.0502.60680.2.4.gem 13) cce/8.7.7 14) craype-network-gemini 15) craype/2.5.16 16) cray-libsci/18.12.1 17) pmi/5.0.10-1.0000.11050.179.3.gem 18) rca/1.0.0-2.0502.60530.1.63.gem 19) atp/2.0.4 20) PrgEnv-cray/5.2.82 21) cray-mpich/7.7.4 22) craype-interlagos 23) torque/6.1.2 24) moab/9.1.2.h6-sles11 25) xalt/0.7.6.local 26) scripts 27) OpenSSL/1.0.2m 28) cURL/7.59.0 29) git/2.17.0 30) wget/1.19.4 31) user-paths 32) gnuplot/5.0.5 33) darshan/3.1.3
then proceed by adapting instructions on the Python portal page to compile code for bwpy.
First load modules for Python, gcc and CUDA
module load bwpy/2.0.4 module swap PrgEnv-cray PrgEnv-gnu/5.2.82-gcc.4.9.3 module swap gcc/4.9.3 gcc/5.3.0 module load cmake/3.9.4 cudatoolkit/9.1.85_3.10-1.0502.df1cc54.3.1
and set up a virtualenv to hold the resulting code:
mkdir $HOME/pytorch-1.4.0 cd $HOME/pytorch-1.4.0 virtualenv --system-site-packages $PWD source bin/activate
Since we want to use bwpy
but not its included pytorch module we need to fudge things a bit to prevent bwpy from interfering:
# fudge things so that pytorch in bwpy does not interfere mkdir -p bwpy/lib bwpy/include ln -s /mnt/bwpy/single/usr/include/cudnn.h /mnt/bwpy/single/usr/include/nccl.h bwpy/include for i in libcudnn.so libcudnn.so.7 libcudnn.so.7.0.5 libcudnn_static.a libnccl.so \ libnccl.so.1 libnccl.so.1.2.3 libnccl.so.2 libnccl.so.2.1.15 libnccl_static.a ; do ln -s /mnt/bwpy/single/usr/lib/$i bwpy/lib/ done PKG_CONFIG_PATH=$EBROOTOPENSSL/lib/pkgconfig:$PKG_CONFIG_PATH unset CPATH unset LIBRARY_PATH
Then install prerequisites (in the virtualenv):
pip install pybind11==2.6.2
and configure the build system:
# build pytorch export CC=gcc export CXX=g++ export USE_MKLDNN=1 export MKLDNN_THREADING=OMP export USE_CUDA=1 export TORCH_CUDA_ARCH_LIST="3.5" export BUILD_CAFFE2_OPS=0 export BUILD_TEST=0 export USE_SYSTEM_NCCL=1 export NCCL_ROOT=/mnt/bwpy/single/usr export NCCL_LIB_DIR=$PWD/bwpy/lib export NCCL_INCLUDE_DIR=$PWD/bwpy/include export CUDNN_LIB_DIR=$PWD/bwpy/lib export CUDNN_INCLUDE_DIR=$PWD/bwpy/include
After that clone pytorch, check out the version required and patch out some code that requires a newer glibc than Blue Waters can provide:
git clone --recursive https://github.com/pytorch/pytorch cd pytorch git checkout v1.4.0 git submodule sync git submodule update --init --recursive REPO_URL=https://raw.githubusercontent.com/bryan-lunt-supercomputing/blue-waters-pytorch HASH=b4ab4c1b6ceb0b2a508ab426b63c030822adee5b wget $REPO_URL/$HASH/pt-bw.patch patch -p1 <pt-bw.patch
Once done you can build and install pytorch in the virtualenv, which will however take several hours:
python setup.py build --verbose 2>&1 | tee setup.log python setup.py install --verbose 2>&1 | tee --append setup.log
A quick test to show the pytorch version in use is:
# move out of pytorch build directory cd ../ python -c 'import torch;print(torch.__version__)'
Acknowledgement
These instructions build heavily on the work of Bryan Lunt.