Use of Machine Learning and Deep Learning frameworks on GPUs

Singularity Containers

Machine learning and deep learning frameworks are available on the ANITI computing cluster in the form of containers.

In order to allow the use of several versions of Cuda/CuDNN/Python and avoid dependency issues and conflicts between machine learning libraries, each framework runs in a dedicated container. Although Docker is the containerization solution more widespread, we use on ISIS Singularity, solution of containerization adapted to computing clusters.

https://www.sylabs.io/docs/

Singularity images (.SIF) frameworks are built on top of images Ubuntu 18.04 and CUDA11.
Images (.SIF) are available under /software/containerCollections/

Nom du conteneur	OS	CUDA	CuDNN	Tensorflow	Pytorch	Contents

tf2-NGC-21-03-py3.sif	Ubuntu 20.04	11.2	8.1.1	2.4.0		Release Notes
pytorch-NGC-21-03-py3.sif	Ubuntu 20.04	11.2	8.1.1		1.9.0	Release Notes
pytorch-NGC-22-03-py3.sif	Ubuntu 20.04	11.6	8.3.3		1.12.0	Release Notes
julia-1.5.2-NGC.sif.sif	Ubuntu 20.04	10.2				Release Notes

Images labeled NGC are images constructed from Nvidia docker containers available on https://ngc.nvidia.com

Running a singularity container

Running a singularity container is done via the command 'singularity exec' followed by the container image and processing to run in the container.

Example :

module load singularity/3.8.3
cd /logiciels/containerCollections/CUDA11/
singularity exec ./tf2-NGC-21-03-py3.sif \$HOME/moncode.sh

Note that user environment variables are available in containers as well as /users, /projects and /software directories

Using Frameworks in a Slurm Batch

The available frameworks have the ability to run on CPUs or GPUs. So you can run the containers with Slurm on any which partition of the ANITI cluster: 'CPU-Nodes' or 'GPU-Nodes'.

In the examples below, we want to take advantage of GPUs, and let's run the processing on the 'GPU-Nodes' partition. To indicate to Slurm that we want to use GPUs, 2 parameters are to must mention:

#SBATCH --gres=gpu:1 (le nombre de cartes que l'on souhaite utiliser, 3 max par serveur) #SBATCH --gres-flags=enforce-binding

Also, to tell Slurm how many CPUs we want to reserve, we use the following parameter:

#SBATCH --cpus-per-gpu=x (où x est le nombre de CPUs associé à chaque carte réservée)

It is not mandatory to specify the memory needed per CPU in the batch. By default, each Job automatically has 10240 MB per requested CPU (i.e. 10 GB) for the nodes calculation of the GPU-Nodes partition.

Examples of using the Tensorflow framewok

Content of slurm_job_tf.sh for running Tensorflow:

#!/bin/sh

#SBATCH --job-name=GPU-Tensorflow-Singularity-Test
#SBATCH --output=ML-%j-Tensorflow.out
#SBATCH --error=ML-%j-Tensorflow.err

#SBATCH --ntasks=1
#SBATCH --cpus-per-gpu=1
#SBATCH --partition=GPU-Nodes
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding

module purge
module load singularity/3.8.3

srun singularity exec /logiciels/containerCollections/CUDA11/tf2-NGC-21-03-py3.sif python "$HOME/tf-script.py"

Command :

[prenom.nom@cr-login-1 ~]#  sbatch slurm_job_tf.sh

Installing additional packages

You may want to use libraries not available by default in the containers made available to you.

To install add-on packages, you can do this use virtualenv, pip or conda.

Below is the procedure to follow:

First , you need to create a virtual environment from your $HOME directory, by opening a shell in the container:

$ singularity shell /logiciels/containerCollections/CUDA11/tf2-NGC-21-03-py3.sif

To create a virtual environment, in the directory called 'ENVNAME':

(tf2-NGC-21-03-py3.sif) → $ mkdir $HOME/ENVNAME

then,

(tf2-NGC-21-03-py3.sif) → $ virtualenv --system-site-packages $HOME/ENVNAME

The --system-site-packages parameter used with the virtualenv command will allow you to use in the virtual environment all packages already installed with the python used to install the environment virtual (tensorflow, pytorch or keras for example).

Then, once the virtual environment is created, you can install locally the desired packages (example with the Shogun package of Reinforcement Learning Package) using pip.

To avoid an error like "EnvironmentError: [Errno 28] No space left on device", you must position the variable TMPDIR Linux environment in your Home directory instead of the default one, which may not contain enough space.

(tf2-NGC-21-03-py3.sif) → $ mkdir localTMP
(tf2-NGC-21-03-py3.sif) → $ TMPDIR=$HOME/localTMP
(tf2-NGC-21-03-py3.sif) → $ TMP=$TMPDIR
(tf2-NGC-21-03-py3.sif) → $ TEMP=$TMPDIR
(tf2-NGC-21-03-py3.sif) → $ export TMPDIR TMP TEMP

After exporting the TMPDIR, TMP and TEMP variables, your environment is ready for installing add-on packages using pip:

$ singularity shell /logiciels/containerCollections/CUDA11/tf2-NGC-21-03-py3.sif

(tf2-NGC-21-03-py3.sif) → $ source ENVNAME/bin/activate
(tf2-NGC-21-03-py3.sif) → $ pip3 install shogun-ml --user

Note: the --user flag installs packages into the directory local to your ENVNAME environment, not in the container

Finally, you can use the new packages in your processes at from a SLURM job:

Execution :

[prenom.nom@cr-login-1 ~]$  sbatch slurm_job_shogun.sh

Contents of slurm_job_shogun.sh:

#!/bin/sh

#SBATCH --job-name=GPU-shogun-Singularity-Test
#SBATCH --output=ML-%j-shogun.out
#SBATCH --error=ML-%j-shogun.err

#SBATCH --ntasks=1
#SBATCH --cpus-per-gpu=1
#SBATCH --partition=GPU-Nodes
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding

module purge
module load singularity/3.8.3

srun singularity exec /logiciels/containerCollections/CUDA11/tf2-NGC-21-03-py3.sif $HOME/ENVNAME/bin/python "$HOME/shogun-script.py"

IMPORTANT :

If you are creating a Python virtual environment with a version python specific (e.g. conda2 create -n ENVNAME python=2.7), you will have to reinstall in this environment all the packages whose you will need, including those already included in the container.

SPECIAL CASE : Installing Python packages from GitHub

After activating your virtual environment with conda or virtualenv, you can download the package to a specific directory:

(tf2-NGC-21-03-py3.sif) → $ source activate ENVNAME
(tf2-NGC-21-03-py3.sif) → $ mkdir mesPackagesGithub
(tf2-NGC-21-03-py3.sif) → $ cd mesPackagesGithub
(tf2-NGC-21-03-py3.sif) → $ wget https://github.com/tensorflow/compression/archive/v1.1.zip
(tf2-NGC-21-03-py3.sif ~/mesPackagesGithub) → $ unzip v1.1.zip
(tf2-NGC-21-03-py3.sif ~/mesPackagesGithub) → $ ls

drwxr-xr-x 2 prenom.nom celdev 820 compression-1.1

After finishing the manual installation of the package and to be able to then to use it you have to change the environment variable PYTHONPATH so that it takes into account the path of the directory where you have installed the package. You can do this in two ways:

Or by an export in the Slurm script:

#!/bin/sh
#SBATCH --job-name=Multi-CPU-Test
#SBATCH --output=ML-%j-Tensorflow.out
#SBATCH --error=ML-%j-Tensorflow.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-gpu=1
#SBATCH --partition=GPU-Nodes
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding

export PYTHONPATH="~/mesPackagesGithub:$PYTHONPATH"

srun [...]

Either in the header of the python script you want to run:

import sys
import os
sys.path.append('/users/prenom.nom/mesPackagesGithub/compression-1.1')
import tensorflow_compression as tfc

[...]

Spécificités d'utilisation de l'image singularity Julia

A Julia container tries to pre-compile packages into files and saves logs (history logs) in the /data directory included in the container. Unlike Docker which allows root rights, Singularity will generate a permission error. The workaround is to create a new directory in your home directory and associate it with the container's /data directory.

mkdir data
singularity exec -B $(pwd)/data:/data /logiciels/containerCollections/CUDA11/julia-1.5.2-NGC.sif /your-path/your-script.jl

where -B is the option to specify your directory