Introduction

The following figure presents the architecture of the SLURM (Simple Linux Utility for Resource Management) cluster on the ANITI platform:

The cluster consists of:

An interactive node (isis.aniti.fr)

This is the node on which you must connect to access the compute cluster. This node (Linux Centos7 subsystem) can be used to validate programs before launching them on the cluster Calculation. Because this node is shared among all users, it should not not be used for running long jobs.

 

Compute nodes

These nodes (Linux Centos7 subsystem) are servers dedicated to calculations. The SLURM job manager manages on the calculation nodes the distribution and execution of the processes that you run from the interactive node. A process running on a compute node accesses data hosted on the storage array, performs processing and saves the result to the array.

Compute nodes are divided into 2 categories:

  • 4 Intel Xeon Gold 6240 18-core dual-processor computing nodes at 2.6 Ghz, and 384 GB of RAM.
    Multithreading is enabled on these servers.
    These nodes are grouped into a Slurm partition that we have named "CPU-Nodes". This is the default partition.
    This partition will be dedicated to processing that does not use GPUs.
    Each process will be limited to 72 threads and/or 368 GB of RAM.
    On the other hand, the number of processes created by a Job, a Job Step or a Task is limited only by the total size of the partition (and the availability of resources): for example a single Job will be able to execute, in parallel, 2 Steps of 2 Tasks each, with each Task creating 72 threads. This Job will use 288 threads (144 CPUs) and will be distributed on the 4 nodes.
  • 3 compute nodes each with 3 Nvidia graphics cards Quadro RTX 8000 (48GB RAM).
    Multithreading is not enabled on these nodes.
    These nodes are grouped in a partition Slurm we named "GPU-Nodes". This partition is intended for processing that takes advantage of computing power provided by GPU processors. Deep's Frameworks Learning (TensorFlow, Pytorch, ...) are the perfect example.

 

A storage bay

With a capacity of approximately 200 TB, this storage is provided by ATLAS storage at the CALMIP mesocentre. Data is accessible from the node interactive and computing nodes via the GPFS protocol.