Batch Example: Job Steps and Tasks

The Job Steps allow you to split a Job into several sections logical. They are created by prefixing a command (script/program) with the Slurm "srun" command and can run sequentially and/or parallelly. For example, a batch can be made up of 3 successive steps, each being divided in 2 parts executed in parallel.

For this, Slurm uses an "allocation unit": the Task. A Task is a process with x "cpus-per-task" threads.

A Step ("srun") uses one or more Task(s) ("-n" option), executed on one or more Node(s) ("-N" option). If omitted these options use, by default, the entire allocation of the Job.

The resources of a Job are then expressed in "cpus-per-task" and "ntasks" (number of Tasks) for a Job's total CPU allocation of: cpus-per-task *ntasks.

Job Description:

In this example, the Job encodes a video in two successive steps: a first stage of preparation (for example, copying files, video cutting, 1st encoding pass...), and a second step that performs two encodings (in H264 and VP9) executed in parallels. The resources to be reserved for this Job are therefore 2 Tasks of 4 threads each.

Batch content:

# SBATCH options:

#SBATCH --job-name=Encode-Steps # Job Name
#SBATCH --cpus-per-task=4       # Allocation of 4 threads per Task
#SBATCH --ntasks=2              # Number of Tasks: 2
#SBATCH --mail-type=END                         # Email notification of the
#SBATCH --mail-user=firstname.lastname@aniti.fr # end of job execution.

# Treatment

module purge                # delete all loaded module environments
module load ffmpeg/0.6.5    # load ffmpeg module version 0.6.5

# 1st step: Step of 2 Tasks (global Job resources)

srun prep.sh

# 2nd step: 2 Steps in parallel (one Task per Step)

srun -n1 -N1 ffmpeg -i video.avi -threads $SLURM_CPUS_PER_TASK -c:v libx264 [...] -f mp4 video-h264.mp4 &

srun -n1 -N1 ffmpeg -i video.mp4 -threads $SLURM_CPUS_PER_TASK -c:v libvpx-vp9 [...] -f webm video-vp9.webm &

# Wait for the end of the "child" Steps (executed in the background)

wait

# 3rd step: finalization (2 Tasks)

srun finish.sh

Remarks :

Threads not used by a Task will be "lost", cannot be used by any other Task or Step. If the Task creates more processes than allocated threads, these processes will be shared threads. (see Process...)

One of the advantages of using Steps for iterative tasks (not executed in parallel) is their support in the functions of management of Jobs (sstat, sacct) allowing both monitoring of Step-by-Step progress of the Job during execution (Steps completed, in progress, their duration...), and detailed statistics resource usage (CPU, RAM, disk, network...) for each Step (after execution).

When Steps are executed in parallel, it is imperative in the parent script (Job), to wait for the execution of the child processes with a "wait\", otherwise the child processes will be automatically interrupted (killed) once the end of the Batch reached.
The parallelization of the Steps is carried out by the SHELL ('&' in end of line), which executes the "srun" command in a sub-process (sub-shell) of the Job.
A Task cannot be executed/distributed on multiple nodes; the number of Tasks must therefore always be greater than or equal to the number of nodes (in the Batch as in a Step).

Shell (Bash) structures for creating Steps depending on the source Datas :

# Loop on the elements of an array (here files):
files=('file1' 'file2' 'file3' ...)

for end in "${files[@]}"; do
    # Adapt "-n1" and "-N1" according to your needs
    srun -n1 -N1 <command> [...] "$f" &
done

# Loop on the files of a directory:
while reading f; do
    # Adapt "-n1" and "-N1" according to your needs
    srun -n1 -N1 <command> [...] "$f" &
done < <(ls "/path/to/files/")

# Use "ls -R" or "find" for recursive traversal of folders

# Read a file line by line:
while read line; do
    # Adapt "-n1" and "-N1" according to your needs
    srun -n1 -N1 <command> [...] "$line" &
done <"/path/to/file"