SLURM

General information

SLURM commands

Command Information Usage
sbatch Submit a job script to SLURM sbatch [options] job_script.sh
srun Run a parallel job without job script srun [options] executable
sall Allocate nodes to SSH to salloc -w mb-1,node-2,…,node-N -p mb
sinfo View information about SLURM nodes and partitions sinfo -p mb, sinfo -N -n mb-N
squeue View information about jobs located in the SLURM scheduling queue squeue [-p mb]
scancel Cancel job scancel -u $user –state=PD, scancel $JobID
sacct Display accounting data of jobs sacct -u $user -o field1,field2,…,fieldN

More information regarding those commands can be check via man.

SLURM Mont-Blanc power monitor Plugin


:!::!: IMPORTANT NOTE :!::!:

The SLURM power monitor plugin only computes the energy-to-solution of the commands executed via srun. This means that, even for serial applications, in case you need to get energy measures, you must use the srun command to execute the application, no matter if it is from the command line or inside the job script which will be submitted via sbatch.


SLURM at the Mont-Blanc prototype includes a plugin which gathers the power data from the nodes involved in a job and, after the job finishes, computes the energy-to-solution. This value is stored at the SLURM database at the Consumed Energy field.

In which cases the energy-to-solution is computed?

An SLURM job is composed by one or more job steps. Each job step is command executed via srun. Then, answering the question, the SLURM plugin computes the energy-to-solution only of the job steps, reporting the value per job step.

The following table contains all the known cases:

Command Executed where? Energy-to-solution available?
srun At the command line YES
srun Inside a job script YES
sbatch Submitted YES (but only for commands executed with srun at the job script)
salloc At the command line YES

The summary is that, if you want the energy-to-solution of a job, SLURM can only provide the one from binaries executed with srun. In case you need the energy-to-solution of the rest of the job script, then you should compute it manually.

SLURM job scripts

Preliminary information

Number of allocated nodes is computed by SLURM using one of the following formulas, depending on the options set.

  • N = ntasks * cpus-per-task
  • N = ntasks * ntasks-per-core
  • N = ntasks * ntasks-per-socket
  • N = ntasks * ntasks-per-node

Also, one can specify how many nodes can be allocated with -N or –nodes option.

-N, --nodes=<minnodes[-maxnodes]>
      Request  that  a  minimum of minnodes nodes be allocated to this job.  A maximum node count may also be specified with maxnodes.  If
      only one number is specified, this is used as both the minimum and maximum node count.  The partition's node limits supersede  those
      of the job.  If a job's node limits are outside of the range permitted for its associated partition, the job will be left in a PEND‐
      ING state.  This permits possible execution at a later time, when the partition limit is changed.  If a job node limit  exceeds  the
      number  of nodes configured in the partition, the job will be rejected.  Note that the environment variable SLURM_NNODES will be set
      to the count of nodes actually allocated to the job. See the ENVIRONMENT VARIABLES  section for more  information.   If  -N  is  not
      specified,  the  default behavior is to allocate enough nodes to satisfy the requirements of the -n and -c options.  The job will be
      allocated as many nodes as possible within the range specified and without delaying the initiation of the job.  The node count spec‐
      ification  may  include  a  numeric value followed by a suffix of "k" (multiplies numeric value by 1,024) or "m" (multiplies numeric
      value by 1,048,576).

NOTE: It is mandatory to set the wall-clock limit at the job script by using the –time option.

       -t, --time=<time>
              Set a limit on the total run time of the job allocation.  If the requested time limit
              exceeds the partition's time limit, the job will be left in a PENDING state (possibly
              indefinitely).   The  default time limit is the partition's default time limit.  When
              the time limit is reached, each task in each job step is  sent  SIGTERM  followed  by
              SIGKILL.  The interval between signals is specified by the SLURM configuration param‐
              eter KillWait.  A time limit of zero requests that no time limit be imposed.  Accept‐
              able  time  formats  include  "minutes",  "minutes:seconds", "hours:minutes:seconds",
              "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".

OpenMP

#!/bin/bash
 
#SBATCH --partition=mb
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --out=omp-%j.out
#SBATCH --err=omp-%j.err
#SBATCH --time=10:00
 
export OMP_NUM_THREADS=2
 
srun ./omp_binary

This script will execute the OpenMP application allocating one node on mb partition which will use 2 cores of the allocated node CPU. Redirecting stdout and stderr to the specified files, %j stands for the JOB_ID.

OmpSs

#!/bin/bash
 
#SBATCH --partition=mb
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --out=ompss-%j.out
#SBATCH --err=ompss-%j.err
#SBATCH --time=10:00
 
export NX_ARGS="$NX_ARGS --threads=2"
 
srun ./ompss_binary

This script will execute the OmpSs application allocating one node on mb partition which will use 2 cores of the allocated node CPU. Redirecting stdout and stderr to the specified files.

MPI

NOTE: At the Mont-Blanc prototype is encouraged to use srun command over mpirun. The main reason is that srun command allows the user to set the running CPU frequency during the execution of the job. Another point to consider is that mpirun command is not available when using MPICH (due to configuration of the package in order to support SLURM).

#!/bin/bash
 
#SBATCH --partition=mb
#SBATCH --ntasks=64
#SBATCH --cpus-per-task=1
#SBATCH --out=mpi-%j.out
#SBATCH --err=mpi-%j.err
#SBATCH --time=10:00
 
# At Mont-Blanc prototype
srun ./mpi_binary
 
# At Mont-Blanc prototype testing
mpirun ./mpi_binary

This script will execute the MPI application allocating 32 nodes on mb partition. A total amount of 64 MPI processes will be spawn, using each process 1 core of each allocated CPU of each allocated node. So, at Mont-Blanc prototype, where each node has one CPU with two cores, a total amount of 32 nodes will be allocated and 2 MPI processes will be spawn at each node.

OpenCL

#!/bin/bash
 
#SBATCH --partition=mb
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu
#SBATCH --out=opencl-%j.out
#SBATCH --err=opencl-%j.err
#SBATCH --time=10:00
 
srun ./opencl_binary

This script will execute the OpenCL application allocating one node on mb partition which will use 1 core of the allocated node CPU. Redirecting stdout and stderr to the specified files.

MPI+OmpSs

#!/bin/bash
 
#SBATCH --partition=mb
#SBATCH --ntasks=64
#SBATCH --cpus-per-task=2
#SBATCH --out=mpi_ompss-%j.out
#SBATCH --err=mpi_ompss-%j.err
#SBATCH --time=10:00
 
export NX_ARGS="$NX_ARGS --threads=2"
srun ./mpi_ompss_binary

This script will execute the MPI+OmpSs application allocating 64 nodes on mb partition. A total amount of 64 MPI processes will be spawn, using each process 2 cores of each allocated CPU of each allocated node. So, at Mont-Blanc prototype, where each node has one CPU with two cores, a total amount of 64 nodes will be allocated and 1 MPI process will be spawn at each node, using each process 2 cores of the CPU.

Dependencies

Start Time

Submit the batch script to the SLURM controller immediately, like normal, but tell the controller to defer the allocation of the job until the specified time.

--begin=<time>     Submit  the batch script to the SLURM controller immediately, like normal,
     but tell the controller to defer the allocation of the job until the spec‐
     ified time.
 
     Time  may  be  of the form HH:MM:SS to run a job at a specific time of day
     (seconds are optional).  (If that time is already past, the  next  day  is
     assumed.)   You  may also specify midnight, noon, or teatime (4pm) and you
     can have a time-of-day suffixed with AM or PM for running in  the  morning
     or  the evening.  You can also say what day the job will be run, by speci‐
     fying a date of the form MMDDYY or MM/DD/YY YYYY-MM-DD. Combine  date  and
     time using the following format YYYY-MM-DD[THH:MM[:SS]]. You can also give
     times like now + count time-units, where the  time-units  can  be  seconds
     (default),  minutes,  hours,  days, or weeks and you can tell SLURM to run
     the job today with the keyword today and to run the job tomorrow with  the
     keyword tomorrow.  The value may be changed after job submission using the
     scontrol command.  For example:
        --begin=16:00
        --begin=now+1hour
        --begin=now+60           (seconds by default)
        --begin=2010-01-20T12:34:00
 
     Notes on date/time specifications:
 
      - Although the 'seconds' field of  the  HH:MM:SS  time  specification  is
     allowed by the code, note that the poll time of the SLURM scheduler is not
     precise enough to guarantee dispatch of the job on the exact second.   The
     job  will  be  eligible  to start on the next poll following the specified
     time. The exact poll interval depends on the  SLURM  scheduler  (e.g.,  60
     seconds with the default sched/builtin).
	- If no time (HH:MM:SS) is specified, the default is (00:00:00).
	-  If  a  date is specified without a year (e.g., MM/DD) then the current
     year is assumed, unless the combination of MM/DD and HH:MM:SS has  already
     passed for that year, in which case the next year is used.

So if you want that your job will start its execution at 21:00.

#!/bin/bash
 
#SBATCH --partition=mb
#SBATCH --ntasks=64
#SBATCH --cpus-per-task=1
#SBATCH --out=mpi-%j.out
#SBATCH --err=mpi-%j.err
#SBATCH --begin=21:00
#SBATCH --time=10:00
 
srun ./mpi_binary

Between jobs

After job begin

This job can begin execution after the specified jobs have begun execution.

after:job_id[:jobid…]

#!/bin/bash
 
#SBATCH --partition=mb
#SBATCH --ntasks=64
#SBATCH --cpus-per-task=1
#SBATCH --out=mpi-%j.out
#SBATCH --err=mpi-%j.err
#SBATCH --dependency=after:512
#SBATCH --time=10:00
 
srun ./mpi_binary

After job finish

This job can begin execution after the specified jobs have terminated.

afterany:job_id[:jobid…]

#!/bin/bash
 
#SBATCH --partition=mb
#SBATCH --ntasks=64
#SBATCH --cpus-per-task=1
#SBATCH --out=mpi-%j.out
#SBATCH --err=mpi-%j.err
#SBATCH --dependency=afterany:512
#SBATCH --time=10:00
 
srun ./mpi_binary

After job fail

This job can begin execution after the specified jobs have terminated in some failed state (non-zero exit code, node failure, timed out, etc).

afternotok:job_id[:jobid…]

#!/bin/bash
 
#SBATCH --partition=mb
#SBATCH --ntasks=64
#SBATCH --cpus-per-task=1
#SBATCH --out=mpi-%j.out
#SBATCH --err=mpi-%j.err
#SBATCH --dependency=afternotok:512
#SBATCH --time=10:00
 
srun ./mpi_binary

After job finish successfully

This job can begin execution after the specified jobs have successfully executed (ran to completion with an exit code of zero).

afterok:job_id[:jobid…]

#!/bin/bash
 
#SBATCH --partition=mb
#SBATCH --ntasks=64
#SBATCH --cpus-per-task=1
#SBATCH --out=mpi-%j.out
#SBATCH --err=mpi-%j.err
#SBATCH --dependency=afterok:512
#SBATCH --time=10:00
 
srun ./mpi_binary

Singleton

This job can begin execution after any previously launched jobs sharing the same job name and user have terminated.

singleton

#!/bin/bash
 
#SBATCH --partition=mb
#SBATCH --ntasks=64
#SBATCH --cpus-per-task=1
#SBATCH --out=mpi-%j.out
#SBATCH --err=mpi-%j.err
#SBATCH --dependency=singleton
#SBATCH --time=10:00
 
srun ./mpi_binary

Mail notifications

Mail notifications when certain job conditions are achieved is supported only at Mont-Blanc prototype. To set this option you must set some options at your job script. Lets see an example:

#!/bin/bash
 
#SBATCH --partition=mb
#SBATCH --ntasks=64
#SBATCH --cpus-per-task=1
#SBATCH --out=mpi-%j.out
#SBATCH --err=mpi-%j.err
#SBATCH --mail-type=END
#SBATCH --mail-user=daniel.ruiz@bsc.es
#SBATCH --time=10:00
 
srun ./mpi_binary


There are two options regarding this feature:

  • –mail-type=<type>
    • Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, ALL (equivalent to BEGIN, END, FAIL and REQUEUE), TIME_LIMIT, TIME_LIMIT_90 (reached 90 percent of time limit), TIME_LIMIT_80 (reached 80 percent of time limit), and TIME_LIMIT_50 (reached 50 percent of time limit). Multiple type values may be specified in a comma separated list. The user to be notified is indicated with –mail-user.
  • –mail-user=<mailto>
    • Mail direction to receive email notification of state changes as defined by –mail-type.

Select CPU Frequency

SLURM provides options for srun command to choose the CPU frequency your application will use. This is only valid if you execute your application using srun command. For example:

#!/bin/bash
 
#SBATCH --partition=mb
#SBATCH --ntasks=64
#SBATCH --cpus-per-task=1
#SBATCH --out=mpi-%j.out
#SBATCH --err=mpi-%j.err
#SBATCH --time=10:00
 
srun --cpu-freq=800000 ./mpi_binary

The syntax of this option is the following:

  • –cpu-freq=<requested frequency in kilohertz>
  • Available frequencies are:
    • Low: The lowest available frequency
    • High: The highest available frequency
    • HighM1: Will select the next highest available frequency (high minus one)
    • Medium: Attemps to set a frequency in the middle of the available range
    • OnDemand: Attemps to use the OnDemand CPU governor (the default value)
    • Performance: Attemps to use the Performance CPU governor
    • <FREQ>: Specify a frequency in kilohertz from the following
      • 1.7 GHz
      • 1.6 GHz
      • 1.5 GHz
      • 1.4 GHz
      • 1.3 GHz
      • 1.2 GHz
      • 1.1 GHz
      • 1 GHz
      • 900 MHz
      • 800 MHz
      • 700 MHz
      • 600 MHz
      • 500 MHz
      • 400 MHz
      • 300 MHz
      • 200 MHz


QR Code
QR Code slurm_examples (generated for current page)