Running parallel jobs with SLURM
On all HPC systems at LRZ, the SLURM scheduler is used to execute parallel jobs. This document describes usage, policies and resources available for submission and management of such jobs.
Table of contents
- Introduction and Prerequisites
- Interactive SLURM shell
- Batch Jobs
- A GUI for job management
- Documentation and Support
Note: The information contained in this document and its subdocuments applies for all parallel job classes. For serial job processing, please consult the Serial Processing documentation.
Introduction and Prerequisites
All parallel programs in the parallel segments of the cluster must be started up using either
- an interactive SLURM shell
- a SLURM batch script
|Cluster segment||submission (and development) node|
|CooLMUC1 (MPP) cluster||lxlogin1.lrz.de, lxlogin2.lrz.de|
|CooLMUC2 cluster||lxlogin5.lrz.de, lxlogin6.lrz.de, lxlogin7.lrz.de|
|sgi ICE, Ultraviolet||ice1-login (not directly, but via above login nodes)|
Jobs can be submitted to any SLURM cell from any submission node, but please note that an MPI program built in the ICE or UV default environment will not run on the MPP or Myrinet 10 GE clusters.
This document provides information on how to configure, submit and execute SLURM jobs, as well as information about batch processing policies. In particular, please be aware that misuse of the resources described here can result in the invalidation of the violating account. In particular, all parallel runs should always use either a salloc shell (for testing) or a scripted SLURM job.
Interactive SLURM shell
For performing program testing and short runs the following sequence of commands can be used: First, salloc is invoked to reserve the needed resources. Then, mpiexec can be used to start up a program on these resources.
Start an MPP mode Intel MPI program using two 16-way nodes on the CooLMUC1 (MPP) cluster
Please invoke this command from lxlogin1,2.
Start an MPP mode Intel MPI program using two 28-way nodes on the CooLMUC2 cluster
Please invoke this command from lxlogin5,6.
|salloc --ntasks=6 --cpus-per-task=8
mpiexec -n 6 ./myprog.exe
Start a hybrid mode Intel MPI program on the MPP cluster using 6 MPI tasks, with 8 OpenMP threads per task (3 nodes will be needed).
Please invoke this command from lxlogin1,2. Of course analogous commands will also work on CooLMUC2; note that currently there are 28 cores per node available on CooLMUC2, but no hyperthreading.
by default, a SLURM shell generated via salloc will run for 15 minutes. This interval can be extended to the partition maximum by specifying a suitable --time=hh:mm:ss argument. Also, the --partition option can be used to explicitly specify a desired partition, but we advise against doing so because the different login nodes have different environments that are incompatible.
Notes and Warnings:
- Only application/commands which are started with mpiexec are executed on the allocated nodes. All other commands are still be executed on the login node. This might block the login node for other users. Workaround would be to start also those memory or time consuming commands with "mpiexec -n 1", even if they are serial, optionally packing them into a script and starting it with mpiexec.
try: "mpiexec -n 2 hostname" and just "hostname" to see the difference.
- SLURM also has its own srun command; however this is not usable for MPI programs built with sgi MPT (UV); for these, please use the srun_ps command.
- Once the allocation expires, the program will be signalled and killed; further programs can not be started. Please issue the exit command and start a new allocation.
This type of execution method should be used for all production runs. A step-by-step recipe for the simplest type of parallel job is given, illustrating the use of the SLURM commands for users of the bash shell. See the documentation section at the end for pointers to more complex setups.
Step 1: Edit a job script
The following script is assumed to be stored in the file myjob.cmd.
#SBATCH -o /home/hpc/<group>/<user>/myjob.%j.%N.out
|(Placeholder) standard output and error go there. Note that the directory where the output file is placed must exist before the job starts, and the full path name must be specified (no environment variable!). The %j encodes the job ID into the output file. The %N encodes the master node of the job and should be added since job IDs from different SLURM clusters may be the same.|
#SBATCH -D /home/hpc/<group>/<user>/mydir
|directory used by script as starting point (working directory)|
#SBATCH -J <job_name>
|(Placeholder) name of job (not more than 10 characterns please)|
|Instead of mpp1(for the CoolMUC1 MPP cluster) you can also use,
mpp2 here (for the CooLMUC2 cluster), or uv2 or uv3 (for the two UltraViolet partitions)
|#SBATCH --get-user-env||Set user environment properly|
|Number of MPI tasks assigned to job. By default, SLURM will start as many MPI tasks per node as there are physical cores in the node.|
|Send an e-mail at job completion|
|(Placeholder) e-mail address (don't forget!)|
|Do not export the environment of the submitting shell into the job; while SLURM allows to also use ALL here, this is strongly discouraged, because the submission environment is very likely to be inconsistent with the environment required for execution of the job.|
|maximum run time is 8 hours 0 minutes 0 seconds; this may be increased up to the queue limit|
|initialize module system|
module load gsl # ... etc
|load any required environment modules (may be needed if program is linked against shared libraries). "gsl" of course is only a placeholder.|
start MPI executable. The MPI variant used depends on the loaded module set; non-MPI programs may fail to start up - please consult the example jobs or the software-specific documentation for other startup mechanisms.
This script essentially looks like a bash script. However, there are specially marked comment lines ("control sequences"), which have a special meaning in theSLURM context explained on the right hand of the above table. The entries marked "Placeholder" must be suitably modified to have valid user-specific values.
For this script, the environment of the submitting shell will not be exported to the job's environment; the job will start 64 MPI tasks on as many cores.
Step 2: Submission procedure
The job script is submitted to the queue via the command
At submission time the control sequences are evaluated and stored in the queuing database, and the script is copied into an internal directory for later execution. If the command was executed successfully, the Job ID will be returned as follows:
Submitted batch job 65648.
It is a good idea to note down your Job ID's, for example to provide to LRZ HPC support as information if anything goes wrong. The submission command can also contain control sequences. For example,
sbatch --time=12:00:00 myjob.cmd
would override the setting inside the script, forcing it to run 12 instead of 8 hours.
Step 3: Checking the status of a job
Once submitted, the job will be queued for some time, depending on how many jobs are presently submitted. Eventually, more or less after previously submitted jobs have completed, the job will be started on one or more of the systems determined by its resource requirements. The status of the job can be queried with the squeue --clusters=[all | cluster_name] command, which will give an output like
CLUSTER: mpp1 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
65646 mpp1_batch job1 xyz1 R 24:19 2 lxa[7-8]
65647 mpp1_batch myj xza2 R 0:09 1 lxa14
65648 mpp1_batch calc yaz7 PD 0:00 6 (Resources)
(assuming mpp1 is specified as the clusters argument) indicating that the job is queued. Once the job is running, the output would indicate the state to be "R" (=running), and would also list the host(s) it was running on. For jobs that have not yet started, the --start option, applied to kbd>squeue, will provide an estimate (!) for the starting time. The sinfo --clusters=[all | cluster_name] command prints out an overview of the status of all clusters or a particular clusters in the SLURM configuration.
Inspection and modification of jobs
Queued jobs can be inspected for their characteristics via the command
scontrol --clusters=<cluster_name> show jobid=<job ID>
which will print out a list of "Keyword=Value" pairs which characterize the job. As long as a job is waiting in the queue, it is possible to modify at least some of these; for example, the command
scontrol --clusters=<cluster_name> update jobid=65648 TimeLimit=04:00:00
would change the run time limit of the above-mentioned example job from 8 hours to 4 hours.
Deleting jobs from the queue
To forcibly remove a job from SLURM, the command
scancel --clusters=<cluster_name> <JOB_ID>
can be used. Please do not forget to specify the cluster! The scancel (1) man page provides further information on the use of this command.
A GUI for job management
The command sview is available to inspect and modify jobs via a graphical user interface:
- To identify your jobs among the many ones in the list, select either the "specific user's jobs" or the "job ID" item from the menu "Actions Y Search"
- By right-clicking on a job of yours and selecting "Edit job" in the context menu, you can obtain a window which allows to modify the job settings. Please be careful about committing your changes.
Documentation and Support
LRZ specific information
The subdocuments linked to in the following table provide further information about usage of SLURM on LRZ's HPC systems:
|Examples||provides example job scripts which cover the most common usage patterns for the|
|Policies||provides information about the policies, such as memory limits, run time limits etc.|
|Specifications||lists SLURM parameter settings and explains them, making appropriate recommendations where necessary|
- The home site of SLURM at LLNL
- The manual pages slurm(1), sinfo(1), squeue(1), scontrol(1), scancel(1), sview(1)</´kbd>
- In case of difficulties with the LRZ installations of SLURM, please contact us via the Service Desk.