ALIs
kommt nochRunning parallel jobs with SLURM
The new MPP cluster uses the SLURM scheduler to execute parallel jobs. This document describes usage, policies and resources available for SLURM execution. It is envisaged that the other parallel cluster segments will also be migrated to SLURM in the near future.
Table of contents
- Introduction and Prerequisites
- Interactive SLURM shell
- Script-driven SLURM jobs
- Step 1: Edit a job script
- Step 2: Submission procedure
- Step 3: Checking the status of a job
- Inspection and modification of jobs
- Deleting jobs from the queue
- A GUI for job management
- Documentation and Support
- LRZ specific information
- General information
Note: The information contained in this document and its subdocuments applies for all parallel job classes. For serial job processing, please consult the SGE (Sun Grid Engine) documentation.
Introduction and Prerequisites
All parallel programs in the parallel segments of the cluster must be started up using either
- an interactive SLURM shell
- a SLURM batch script
In order to access the SLURM infrastructure described here, please first log in to a front end node of the cluster, and from there to a submission node of the cluster segment you wish to use; each cluster segment corresponds to one or more SLURM cells.
|
Cluster segment |
submission (and development) node |
| Myrinet 10 GE cluster, MPP cluster | lxia4-1, lxia4-2 |
| sgi ICE, Ultraviolet | ice1-login |
Jobs can be submitted to any SLURM cell from any submission node, but please note that an MPI program built on the MPP or Myrinet 10 GE cluster will typically not run on the ICE or UV systems and vice versa.
This document provides information on how to configure, submit and execute SLURM jobs, as well as information about batch processing policies. In particular, please be aware that misuse of the resources described here can result in the invalidation of the violating account.
Interactive SLURM shell
For performing program testing and short runs the following sequence of commands can be used: First, salloc is invoked to reserve the needed resources. Then, the srun_ps command is used to start up a program on these resources.
|
salloc --ntasks=32 --partition=mpp1_inter
srun_ps ./myprog.exe exit |
Start an MPP mode Parastation MPI program using two 16-way nodes on the MPP cluster |
|
salloc --ntasks=16 --partition=ice1_inter
srun_ps ./myprog.exe exit |
Start an MPP mode sgi MPT program using two nodes with 16 physical cores on the ICE
(additionally specify --ntasks-per-core=2 if you want to use hyperthreaded cores. In this example only one node would be assigned) |
|
salloc --ntasks=6 --cpus-per-task=8 --partition=mpp1_inter
srun_ps -t 8 ./myprog.exe exit |
Start a hybrid mode Parastation MPI program on the MPP cluster using 6 MPI tasks, with 8 OpenMP threads per task (3 nodes will be needed). |
|
salloc --ntasks=4 --cpus-per-task=4 --partition=ice1_inter
srun_ps -t 4 ./myprog.exe exit |
Start a hybrid mode sgi MPT program on the ICE with 4 MPI tasks and 4 OpenMP threads per task (2 nodes will be needed) |
by default, a SLURM shell generated via salloc will run for 15 minutes. This interval can be extended to the partition maximum by specifying a suitable --time=hh:mm:ss argument
Notes:
- SLURM also has its own srun command; however this is not usable with MPI programs built with Parastation MPI or sgi MPT.
- The srun_ps command can be used to start up programs built with Parastation MPI (MPP cluster), sgi MPT (ICE or UV), or Intel MPI (all systems running under SLURM).
- Once the allocation expires, the program will be signalled and killed; further programs can not be started. Please issue the exit command and start a new allocation.
Script-driven SLURM jobs
This type of execution method should be used for all production runs. A step-by-step recipe for the simplest type of parallel job is given, illustrating the use of the SLURM commands for users of the bash shell. See the documentation section at the end for pointers to more complex setups.
Step 1: Edit a job script
The following script is assumed to be stored in the file myjob.cmd.
|
#!/bin/bash |
|
|
#SBATCH -o /home/hpc/<group>/<user>/myjob.%j.%N.out |
(Placeholder) standard output and error go there. Note that the directory where the output file is placed must exist before the job starts, and the full path name must be specified (no environment variable!). The %j encodes the job ID into the output file. The %N encodes the master node of the job and should be added since job IDs from different SLURM clusters may be the same. |
|
#SBATCH -D /home/hpc/<group>/<user>/mydir |
directory used by script as starting point (working directory) |
|
#SBATCH -J <job_name> |
(Placeholder) name of job |
|
#SBATCH --clusters=mpp1 |
Instead of mpp1 (for the MPP cluster) you can also use,
ice1 here (for the ICE), or uv2 or uv3 (for the two UltraViolet partitions) |
| #SBATCH --get-user-env | Set user environment properly |
|
#SBATCH --ntasks=64 |
Number of MPI tasks assigned to job. By default, SLURM will start as many MPI tasks per node as there are physical cores in the node. |
|
#SBATCH --mail-type=end |
Send an e-mail at job completion |
|
#SBATCH --mail-user=<email_address>@<domain> |
(Placeholder) e-mail address (don't forget!) |
|
#SBATCH --export=NONE |
Do not export the environment of the submitting shell into the job; while SLURM allows to also use ALL here, this is strongly discouraged, because the submission environment is very likely to be inconsistent with the environment required for execution of the job. |
|
#SBATCH --time=08:00:00 |
maximum run time is 8 hours 0 minutes 0 seconds; this may be increased up to the queue limit |
|
source /etc/profile.d/modules.sh |
initialize module system |
|
module load gsl # ... etc |
load any required environment modules (may be needed if program is linked against shared libraries). "gsl" of course is only a placeholder. |
|
srun_ps ./my_mpi_prog.exe |
start MPI executable. The MPI variant used depends on the loaded module set; non-MPI programs may fail to start up - please consult the example jobs or the software-specific documentation for other startup mechanisms |
This script essentially looks like a bash script. However, there are specially marked comment lines ("control sequences"), which have a special meaning in theSLURM context explained on the right hand of the above table. The entries marked "Placeholder" must be suitably modified to have valid user-specific values.
For this script, the environment of the submitting shell will not be exported to the job's environment; the job will start 64 MPI tasks on as many cores.
Step 2: Submission procedure
The job script is submitted to the queue via the command
sbatch myjob.cmd
At submission time the control sequences are evaluated and stored in the queuing database, and the script is copied into an internal directory for later execution. If the command was executed successfully, the Job ID will be returned as follows:
Submitted batch job 65648.
It is a good idea to note down your Job ID's, for example to provide to LRZ HPC support as information if anything goes wrong. The submission command can also contain control sequences. For example,
sbatch --time=12:00:00 myjob.cmd
would override the setting inside the script, forcing it to run 12 instead of 8 hours.
Step 3: Checking the status of a job
Once submitted, the job will be queued for some time, depending on how many jobs are presently submitted. Eventually, more or less after previously submitted jobs have completed, the job will be started on one or more of the systems determined by its resource requirements. The status of the job can be queried with the squeue --clusters=[all | cluster_name] command, which will give an output like
CLUSTER: mpp1 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 65646 mpp1_batch job1 xyz1 R 24:19 2 lxa[7-8] 65647 mpp1_batch myj xza2 R 0:09 1 lxa14 65648 mpp1_batch calc yaz7 PD 0:00 6 (Resources) |
(assuming mpp1 is specified as the clusters argument) indicating that the job is queued. Once the job is running, the output would indicate the state to be "R" (=running), and would also list the host(s) it was running on. For jobs that have not yet started, the --start option, applied to squeue, will provide an estimate (!) for the starting time. The sinfo --clusters=[all | cluster_name] command prints out an overview of the status of all clusters or a particular clusters in the SLURM configuration.
Inspection and modification of jobs
Queued jobs can be inspected for their characteristics via the command
scontrol --clusters=<cluster_name> show jobid=<job ID>
which will print out a list of "Keyword=Value" pairs which characterize the job. As long as a job is waiting in the queue, it is possible to modify at least some of these; for example, the command
scontrol --clusters=<cluster_name> update jobid=65648 TimeLimit=04:00:00
would change the run time limit of the above-mentioned example job from 8 hours to 4 hours.
Deleting jobs from the queue
To forcibly remove a job from SLURM, the command
scancel --clusters=<cluster_name> <JOB_ID>
can be used. Please do not forget to specify the cluster! The scancel (1) man page provides further information on the use of this command.
A GUI for job management
The command sview is available to inspect and modify jobs via a graphical user interface:
- To identify your jobs among the many ones in the list, select either the "specific user's jobs" or the "job ID" item from the menu "Actions YSearch"
- By right-clicking on a job of yours and selecting "Edit job" in the context menu, you can obtain a window which allows to modify the job settings. Please be careful about committing your changes.
Documentation and Support
LRZ specific information
The subdocuments linked to in the following table provide further information about usage of SLURM on LRZ's HPC systems:
| Examples | provides example job scripts which cover the most common usage patterns for the |
| Policies | provides information about the policies, such as memory limits, run time limits etc. |
| Specifications | lists SLURM parameter settings and explains them, making appropriate recommendations where necessary |
General information
- The home site of SLURM at LLNL
- The manual pages slurm(1), sinfo(1), squeue(1), scontrol(1), scancel(1), sview(1)
- In case of difficulties with the LRZ installations of SLURM, please contact us via the Service Desk.