Interactive and Batch Jobs with Loadleveler
Table of contents
- Node Allocation Policy
- Interactive parallel Jobs
- Batch-Jobs with LoadLeveler
- Job Command File
- Job Command File Keywords
- Job Name
- Job Type
- Job Class
- User Hold
- Parallel Jobs and Node Allocation
- Network Protocol
- Files and Directories
- Executable and Arguments
- Job Steps and Dependencies
- Logfiles of Submitted Jobs
- Working with energy aware jobs on SuperMUC
- Querying the Status of a Job
- Why isn't my job running?
- Special Cases
- Further Information
Node Allocation Policy
- Only complete nodes are provided to a given job for dedicated use.
- Accounting is performed by using: AllocatedNodes*Walltime*(number_of_core_in_node)
- Try to not waste cores of a node, i.e. try to use all cores. However, in some special cases, e.g. large memory requirements, etc. , you may not be able to use all cores.
- TASKS_PER_NODE must be less or equal 40 for fat nodes
- TASKS_PER_NODE must be less or equal 16 for thin nodes
- TASKS_PER_NODE must be less or equal 28 for thin nodes
(if using SMT/HyperThreading it must be less or equal 80, 32 or 56, resp.)
Jobs size: total tasks, number of nodes and tasks per node
Typically there are two ways of specifying the job size:
- Total tasks and number of nodes
- This is the more general case
- If your number of tasks is not evenly divisible by number of nodes
- You must compute the number of node by yourself (i.e., ceil(number of tasks/40) )
- LoadLeveler decides how to distrubute the tasks to nodes
- marked red in the following examples
- Tasks per node and number of nodes
- if your number of tasks is 40 (or a bit less)
- if you can run with various number of nodes by specify min and max
- marked blue in the following examples
- Do not mix both ways!
- Always specify number of nodes
- Always specify the number of islands on SuperMUC for jobclass "large"
- Do not waste cores of a node (remember: 16 or 40 cores per node)!
See details in the section: Keywords for Node and Core allocation
Interactive parallel Jobs
Parallel executables which are built with IBM's MPI can be run interactively by invoking them with one of the following methods
- poe ./executable -procs <nnn> -nodes <NNN>
- mpiexec -n <nnn> ./executable -nodes <NNN>
- MP_PROCS=nnn MP_NODES=NNN poe ./executable
- MP_PROCS=nnn MP_NODES=NNN mpiexec ./executable
- MP_PROCS=nnn MP_NODES=NNN ./executable
The executable is not executed on the front-end nodes but on the compute nodes. The resource manager of LoadLeveler is used to allocate free nodes. Therefore it is equivalent to submitting a Job Command File to LoadLeveler via llsubmit.
The LoadLeveler keyword part can also be used to allocate nodes
cat > LL_FILE <<EOD #@ job_type = parallel #@ node = 2 #@ total_tasks = 78 ## other example ##@ tasks_per_node= 39 #@ class = test
#@ queue EOD poe ./executable -rmfile LL_FILE
The above will work on the thin login nodes. On the fat login nodes, please specify @class = fattest.
You can invoke executables compiled with Intel MPI via poe. However, the run runtime libraries of IBM MPI are then used. Problems, particularly with hybrid (OpenMP-MPI) execution may occur. We recommend to write an LoadLeveler Job Command File and submit it to the class "test" or "fattest". For convenience you can also use llrun from the lrztools.
Limited resources for job class "test/fattest"
If you are logged in to sb.supermuc.lrz.de or hw.supermuc.lrz.de, POE jobs take their resources from the nodes of the job class "test". If you are logged in to wm.supermuc.lrz.de, POE jobs take their resources from the nodes of the job class "fattest". Programs compiled for the Sandy Bridge architecture (using AVX) may not be able to run on a fat node, and programs compiled with CORE-AVX2 can only run on the haswell architecture of Phase 2.
If you get the the following message:
ERROR: 0031-165 Job has been cancelled, killed, or schedd or resource is unavailable. Not enough resources to start now. Global MAX_TOP_DOGS limit of 1 reached.
then all nodes are busy. You have to wait and try again, or you have to submit your program to the test or general queue as a batch job e.g., using the llsubmit command. You can use the examples given below by just replacing class=general by class=test.
If you use more than one node (i.e., more than 16 cores on a thin island, or more than 40 cores on a fat island), you might need to explicitly set MP_NODES to a value larger than 1. Alternatively, use llrun, which in most cases will make the correct decision for node and core allocation.
Batch-Jobs with LoadLeveler
The login node "supermuc-login" intended only for editing and compiling your parallel programs. Interactive usage of "poe/mpirun" is not allowed. To run test or production jobs, submit them to the LoadLeveler batch system, which will find and allocate the resources required for your job (i.e., the compute nodes to run your job on).
The most important Loadleveler commands are:
|llsubmit||Submit a job script for execution.|
|llq||Check the status of your job(s).|
|llhold||Place a hold on a job|
|llcancel||Cancel a job.|
|llclass||Query information about job classes.|
The -H flag provides extended help information.
Build your job command file "job.cmd" by using a text editor to create a script file.
# This job command file is called job.cmd #@ input = job.in #@ output = job.out #@ error = job.err #@ job_type = parallel #@ class = general #@ node = 8 #@ total_tasks=128 #@ network.MPI = sn_all,not_shared,us #@ ... other LoadLeveler keywords (see below) #@ queue echo "JOB is run"
To submit the job command file that you created, use the llsubmit command:
LoadLeveler responds by issuing a message similar to:
submit: The job "supermuc.22" has been submitted.
Where supermuc is the name of the machine to which the job was submitted and 22 is the job identifier (ID).
To display the status of the job you just submitted, use the llq command. This command returns information about all jobs in the LoadLeveler queue:
To place a temporary hold on a job in a queue, use the llhold command. This command only takes effect if jobs are in the Idle or NotQueued state. Jobs beeing in hold for more than 3 months will be deleted by system administrators.
To release the hold, use the llhold command with the -r option:
llhold -r supermuc.22
To cancel a job, use the llcancel command:
Job Command File
A Job Command File describes the job to be submitted to the LoadLeveler Job Manager using the llsubmit command. It can contain multiple steps, each designated by the #@queue statement. Lines starting with '#@' are statements that are interpreted by LoadLeveler's parser. The job command file itself can be the script to be executed by each step of the job.
Note that the llsubmit command itself does not form the job using command line arguments.
Examples for job command files
Submission of jobs to the Phase 1 system requires logging in to a Phase 1 login node;
Submission of jobs to the Phase 2 system requires logging in to a Phase 2 login node.
We recommend not to use csh/tcsh as the environment for LoadLeveler (whether by indicating the executing shell with #! in the first line or with the #@shell keyword). We see issues with the execution of the login profiles, PATHES and/or module environment. If you want to use csh/tcsh, then pack your commands and settings into an ordinary csh skript and execute this script within the Loadleveler job.
Job Command File Keywords
#@ job_name = myjob
- Specifies the name of the job. This keyword must be specified in the first job step. If it is specified in other job steps in the job command file, it is ignored.
The job_name only appears in the long reports of the llq, llstatus, and llsummary commands, and in mail related to the job.
#@ job_type=serial | parallel | MPICH
- means a POE job
- Must be used when running jobs with IBM MPI.
- Similar to a parallel job, but handled differently by LoadLeveler internally
- Must be used when running jobs with Intel MPI.
Serial (Presently not available at SuperMUC)
- single task job
- For single task, multiple threads jobs, please see the OpenMP job example above
- Many parallel keywords cannot be specified with a serial job_type
#@ class = class_name
Valid classes are:
|Max. Island Count|
Job classes on SuperMUC Fat Nodes (Phase 1)
|fattest||Test and interactive use||1||1 - 4
|fatter||Long-running jobs, Time Extended Runs||a)||1||1 - 13||192 h (8 days)||
|fat||Medium-sized production runs fitting into athe fat node island, 256 GB/node||1||1 - 52||48 h||256
Job classes on SuperMUC Thin Nodes (Phase 1)
|test||Test and interactive use||1||1 - 32||30 min||32
|general||Medium-sized production runs fitting into a single island||c), d)||1 or
|33 - 512||48 h||32
|large||Large Jobs, spanning more than one Island.
You must specify the LL keyword:
|a) b) c) d) e)||min: ≥ 2
max: ≤ 2*min
|513 - 4096||48 h||32
Job classes on SuperMUC Haswell Nodes (Phase 2)
|test||Test and interactive use||1||1 - 20||30 min||64
|micro||Small jobs, pre- and postprocessing runs (internally restricted to run only on some specific islands)||1||1 - 20||48 h||64
|general||Medium-sized production runs fitting into a single island||c) d)||1 or
|21 - 512||48 h||64
|big||Big memory per node jobs (pre- and postprocessing runs, on some Haswell nodes equipped with 256 GB/node)||1||1 - 8||6 h||256
Special Job Classes
Restricted use (on request by particular users)
|1, 6 or 18||
|For temporary workarounds and for specific priority requirements.
Restricted use, not generally available.
1 - 9216
a) Jobs may be forcibly killed for maintenances
b) For large core counts it is not recommended to use SMT threads/Hyperthreading
c) These job classes require a minimum number of nodes
d) See description of #@ island_count below for more details and important recommendations.
Hint: use "llclass -l" to see all limits and definitions
#@ wall_clock_limit = hardlimit[,softlimit]
Specifies the hard limit, soft limit, or both for the elapsed time for which a job can run.
Limits are specified with the format hh:mm:ss (hours:minutes:seconds)
#@ hold = user
Specifies that the job will be put into user hold state when it is submitted. The job can subsequently be released by issuing the command llhold -r <job_id>.
Parallel Jobs and Node Allocation
# @ node_topology = island
This keyword is needed for SuperMUC, and is therefore automatically set by the LRZ submit filter.
#@ island_count = <min[,max]>
#@ island_count = number
If a job requires more than 512 nodes, this keyword must be specified. Each Island on SuperMUC normally contains 512 nodes, however, if nodes have problems they may be switched off. The scheduler attempts to start the job on min islands, but will start the job on up to max islands if the job does currently not fit on min islands (e.g., because there are not enough free nodes available). Since SuperMUC has a pruned (1:4) network topology, there is a trade-off between performance and more chances to get a job allocated.
LRZ recommends to use an island count which is slightly higher (+1 or +2) than the minimum requirement to give the LoadLeveler more flexibility for finding free nodes. Example: if you want to run on 2000 nodes, you must use at least 4 islands. In this case specify
#@ island_count = 4,5
instead of just requiring just 4 islands.
For jobs with less than 512 you can also use island_count. However if you want less than approx. 400 nodes, please specify a minimum island_count of 1 to avoid a fragmentation of the machine by many small jobs.
It is always necessary to set island_count=1 for Intel MPI jobs on Phase 2 (Haswell nodes).
#@ node = <min[,max]>
#@ node = <number>
The scheduler attempts to get max nodes to run the job step, but will start the job step on min nodes if necessary, if max is not specified, the scheduler will use exactly the number of nodes on which to run the job step
#@ tasks_per_node = <number>.
Used in conjunction with #@ node, each node is assigned number tasks
tasks_per_node must be less or equal 40.
#@ total_tasks = <number>
Rather than specifying the number of tasks to start on each node, the total number of tasks in the job step across all nodes can be specified
Valid combinations of keywords are given in the next table (keywords marks by X can be combined):
|node = <min, max>||
|node = <number>||
see also: Task Assignment Considerations
Because of using the Island Topology, the keywords blocking and task_geometry are not supported on SuperMUC!
#@ first_node_tasks = number
Specifies a different task countfirst node assigned to a job step using first_node_tasks. This may be useful for Master-Slave applications. All remaining nodes will run tasks_per_node tasks. The first_node_tasks keyword can only be specified in conjunction with node and tasks_per_node; it cannot be specified with the total_tasks keyword. When first_node_tasks is used, a maximum node specification is not permitted. For example, node=6 is acceptable but node=6,8 is not. Example:
#@ node = 6 #@ first_node_tasks = 1 #@ tasks_per_node = 16
A total of 6 machines are selected for the job step. The first machine will run only 1 task, task ID 0; the remaining 5 machines will run 16 tasks each. The total number of tasks in this job step is 81.
#@ network.protocol = type[, usage[, mode[,comm_level]]]
protocol: specifies the communication protocols that are used with an adapter,
- MPI Specifies the message passing interface (MPI).
- LAPI Specifies the low-level application programming interface
- MPI_LAPI, both MPI and LAPI
usage: Specifies whether the adapter can be shared with tasks of other job steps. Possible values are
- shared, which is the default,
type: This field is required and specifies one of the following:
- sn_all: Specifies that striped communication should be used over all available switch networks.
mode: Specifies the communication subsystem mode used by the communication protocol that you specify
- US (User Space), always use US!
- IP. do not use, but its the default !
#@ restart = yes | no
Specifies whether LoadLeveler considers a job to be restartable.
If restart=yes (default), and the job is vacated (e.g. in case of system errors) from its executing machine before completing, the central manager requeues the job. It can start running again when a machine on which it can run becomes available.
If restart=no, a vacated job is canceled rather than requeued.
Files and Directories
#@ input = filename
#@ output = filename
#@ error = filename
#@ initialdir = pathname: Specifies the path name of the directory to use as the initial working directory during execution of the job step.
Filename can be a absolute or it can be relative to the current working directory.
If no keyword is specified, LoadLeveler uses /dev/null.
Do not use the keywords and leave the values unspecified; this might lead to errors.
Executable and Arguments
The Job Manager copies the executable from the submitting node to the spool directory
#@ executable = program_name
- The name of the executable that will be started by LoadLeveler
- If blank, the job command file itself will be used as the executable
- A different executable can be specified for each step in a job
#@ arguments = arg1 arg2 ...
- Specifies the list of arguments passed to the program when the job step runs
- Different arguments can be specified for each step, even if all steps are using the job command file as the executable
#@ environment = env1; env2; ...
Specifies the environment variables that will be set for the job step by LoadLeveler when the job step starts.
- DO NOT USE COPY_ALL!
- $var: Copies var into the job step's environment
- !var: Omits var from the job step's environment
- var=value: Specifies that the environment variable var be set to the value ?value? and copied to the job step's environment when the job step is started
LoadLeveler sets the environment before the login shell is executed, so anything set by the shell (such as settings in .profile) will override the variables set by LoadLeveler
#@ env_copy = all | master
Specifies whether environment variables are copied to all nodes of the job step or only to the master node. By default, the environment is copied to all nodes
#@ shell = name
When LoadLeveler starts the job, the specified shell will be used instead of the default shell for the user from /etc/passwd
#@requirements = <boolean expression>
This keyword can be used to enter specific conditions that the schedules resources must fulfill. In the context of SuperMUC, the following boolean expressions are useful:
- (Island != island01) # run on thin nodes if in job class special
- (Island == island01) # run on fat nodes if in job class special
Job Steps and Dependencies
A single job can contain more than one step. Every job contains, by default, at least one step. A step can have dependencies on the exit codes of other steps in the same job. Each "#@ queue" statement marks the end of one step and the beginning of the next. Most keyword values are inherited from the previous step. A group of steps within a job can be co-scheduled: they are treated as one entity that will all be started at the same time.
#@ step_name = step_name
You can name the job step using any combination of letters, numbers, underscores (_) and periods (.). You cannot, however, name it T or F, or use a number in the first position of the step name. The step name you use must be unique and can be used only once.
#@ dependency = step_name operator value
value is usually a number that specifies the job return code to which the step_name is set.
Operators include ==, !=, <=, >=, <, >, &&, ||
A step can have dependencies on more than one step, like
#@ dependency = (step1 == 0) && (step2 >= 0)
It can also be one of the following LoadLeveler defined job step return codes:
Special Return Codes
- CC_NOTRUN: The return code set by LoadLeveler for a job step which is not run because the dependency is not met. The value of CC_NOTRUN is 1002
- CC_REMOVED: The return code set by LoadLeveler for a job step which is removed from the system (because, for example, llcancel was issued against he job step). The value of CC_REMOVED is 1001.
See also: example job with dependencies
#@ notification = always|error|start|never|complete
Specifies when mail is sent to the adress in the notify_user keywordl:
- always : Notify the user when the job begins, ends, or if it incurs error conditions.
- error: Notify the user only if the job fails.
- start: Notify the user only when the job begins.
- never: Never notify the user.
- complete: Notify the user only when the job ends. (Default)
Note: the number of execution nodes reported in the notification mail is in general incorrect. This is a known bug in LoadLeveler that will not be fixed any more.
#@ notify_user = email_address.
Specifies the address to which mail is sent based on the notification keyword.
Variables only accessible for LoadLeveler command file (at submit time)
Several variables are available for use in job command files.
$(domain): The domain of the host from which the job was submitted.
$(home): The home directory for the user on the cluster selected to run the job.
$(user): The user name that will be used to run the job. This might be a different user name.
$(host): The hostname of the machine from which the job was submitted.
$(jobid): The sequential number assigned to this job by the schedd daemon.
$(stepid): The sequential number assigned to this job step when multiple queue statements are used with the job command file.
Some variables are set from other keywords defined in the job command file
$(base_executable): Automatically set from the executable keyword; consists of the executable file name without the directory component (basename).
#@ output = $(home)/$(job_name)/$(step_name).$(schedd_host).$(jobid).$(stepid).out
Environment Variables (accessible with job)
LoadLeveler sets several environment variables in the application's environment. A complete list is available in Using and Administering, but here are a few examples and explanations:
|LOADLBATCH=yes||Set when it is a batchjob.|
|LOADL_HOSTFILE=/var/loadl/...||Contains the list of hosts where the job is run.|
|LOADL_STEP_ID=srv23ib.12345.0||The three part job identifier.|
|LOADL_JOB_NAME=mytest||The name of the job.|
|LOADL_STEP_NAME=run1||The the step name.|
|LOADL_STEP_CLASS=general||The job class .|
|LOADL_PID=98765||The process ID of the starter process|
|LOADL_STEP_COMMAND=/home/prxxxx/luyyyy/JOB||The name of the executable (or the name of the job command file if the job command file is the executable).|
|LOADL_STEP_INITDIR=/home/prxxxx/luyyyy/||The initial working directory.|
|LOADL_STEP_ARGS=input1||Any arguments passed by the job step.|
|LOADL_STEP_ERR=err.51458||The file used for standard error messages (stderr).|
|LOADL_STEP_OUT=out.51458||The file used for standard output (stdout).|
|LOADL_STEP_IN=/dev/null||The file used for standard input (stdin).|
|LOADL_TOTAL_TASKS=800||Specifies the total number of tasks of the MPICH job step. This variable is available only when the job_type is set to MPICH.|
Logfiles of Submitted Jobs
For each job which starts running, a logfile is written into: $SCRATCH/.lrz_user_logs/<jobstepID> containing the following information
- userID and goupsID
- budget and quota information
- Loaded model at begin of job
- Job script
- environment at begin of job
- job status at begin of job, including node list
Do not delete these files by yourself since they are needed to be analysed in the case of problems; the files will be automatically deleted by LRZ after a few weeks.
Working with energy aware jobs on SuperMUC
Energy consumption, power requirements, CO2 emmision and the associated costs are becoming major issues for high performance computing. Therefore LRZ procured energy saving functionalities for LoadLeveler. Using the energy functions, a job can run faster than default if the benefit in run time exceeds certain thresholds (given below), typically at the expense of using more energy.
The energy_policy_tag allows LoadLeveler to identify the energy data associated with a job. With those energy data, LoadLeveler can decide which frequency should be used to run the job with minimal performance degradation. The energy data include:
- Power consumption and the elapsed time when run with default frequency
- The estimated power and energy
- The elapsed time at other frequencies
- The percentage of performance degradation (wrt. runtime)
Setting the energy policy tag in the job command file, the energy data will be generated and stored in the database when running the job for the first time. If the job is submitted again with the same energy policy tag, the same policy will be used. Submitting jobs using a new energy tag for the first time, take care of keeping the tag name unique among the tags you have previously generated.
To use energy keywords in the job command file follow these steps:
- Generation of energy tags:
Provide an unique identifier for the energy_policy_tag when submitting a job for the firsttime. For example:
#@ energy_policy_tag = my_energy_tag #@ minimize_time_to_solution = yes
The identifier may contain lower and upper case Latin letters, Arabic digits, and the underscore character, however, no other characters are allowed. Especially, ä, ö, ü, @ and . characters must not be used.
LoadLeveler generates the energy data associated with this energy tag for the job when the job runs for the first time. After the job has ended, the energy data using the energy tag can be queried the following way:
llrqetag -e my_energy_tag [-u user] [-j jobid]
Generated by: srv03-ib.219213.0 Last used Time: Thu Jun 27 11:45:40 2013 User: luxyz123 Nominal Frequency: 2.70 GHZ Default Frequency: 2.30 GHZ Node's Energy Use: 0.001168 KWH Execution Time: 71 Seconds Frequency(GHZ) EstEngCons(KWH) EngVar(%) EstTime(Sec) TimeVar(%) Power(W) 2.70 0.001380 18.10 61 -14.08 81.43 2.60 0.001317 12.75 63 -11.27 75.27 2.50 0.001266 8.39 65 -8.45 70.14 2.40 0.001241 6.26 68 -4.23 65.72 2.30 0.001168 0.00 71 0.00 59.24 2.20 0.001183 1.26 73 2.82 58.34 2.10 0.001159 -0.82 77 8.45 54.17 .... 1.30 0.001330 13.88 124 74.65 38.63 1.20 0.001455 24.53 134 88.73 39.09 EstEngCons: Estimated Energy Consumption (Energy to solution) EngVar: EnergyVariation(Freq) = (EstEngCons(Freq)-EstEngCons(2.3Ghz))/EstEngCons(2.3Ghz) negative values: energy saving; positive values: needs more energy to solution EstTime: Estimated time for job TimeVar: TimeVariation, negative: faster, positive: slower Power: Power requirement of one socket
Here you can see in the first line, that for running the application 14.08% faster, you need 18.1% more energy.
Generate the energy tag only for production runs which are running at least 10 minutes. Short running jobs or jobs which do not reflect the production profile will not provide sensible and reliable results.
Using the energy tags to run at higher frequency than default:
Not all jobs will run faster, if the CPU clock speed is increased. It depends on the characteristics of the application e.g., CPU-bound job will run faster, however, memory-bound job might not benefit from higher CPU-frequency. LoadLeveler will decide on data associated with the energy tag which is the best frequency to run the job based on the tradeoff of energy consumption and runtime.
There are two required items you need to enter in all your LoadLeveler scripts if you want to use the possibility to run on higher frequency:
- You need to generate an energy tag in a first characteristic job
You need to use this tag and policy in the follow-on jobs by using the same keywords again:
#@ energy_policy_tag = my_energy_tag #@ minimize_time_to_solution = yes
to request a frequency increase from LoadLeveler. Without this item, jobs will run at the standard frequency of 2.3 GHz.
If run time with 2.4 GHz is expected to decrease by more than 2.5% compared to 2.3 GHz, the frequency is set to 2.4 GHz
If run time with 2.5 GHz is expected to decrease by more than 5% compared to 2.3 GHz, the frequency is set to 2.5 GHz
If run time with 2.6 GHz is expected to decrease by more than 8.5% compared to 2.3 GHz, the frequency is set to 2.6 GHz
If run time with 2.7 GHz is expected to decrease by more than 12% compared to 2.3 GHz, the frequency is set to 2.7 GHz
We encourage all users of the system to make use of this new feature to accelerate processing of jobs and to run jobs in an energy efficient way.
- Removal of energy tags:
You can remove energy tags by:
llrrmetag -e energy_tag [-u user] [-j job] [-t MM/DD/[YY]YY]tag]Here the -t MM/DD/[YY]YY option means: The energy tag will be removed if it has not been used since the date specified.
- Circumvent energy tags:
There might be problems when using the energy function and libraries for performance measurements at the same time. In those case it is necessary to switch the the energy function off. You can do that via
#@ energy_policy_tag = NONEor by setting the environment LL_BYPASS_ETAG variable in the submitting shell before the llsubmit command.
Querying the Status of a Job
The llq command lists all job steps in the queue, one job step per line.
Do not query llq repeatedly with scripts in short time intervals (like watch -n N, with N < 200. Handling too many llq requests prevent the LoadLeveler from doing its actual work, i.e. efficient job scheduling.
llq -u userlist filters out only those job steps belonging to the specified users.
llq -j joblist will display only the specified jobs.
The format of a job ID is host.jobid.
The format of a step ID is host.jobid.stepid.
Fields in llq's listing
- Class: Job class.Id: The format of a full LoadLeveler step identifier is host.jobid.stepid.
- Owner: User ID that the job will be run under.
- PRI: User priority of the job step
- Running On: If running, the name of the machine the job step is running on. This is blank when the job is not running. For a parallel job step, only the first machine is shown.
- ST: Current state of the job step.
- Idle (I): The job step is waiting to be scheduled.
- NotQueued (NQ): The job step is not being considered for scheduling, but it has been submitted to LoadLeveler and the Job Manager and Scheduler do know about it, e.g:
- Job steps submitted above installation-defined limits on queued or idle jobs. This helps to speedup job scheduling. These Jobs will go into state Idle, after the number of idle or queued or idle jobs are below the threshold.
- Job steps whose dependencies cannot yet be determined.
- Job steps that have requested to run in a non-existent reservation.
- No user intervention can move the job step to Idle state
- User Hold (H): The job step is not being considered for scheduling. It can be released from hold using the llhold -r command by the user who submitted the job.
- System Hold (S): The job step is not being considered for scheduling. It can be released only by a LoadLeveler administrator.
- User & System Hold (HS): It must be released from hold by a LoadLeveler administrator and by the user before it can be scheduled.
- Deferred (D): The job step was submitted with a startdate, and that date and time have not yet arrived.
- Running (R): The job step is currently running.
- Pending (P): The scheduler has assigned resources to the job step and is in the process of sending the start request to the resource manager.
- Starting (ST): The resource manager has received the start request from the scheduler and is in the process of dispatching the job step to the nodes where it will run. The next state will normally be Running.
- Completed (C): The job step has completed.
- Canceled (CA): The job step was canceled by a user or an administrator.
- Preempted (E): The job step has been preempted by the suspend method, either by the scheduler or by an administrator using the llpreempt command.
- Preempt Pending (EP): LoadLeveler is in the process of preempting the job step.
- Resume Pending (MP): The job step is being resumed from preemption. The next state will normally be Running
- Submitted: Date and time of job submission
Predefined helpful commands:
- llx: extended display of all jobs
- llu: displays jobs of user
- llg: displays jobs of group
module load lrztools; llx -h; llu -h; llg -h
Why isn't my job running?
- llq -s job_ids
it will provide information on why the selected jobs remain in the Hold, NotQueued, Idle or Deferred state. Is the job step's class available?
Are machines configured to run your job class available
- Run llclass to see if the class is defined and has available initiators.
- llstatus -l will show Configured Classes and Available Classes.
- llstatus will show if machines are Idle or Busy.
Does the job step have requirements which cannot be met by any available machines?
Are higher priority job steps getting scheduled ahead of your job step or are they reserving resources that your job step cannot backfill?
- llq -l will show q_sysprio which is what is used to order the job steps in the queue.
- The llprio command can be used to adjust a job step's priority relative to that user's other submitted job steps.
Is the job step bound to a reservation that has not yet become active?
If the job status immediately goes to "Hold" check that the files for "output" and "error" really can be written (directory exists and has write permissions etc.). Also check that the initial directory and executable path exist and have the right permissions.
- Running very large jobs ("special class")
- Using llrun to start interactive or batch Jobs
- Multiple serial jobs and commands
- Multiple parallel jobs (IBM MPI)
- Multiple parallel jobs (Intel MPI)