Interactive and Batch Jobs with Loadleveler

Node Allocation Policy

  • Only complete nodes are provided to a given job for dedicated use.
  • Accounting is performed by using: AllocatedNodes*Walltime*(number_of_core_in_node)
  • Try to not waste cores of a node, i.e. try to use all  cores. However, in some special cases, e.g. large memory requirements, etc. , you may not be able to use all cores.
  • TASKS_PER_NODE must be less or equal 40 for fat nodes
  • TASKS_PER_NODE must be less or equal 16 for thin nodes
  • TASKS_PER_NODE must be less or equal 28 for thin nodes
    (if using SMT/HyperThreading it must be  less or equal 80, 32 or 56, resp.)

Jobs size: total tasks, number of nodes  and tasks per node

Typically there are two ways of specifying the job size:

  • Total tasks and number of nodes
    • This is the more general case
    • If your number of tasks is not evenly divisible by number of nodes
    • You must compute the number of node by yourself (i.e., ceil(number of tasks/40) )
    • LoadLeveler decides how to distrubute the tasks to nodes
    • marked red in the following examples
  • Tasks per node and number of nodes
    • if your number of tasks is 40 (or a bit less)
    • if you can run with various number of nodes by specify min and max
    • marked blue in the following examples

Hints:

  • Do not mix both ways!
  • Always specify number of nodes
  • Always specify the number of islands on SuperMUC for jobclass "large"
  • Do not waste cores of a node (remember: 16 or 40 cores per node)!

See details in the section: Keywords for Node and Core allocation

Pfeil nach oben


Interactive parallel Jobs

IBM MPI

Parallel executables which are built with IBM's MPI can be run interactively by invoking them with one of the following methods

  • poe ./executable -procs <nnn> -nodes <NNN>
  • mpiexec -n <nnn> ./executable -nodes <NNN>
  • MP_PROCS=nnn MP_NODES=NNN poe ./executable
  • MP_PROCS=nnn MP_NODES=NNN mpiexec ./executable
  • MP_PROCS=nnn MP_NODES=NNN ./executable

The executable is not executed on the front-end nodes but on the compute nodes. The resource manager of LoadLeveler is used to allocate free nodes. Therefore it is equivalent to submitting a Job Command File to LoadLeveler via llsubmit.

The LoadLeveler keyword part can also be used to allocate nodes

cat > LL_FILE <<EOD
#@ job_type = parallel
#@ node = 2
#@ total_tasks = 78
## other example
##@ tasks_per_node= 39
#@ class = test
#@ queue EOD poe ./executable -rmfile LL_FILE

The above will work on the thin login nodes. On the fat login nodes, please specify @class = fattest.

Intel MPI

You can invoke executables compiled with Intel MPI via poe. However, the run runtime libraries of IBM MPI are then used. Problems, particularly with hybrid (OpenMP-MPI) execution may occur. We recommend to write an LoadLeveler Job Command File and submit it to the class "test" or "fattest". For convenience you can also use llrun from the lrztools.

Limited resources for job class "test/fattest"

If you are logged in to sb.supermuc.lrz.de or hw.supermuc.lrz.de, POE jobs take their resources from the nodes of the job class "test". If you are logged in to wm.supermuc.lrz.de, POE jobs take their resources from the nodes of the job class "fattest". Programs compiled for the Sandy Bridge architecture (using AVX) may not be able to run on a fat node, and programs compiled with CORE-AVX2 can only run on the haswell architecture of Phase 2.

If you get the the following message:

ERROR: 0031-165 Job has been cancelled, killed, or schedd or resource is unavailable. 
Not enough resources to start now. Global MAX_TOP_DOGS limit of 1 reached.

then all nodes are busy. You have to wait and try again, or you have to submit your program to the test or general queue as a batch job e.g., using the llsubmit command. You can use the examples given below by just replacing class=general by class=test.

If you use more than one node (i.e., more than 16 cores on a thin island, or more than 40 cores on a fat island), you might need to explicitly set  MP_NODES to a value larger than 1. Alternatively, use  llrun, which in most cases will make the correct decision for node and core allocation.

Pfeil nach oben


Batch-Jobs with LoadLeveler

The login node "supermuc-login" intended only for editing and compiling your parallel programs. Interactive usage of "poe/mpirun" is not allowed. To run test or production jobs, submit them to the LoadLeveler batch system, which will find and allocate the resources required for your job (i.e., the compute nodes to run your job on). 

The most important Loadleveler commands are:

llsubmit Submit a job script for execution.
llq Check the status of your job(s).
llhold Place a hold on a job
llcancel Cancel a job.
llclass Query information about job classes.

The -H flag provides extended help information.

Build your job command file "job.cmd" by using a text editor to create a script file.

# This job command file is called job.cmd
#@ input = job.in
#@ output = job.out
#@ error = job.err
#@ job_type = parallel
#@ class = general
#@ node = 8
#@ total_tasks=128
#@ network.MPI = sn_all,not_shared,us
#@ ... other LoadLeveler keywords (see below)
#@ queue
echo "JOB is run"

To submit the job command file that you created, use the llsubmit command:

llsubmit job.cmd

LoadLeveler responds by issuing a message similar to:

submit: The job "supermuc.22" has been submitted.

Where supermuc is the name of the machine to which the job was submitted and 22 is the job identifier (ID).

To display the status of the job you just submitted, use the llq command. This command returns information about all jobs in the LoadLeveler queue:

llq supermuc.22

To place a temporary hold on a job in a queue, use the llhold command. This command only takes effect if jobs are in the Idle or NotQueued state. Jobs beeing in hold for more than 3 months will be deleted by system administrators.

llhold supermuc.22

To release the hold, use the llhold command with the -r option:

llhold -r supermuc.22

To cancel a job, use the llcancel command:

llcancel supermuc.22

Job Command File

A Job Command File describes the job to be submitted to the LoadLeveler Job Manager using the llsubmit command. It can contain multiple steps, each designated by the #@queue statement. Lines starting with '#@' are statements that are interpreted by LoadLeveler's parser. The job command file itself can be the script to be executed by each step of the job.

Note that the llsubmit command itself does not form the job using command line arguments.

Examples for job command files

Submission of jobs to the Phase 1 system requires logging in to a Phase 1 login node;
Submission of jobs to the Phase 2 system requires logging in to a Phase 2 login node.

We recommend not to use csh/tcsh as the environment for LoadLeveler (whether by indicating the executing shell with #! in the first line or with the #@shell keyword). We see issues with the execution of the login profiles, PATHES and/or module environment. If you want to use csh/tcsh, then pack your commands and settings into an ordinary csh skript and execute this script within the Loadleveler job.

Pfeil nach oben


Job Command File Keywords

Job Name

#@ job_name = myjob

Specifies the name of the job. This keyword must be specified in the first job step. If it is specified in other job steps in the job command file, it is ignored.

The job_name only appears in the long reports of the llq, llstatus, and llsummary commands, and in mail related to the job.

Job Type

#@ job_type=serial | parallel | MPICH

Parallel

  • means a POE job
  • Must be used when running jobs with IBM MPI.

MPICH

  • Similar to a parallel job, but handled differently by LoadLeveler internally
  • Must be used when running jobs with Intel MPI.

Serial (Presently not available at SuperMUC)

  • single task job
  • For single task, multiple threads jobs, please see the OpenMP job example above
  • Many parallel keywords cannot be specified with a serial job_type

Job Class

#@ class = class_name

Valid classes are:

Class
Name
 PurposeSee also
Remark
Max. Island Count
(min,max)
min.-.max.
Nodes
Wall Clock
Limit
Memory
per
Node
(usable)
Run limit
per user

Job classes on SuperMUC Fat Nodes (Phase 1)

fattest Test and interactive use   1 1 - 4
(160 cores)
2 h 256
(250)
1
fatter Long-running jobs, Time Extended Runs a) 1 1 - 13 192 h (8 days)

256
(250)

2
fat Medium-sized production runs fitting into athe fat node island, 256 GB/node   1 1 - 52 48 h 256
(250)
8

Job classes on SuperMUC Thin Nodes (Phase 1)

test Test and interactive use   1 1 - 32 30 min 32
(~26)
1
general Medium-sized production runs fitting into a single island c), d) 1 or
1,2
33 - 512 48 h 32
(~26)
 8
large Large Jobs, spanning more than one Island.
You must specify the LL keyword:
#@ island_count=...
a) b) c) d) e) min: ≥ 2
max: ≤ 2*min
513 - 4096 48 h 32
(~26)
8

Job classes on SuperMUC Haswell Nodes (Phase 2)

test Test and interactive use   1 1 - 20 30 min 64
(~58)
1
micro Small jobs, pre- and postprocessing runs (internally restricted to run only on some specific islands)   1 1 - 20 48 h 64
(~58)
8
general Medium-sized production runs fitting into a single island c) d) 1 or
1,2
21 - 512 48 h 64
(~58)
8
big Big memory per node jobs (pre- and postprocessing runs, on some Haswell nodes equipped with 256 GB/node)   1 1 - 8 6 h 256
(~250)
1

Special Job Classes

special

Restricted use (on request by particular users)

  1, 6 or 18

Fat nodes:
1- 205
Thin nodes:
1 - 9216
Haswell nodes:
1- 3072

48 h 32
or
64
1
tmp1
tmp2
For temporary workarounds and for specific priority requirements.
Restricted use, not generally available.
  tbd Fat nodes:
1- 205
Thin nodes:
1 - 9216
Haswell nodes:
1- 3072
48 h 32
or
64
---

Reamarks:

a) Jobs may be forcibly killed for maintenances
b) For large core counts it is not recommended to use SMT threads/Hyperthreading
c) These job classes require a minimum number of nodes
d) See description of #@ island_count below for more details and important recommendations.

Hint: use "llclass -l" to see all limits and definitions

Limits

#@ wall_clock_limit = hardlimit[,softlimit]

Specifies the hard limit, soft limit, or both for the elapsed time for which a job can run.
Limits are specified with the format hh:mm:ss (hours:minutes:seconds)

User Hold

#@ hold = user

Specifies that the job will be put into user hold state when it is submitted. The job can subsequently be released by issuing the command llhold -r <job_id>.

Parallel Jobs and Node Allocation

# @ node_topology = island

This keyword is needed for SuperMUC, and is therefore automatically set by the LRZ submit filter.

#@ island_count = <min[,max]>
#@ island_count =  number

If a job requires more than 512 nodes, this keyword must be specified. Each Island on SuperMUC normally contains 512 nodes, however, if nodes have problems they may be switched off. The scheduler attempts to start the job on min islands, but will start the job on up to max islands if the job does currently not fit on min islands (e.g., because there are not enough free nodes available). Since SuperMUC has a pruned (1:4) network topology, there is a trade-off between performance and more chances to get a job allocated.

LRZ recommends to use an island count which is slightly higher (+1 or +2) than the minimum requirement to give the LoadLeveler more flexibility for finding free nodes. Example: if you want to run on 2000 nodes, you must use at least 4 islands. In this case specify
#@ island_count = 4,5
instead of just requiring just 4 islands.

For jobs with less than 512 you can also use island_count. However if you want less than approx. 400 nodes, please specify a minimum island_count of 1 to avoid a fragmentation of the machine by many small jobs

#@ node = <min[,max]>
#@ node
= <number>

The scheduler attempts to get max nodes to run the job step, but will start the job step on min nodes if necessary, if max is not specified, the scheduler will use exactly the number of nodes on which to run the job step

#@ tasks_per_node = <number>.

Used in conjunction with #@ node, each node is assigned number tasks

tasks_per_node must be less or equal 40.

#@ total_tasks = <number>

 Rather than specifying the number of tasks to start on each node, the total number of tasks in the job step across all nodes can be specified

Valid combinations of keywords are given in the next table (keywords marks by X can be combined):

Keyword

Valid combinations

total_tasks

X

X

   
tasks_per_node    

X

X

node = <min, max>    

X

 
node = <number>

X

   

X

task_geometry        
blocking  

 

   

see also: Task Assignment Considerations

Because of using the Island Topology, the keywords blocking and task_geometry are not supported on SuperMUC!

#@ first_node_tasks = number

Specifies a different task countfirst node assigned to a job step using first_node_tasks. This may be useful for Master-Slave applications. All remaining nodes will run tasks_per_node tasks. The first_node_tasks keyword can only be specified in conjunction with node and tasks_per_node; it cannot be specified with the total_tasks keyword. When first_node_tasks is used, a maximum node specification is not permitted. For example, node=6 is acceptable but node=6,8 is not. Example:

#@ node = 6
#@ first_node_tasks = 1
#@ tasks_per_node = 16

A total of 6 machines are selected for the job step. The first machine will run only 1 task, task ID 0; the remaining 5 machines will run 16 tasks each. The total number of tasks in this job step is 81.

Network Protocol

#@ network.protocol = type[, usage[, mode[,comm_level]]]

protocol: specifies the communication protocols that are used with an adapter,

  • MPI Specifies the message passing interface (MPI).
  • LAPI Specifies the low-level application programming interface
  • MPI_LAPI, both MPI and LAPI

usage: Specifies whether the adapter can be shared with tasks of other job steps. Possible values are

  • shared, which is the default,
  • not_shared.

type: This field is required and specifies one of the following:

  • sn_all: Specifies that striped communication should be used over all available switch networks.

mode: Specifies the communication subsystem mode used by the communication protocol that you specify

  • US (User Space), always use US!
  • IP. do not use, but its the default !

Restart

#@ restart = yes | no

Specifies whether LoadLeveler considers a job to be restartable.

If restart=yes (default), and the job is vacated (e.g. in case of system errors) from its executing machine before completing, the central manager requeues the job. It can start running again when a machine on which it can run becomes available.
If restart=no, a vacated job is canceled rather than requeued.

Files and Directories

#@ input = filename
#@ output = filename
#@ error = filename
#@ initialdir = pathname:
Specifies the path name of the directory to use as the initial working directory during execution of the job step.

Filename can be a absolute or it can be relative to the current working directory.
If no keyword is specified, LoadLeveler uses /dev/null.
Do not use the keywords and leave the values unspecified; this might lead to errors.

Executable and Arguments

The Job Manager copies the executable from the submitting node to the spool directory

#@ executable = program_name

  • The name of the executable that will be started by LoadLeveler
  • If blank, the job command file itself will be used as the executable
  • A different executable can be specified for each step in a job

#@ arguments = arg1 arg2 ...

  • Specifies the list of arguments passed to the program when the job step runs
  • Different arguments can be specified for each step, even if all steps are using the job command file as the executable

#@ environment = env1; env2; ...

Specifies the environment variables that will be set for the job step by LoadLeveler when the job step starts.

  • DO NOT USE COPY_ALL!
  • $var: Copies var into the job step's environment
  • !var: Omits var from the job step's environment
  • var=value: Specifies that the environment variable var be set to the value ?value? and copied to the job step's environment when the job step is started

LoadLeveler sets the environment before the login shell is executed, so anything set by the shell (such as settings in .profile) will override the variables set by LoadLeveler

#@ env_copy = all | master

Specifies whether environment variables are copied to all nodes of the job step or only to the master node. By default, the environment is copied to all nodes

#@ shell = name

When LoadLeveler starts the job, the specified shell will be used instead of the default shell for the user from /etc/passwd

Requirements

#@requirements = <boolean expression>

This keyword can be used to enter specific conditions that the schedules resources must fulfill. In the context of SuperMUC, the following boolean expressions are useful:

  • (Island != island01)  # run on thin nodes if in job class special
  • (Island == island01) # run on fat nodes if in job class special

Job Steps and Dependencies

A single job can contain more than one step. Every job contains, by default, at least one step. A step can have dependencies on the exit codes of other steps in the same job. Each "#@ queue" statement marks the end of one step and the beginning of the next. Most keyword values are inherited from the previous step. A group of steps within a job can be co-scheduled: they are treated as one entity that will all be started at the same time.

#@ step_name = step_name

You can name the job step using any combination of letters, numbers, underscores (_) and periods (.). You cannot, however, name it T or F, or use a number in the first position of the step name. The step name you use must be unique and can be used only once.

#@ dependency = step_name operator value

value is usually a number that specifies the job return code to which the step_name is set.

It can also be one of the following LoadLeveler defined job step return codes:

Operators include ==, !=, <=, >=, <, >, &&, ||

A step can have dependencies on more than one step, like

         #@ dependency = (step1 == 0) && (step2 >= 0)

Special Return Codes

  • CC_NOTRUN: The return code set by LoadLeveler for a job step which is not run because the dependency is not met. The value of CC_NOTRUN is 1002
  • CC_REMOVED: The return code set by LoadLeveler for a job step which is removed from the system (because, for example, llcancel was issued against he job step). The value of CC_REMOVED is 1001.

See also: example job with dependencies

Notification

#@ notification = always|error|start|never|complete

Specifies when mail is sent to the adress in the notify_user keywordl:

  • always : Notify the user when the job begins, ends, or if it incurs error conditions.
  • error: Notify the user only if the job fails.
  • start: Notify the user only when the job begins.
  • never:  Never notify the user.
  • complete: Notify the user only when the job ends. (Default)

Note: the number of execution nodes reported in the notification mail is in general incorrect. This is a known bug in LoadLeveler that will not be fixed any more.

#@ notify_user = email_address.

Specifies the address to which mail is sent based on the notification keyword.

Pfeil nach oben


Variables

Variables only accessible for LoadLeveler command file (at submit time)

Several variables are available for use in job command files.

$(domain): The domain of the host from which the job was submitted.
$(home): The home directory for the user on the cluster selected to run the job.
$(user): The user name that will be used to run the job. This might be a different user name.
$(host): The hostname of the machine from which the job was submitted.
$(jobid): The sequential number assigned to this job by the schedd daemon.
$(stepid): The sequential number assigned to this job step when multiple queue statements are used with the job command file.

Some variables are set from other keywords defined in the job command file

$(executable)
$(class)
$(comment)
$(job_name)
$(step_name)
$(base_executable): Automatically set from the executable keyword; consists of the executable file name without the directory component (basename).

Example: 
#@ output = $(home)/$(job_name)/$(step_name).$(schedd_host).$(jobid).$(stepid).out

Environment Variables (accessible with job)

LoadLeveler sets several environment variables in the application's environment. A complete list is available in Using and Administering, but here are a few examples and explanations: 

LOADLBATCH=yes Set when it is a batchjob.
LOADL_HOSTFILE=/var/loadl/... Contains the list of hosts where the job is run.
LOADL_STEP_ID=srv23ib.12345.0 The three part job identifier.
LOADL_JOB_NAME=mytest The name of the job.
LOADL_STEP_NAME=run1 The the step name.
LOADL_STEP_CLASS=general The job class .
LOADL_PID=98765 The process ID of the starter process
LOADL_STEP_COMMAND=/home/prxxxx/luyyyy/JOB The name of the executable (or the name of the job command file if the job command file is the executable).
LOADL_STEP_INITDIR=/home/prxxxx/luyyyy/ The initial working directory.
LOADL_STEP_ARGS=input1 Any arguments passed by the job step.
LOADL_STEP_ERR=err.51458 The file used for standard error messages (stderr).
LOADL_STEP_OUT=out.51458 The file used for standard output (stdout).
LOADL_STEP_IN=/dev/null The file used for standard input (stdin).
LOADL_TOTAL_TASKS=800 Specifies the total number of tasks of the MPICH job step. This variable is available only when the job_type is set to MPICH.

Pfeil nach oben


Logfiles of Submitted Jobs

For each job which starts running, a logfile is written into: $SCRATCH/.lrz_user_logs/<jobstepID> containing the following information

  • Startdate
  • userID and goupsID
  • budget and quota information
  • Loaded model at begin of job
  • Job script
  • environment at begin of job
  • job status at begin of job, including node list

Do not delete these files by yourself since they are needed to be analysed in the case of problems; the files will be automatically deleted by LRZ after a few weeks.

Pfeil nach oben


Working with energy aware jobs on SuperMUC

Energy consumption, power requirements, CO2 emmision and the associated costs are becoming major issues for high performance computing. Therefore LRZ procured energy saving functionalities for LoadLeveler. Using the energy functions, a job can run faster than default if the benefit in run time exceeds certain thresholds (given below), typically at the expense of using more energy.

The energy_policy_tag allows LoadLeveler to identify the energy data associated with a job. With those energy data, LoadLeveler can decide which frequency should be used to run the job with minimal performance degradation. The energy data include:

  • Power consumption and the elapsed time when run with default frequency
  • The estimated power and energy
  • The elapsed time at other frequencies
  • The percentage of performance degradation (wrt. runtime)

Setting the energy policy tag in the job command file, the energy data will be generated and stored in the database when running the job for the first time. If the job is submitted again with the same energy policy tag, the same policy will be used. Submitting jobs using a new energy tag for the first time, take care of keeping the tag name unique among the tags you have previously generated.

To use energy keywords in the job command file follow these steps:

  • Generation of energy tags:

    Provide an unique identifier for the energy_policy_tag when submitting a job for the firsttime. For example:
    #@ energy_policy_tag = my_energy_tag 
    #@ minimize_time_to_solution = yes
    

    The identifier may contain lower and upper case Latin letters, Arabic digits, and the underscore character, however, no other characters are allowed. Especially, ä, ö, ü, @ and . characters must not be used.
    LoadLeveler generates the energy data associated with this energy tag for the job when the job runs for the first time. After the job has ended, the energy data using the energy tag can be queried the following way:

    llrqetag -e my_energy_tag [-u user] [-j jobid]

    or just

    llrqetag

  • Example:
    Generated by: srv03-ib.219213.0
    Last used Time: Thu Jun 27 11:45:40 2013
    User: luxyz123
    Nominal Frequency: 2.70 GHZ
    Default Frequency: 2.30 GHZ
    Node's Energy Use: 0.001168 KWH
    Execution Time: 71 Seconds
    Frequency(GHZ) EstEngCons(KWH) EngVar(%) EstTime(Sec) TimeVar(%) Power(W)                        
              2.70        0.001380     18.10           61     -14.08    81.43
              2.60        0.001317     12.75           63     -11.27    75.27
              2.50        0.001266      8.39           65      -8.45    70.14
              2.40        0.001241      6.26           68      -4.23    65.72
              2.30        0.001168      0.00           71       0.00    59.24
              2.20        0.001183      1.26           73       2.82    58.34
              2.10        0.001159     -0.82           77       8.45    54.17
         ....
              1.30        0.001330     13.88          124      74.65    38.63
              1.20        0.001455     24.53          134      88.73    39.09
    EstEngCons: Estimated Energy Consumption (Energy to solution) 
    EngVar:     EnergyVariation(Freq) = (EstEngCons(Freq)-EstEngCons(2.3Ghz))/EstEngCons(2.3Ghz) 
                negative values: energy saving; positive values: needs more energy to solution 
    EstTime:    Estimated time for job 
    TimeVar:    TimeVariation, negative: faster, positive: slower 
    Power:      Power requirement of one socket
    

    Here you can see in the first line, that for running the application 14.08% faster, you need 18.1% more energy.

    Generate the energy tag only for production runs which are running at least 10 minutes. Short running jobs or jobs which do not reflect the production profile will not provide sensible and reliable results.

  • Using  the energy tags to run at higher frequency than default:

    Not all jobs will run faster, if the CPU clock speed  is increased. It depends on the characteristics of the application e.g., CPU-bound job will run faster, however, memory-bound job might not benefit from higher CPU-frequency. LoadLeveler will decide on data associated with the energy tag which is the best frequency to run the job based on the tradeoff of energy consumption and runtime.

    There are two required items you need to enter in all your LoadLeveler scripts if you want to use the possibility to run on higher frequency:

    1. You need to generate an energy tag in a first characteristic job
    2. You need to use this tag and policy in the follow-on jobs  by using the same keywords again:

      #@ energy_policy_tag = my_energy_tag 
      #@ minimize_time_to_solution = yes
      

      to request a frequency increase from LoadLeveler. Without this item, jobs will run at the standard frequency of 2.3 GHz.

    If both the above are fulfilled, LoadLeveler will set the CPU frequency for all nodes of the job according to following criteria, applied in order of appearance:

    If run time with 2.4 GHz is expected to decrease by more than 2.5% compared to 2.3 GHz, the frequency is set to 2.4 GHz
    If run time with 2.5 GHz is expected to decrease by more than 5% compared to 2.3 GHz, the frequency is set to 2.5 GHz
    If run time with 2.6 GHz is expected to decrease by more than 8.5% compared to 2.3 GHz, the frequency is set to 2.6 GHz
    If run time with 2.7 GHz is expected to decrease by more than 12% compared to 2.3 GHz, the frequency is set to 2.7 GHz

    We encourage all users of the system to make use of this new feature to accelerate processing of jobs and to run jobs in an energy efficient way.
  • Removal of energy tags:

    You can remove energy tags by:

    llrrmetag -e energy_tag [-u user] [-j job] [-t MM/DD/[YY]YY]tag]

    Here the -t MM/DD/[YY]YY option means: The energy tag will be removed if it has not been used since the date specified.
     
  • Circumvent energy tags:

    There might be problems when using the energy function and libraries for performance measurements at the same time. In those case it is necessary to switch the the energy function off. You can do that via

    #@ energy_policy_tag = NONE

    or by setting the environment LL_BYPASS_ETAG variable in the submitting shell before the llsubmit command.

Pfeil nach oben


Querying the Status of a Job

The llq command lists all job steps in the queue, one job step per line.

Do not query llq repeatedly with scripts in short time intervals (like watch -n N, with N < 200. Handling too many llq requests prevent the LoadLeveler from doing its actual work, i.e. efficient job scheduling. 

llq -u userlist filters out only those job steps belonging to the specified users.

llq -j joblist will display only the specified jobs.

The format of a job ID is host.jobid.
The format of a step ID is host.jobid.stepid.

Fields in llq's listing

  • Class: Job class.Id: The format of a full LoadLeveler step identifier is host.jobid.stepid.
  • Owner: User ID that the job will be run under.
  • PRI: User priority of the job step
  • Running On: If running, the name of the machine the job step is running on. This is blank when the job is not running. For a parallel job step, only the first machine is shown.
  • ST: Current state of the job step.
  • Idle (I): The job step is waiting to be scheduled.
  • NotQueued (NQ): The job step is not being considered for scheduling, but it has been submitted to LoadLeveler and the Job Manager and Scheduler do know about it, e.g:
    • Job steps submitted above installation-defined limits on queued or idle jobs. This helps to speedup job scheduling. These Jobs will go into state Idle, after the number of  idle or queued or idle jobs are below the threshold.
    • Job steps whose dependencies cannot yet be determined.
    • Job steps that have requested to run in a non-existent reservation.
    • No user intervention can move the job step to Idle state
  • User Hold (H): The job step is not being considered for scheduling. It can be released from hold using the llhold -r command by the user who submitted the job.
  • System Hold (S): The job step is not being considered for scheduling. It can be released only by a LoadLeveler administrator.
  • User & System Hold (HS): It must be released from hold by a LoadLeveler administrator and by the user before it can be scheduled.
  • Deferred (D): The job step was submitted with a startdate, and that date and time have not yet arrived.
  • Running (R): The job step is currently running.
  • Pending (P): The scheduler has assigned resources to the job step and is in the process of sending the start request to the resource manager.
  • Starting (ST): The resource manager has received the start request from the scheduler and is in the process of dispatching the job step to the nodes where it will run. The next state will normally be Running.
  • Completed (C): The job step has completed.
  • Canceled (CA): The job step was canceled by a user or an administrator.
  • Preempted (E): The job step has been preempted by the suspend method, either by the scheduler or by an administrator using the llpreempt command.
  • Preempt Pending (EP): LoadLeveler is in the process of preempting the job step.
  • Resume Pending (MP): The job step is being resumed from preemption. The next state will normally be Running
  • Submitted: Date and time of job submission

Predefined helpful commands:

Pfeil nach oben


Why isn't my job running?

First, run

  •  llq -s job_ids

it will provide information on why the selected jobs remain in the Hold, NotQueued, Idle or Deferred state. Is the job step's class available?

Are machines configured to run your job class available

  • Run llclass to see if the class is defined and has available initiators.
  • llstatus -l will show Configured Classes and Available Classes.
  • llstatus will show if machines are Idle or Busy.

Does the job step have requirements which cannot be met by any available machines?

Are higher priority job steps getting scheduled ahead of your job step or are they reserving resources that your job step cannot backfill?

  • llq -l will show q_sysprio which is what is used to order the job steps in the queue.
  • The llprio command can be used to adjust a job step's priority relative to that user's other submitted job steps.

Is the job step bound to a reservation that has not yet become active?

If the job status immediately goes to "Hold" check that the files for "output" and "error" really can be written (directory exists and has write permissions etc.). Also check that the initial directory and executable path exist and have the right permissions.

Pfeil nach oben


Special Cases

Pfeil nach oben


Further Information