Newer
Older
<html>
<head>
<title>SLURM Quick Start Guide</title>
</head>
<body>
<h1>SLURM Quick Start Guide</h1>
<h2>Overview</h2>
Simple Linux Utility for Resource Management (SLURM) is an open source,
fault-tolerant, and highly scalable cluster management and job
scheduling system for Linux clusters large and small.
SLURM requires no kernel modifications for it operation and is
relatively self-contained.
As a cluster resource manager, SLURM has three key functions. First,
it allocates exclusive and/or non-exclusive access to resources
(compute nodes) to users for
some duration of time so they can perform work. Second, it provides
a framework for starting, executing, and monitoring work (normally a
parallel job) on the set of allocated nodes. Finally, it arbitrates
conflicting requests for resources by managing a queue of pending work.
<h2>Architecture</h2>
As depicted in Figure 1, SLURM consists of a <b>slurmd</b> daemon
running on each compute node, a central <b>slurmctld</b> daemon running on
a management node (with optional fail-over twin), and five command line
utilities: <b>srun</b>, <b>scancel</b>, <b>sinfo</b>, <b>squeue</b>, and
<b>scontrol</b>, which can run anywhere in the cluster.
<p align=center>
<img src="arch.png">
<p align=center>
Figure 1: SLURM components
</p>
<p>
The entities managed by these SLURM daemons are shown in Figure 2
and include
<b>nodes</b>, the compute resource in SLURM,
<b>partitions</b>, which group nodes into logical disjoint sets,
<b>jobs</b>, or allocations of resources assigned to a user for a
specified amount of time, and
<b>job steps</b>, which are sets of (possibly parallel) tasks within a job.
Priority-ordered jobs are allocated nodes within a partition until the
resources (nodes) within that partition are exhausted.
Once a job is assigned a set of nodes, the user is able to initiate
parallel work in the form of job steps in any configuration within the
allocation. For instance a single job step may be started which utilizes
all nodes allocated to the job, or several job steps may independently
use a portion of the allocation.
<p align=center>
<img src="entities.png">
<p align=center>
Figure 2: SLURM entities
</p>
<h2>Commands</h2>
Man pages exist for all SLURM daemons, commands, and API functions.
The command option "--help" also provides a brief summary of options.
Note that the command options are all case insensitive.
<p>
<b>srun</b> is used to submit a job for execution, allocate resources,
attach to an existing allocation, or initiate job steps.
Jobs can be submitted for immediate or later execution (e.g. batch).
srun has a wide variety of options to specify resource requirements
including: minimum and maximum node count, processor count, specific
nodes to use or not use, and specific node characteristics (so much
Besides securing a resource allocation, srun is used to initiate
job steps.
These job steps can execute sequentially or in parallel on independent
<p>
<b>scancel</b> is used to cancel a pending or running job or job step.
It can also be used to send an arbitrary signal to all processes
associated with a running job or job step.
<p>
<b>scontrol</b> is the administrative tool used to view and/or modify
SLURM state.
Note that many scontrol commands can only be executed as user root.
<p>
<b>sinfo</b> reports the state of partitions and nodes managed by SLURM.
It has a wide variety of filtering, sorting, and formatting options.
<p>
<b>squeue</b> reports the state of jobs or job steps.
It has a wide variety of filtering, sorting, and formatting options.
By default, it reports the running jobs in priority order and then the
pending jobs in priority order.
<h2>Daemons</h2>
<b>slurmctld</b> is sometimes called the <i>controller</i> daemon.
monitoring node state, and allocating resources (nodes) to jobs.
There is an optional backup controller that automatically assumes
control in the event the primary controller fails.
The primary controller resumes control whenever
it is restored to service. The controller saves its state to disk
whenever there is a change. This state can be recovered by the controller
at startup time. <b>slurmctld</b> would typically execute as a
State changes are saved so that jobs and other state can be
preserved when slurmctld moves or is restarted.
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
<p>
The <b>slurmd</b> daemon executes on every compute node.
It resembles a remote shell daemon to export control to SLURM.
Since slurmd initiates and manages user jobs, it must execute as
the user root.
<p>
slurmctld and/or slurmd should be initiated at node startup time
per the SLURM configuration.
<h2>Examples</h2>
Execute <i>/bin/hostname</i> on four nodes (<i>-N4</i>).
Include task numbers on the output (<i>-l</i>).
The default partition will be used.
One task per node will be used by default.
<pre>
adev0: srun -N4 -l /bin/hostname
0: adev9
1: adev10
2: adev11
3: adev12
</pre>
<p>
Execute <i>/bin/hostname</i> in four tasks (<i>-n4</i>).
Include task numbers on the output (<i>-l</i>).
The default partition will be used.
One processor per task will be used by default
(note that we don't specify a node count).
<pre>
adev0: srun -n4 -l /bin/hostname
0: adev9
1: adev9
2: adev10
3: adev10
</pre>
<p>
Submit the script <i>my.script</i> for later execution (<i>-b</i>).
Explicitly use the nodes adev9 and adev10 (<i>-w "adev[9-10]"</i>,
note the use of a node range expression).
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
One processor per task will be used by default
The output will appear in the file <i>my.stdout</i> (<i>-o my.stdout</i>).
By default, one task will be initiated per processor on the nodes.
Note that <i>my.script</i> contains the command <i>/bin/hostname</i>
which executed on the first node in the allocation (where the script
runs) plus two job steps
initiated using the <i>srun</i> command and executed sequentially.
<pre>
adev0: cat my.script
#!/bin/sh
/bin/hostname
srun -l /bin/hostname
srun -l /bin/pwd
adev0: srun -w "adev[9-10]" -o my.stdout -b my.script
srun: jobid 469 submitted
adev0: cat my.stdout
adev9
0: adev9
1: adev9
2: adev10
3: adev10
0: /home/jette
1: /home/jette
2: /home/jette
3: /home/jette
</pre>
<p>
Submit a job, get its status and cancel it.
<pre>
adev0: srun -b my.sleeper
srun: jobid 473 submitted
adev0: squeue
JobId Partition Name User St TimeLim Prio Nodes
473 batch my.sleep jette R UNLIMIT 0.99 adev9
adev0: scancel 473
adev0: squeue
JobId Partition Name User St TimeLim Prio Nodes
</pre>
<p>
<pre>
adev0: sinfo
PARTITION NODES STATE CPUS MEMORY TMP_DISK NODES
--------------------------------------------------------------------------------
debug 8 IDLE 2 3448 82306 adev[0-7]
batch 1 DOWN 2 3448 82306 adev8
7 IDLE 2 3448-3458 82306 adev[9-15]
</pre>
<h2>SLURM Administration</h2>
The remaining information provides basic SLURM administration information.
Individuals only interested in making use of SLURM need not read
read further.
<h3>Infrastructure</h3>
All communications between SLURM components are authenticated.
The authentication infrastructure used is specified in the SLURM
configuration file and options include:
<a href="http://www.theether.org/authd/">authd</a>, munged and none.
<p>
SLURM uses the syslog function to record events. It uses a
range of importance levels for these messages. Be certain
that your system's syslog functionality is operational.
<p>
There is no necessity for synchronized clocks on the nodes.
Events occur either in real-time based upon message traffic.
However, synchronized clocks will permit easier analysis of
SLURM logs from multiple nodes.
The SLURM configuration file includes a wide variety of parameters.
A full description of the parameters is included in the <i>slurm.conf</i>
man page.
Rather than duplicate that information, a sample configuration file
Any text following a "#" is considered a comment.
The keywords in the file are not case sensitive,
although the argument typically is (e.g. "SlurmUser=slurm"
might be specified as "slurmuser=slurm").
The control machine, like all other machine specifications can
include both the host name and the name used for communications.
In this case, the host's name is "mcri" and the name "emcri" is
used for communications. The "e" prefix identifies this as an
ethernet address at this site.
Port numbers to be used for communications are specified as
well as various timer values.
<p>
A description of the nodes and their grouping into non-overlapping
partitions is required.
Partition and node specifications use node range expressions to
identify nodes in a concise fashion.
This configuration file defines a 1154 node cluster for SLURM, but
might be used for a much larger cluster by just changing a
few node range expressions.
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
<pre>
#
# Sample /etc/slurm.conf for mcr.llnl.gov
#
ControlMachine=mcri ControlAddr=emcri
#
AuthType=auth/authd
Epilog=/usr/local/slurm/etc/epilog
HeartbeatInterval=30
PluginDir=/usr/local/slurm/lib/slurm
Prolog=/usr/local/slurm/etc/prolog
SlurmUser=slurm
SlurmctldPort=7002
SlurmctldTimeout=300
SlurmdPort=7003
SlurmdSpoolDir=/var/tmp/slurmd.spool
SlurmdTimeout=300
StateSaveLocation=/tmp/slurm.state
#
# Node Configurations
#
NodeName=DEFAULT Procs=2 RealMemory=2000 TmpDisk=64000 State=UNKNOWN
NodeName=mcr[0-1151] NodeAddr=emcr[0-1151]
#
# Partition Configurations
#
PartitionName=DEFAULT State=UP
PartitionName=pdebug Nodes=mcr[0-191] MaxTime=30 MaxNodes=32 Default=YES
PartitionName=pbatch Nodes=mcr[192-1151]
</pre>
<h3>Administration Examples</h3>
scontrol can be used to print all system information and
modify most of it. Only a few examples are shown below.
Please see the scontrol man page for full details.
The commands and options are all case insensitive.
<p>
Print detailed state of all jobs in the system.
<pre>
adev0: scontrol
scontrol: show job
Priority=4294901286 Partition=batch BatchFlag=0
AllocNode:Sid=adevi:21432 TimeLimit=UNLIMITED
StartTime=03/19-12:53:41 EndTime=03/19-12:53:59
NodeList=adev8 NodeListIndecies=-1
ReqProcs=0 MinNodes=0 Shared=0 Contiguous=0
MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
ReqNodeList=(null) ReqNodeListIndecies=-1
Priority=4294901285 Partition=batch BatchFlag=0
AllocNode:Sid=adevi:21432 TimeLimit=UNLIMITED
StartTime=03/19-12:54:01 EndTime=NONE
NodeList=adev8 NodeListIndecies=8,8,-1
ReqProcs=0 MinNodes=0 Shared=0 Contiguous=0
MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
ReqNodeList=(null) ReqNodeListIndecies=-1
</pre>
<p>
Print the detailed state of job 477 and change its priority to zero.
A priority of zero prevents a job from being initiated (it is held
in <i>pending</i> state).
<pre>
adev0: scontrol
scontrol: show job 477
Priority=4294901286 Partition=batch BatchFlag=0
<i>more data removed....</i>
scontrol: update JobId=477 Priority=0
</pre>
<p>
Print the state of node <i>adev13</i> and drain it.
To drain a node specify a new state of "DRAIN", "DRAINED", or "DRAINING".
SLURM will automatically set it to the appropriate value of either "DRAINING"
or "DRAINED" depending if the node is allocated or not.
Return it to service later.
<pre>
adev0: scontrol
scontrol: show node adev13
NodeName=adev13 State=ALLOCATED CPUs=2 RealMemory=3448 TmpDisk=32000
Weight=16 Partition=debug Features=(null)
scontrol: update NodeName=adev13 State=DRAINING
scontrol: show node adev13
NodeName=adev13 State=DRAINING CPUs=2 RealMemory=3448 TmpDisk=32000
Weight=16 Partition=debug Features=(null)
scontrol: quit
<i>Later</i>
adev0: scontrol update NodeName=adev13 State=IDLE
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
</pre>
<p>
Reconfigure all slurm daemons on all nodes.
This should be done after changing the SLURM configuration file.
<pre>
adev0: scontrol reconfig
</pre>
<p>
Print the current slurm configuration.
<pre>
adev0: scontrol show config
Configuration data as of 03/19-13:04:12
AuthType = auth/munge
BackupAddr = eadevj
BackupController = adevj
ControlAddr = eadevi
ControlMachine = adevi
Epilog = (null)
FastSchedule = 0
FirstJobId = 0
NodeHashBase = 10
HeartbeatInterval = 60
InactiveLimit = 0
JobCredPrivateKey = /etc/slurm/slurm.key
JobCredPublicKey = /etc/slurm/slurm.cert
KillWait = 30
PluginDir = /usr/lib/slurm
Prioritize = (null)
Prolog = (null)
ReturnToService = 1
SlurmUser = slurm(97)
SlurmctldDebug = 4
SlurmctldLogFile = /tmp/slurmctld.log
SlurmctldPidFile = (null)
SlurmctldPort = 0
SlurmctldTimeout = 300
SlurmdDebug = 65534
SlurmdLogFile = /tmp/slurmd.log
SlurmdPidFile = (null)
SlurmdPort = 0
SlurmdSpoolDir = /tmp/slurmd
SlurmdTimeout = 300
SLURM_CONFIG_FILE = /etc/slurm/slurm.conf
StateSaveLocation = /usr/local/tmp/slurm/adev
TmpFS = /tmp
</pre>
<p>
Shutdown all SLURM daemons on all nodes.
<pre>
adev0: scontrol shutdown
</pre>
<hr>
URL = http://www-lc.llnl.gov/dctg-lc/slurm/quick.start.guide.html