Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
Slurm
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
tud-zih-energy
Slurm
Commits
6b97dc6a
Commit
6b97dc6a
authored
17 years ago
by
Moe Jette
Browse files
Options
Downloads
Patches
Plain Diff
add gang scheduling web page from Chris Holmes with some editing by Moe Jette
parent
ce7655da
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc/html/gang_scheduling.shtml
+462
-11
462 additions, 11 deletions
doc/html/gang_scheduling.shtml
with
462 additions
and
11 deletions
doc/html/gang_scheduling.shtml
+
462
−
11
View file @
6b97dc6a
<!--#include virtual="header.txt"-->
<!--#include virtual="header.txt"-->
<h1>Gang Scheduling</h1>
<H1>Gang Scheduling</H1>
<p>SLURM version 1.2 and earlier supported dedication of resources
<P>
SLURM version 1.2 and earlier supported dedication of resources
to jobs.
to jobs.
Beginning in SLURM version 1.3 gang scheduling is supported. </p>
Beginning in SLURM version 1.3, gang scheduling is supported.
Gang scheduling is when two or more jobs are allocated to the same resources
and these jobs are alternately suspended to let all of the tasks of each
job have full access to the shared resources for a period of time.
</P>
<P>
A resource manager that supports timeslicing can improve it's responsiveness
and utilization by allowing more jobs to begin running sooner. Shorter-running
jobs no longer have to wait in a queue behind longer-running jobs. Instead they
can be run "in parallel" with the longer-running jobs, which will allow them
to finish quicker. Throughput is also improved because overcommitting the
resources provides opportunities for "local backfilling" to occur (see example
below).
</P>
<P>
The SLURM 1.3.0 the <I>sched/gang</I> plugin provides timeslicing. When enabled,
it monitors each of the partitions in SLURM. If a new job has been allocated to
resources in a partition that have already been allocated to an existing job,
then the plugin will suspend the new job until the configured
<I>SchedulerTimeslice</I> interval has elapsed. Then it will suspend the
running job and let the new job make use of the resources for a
<I>SchedulerTimeslice</I> interval. This will continue until one of the
jobs terminates.
</P>
<H2>Configuration</H2>
<P>
There are several important configuration parameters relating to
gang scheduling:
</P>
<UL>
<LI>
<B>SelectType</B>: The SLURM <I>sched/gang</I> plugin supports nodes
allocated by the <I>select/linear</I> plugin and socket/core/CPU resources
allocated by the <I>select/cons_res</I> plugin.
</LI>
<LI>
<B>SelectTypeParameter</B>: Since resources will be getting overallocated
with jobs, the resource selection plugin should be configured to track the
amount of memory used by each job to ensure that memory page swapping does
not occur. When <I>select/linear</I> is chosen, we recommend setting
<I>SelectTypeParameter=CR_Memory</I>. When <I>select/cons_res</I> is
chosen, we recommend including Memory as a resource (ex.
<I>SelectTypeParameter=CR_Core_Memory</I>).
</LI>
<LI>
<B>DefMemPerTask</B>: Since job requests may not explicitly specify
a memory requirement, we also recommend configuring <I>DefMemPerTask</I>
(default memory per task). It may also be desirable to configure
<I>MaxMemPerTask</I> (maximum memory per task) in <I>slurm.conf</I>.
</LI>
<LI>
<B>JobAcctGatherType and JobAcctGatherFrequency</B>:
If you wish to enforce memory limits, accounting must be enabled
using the <I>JobAcctGatherType</I> and <I>JobAcctGatherFrequency</I>
parameters. If accounting is enabled and a job exceeds its configured
memory limits, it will be canceled in order to prevent it from
adversely effecting other jobs sharing the same resources.
</LI>
<LI>
<B>SchedulerType</B>: Configure the <I>sched/gang</I> plugin by setting
<I>SchedulerType=sched/gang</I> in <I>slurm.conf</I>.
</LI>
<LI>
<B>SchedulerTimeSlice</B>: The default timeslice interval is 30 seconds.
To change this duration, set <I>SchedulerTimeSlice</I> to the desired interval
(in seconds) in <I>slurm.conf</I>. For example, to set the timeslice interval
to one minute, set <I>SchedulerTimeSlice=60</I>. Short values can increase
the overhead of gang scheduling.
</LI>
<LI>
<B>Shared</B>: Configure the partitions <I>Shared</I> setting to
<I>FORCE</I> for all partitions in which timeslicing is to take place.
The <I>FORCE</I> option now supports an additional parameter that controls
how many jobs can share a resource (FORCE[:max_share]). By default the
max_share value is 4. To allow up to 6 jobs from this partition to be
allocated to a common resource, set <I>Shared=FORCE:6</I>.
</LI>
</UL>
<P>
In order to enable gang scheduling after making the configuration changes
described above, restart SLURM if it is already running. Any change to the
plugin settings in SLURM requires a full restart of the daemons. If you
just change the partition <I>Shared</I> setting, this can be updated with
<I>scontrol reconfig</I>.
</P>
<P>
For an advanced topic discussion on the potential use of swap space,
see "Making use of swap space" in the "Future Work" section below.
</P>
<H2>Timeslicer Design and Operation</H2>
<P>
When enabled, the <I>sched/gang</I> plugin keeps track of the resources
allocated to all jobs. For each partition an "active bitmap" is maintained that
tracks all concurrently running jobs in the SLURM cluster. Each time a new
job is allocated to resources in a partition, the <I>sched/gang</I> plugin
compares these newly allocated resources with the resources already maintained
in the "active bitmap". If these two sets of resources are disjoint then the new
job is added to the "active bitmap". If these two sets of resources overlap then
the new job is suspended. All jobs are tracked in a per-partition job queue
within the <I>sched/gang</I> plugin.
</P>
<P>
A separate <I>timeslicer thread</I> is spawned by the <I>sched/gang</I> plugin
on startup. This thread sleeps for the configured <I>SchedulerTimeSlice</I>
interval. When it wakes up, it checks each partition for suspended jobs. If
suspended jobs are found then the <I>timeslicer thread</I> moves all running
jobs to the end of the job queue. It then reconstructs the "active bitmap" for
this partition beginning with the suspended job that has waited the longest to
run (this will be the first suspended job in the run queue). Each following job
is then compared with the new "active bitmap", and if the job can be run
concurrently with the other "active" jobs then the job is added. Once this is
complete then the <I>timeslicer thread</I> suspends any currently running jobs
that are no longer part of the "active bitmap", and resumes jobs that are new to
the "active bitmap".
</P>
<P>
This <I>timeslicer thread</I> algorithm for rotating jobs is designed to prevent
jobs from starving (remaining in the suspended state indefinitly) and to be as
fair as possible in the distribution of runtime while still keeping all of the
resources as busy as possible.
</P>
<P>
The <I>sched/gang</I> plugin suspends jobs via the same internal functions that
support <I>scontrol suspend</I> and <I>scontrol resume</I>. A good way to
observe the operation of the timeslicer is by running <I>watch squeue</I> in a
terminal window.
</P>
<H2>A Simple Example</H2>
<P>
The following example is configured with <I>select/linear</I>,
<I>sched/gang</I>, and <I>Shared=FORCE</I>. This example takes place on a small
cluster of 5 nodes:
</P>
<PRE>
[user@n16 load]$ <B>sinfo</B>
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
active* up infinite 5 idle n[12-16]
</PRE>
<P>
Here are the Scheduler settings (the last two settings are the relevant ones):
</P>
<PRE>
[user@n16 load]$ <B>scontrol show config | grep Sched</B>
FastSchedule = 1
SchedulerPort = 7321
SchedulerRootFilter = 1
SchedulerTimeSlice = 30
SchedulerType = sched/gang
[user@n16 load]$
</PRE>
<P>
The <I>myload</I> script launches a simple load-generating app that runs
for the given number of seconds. Submit <I>myload</I> to run on all nodes:
</P>
<PRE>
[user@n16 load]$ <B>sbatch -N5 ./myload 300</B>
sbatch: Submitted batch job 3
[user@n16 load]$ <B>squeue</B>
JOBID PARTITION NAME USER ST TIME NODES NODELIST
3 active myload user 0:05 5 n[12-16]
</PRE>
<P>
Submit it again and watch the <I>sched/gang</I> plugin suspend it:
</P>
<PRE>
[user@n16 load]$ <B>sbatch -N5 ./myload 300</B>
sbatch: Submitted batch job 4
[user@n16 load]$ <B>squeue</B>
JOBID PARTITION NAME USER ST TIME NODES NODELIST
3 active myload user R 0:13 5 n[12-16]
4 active myload user S 0:00 5 n[12-16]
</PRE>
<P>
After 30 seconds the <I>sched/gang</I> plugin swaps jobs, and now job 4 is the
active one:
</P>
<PRE>
[user@n16 load]$ <B>squeue</B>
JOBID PARTITION NAME USER ST TIME NODES NODELIST
4 active myload user R 0:08 5 n[12-16]
3 active myload user S 0:41 5 n[12-16]
[user@n16 load]$ <B>squeue</B>
JOBID PARTITION NAME USER ST TIME NODES NODELIST
4 active myload user R 0:21 5 n[12-16]
3 active myload user S 0:41 5 n[12-16]
</PRE>
<P> After another 30 seconds the <I>sched/gang</I> plugin sets job 3 running again:
</P>
<PRE>
[user@n16 load]$ <B>squeue</B>
JOBID PARTITION NAME USER ST TIME NODES NODELIST
3 active myload user R 0:50 5 n[12-16]
4 active myload user S 0:30 5 n[12-16]
</PRE>
<P>
<B>A possible side effect of timeslicing</B>: Note that jobs that are
immediately suspended may cause their srun commands to produce the following
output:
</P>
<PRE>
[user@n16 load]$ <B>cat slurm-4.out</B>
srun: Job step creation temporarily disabled, retrying
srun: Job step creation still disabled, retrying
srun: Job step creation still disabled, retrying
srun: Job step creation still disabled, retrying
srun: Job step created
</PRE>
<P>
This occurs because <I>srun</I> is attempting to launch a jobstep in an
allocation that has been suspended. The <I>srun</I> process will continue in a
retry loop to launch the jobstep until the allocation has been resumed and the
jobstep can be launched.
</P>
<P>
When the <I>sched/gang</I> plugin is enabled, this type of output in the user
jobs should be considered benign.
</P>
<h2>Configuration</h2>
<H2>More examples</H2>
<p>There are several important configuration parameters relating to
<P>
gang scheduling:</p>
The following example shows how the timeslicer algorithm keeps the resources
<ul>
busy. Job 10 runs continually, while jobs 9 and 11 are timesliced:
</P>
<PRE>
[user@n16 load]$ <B>sbatch -N3 ./myload 300</B>
sbatch: Submitted batch job 9
[user@n16 load]$ <B>sbatch -N2 ./myload 300</B>
sbatch: Submitted batch job 10
[user@n16 load]$ <B>sbatch -N3 ./myload 300</B>
sbatch: Submitted batch job 11
[user@n16 load]$ <B>squeue</B>
JOBID PARTITION NAME USER ST TIME NODES NODELIST
9 active myload user R 0:11 3 n[12-14]
10 active myload user R 0:08 2 n[15-16]
11 active myload user S 0:00 3 n[12-14]
[user@n16 load]$ <B>squeue</B>
JOBID PARTITION NAME USER ST TIME NODES NODELIST
10 active myload user R 0:50 2 n[15-16]
11 active myload user R 0:12 3 n[12-14]
9 active myload user S 0:41 3 n[12-14]
[user@n16 load]$ <B>squeue</B>
JOBID PARTITION NAME USER ST TIME NODES NODELIST
10 active myload user R 1:04 2 n[15-16]
11 active myload user R 0:26 3 n[12-14]
9 active myload user S 0:41 3 n[12-14]
[user@n16 load]$ <B>squeue</B>
JOBID PARTITION NAME USER ST TIME NODES NODELIST
9 active myload user R 0:46 3 n[12-14]
10 active myload user R 1:13 2 n[15-16]
11 active myload user S 0:30 3 n[12-14]
[user@n16 load]$
</PRE>
</P>
<P>
The next example displays "local backfilling":
</P>
<PRE>
[user@n16 load]$ <B>sbatch -N3 ./myload 300</B>
sbatch: Submitted batch job 12
[user@n16 load]$ <B>sbatch -N5 ./myload 300</B>
sbatch: Submitted batch job 13
[user@n16 load]$ <B>sbatch -N2 ./myload 300</B>
sbatch: Submitted batch job 14
[user@n16 load]$ <B>squeue</B>
JOBID PARTITION NAME USER ST TIME NODES NODELIST
12 active myload user R 0:14 3 n[12-14]
14 active myload user R 0:06 2 n[15-16]
13 active myload user S 0:00 5 n[12-16]
[user@n16 load]$
</PRE>
<P>
Without timeslicing and without the backfill scheduler enabled, job 14 has to
wait for job 13 to finish.
</P><P>
This is called "local" backfilling because the backfilling only occurs with jobs
close enough in the queue to get allocated by the scheduler as part of
oversubscribing the resources. Recall that the number of jobs that can
overcommit a resource is controlled by the <I>Shared=FORCE:max_share</I> value,
so this value effectively controls the scope of "local backfilling".
</P><P>
Normal backfill algorithms check <U>all</U> jobs in the wait queue.
</P>
<li><b>GangTimeSlice</b>:
<H2>Consumable Resource Examples</H2>
Description.</li>
<P>
The following two examples illustrate the primary difference between
<I>CR_CPU</I> and <I>CR_Core</I> when consumable resource selection is enabled
(<I>select/cons_res</I>).
</P>
<P>
When <I>CR_CPU</I> (or <I>CR_CPU_Memory</I>) is configured then the selector
treats the CPUs as simple, <I>interchangeable</I> computing resources. However
when <I>CR_Core</I> (or <I>CR_Core_Memory</I>) is enabled the selector treats
the CPUs as individual resources that are <U>specifically</U> allocated to jobs.
This subtle difference is highlighted when timeslicing is enabled.
</P>
<P>
In both examples 6 jobs are submitted. Each job requests 2 CPUs per node, and
all of the nodes contain two quad-core processors. The timeslicer will initially
let the first 4 jobs run and suspend the last 2 jobs. The manner in which these
jobs are timesliced depends upon the configured <I>SelectTypeParameter</I>.
</P>
<P>
In the first example <I>CR_Core_Memory</I> is configured. Note that jobs 46 and
47 don't <U>ever</U> get suspended. This is because they are not sharing their
cores with any other job. Jobs 48 and 49 were allocated to the same cores as
jobs 45 and 46. The timeslicer recognizes this and timeslices only those jobs:
</P>
<PRE>
[user@n16 load]$ <B>sinfo</B>
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
active* up infinite 5 idle n[12-16]
[user@n16 load]$ <B>scontrol show config | grep Select</B>
SelectType = select/cons_res
SelectTypeParameters = CR_CORE_MEMORY
[user@n16 load]$ <B>sinfo -o "%20N %5D %5c %5z"</B>
NODELIST NODES CPUS S:C:T
n[12-16] 5 8 2:4:1
[user@n16 load]$
[user@n16 load]$
[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
sbatch: Submitted batch job 44
[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
sbatch: Submitted batch job 45
[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
sbatch: Submitted batch job 46
[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
sbatch: Submitted batch job 47
[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
sbatch: Submitted batch job 48
[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
sbatch: Submitted batch job 49
[user@n16 load]$ <B>squeue</B>
JOBID PARTITION NAME USER ST TIME NODES NODELIST
44 active myload user R 0:09 5 n[12-16]
45 active myload user R 0:08 5 n[12-16]
46 active myload user R 0:08 5 n[12-16]
47 active myload user R 0:07 5 n[12-16]
48 active myload user S 0:00 5 n[12-16]
49 active myload user S 0:00 5 n[12-16]
[user@n16 load]$ <B>squeue</B>
JOBID PARTITION NAME USER ST TIME NODES NODELIST
46 active myload user R 0:49 5 n[12-16]
47 active myload user R 0:48 5 n[12-16]
48 active myload user R 0:06 5 n[12-16]
49 active myload user R 0:06 5 n[12-16]
44 active myload user S 0:44 5 n[12-16]
45 active myload user S 0:43 5 n[12-16]
[user@n16 load]$ <B>squeue</B>
JOBID PARTITION NAME USER ST TIME NODES NODELIST
44 active myload user R 1:23 5 n[12-16]
45 active myload user R 1:22 5 n[12-16]
46 active myload user R 2:22 5 n[12-16]
47 active myload user R 2:21 5 n[12-16]
48 active myload user S 1:00 5 n[12-16]
49 active myload user S 1:00 5 n[12-16]
[user@n16 load]$
</PRE>
<P>
Note the runtime of all 6 jobs in the output of the last <I>squeue</I> command.
Jobs 46 and 47 have been running continuously, while jobs 45 and 46 are
splitting their runtime with jobs 48 and 49.
</P><P>
The next example has <I>CR_CPU_Memory</I> configured and the same 6 jobs are
submitted. Here the selector and the timeslicer treat the CPUs as countable
resources which results in all 6 jobs sharing time on the CPUs:
</P>
<PRE>
[user@n16 load]$ <B>sinfo</B>
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
active* up infinite 5 idle n[12-16]
[user@n16 load]$ <B>scontrol show config | grep Select</B>
SelectType = select/cons_res
SelectTypeParameters = CR_CPU_MEMORY
[user@n16 load]$ <B>sinfo -o "%20N %5D %5c %5z"</B>
NODELIST NODES CPUS S:C:T
n[12-16] 5 8 2:4:1
[user@n16 load]$
[user@n16 load]$
[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
sbatch: Submitted batch job 51
[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
sbatch: Submitted batch job 52
[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
sbatch: Submitted batch job 53
[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
sbatch: Submitted batch job 54
[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
sbatch: Submitted batch job 55
[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
sbatch: Submitted batch job 56
[user@n16 load]$ <B>squeue</B>
JOBID PARTITION NAME USER ST TIME NODES NODELIST
51 active myload user R 0:11 5 n[12-16]
52 active myload user R 0:11 5 n[12-16]
53 active myload user R 0:10 5 n[12-16]
54 active myload user R 0:09 5 n[12-16]
55 active myload user S 0:00 5 n[12-16]
56 active myload user S 0:00 5 n[12-16]
[user@n16 load]$ <B>squeue</B>
JOBID PARTITION NAME USER ST TIME NODES NODELIST
51 active myload user R 1:09 5 n[12-16]
52 active myload user R 1:09 5 n[12-16]
55 active myload user R 0:23 5 n[12-16]
56 active myload user R 0:23 5 n[12-16]
53 active myload user S 0:45 5 n[12-16]
54 active myload user S 0:44 5 n[12-16]
[user@n16 load]$ <B>squeue</B>
JOBID PARTITION NAME USER ST TIME NODES NODELIST
53 active myload user R 0:55 5 n[12-16]
54 active myload user R 0:54 5 n[12-16]
55 active myload user R 0:40 5 n[12-16]
56 active myload user R 0:40 5 n[12-16]
51 active myload user S 1:16 5 n[12-16]
52 active myload user S 1:16 5 n[12-16]
[user@n16 load]$ <B>squeue</B>
JOBID PARTITION NAME USER ST TIME NODES NODELIST
51 active myload user R 3:18 5 n[12-16]
52 active myload user R 3:18 5 n[12-16]
53 active myload user R 3:17 5 n[12-16]
54 active myload user R 3:16 5 n[12-16]
55 active myload user S 3:00 5 n[12-16]
56 active myload user S 3:00 5 n[12-16]
[user@n16 load]$
</PRE>
<P>
Note that the runtime of all 6 jobs is roughly equal. Jobs 51-54 ran first so
they're slightly ahead, but so far all jobs have run for at least 3 minutes.
</P><P>
At the core level this means that SLURM relies on the linux kernel to move jobs
around on the cores to maximize performance. This is different than when
<I>CR_Core_Memory</I> was configured and the jobs would effectively remain
"pinned" to their specific cores for the duration of the job. Note that
<I>CR_Core_Memory</I> supports CPU binding, while <I>CR_CPU_Memory</I> does not.
</P>
</ul></p>
<H2>Future Work</H2>
<P>
Priority scheduling and preemptive scheduling are other forms of gang
scheduling that are currently under development for SLURM.
</P>
<P>
<B>Making use of swap space</B>: (note that this topic is not currently
scheduled for development, unless someone would like to pursue this) It should
be noted that timeslicing does provide an interesting mechanism for high
performance jobs to make use of swap space. The optimal scenario is one in which
suspended jobs are "swapped out" and active jobs are "swapped in". The swapping
activity would only occur once every <I>SchedulerTimeslice</I> interval.
</P>
<P>
However, SLURM should first be modified to include support for scheduling jobs
into swap space and to provide controls to prevent overcommitting swap space.
For now this idea could be experimented with by disabling memory support in the
selector and submitting appropriately sized jobs.
</P>
<p style="text-align:center;">Last modified 1
1
March 2008</p>
<p style="text-align:center;">Last modified 1
7
March 2008</p>
<!--#include virtual="footer.txt"-->
<!--#include virtual="footer.txt"-->
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment