add gang scheduling web page from Chris Holmes with some editing by Moe Jette

6b97dc6a · Moe Jette · ce7655da · 6b97dc6a
Commit 6b97dc6a authored 17 years ago by Moe Jette
--- a/doc/html/gang_scheduling.shtml
+++ b/doc/html/gang_scheduling.shtml
 <!--#include virtual="header.txt"-->
-<h1>Gang Scheduling</h1>
+<H1>Gang Scheduling</H1>
-<p>SLURM version 1.2 and earlier supported dedication of resources
+<P>
+SLURM version 1.2 and earlier supported dedication of resources
 to jobs.
-Beginning in SLURM version 1.3 gang scheduling is supported. </p>
+Beginning in SLURM version 1.3, gang scheduling is supported. 
+Gang scheduling is when two or more jobs are allocated to the same resources 
+and these jobs are alternately suspended to let all of the tasks of each 
+job have full access to the shared resources for a period of time.
+</P>
+<P>
+A resource manager that supports timeslicing can improve it's responsiveness
+and utilization by allowing more jobs to begin running sooner. Shorter-running 
+jobs no longer have to wait in a queue behind longer-running jobs. Instead they 
+can be run "in parallel" with the longer-running jobs, which will allow them 
+to finish quicker. Throughput is also improved because overcommitting the 
+resources provides opportunities for "local backfilling" to occur (see example 
+below).
+</P>
+<P>
+The SLURM 1.3.0 the <I>sched/gang</I> plugin provides timeslicing. When enabled, 
+it monitors each of the partitions in SLURM. If a new job has been allocated to
+resources in a partition that have already been allocated to an existing job,
+then the plugin will suspend the new job until the configured
+<I>SchedulerTimeslice</I> interval has elapsed. Then it will suspend the
+running job and let the new job make use of the resources for a 
+<I>SchedulerTimeslice</I> interval. This will continue until one of the
+jobs terminates.
+</P>
+<H2>Configuration</H2>
+<P>
+There are several important configuration parameters relating to 
+gang scheduling:
+</P>
+<UL>
+<LI>
+<B>SelectType</B>: The SLURM <I>sched/gang</I> plugin supports nodes 
+allocated by the <I>select/linear</I> plugin and socket/core/CPU resources 
+allocated by the <I>select/cons_res</I> plugin.
+</LI>
+<LI>
+<B>SelectTypeParameter</B>: Since resources will be getting overallocated 
+with jobs, the resource selection plugin should be configured to track the 
+amount of memory used by each job to ensure that memory page swapping does 
+not occur. When <I>select/linear</I> is chosen, we recommend setting 
+<I>SelectTypeParameter=CR_Memory</I>. When <I>select/cons_res</I> is
+chosen, we recommend including Memory as a resource (ex.
+<I>SelectTypeParameter=CR_Core_Memory</I>).
+</LI>
+<LI>
+<B>DefMemPerTask</B>: Since job requests may not explicitly specify 
+a memory requirement, we also recommend configuring <I>DefMemPerTask</I> 
+(default memory per task). It may also be desirable to configure 
+<I>MaxMemPerTask</I> (maximum memory per task) in <I>slurm.conf</I>.
+</LI>
+<LI>
+<B>JobAcctGatherType and JobAcctGatherFrequency</B>:
+If you wish to enforce memory limits, accounting must be enabled
+using the <I>JobAcctGatherType</I> and <I>JobAcctGatherFrequency</I>
+parameters. If accounting is enabled and a job exceeds its configured
+memory limits, it will be canceled in order to prevent it from 
+adversely effecting other jobs sharing the same resources.
+</LI>
+<LI>
+<B>SchedulerType</B>: Configure the <I>sched/gang</I> plugin by setting
+<I>SchedulerType=sched/gang</I> in <I>slurm.conf</I>.
+</LI>
+<LI>
+<B>SchedulerTimeSlice</B>: The default timeslice interval is 30 seconds. 
+To change this duration, set <I>SchedulerTimeSlice</I> to the desired interval 
+(in seconds) in <I>slurm.conf</I>. For example, to set the timeslice interval 
+to one minute, set <I>SchedulerTimeSlice=60</I>. Short values can increase 
+the overhead of gang scheduling.
+</LI>
+<LI>
+<B>Shared</B>: Configure the partitions <I>Shared</I> setting to 
+<I>FORCE</I> for all partitions in which timeslicing is to take place. 
+The <I>FORCE</I> option now supports an additional parameter that controls 
+how many jobs can share a resource (FORCE[:max_share]). By default the 
+max_share value is 4. To allow up to 6 jobs from this partition to be 
+allocated to a common resource, set <I>Shared=FORCE:6</I>.
+</LI>
+</UL>
+<P>
+In order to enable gang scheduling after making the configuration changes 
+described above, restart SLURM if it is already running. Any change to the 
+plugin settings in SLURM requires a full restart of the daemons. If you 
+just change the partition <I>Shared</I> setting, this can be updated with
+<I>scontrol reconfig</I>.
+</P>
+<P>
+For an advanced topic discussion on the potential use of swap space, 
+see "Making use of swap space" in the "Future Work" section below.
+</P>
+<H2>Timeslicer Design and Operation</H2>
+<P>
+When enabled, the <I>sched/gang</I> plugin keeps track of the resources
+allocated to all jobs. For each partition an "active bitmap" is maintained that
+tracks all concurrently running jobs in the SLURM cluster. Each time a new
+job is allocated to resources in a partition, the <I>sched/gang</I> plugin
+compares these newly allocated resources with the resources already maintained
+in the "active bitmap". If these two sets of resources are disjoint then the new
+job is added to the "active bitmap". If these two sets of resources overlap then
+the new job is suspended. All jobs are tracked in a per-partition job queue
+within the <I>sched/gang</I> plugin.
+</P>
+<P>
+A separate <I>timeslicer thread</I> is spawned by the <I>sched/gang</I> plugin
+on startup. This thread sleeps for the configured <I>SchedulerTimeSlice</I>
+interval. When it wakes up, it checks each partition for suspended jobs. If
+suspended jobs are found then the <I>timeslicer thread</I> moves all running
+jobs to the end of the job queue. It then reconstructs the "active bitmap" for
+this partition beginning with the suspended job that has waited the longest to
+run (this will be the first suspended job in the run queue). Each following job
+is then compared with the new "active bitmap", and if the job can be run
+concurrently with the other "active" jobs then the job is added. Once this is
+complete then the <I>timeslicer thread</I> suspends any currently running jobs
+that are no longer part of the "active bitmap", and resumes jobs that are new to
+the "active bitmap".
+</P>
+<P>
+This <I>timeslicer thread</I> algorithm for rotating jobs is designed to prevent
+jobs from starving (remaining in the suspended state indefinitly) and to be as
+fair as possible in the distribution of runtime while still keeping all of the
+resources as busy as possible.
+</P>
+<P>
+The <I>sched/gang</I> plugin suspends jobs via the same internal functions that
+support <I>scontrol suspend</I> and <I>scontrol resume</I>. A good way to
+observe the operation of the timeslicer is by running <I>watch squeue</I> in a
+terminal window.
+</P>
+<H2>A Simple Example</H2>
+<P>
+The following example is configured with <I>select/linear</I>,
+<I>sched/gang</I>, and <I>Shared=FORCE</I>. This example takes place on a small
+cluster of 5 nodes:
+</P>
+<PRE>
+[user@n16 load]$ <B>sinfo</B>
+PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
+active*      up   infinite     5   idle n[12-16]
+</PRE>
+<P>
+Here are the Scheduler settings (the last two settings are the relevant ones):
+</P>
+<PRE>
+[user@n16 load]$ <B>scontrol show config | grep Sched</B>
+FastSchedule            = 1
+SchedulerPort           = 7321
+SchedulerRootFilter     = 1
+SchedulerTimeSlice      = 30
+SchedulerType           = sched/gang
+[user@n16 load]$
+</PRE>
+<P>
+The <I>myload</I> script launches a simple load-generating app that runs
+for the given number of seconds. Submit <I>myload</I> to run on all nodes:
+</P>
+<PRE>
+[user@n16 load]$ <B>sbatch -N5 ./myload 300</B>
+sbatch: Submitted batch job 3
+[user@n16 load]$ <B>squeue</B>
+JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
+    3    active  myload  user     0:05     5 n[12-16]
+</PRE>
+<P>
+Submit it again and watch the <I>sched/gang</I> plugin suspend it:
+</P>
+<PRE>
+[user@n16 load]$ <B>sbatch -N5 ./myload 300</B>
+sbatch: Submitted batch job 4
+[user@n16 load]$ <B>squeue</B>
+JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
+    3    active  myload  user  R  0:13     5 n[12-16]
+    4    active  myload  user  S  0:00     5 n[12-16]
+</PRE>
+<P>
+After 30 seconds the <I>sched/gang</I> plugin swaps jobs, and now job 4 is the
+active one:
+</P>
+<PRE>
+[user@n16 load]$ <B>squeue</B>
+JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
+    4    active  myload  user  R  0:08     5 n[12-16]
+    3    active  myload  user  S  0:41     5 n[12-16]
+[user@n16 load]$ <B>squeue</B>
+JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
+    4    active  myload  user  R  0:21     5 n[12-16]
+    3    active  myload  user  S  0:41     5 n[12-16]
+</PRE>
+<P>
After another 30 seconds the <I>sched/gang</I> plugin sets job 3 running again:
+</P>
+<PRE>
+[user@n16 load]$ <B>squeue</B>
+JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
+    3    active  myload  user  R  0:50     5 n[12-16]
+    4    active  myload  user  S  0:30     5 n[12-16]
+</PRE>
+<P>
+<B>A possible side effect of timeslicing</B>: Note that jobs that are
+immediately suspended may cause their srun commands to produce the following
+output:
+</P>
+<PRE>
+[user@n16 load]$ <B>cat slurm-4.out</B>
+srun: Job step creation temporarily disabled, retrying
+srun: Job step creation still disabled, retrying
+srun: Job step creation still disabled, retrying
+srun: Job step creation still disabled, retrying
+srun: Job step created
+</PRE>
+<P>
+This occurs because <I>srun</I> is attempting to launch a jobstep in an
+allocation that has been suspended. The <I>srun</I> process will continue in a
+retry loop to launch the jobstep until the allocation has been resumed and the
+jobstep can be launched.
+</P>
+<P>
+When the <I>sched/gang</I> plugin is enabled, this type of output in the user
+jobs should be considered benign.
+</P>
-<h2>Configuration</h2>
+<H2>More examples</H2>
-<p>There are several important configuration parameters relating to 
+<P>
-gang scheduling:</p>
+The following example shows how the timeslicer algorithm keeps the resources
-<ul>
+busy. Job 10 runs continually, while jobs 9 and 11 are timesliced:
+</P>
+<PRE>
+[user@n16 load]$ <B>sbatch -N3 ./myload 300</B>
+sbatch: Submitted batch job 9
+[user@n16 load]$ <B>sbatch -N2 ./myload 300</B>
+sbatch: Submitted batch job 10
+[user@n16 load]$ <B>sbatch -N3 ./myload 300</B>
+sbatch: Submitted batch job 11
+[user@n16 load]$ <B>squeue</B>
+JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
+    9    active  myload  user  R  0:11     3 n[12-14]
+   10    active  myload  user  R  0:08     2 n[15-16]
+   11    active  myload  user  S  0:00     3 n[12-14]
+[user@n16 load]$ <B>squeue</B>
+JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
+   10    active  myload  user  R  0:50     2 n[15-16]
+   11    active  myload  user  R  0:12     3 n[12-14]
+    9    active  myload  user  S  0:41     3 n[12-14]
+[user@n16 load]$ <B>squeue</B>
+JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
+   10    active  myload  user  R  1:04     2 n[15-16]
+   11    active  myload  user  R  0:26     3 n[12-14]
+    9    active  myload  user  S  0:41     3 n[12-14]
+[user@n16 load]$ <B>squeue</B>
+JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
+    9    active  myload  user  R  0:46     3 n[12-14]
+   10    active  myload  user  R  1:13     2 n[15-16]
+   11    active  myload  user  S  0:30     3 n[12-14]
+[user@n16 load]$
+</PRE>
+</P>
+<P>
+The next example displays "local backfilling":
+</P>
+<PRE>
+[user@n16 load]$ <B>sbatch -N3 ./myload 300</B>
+sbatch: Submitted batch job 12
+[user@n16 load]$ <B>sbatch -N5 ./myload 300</B>
+sbatch: Submitted batch job 13
+[user@n16 load]$ <B>sbatch -N2 ./myload 300</B>
+sbatch: Submitted batch job 14
+[user@n16 load]$ <B>squeue</B>
+JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
+   12    active  myload  user  R  0:14     3 n[12-14]
+   14    active  myload  user  R  0:06     2 n[15-16]
+   13    active  myload  user  S  0:00     5 n[12-16]
+[user@n16 load]$
+</PRE>
+<P>
+Without timeslicing and without the backfill scheduler enabled, job 14 has to
+wait for job 13 to finish.
+</P><P>
+This is called "local" backfilling because the backfilling only occurs with jobs
+close enough in the queue to get allocated by the scheduler as part of
+oversubscribing the resources. Recall that the number of jobs that can
+overcommit a resource is controlled by the <I>Shared=FORCE:max_share</I> value,
+so this value effectively controls the scope of "local backfilling".
+</P><P>
+Normal backfill algorithms check <U>all</U> jobs in the wait queue.
+</P>
-<li><b>GangTimeSlice</b>:
+<H2>Consumable Resource Examples</H2>
-Description.</li>
+<P>
+The following two examples illustrate the primary difference between
+<I>CR_CPU</I> and <I>CR_Core</I> when consumable resource selection is enabled
+(<I>select/cons_res</I>).
+</P>
+<P>
+When <I>CR_CPU</I> (or <I>CR_CPU_Memory</I>) is configured then the selector
+treats the CPUs as simple, <I>interchangeable</I> computing resources. However
+when <I>CR_Core</I> (or <I>CR_Core_Memory</I>) is enabled the selector treats
+the CPUs as individual resources that are <U>specifically</U> allocated to jobs.
+This subtle difference is highlighted when timeslicing is enabled.
+</P>
+<P>
+In both examples 6 jobs are submitted. Each job requests 2 CPUs per node, and
+all of the nodes contain two quad-core processors. The timeslicer will initially
+let the first 4 jobs run and suspend the last 2 jobs. The manner in which these
+jobs are timesliced depends upon the configured <I>SelectTypeParameter</I>.
+</P>
+<P>
+In the first example <I>CR_Core_Memory</I> is configured. Note that jobs 46 and
+47 don't <U>ever</U> get suspended. This is because they are not sharing their
+cores with any other job. Jobs 48 and 49 were allocated to the same cores as
+jobs 45 and 46. The timeslicer recognizes this and timeslices only those jobs: 
+</P>
+<PRE>
+[user@n16 load]$ <B>sinfo</B>
+PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
+active*      up   infinite     5   idle n[12-16]
+[user@n16 load]$ <B>scontrol show config | grep Select</B>
+SelectType              = select/cons_res
+SelectTypeParameters    = CR_CORE_MEMORY
+[user@n16 load]$ <B>sinfo -o "%20N %5D %5c %5z"</B>
+NODELIST             NODES CPUS  S:C:T
+n[12-16]             5     8     2:4:1
+[user@n16 load]$
+[user@n16 load]$
+[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
+sbatch: Submitted batch job 44
+[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
+sbatch: Submitted batch job 45
+[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
+sbatch: Submitted batch job 46
+[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
+sbatch: Submitted batch job 47
+[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
+sbatch: Submitted batch job 48
+[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
+sbatch: Submitted batch job 49
+[user@n16 load]$ <B>squeue</B>
+JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
+   44    active  myload  user  R  0:09     5 n[12-16]
+   45    active  myload  user  R  0:08     5 n[12-16]
+   46    active  myload  user  R  0:08     5 n[12-16]
+   47    active  myload  user  R  0:07     5 n[12-16]
+   48    active  myload  user  S  0:00     5 n[12-16]
+   49    active  myload  user  S  0:00     5 n[12-16]
+[user@n16 load]$ <B>squeue</B>
+JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
+   46    active  myload  user  R  0:49     5 n[12-16]
+   47    active  myload  user  R  0:48     5 n[12-16]
+   48    active  myload  user  R  0:06     5 n[12-16]
+   49    active  myload  user  R  0:06     5 n[12-16]
+   44    active  myload  user  S  0:44     5 n[12-16]
+   45    active  myload  user  S  0:43     5 n[12-16]
+[user@n16 load]$ <B>squeue</B>
+JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
+   44    active  myload  user  R  1:23     5 n[12-16]
+   45    active  myload  user  R  1:22     5 n[12-16]
+   46    active  myload  user  R  2:22     5 n[12-16]
+   47    active  myload  user  R  2:21     5 n[12-16]
+   48    active  myload  user  S  1:00     5 n[12-16]
+   49    active  myload  user  S  1:00     5 n[12-16]
+[user@n16 load]$
+</PRE>
+<P>
+Note the runtime of all 6 jobs in the output of the last <I>squeue</I> command.
+Jobs 46 and 47 have been running continuously, while jobs 45 and 46 are
+splitting their runtime with jobs 48 and 49.
+</P><P>
+The next example has <I>CR_CPU_Memory</I> configured and the same 6 jobs are
+submitted. Here the selector and the timeslicer treat the CPUs as countable
+resources which results in all 6 jobs sharing time on the CPUs:
+</P>
+<PRE>
+[user@n16 load]$ <B>sinfo</B>
+PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
+active*      up   infinite     5   idle n[12-16]
+[user@n16 load]$ <B>scontrol show config | grep Select</B>
+SelectType              = select/cons_res
+SelectTypeParameters    = CR_CPU_MEMORY
+[user@n16 load]$ <B>sinfo -o "%20N %5D %5c %5z"</B>
+NODELIST             NODES CPUS  S:C:T
+n[12-16]             5     8     2:4:1
+[user@n16 load]$
+[user@n16 load]$
+[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
+sbatch: Submitted batch job 51
+[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
+sbatch: Submitted batch job 52
+[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
+sbatch: Submitted batch job 53
+[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
+sbatch: Submitted batch job 54
+[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
+sbatch: Submitted batch job 55
+[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
+sbatch: Submitted batch job 56
+[user@n16 load]$ <B>squeue</B>
+JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
+   51    active  myload  user  R  0:11     5 n[12-16]
+   52    active  myload  user  R  0:11     5 n[12-16]
+   53    active  myload  user  R  0:10     5 n[12-16]
+   54    active  myload  user  R  0:09     5 n[12-16]
+   55    active  myload  user  S  0:00     5 n[12-16]
+   56    active  myload  user  S  0:00     5 n[12-16]
+[user@n16 load]$ <B>squeue</B>
+JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
+   51    active  myload  user  R  1:09     5 n[12-16]
+   52    active  myload  user  R  1:09     5 n[12-16]
+   55    active  myload  user  R  0:23     5 n[12-16]
+   56    active  myload  user  R  0:23     5 n[12-16]
+   53    active  myload  user  S  0:45     5 n[12-16]
+   54    active  myload  user  S  0:44     5 n[12-16]
+[user@n16 load]$ <B>squeue</B>
+JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
+   53    active  myload  user  R  0:55     5 n[12-16]
+   54    active  myload  user  R  0:54     5 n[12-16]
+   55    active  myload  user  R  0:40     5 n[12-16]
+   56    active  myload  user  R  0:40     5 n[12-16]
+   51    active  myload  user  S  1:16     5 n[12-16]
+   52    active  myload  user  S  1:16     5 n[12-16]
+[user@n16 load]$ <B>squeue</B>
+JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
+   51    active  myload  user  R  3:18     5 n[12-16]
+   52    active  myload  user  R  3:18     5 n[12-16]
+   53    active  myload  user  R  3:17     5 n[12-16]
+   54    active  myload  user  R  3:16     5 n[12-16]
+   55    active  myload  user  S  3:00     5 n[12-16]
+   56    active  myload  user  S  3:00     5 n[12-16]
+[user@n16 load]$
+</PRE>
+<P>
+Note that the runtime of all 6 jobs is roughly equal. Jobs 51-54 ran first so
+they're slightly ahead, but so far all jobs have run for at least 3 minutes.
+</P><P>
+At the core level this means that SLURM relies on the linux kernel to move jobs
+around on the cores to maximize performance. This is different than when
+<I>CR_Core_Memory</I> was configured and the jobs would effectively remain
+"pinned" to their specific cores for the duration of the job. Note that
+<I>CR_Core_Memory</I> supports CPU binding, while <I>CR_CPU_Memory</I> does not.
+</P>
-</ul></p>
+<H2>Future Work</H2>
+<P>
+Priority scheduling and preemptive scheduling are other forms of gang
+scheduling that are currently under development for SLURM.
+</P>
+<P>
+<B>Making use of swap space</B>: (note that this topic is not currently
+scheduled for development, unless someone would like to pursue this) It should
+be noted that timeslicing does provide an interesting mechanism for high
+performance jobs to make use of swap space. The optimal scenario is one in which
+suspended jobs are "swapped out" and active jobs are "swapped in". The swapping
+activity would only occur once every  <I>SchedulerTimeslice</I> interval.
+</P>
+<P>
+However, SLURM should first be modified to include support for scheduling jobs
+into swap space and to provide controls to prevent overcommitting swap space.
+For now this idea could be experimented with by disabling memory support in the
+selector and submitting appropriately sized jobs.
+</P>
-<p style="text-align:center;">Last modified 11 March 2008</p>
+<p style="text-align:center;">Last modified 17 March 2008</p>
 <!--#include virtual="footer.txt"-->