diff --git a/doc/html/gang_scheduling.shtml b/doc/html/gang_scheduling.shtml index df914619b3f91d102b35b17c5cf5f5ab25e91605..66c0b7cf690a86d02f163c293690400f21297874 100644 --- a/doc/html/gang_scheduling.shtml +++ b/doc/html/gang_scheduling.shtml @@ -1,21 +1,472 @@ <!--#include virtual="header.txt"--> -<h1>Gang Scheduling</h1> -<p>SLURM version 1.2 and earlier supported dedication of resources +<H1>Gang Scheduling</H1> + +<P> +SLURM version 1.2 and earlier supported dedication of resources to jobs. -Beginning in SLURM version 1.3 gang scheduling is supported. </p> +Beginning in SLURM version 1.3, gang scheduling is supported. +Gang scheduling is when two or more jobs are allocated to the same resources +and these jobs are alternately suspended to let all of the tasks of each +job have full access to the shared resources for a period of time. +</P> +<P> +A resource manager that supports timeslicing can improve it's responsiveness +and utilization by allowing more jobs to begin running sooner. Shorter-running +jobs no longer have to wait in a queue behind longer-running jobs. Instead they +can be run "in parallel" with the longer-running jobs, which will allow them +to finish quicker. Throughput is also improved because overcommitting the +resources provides opportunities for "local backfilling" to occur (see example +below). +</P> +<P> +The SLURM 1.3.0 the <I>sched/gang</I> plugin provides timeslicing. When enabled, +it monitors each of the partitions in SLURM. If a new job has been allocated to +resources in a partition that have already been allocated to an existing job, +then the plugin will suspend the new job until the configured +<I>SchedulerTimeslice</I> interval has elapsed. Then it will suspend the +running job and let the new job make use of the resources for a +<I>SchedulerTimeslice</I> interval. This will continue until one of the +jobs terminates. +</P> + +<H2>Configuration</H2> +<P> +There are several important configuration parameters relating to +gang scheduling: +</P> +<UL> +<LI> +<B>SelectType</B>: The SLURM <I>sched/gang</I> plugin supports nodes +allocated by the <I>select/linear</I> plugin and socket/core/CPU resources +allocated by the <I>select/cons_res</I> plugin. +</LI> +<LI> +<B>SelectTypeParameter</B>: Since resources will be getting overallocated +with jobs, the resource selection plugin should be configured to track the +amount of memory used by each job to ensure that memory page swapping does +not occur. When <I>select/linear</I> is chosen, we recommend setting +<I>SelectTypeParameter=CR_Memory</I>. When <I>select/cons_res</I> is +chosen, we recommend including Memory as a resource (ex. +<I>SelectTypeParameter=CR_Core_Memory</I>). +</LI> +<LI> +<B>DefMemPerTask</B>: Since job requests may not explicitly specify +a memory requirement, we also recommend configuring <I>DefMemPerTask</I> +(default memory per task). It may also be desirable to configure +<I>MaxMemPerTask</I> (maximum memory per task) in <I>slurm.conf</I>. +</LI> +<LI> +<B>JobAcctGatherType and JobAcctGatherFrequency</B>: +If you wish to enforce memory limits, accounting must be enabled +using the <I>JobAcctGatherType</I> and <I>JobAcctGatherFrequency</I> +parameters. If accounting is enabled and a job exceeds its configured +memory limits, it will be canceled in order to prevent it from +adversely effecting other jobs sharing the same resources. +</LI> +<LI> +<B>SchedulerType</B>: Configure the <I>sched/gang</I> plugin by setting +<I>SchedulerType=sched/gang</I> in <I>slurm.conf</I>. +</LI> +<LI> +<B>SchedulerTimeSlice</B>: The default timeslice interval is 30 seconds. +To change this duration, set <I>SchedulerTimeSlice</I> to the desired interval +(in seconds) in <I>slurm.conf</I>. For example, to set the timeslice interval +to one minute, set <I>SchedulerTimeSlice=60</I>. Short values can increase +the overhead of gang scheduling. +</LI> +<LI> +<B>Shared</B>: Configure the partitions <I>Shared</I> setting to +<I>FORCE</I> for all partitions in which timeslicing is to take place. +The <I>FORCE</I> option now supports an additional parameter that controls +how many jobs can share a resource (FORCE[:max_share]). By default the +max_share value is 4. To allow up to 6 jobs from this partition to be +allocated to a common resource, set <I>Shared=FORCE:6</I>. +</LI> +</UL> +<P> +In order to enable gang scheduling after making the configuration changes +described above, restart SLURM if it is already running. Any change to the +plugin settings in SLURM requires a full restart of the daemons. If you +just change the partition <I>Shared</I> setting, this can be updated with +<I>scontrol reconfig</I>. +</P> +<P> +For an advanced topic discussion on the potential use of swap space, +see "Making use of swap space" in the "Future Work" section below. +</P> + +<H2>Timeslicer Design and Operation</H2> + +<P> +When enabled, the <I>sched/gang</I> plugin keeps track of the resources +allocated to all jobs. For each partition an "active bitmap" is maintained that +tracks all concurrently running jobs in the SLURM cluster. Each time a new +job is allocated to resources in a partition, the <I>sched/gang</I> plugin +compares these newly allocated resources with the resources already maintained +in the "active bitmap". If these two sets of resources are disjoint then the new +job is added to the "active bitmap". If these two sets of resources overlap then +the new job is suspended. All jobs are tracked in a per-partition job queue +within the <I>sched/gang</I> plugin. +</P> +<P> +A separate <I>timeslicer thread</I> is spawned by the <I>sched/gang</I> plugin +on startup. This thread sleeps for the configured <I>SchedulerTimeSlice</I> +interval. When it wakes up, it checks each partition for suspended jobs. If +suspended jobs are found then the <I>timeslicer thread</I> moves all running +jobs to the end of the job queue. It then reconstructs the "active bitmap" for +this partition beginning with the suspended job that has waited the longest to +run (this will be the first suspended job in the run queue). Each following job +is then compared with the new "active bitmap", and if the job can be run +concurrently with the other "active" jobs then the job is added. Once this is +complete then the <I>timeslicer thread</I> suspends any currently running jobs +that are no longer part of the "active bitmap", and resumes jobs that are new to +the "active bitmap". +</P> +<P> +This <I>timeslicer thread</I> algorithm for rotating jobs is designed to prevent +jobs from starving (remaining in the suspended state indefinitly) and to be as +fair as possible in the distribution of runtime while still keeping all of the +resources as busy as possible. +</P> +<P> +The <I>sched/gang</I> plugin suspends jobs via the same internal functions that +support <I>scontrol suspend</I> and <I>scontrol resume</I>. A good way to +observe the operation of the timeslicer is by running <I>watch squeue</I> in a +terminal window. +</P> + +<H2>A Simple Example</H2> +<P> +The following example is configured with <I>select/linear</I>, +<I>sched/gang</I>, and <I>Shared=FORCE</I>. This example takes place on a small +cluster of 5 nodes: +</P> +<PRE> +[user@n16 load]$ <B>sinfo</B> +PARTITION AVAIL TIMELIMIT NODES STATE NODELIST +active* up infinite 5 idle n[12-16] +</PRE> +<P> +Here are the Scheduler settings (the last two settings are the relevant ones): +</P> +<PRE> +[user@n16 load]$ <B>scontrol show config | grep Sched</B> +FastSchedule = 1 +SchedulerPort = 7321 +SchedulerRootFilter = 1 +SchedulerTimeSlice = 30 +SchedulerType = sched/gang +[user@n16 load]$ +</PRE> +<P> +The <I>myload</I> script launches a simple load-generating app that runs +for the given number of seconds. Submit <I>myload</I> to run on all nodes: +</P> +<PRE> +[user@n16 load]$ <B>sbatch -N5 ./myload 300</B> +sbatch: Submitted batch job 3 +[user@n16 load]$ <B>squeue</B> +JOBID PARTITION NAME USER ST TIME NODES NODELIST + 3 active myload user 0:05 5 n[12-16] +</PRE> +<P> +Submit it again and watch the <I>sched/gang</I> plugin suspend it: +</P> +<PRE> +[user@n16 load]$ <B>sbatch -N5 ./myload 300</B> +sbatch: Submitted batch job 4 +[user@n16 load]$ <B>squeue</B> +JOBID PARTITION NAME USER ST TIME NODES NODELIST + 3 active myload user R 0:13 5 n[12-16] + 4 active myload user S 0:00 5 n[12-16] +</PRE> +<P> +After 30 seconds the <I>sched/gang</I> plugin swaps jobs, and now job 4 is the +active one: +</P> +<PRE> +[user@n16 load]$ <B>squeue</B> +JOBID PARTITION NAME USER ST TIME NODES NODELIST + 4 active myload user R 0:08 5 n[12-16] + 3 active myload user S 0:41 5 n[12-16] +[user@n16 load]$ <B>squeue</B> +JOBID PARTITION NAME USER ST TIME NODES NODELIST + 4 active myload user R 0:21 5 n[12-16] + 3 active myload user S 0:41 5 n[12-16] +</PRE> +<P> After another 30 seconds the <I>sched/gang</I> plugin sets job 3 running again: +</P> +<PRE> +[user@n16 load]$ <B>squeue</B> +JOBID PARTITION NAME USER ST TIME NODES NODELIST + 3 active myload user R 0:50 5 n[12-16] + 4 active myload user S 0:30 5 n[12-16] +</PRE> +<P> +<B>A possible side effect of timeslicing</B>: Note that jobs that are +immediately suspended may cause their srun commands to produce the following +output: +</P> +<PRE> +[user@n16 load]$ <B>cat slurm-4.out</B> +srun: Job step creation temporarily disabled, retrying +srun: Job step creation still disabled, retrying +srun: Job step creation still disabled, retrying +srun: Job step creation still disabled, retrying +srun: Job step created +</PRE> +<P> +This occurs because <I>srun</I> is attempting to launch a jobstep in an +allocation that has been suspended. The <I>srun</I> process will continue in a +retry loop to launch the jobstep until the allocation has been resumed and the +jobstep can be launched. +</P> +<P> +When the <I>sched/gang</I> plugin is enabled, this type of output in the user +jobs should be considered benign. +</P> -<h2>Configuration</h2> -<p>There are several important configuration parameters relating to -gang scheduling:</p> -<ul> +<H2>More examples</H2> +<P> +The following example shows how the timeslicer algorithm keeps the resources +busy. Job 10 runs continually, while jobs 9 and 11 are timesliced: +</P> +<PRE> +[user@n16 load]$ <B>sbatch -N3 ./myload 300</B> +sbatch: Submitted batch job 9 +[user@n16 load]$ <B>sbatch -N2 ./myload 300</B> +sbatch: Submitted batch job 10 +[user@n16 load]$ <B>sbatch -N3 ./myload 300</B> +sbatch: Submitted batch job 11 +[user@n16 load]$ <B>squeue</B> +JOBID PARTITION NAME USER ST TIME NODES NODELIST + 9 active myload user R 0:11 3 n[12-14] + 10 active myload user R 0:08 2 n[15-16] + 11 active myload user S 0:00 3 n[12-14] +[user@n16 load]$ <B>squeue</B> +JOBID PARTITION NAME USER ST TIME NODES NODELIST + 10 active myload user R 0:50 2 n[15-16] + 11 active myload user R 0:12 3 n[12-14] + 9 active myload user S 0:41 3 n[12-14] +[user@n16 load]$ <B>squeue</B> +JOBID PARTITION NAME USER ST TIME NODES NODELIST + 10 active myload user R 1:04 2 n[15-16] + 11 active myload user R 0:26 3 n[12-14] + 9 active myload user S 0:41 3 n[12-14] +[user@n16 load]$ <B>squeue</B> +JOBID PARTITION NAME USER ST TIME NODES NODELIST + 9 active myload user R 0:46 3 n[12-14] + 10 active myload user R 1:13 2 n[15-16] + 11 active myload user S 0:30 3 n[12-14] +[user@n16 load]$ +</PRE> +</P> +<P> +The next example displays "local backfilling": +</P> +<PRE> +[user@n16 load]$ <B>sbatch -N3 ./myload 300</B> +sbatch: Submitted batch job 12 +[user@n16 load]$ <B>sbatch -N5 ./myload 300</B> +sbatch: Submitted batch job 13 +[user@n16 load]$ <B>sbatch -N2 ./myload 300</B> +sbatch: Submitted batch job 14 +[user@n16 load]$ <B>squeue</B> +JOBID PARTITION NAME USER ST TIME NODES NODELIST + 12 active myload user R 0:14 3 n[12-14] + 14 active myload user R 0:06 2 n[15-16] + 13 active myload user S 0:00 5 n[12-16] +[user@n16 load]$ +</PRE> +<P> +Without timeslicing and without the backfill scheduler enabled, job 14 has to +wait for job 13 to finish. +</P><P> +This is called "local" backfilling because the backfilling only occurs with jobs +close enough in the queue to get allocated by the scheduler as part of +oversubscribing the resources. Recall that the number of jobs that can +overcommit a resource is controlled by the <I>Shared=FORCE:max_share</I> value, +so this value effectively controls the scope of "local backfilling". +</P><P> +Normal backfill algorithms check <U>all</U> jobs in the wait queue. +</P> -<li><b>GangTimeSlice</b>: -Description.</li> +<H2>Consumable Resource Examples</H2> +<P> +The following two examples illustrate the primary difference between +<I>CR_CPU</I> and <I>CR_Core</I> when consumable resource selection is enabled +(<I>select/cons_res</I>). +</P> +<P> +When <I>CR_CPU</I> (or <I>CR_CPU_Memory</I>) is configured then the selector +treats the CPUs as simple, <I>interchangeable</I> computing resources. However +when <I>CR_Core</I> (or <I>CR_Core_Memory</I>) is enabled the selector treats +the CPUs as individual resources that are <U>specifically</U> allocated to jobs. +This subtle difference is highlighted when timeslicing is enabled. +</P> +<P> +In both examples 6 jobs are submitted. Each job requests 2 CPUs per node, and +all of the nodes contain two quad-core processors. The timeslicer will initially +let the first 4 jobs run and suspend the last 2 jobs. The manner in which these +jobs are timesliced depends upon the configured <I>SelectTypeParameter</I>. +</P> +<P> +In the first example <I>CR_Core_Memory</I> is configured. Note that jobs 46 and +47 don't <U>ever</U> get suspended. This is because they are not sharing their +cores with any other job. Jobs 48 and 49 were allocated to the same cores as +jobs 45 and 46. The timeslicer recognizes this and timeslices only those jobs: +</P> +<PRE> +[user@n16 load]$ <B>sinfo</B> +PARTITION AVAIL TIMELIMIT NODES STATE NODELIST +active* up infinite 5 idle n[12-16] +[user@n16 load]$ <B>scontrol show config | grep Select</B> +SelectType = select/cons_res +SelectTypeParameters = CR_CORE_MEMORY +[user@n16 load]$ <B>sinfo -o "%20N %5D %5c %5z"</B> +NODELIST NODES CPUS S:C:T +n[12-16] 5 8 2:4:1 +[user@n16 load]$ +[user@n16 load]$ +[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> +sbatch: Submitted batch job 44 +[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> +sbatch: Submitted batch job 45 +[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> +sbatch: Submitted batch job 46 +[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> +sbatch: Submitted batch job 47 +[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> +sbatch: Submitted batch job 48 +[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> +sbatch: Submitted batch job 49 +[user@n16 load]$ <B>squeue</B> +JOBID PARTITION NAME USER ST TIME NODES NODELIST + 44 active myload user R 0:09 5 n[12-16] + 45 active myload user R 0:08 5 n[12-16] + 46 active myload user R 0:08 5 n[12-16] + 47 active myload user R 0:07 5 n[12-16] + 48 active myload user S 0:00 5 n[12-16] + 49 active myload user S 0:00 5 n[12-16] +[user@n16 load]$ <B>squeue</B> +JOBID PARTITION NAME USER ST TIME NODES NODELIST + 46 active myload user R 0:49 5 n[12-16] + 47 active myload user R 0:48 5 n[12-16] + 48 active myload user R 0:06 5 n[12-16] + 49 active myload user R 0:06 5 n[12-16] + 44 active myload user S 0:44 5 n[12-16] + 45 active myload user S 0:43 5 n[12-16] +[user@n16 load]$ <B>squeue</B> +JOBID PARTITION NAME USER ST TIME NODES NODELIST + 44 active myload user R 1:23 5 n[12-16] + 45 active myload user R 1:22 5 n[12-16] + 46 active myload user R 2:22 5 n[12-16] + 47 active myload user R 2:21 5 n[12-16] + 48 active myload user S 1:00 5 n[12-16] + 49 active myload user S 1:00 5 n[12-16] +[user@n16 load]$ +</PRE> +<P> +Note the runtime of all 6 jobs in the output of the last <I>squeue</I> command. +Jobs 46 and 47 have been running continuously, while jobs 45 and 46 are +splitting their runtime with jobs 48 and 49. +</P><P> +The next example has <I>CR_CPU_Memory</I> configured and the same 6 jobs are +submitted. Here the selector and the timeslicer treat the CPUs as countable +resources which results in all 6 jobs sharing time on the CPUs: +</P> +<PRE> +[user@n16 load]$ <B>sinfo</B> +PARTITION AVAIL TIMELIMIT NODES STATE NODELIST +active* up infinite 5 idle n[12-16] +[user@n16 load]$ <B>scontrol show config | grep Select</B> +SelectType = select/cons_res +SelectTypeParameters = CR_CPU_MEMORY +[user@n16 load]$ <B>sinfo -o "%20N %5D %5c %5z"</B> +NODELIST NODES CPUS S:C:T +n[12-16] 5 8 2:4:1 +[user@n16 load]$ +[user@n16 load]$ +[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> +sbatch: Submitted batch job 51 +[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> +sbatch: Submitted batch job 52 +[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> +sbatch: Submitted batch job 53 +[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> +sbatch: Submitted batch job 54 +[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> +sbatch: Submitted batch job 55 +[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> +sbatch: Submitted batch job 56 +[user@n16 load]$ <B>squeue</B> +JOBID PARTITION NAME USER ST TIME NODES NODELIST + 51 active myload user R 0:11 5 n[12-16] + 52 active myload user R 0:11 5 n[12-16] + 53 active myload user R 0:10 5 n[12-16] + 54 active myload user R 0:09 5 n[12-16] + 55 active myload user S 0:00 5 n[12-16] + 56 active myload user S 0:00 5 n[12-16] +[user@n16 load]$ <B>squeue</B> +JOBID PARTITION NAME USER ST TIME NODES NODELIST + 51 active myload user R 1:09 5 n[12-16] + 52 active myload user R 1:09 5 n[12-16] + 55 active myload user R 0:23 5 n[12-16] + 56 active myload user R 0:23 5 n[12-16] + 53 active myload user S 0:45 5 n[12-16] + 54 active myload user S 0:44 5 n[12-16] +[user@n16 load]$ <B>squeue</B> +JOBID PARTITION NAME USER ST TIME NODES NODELIST + 53 active myload user R 0:55 5 n[12-16] + 54 active myload user R 0:54 5 n[12-16] + 55 active myload user R 0:40 5 n[12-16] + 56 active myload user R 0:40 5 n[12-16] + 51 active myload user S 1:16 5 n[12-16] + 52 active myload user S 1:16 5 n[12-16] +[user@n16 load]$ <B>squeue</B> +JOBID PARTITION NAME USER ST TIME NODES NODELIST + 51 active myload user R 3:18 5 n[12-16] + 52 active myload user R 3:18 5 n[12-16] + 53 active myload user R 3:17 5 n[12-16] + 54 active myload user R 3:16 5 n[12-16] + 55 active myload user S 3:00 5 n[12-16] + 56 active myload user S 3:00 5 n[12-16] +[user@n16 load]$ +</PRE> +<P> +Note that the runtime of all 6 jobs is roughly equal. Jobs 51-54 ran first so +they're slightly ahead, but so far all jobs have run for at least 3 minutes. +</P><P> +At the core level this means that SLURM relies on the linux kernel to move jobs +around on the cores to maximize performance. This is different than when +<I>CR_Core_Memory</I> was configured and the jobs would effectively remain +"pinned" to their specific cores for the duration of the job. Note that +<I>CR_Core_Memory</I> supports CPU binding, while <I>CR_CPU_Memory</I> does not. +</P> -</ul></p> +<H2>Future Work</H2> + +<P> +Priority scheduling and preemptive scheduling are other forms of gang +scheduling that are currently under development for SLURM. +</P> +<P> +<B>Making use of swap space</B>: (note that this topic is not currently +scheduled for development, unless someone would like to pursue this) It should +be noted that timeslicing does provide an interesting mechanism for high +performance jobs to make use of swap space. The optimal scenario is one in which +suspended jobs are "swapped out" and active jobs are "swapped in". The swapping +activity would only occur once every <I>SchedulerTimeslice</I> interval. +</P> +<P> +However, SLURM should first be modified to include support for scheduling jobs +into swap space and to provide controls to prevent overcommitting swap space. +For now this idea could be experimented with by disabling memory support in the +selector and submitting appropriately sized jobs. +</P> -<p style="text-align:center;">Last modified 11 March 2008</p> +<p style="text-align:center;">Last modified 17 March 2008</p> <!--#include virtual="footer.txt"-->