Update of docs for job preemption, docs.patch from Chris Holmes.

ffeebd81 · Moe Jette · 530172f7 · ffeebd81 · ffeebd81 · ffeebd81
Commit ffeebd81 authored 16 years ago by Moe Jette
--- a/doc/html/cons_res_share.shtml
+++ b/doc/html/cons_res_share.shtml
@@ -41,8 +41,9 @@ The following table describes this new functionality in more detail:
 <TD>Whole nodes are allocated to jobs. No node will run more than one job.</TD>
 </TR><TR>
 <TD>Shared=YES</TD>
-<TD>Same as Shared=FORCE if job request specifies --shared option.
-Otherwise same as Shared=NO.</TD>
+<TD>By default same as Shared=NO. Nodes allocated to a job may be shared with
+other jobs if each job allows sharing via the <CODE>srun --shared</CODE>
+option.</TD>
 </TR><TR>
 <TD>Shared=FORCE</TD>
 <TD>Whole nodes are allocated to jobs. A node may run more than one job.</TD>
@@ -55,8 +56,9 @@ SelectTypeParameters=<B>CR_Core_Memory</B></TD>
 <TD>Cores are allocated to jobs. No core will run more than one job.</TD>
 </TR><TR>
 <TD>Shared=YES</TD>
-<TD>Allocate whole nodes if job request specifies --exclusive option.
-Otherwise same as Shared=FORCE.</TD>
+<TD>By default same as Shared=NO. Cores allocated to a job may be shared with
+other jobs if each job allows sharing via the <CODE>srun --shared</CODE>
+option.</TD>
 </TR><TR>
 <TD>Shared=FORCE</TD>
 <TD>Cores are allocated to jobs. A core may run more than one job.</TD>
@@ -69,8 +71,9 @@ SelectTypeParameters=<B>CR_CPU_Memory</B></TD>
 <TD>CPUs are allocated to jobs. No CPU will run more than one job.</TD>
 </TR><TR>
 <TD>Shared=YES</TD>
-<TD>Allocate whole nodes if job request specifies --exclusive option.
-Otherwise same as Shared=FORCE.</TD>
+<TD>By default same as Shared=NO. CPUs allocated to a job may be shared with
+other jobs if each job allows sharing via the <CODE>srun --shared</CODE>
+option.</TD>
 </TR><TR>
 <TD>Shared=FORCE</TD>
 <TD>CPUs are allocated to jobs. A CPU may run more than one job.</TD>
@@ -83,8 +86,9 @@ SelectTypeParameters=<B>CR_Socket_Memory</B></TD>
 <TD>Sockets are allocated to jobs. No socket will run more than one job.</TD>
 </TR><TR>
 <TD>Shared=YES</TD>
-<TD>Allocate whole nodes if job request specifies --exclusive option.
-Otherwise same as Shared=FORCE.</TD>
+<TD>By default same as Shared=NO. Sockets allocated to a job may be shared with
+other jobs if each job allows sharing via the <CODE>srun --shared</CODE>
+option.</TD>
 </TR><TR>
 <TD>Shared=FORCE</TD>
 <TD>Sockets are allocated to jobs. A socket may run more than one job.</TD>
@@ -110,9 +114,9 @@ busy nodes that have more than half of the CPUs available for use. The
 <CODE>select/linear</CODE> plugin simply counts jobs on nodes, and does not
 track the CPU usage on each node.
 </P><P>
-This new functionality also supports the new
-<CODE>Shared=FORCE:&lt;num&gt;</CODE> syntax. If <CODE>Shared=FORCE:3</CODE> is
-configured with <CODE>select/cons_res</CODE> and <CODE>CR_Core</CODE> or
+This new sharing functionality in the select/cons_res plugin also supports the
+new <CODE>Shared=FORCE:&lt;num&gt;</CODE> syntax. If <CODE>Shared=FORCE:3</CODE>
+is configured with <CODE>select/cons_res</CODE> and <CODE>CR_Core</CODE> or
 <CODE>CR_Core_Memory</CODE>, then the <CODE>select/cons_res</CODE> plugin will
 run up to 3 jobs on each <U>core</U> of each node in the partition. If
 <CODE>CR_Socket</CODE> or <CODE>CR_Socket_Memory</CODE> is configured, then the
@@ -122,10 +126,28 @@ of each node in the partition.
 <H3>Nodes in Multiple Partitions</H3>
 <P>
 SLURM has supported configuring nodes in more than one partition since version
-0.7.0. The <CODE>Shared=FORCE</CODE> support in the <CODE>select/cons_res</CODE>
-plugin accounts for this "multiple partition" support. Here are several
-scenarios with the <CODE>select/cons_res</CODE> plugin enabled to help
-understand how all of this works together:
+0.7.0. The following table describes how nodes configured in two partitions with
+different <CODE>Shared</CODE> settings will be allocated to jobs. Note that
+"shared" jobs are jobs that are submitted to partitions configured with
+<CODE>Shared=FORCE</CODE> or with <CODE>Shared=YES</CODE> and the job requested
+sharing with the <CODE>srun --shared</CODE> option. Conversely, "non-shared"
+jobs are jobs that are submitted to partitions configured with
+<CODE>Shared=NO</CODE> or <CODE>Shared=YES</CODE> and the job did <U>not</U>
+request sharable resources.
+</P>
+<TABLE CELLPADDING=3 CELLSPACING=1 BORDER=1>
+<TR><TH>&nbsp;</TH><TH>First job "sharable"</TH><TH>First job not
+"sharable"</TH></TR>
+<TR><TH>Second job "sharable"</TH><TD>Both jobs can run on the same nodes and may
+share resources</TD><TD>Jobs do not run on the same nodes</TD></TR>
+<TR><TH>Second job not "sharable"</TH><TD>Jobs do not run on the same nodes</TD>
+<TD>Jobs can run on the same nodes but will not share resources</TD></TR>
+</TABLE>
+<P>
+The next table contains several
+scenarios with the <CODE>select/cons_res</CODE> plugin enabled to further
+clarify how a node is used when it is configured in more than one partition and
+the partitions have different "Shared" policies:
 </P>
 <TABLE CELLPADDING=3 CELLSPACING=1 BORDER=1>
 <TR><TH>SLURM configuration</TH>
@@ -185,6 +207,12 @@ having memory pages swapped out and severely degraded performance.
 <TD>Memory allocation is not tracked. Jobs are allocated to nodes without
 considering if there is enough free memory. Swapping could occur!</TD>
 </TR><TR>
+<TD>SelectType=<B>select/linear</B> plus<BR>
+SelectTypeParameters=<B>CR_Memory</B></TD>
+<TD>Memory allocation is tracked.  Nodes that do not have enough available
+memory to meet the jobs memory requirement will not be allocated to the job.
+</TD>
+</TR><TR>
 <TD>SelectType=<B>select/cons_res</B><BR>
 Plus one of the following:<BR>
 SelectTypeParameters=<B>CR_Core</B><BR>
@@ -200,32 +228,38 @@ SelectTypeParameters=<B>CR_Core_Memory</B><BR>
 SelectTypeParameters=<B>CR_CPU_Memory</B><BR>
 SelectTypeParameters=<B>CR_Socket_Memory</B></TD>
 <TD>Memory allocation for all jobs are tracked. Nodes that do not have enough
-available memory to meet the job's memory requirement will not be allocated to
+available memory to meet the jobs memory requirement will not be allocated to
 the job.</TD>
 </TR>
 </TABLE>
-<P>Users can specify their job's memory requirements one of two ways.
-<CODE>--mem=&lt;num&gt;</CODE> can be used to specify the job's memory 
-requirement on a per allocated node basis. This option is probably best 
-suited for use with the <CODE>select/linear</CODE> plugin, which allocates 
-whole nodes to jobs. 
-<CODE>--mem-per-cpu=&lt;num&gt;</CODE> can be used to specify the job's 
-memory requirement on a per allocated CPU basis. This is probably best
-suited for use with the <CODE>select/cons_res</CODE> plugin which can 
+<P>Users can specify their job's memory requirements one of two ways. The
+<CODE>srun --mem=&lt;num&gt;</CODE> option can be used to specify the jobs
+memory requirement on a per allocated node basis. This option is recommended 
+for use with the <CODE>select/linear</CODE> plugin, which allocates 
+whole nodes to jobs. The
+<CODE>srun --mem-per-cpu=&lt;num&gt;</CODE> option can be used to specify the
+jobs memory requirement on a per allocated CPU basis. This is recommended
+for use with the <CODE>select/cons_res</CODE> plugin which can 
 allocate individual CPUs to jobs.</P>

 <P>Default and maximum values for memory on a per node or per CPU basis can 
-be configured using the following options: <CODE>DefMemPerCPU</CODE>,
-<CODE>DefMemPerNode</CODE>, <CODE>MaxMemPerCPU</CODE> and <CODE>MaxMemPerNode</CODE>.
+be configured by the system administrator using the following
+<CODE>slurm.conf</CODE> options: <CODE>DefMemPerCPU</CODE>,
+<CODE>DefMemPerNode</CODE>, <CODE>MaxMemPerCPU</CODE> and
+<CODE>MaxMemPerNode</CODE>.
 Users can use the <CODE>--mem</CODE> or <CODE>--mem-per-cpu</CODE> option
-at job submission time to specify their memory requirements.
-Enforcement of a job's memory allocation is performed by the accounting 
-plugin, which periodically gathers data about running jobs. Set 
+at job submission time to override the default value, but they cannot exceed
+the maximum value.
+</P><P>
+Enforcement of a jobs memory allocation is performed by setting the "maximum
+data segment size" and the "maximum virtual memory size" system limits to the
+appropriate values before launching the tasks. Enforcement is also managed by
+the accounting plugin, which periodically gathers data about running jobs. Set 
 <CODE>JobAcctGather</CODE> and <CODE>JobAcctFrequency</CODE> to 
 values suitable for your system.</P>

 <p class="footer"><a href="#top">top</a></p>

-<p style="text-align:center;">Last modified 8 July 2008</p>
+<p style="text-align:center;">Last modified 2 December 2008</p>

 <!--#include virtual="footer.txt"-->
--- a/doc/html/gang_scheduling.shtml
+++ b/doc/html/gang_scheduling.shtml
@@ -3,12 +3,16 @@
 <H1>Gang Scheduling</H1>

 <P>
-SLURM version 1.2 and earlier supported dedication of resources
-to jobs.
-Beginning in SLURM version 1.3, gang scheduling is supported. 
-Gang scheduling is when two or more jobs are allocated to the same resources 
-and these jobs are alternately suspended to let all of the tasks of each 
-job have full access to the shared resources for a period of time.
+SLURM version 1.2 and earlier supported dedication of resources to jobs.
+Beginning in SLURM version 1.3, timesliced gang scheduling is supported. 
+Timesliced gang scheduling is when two or more jobs are allocated to the same
+resources and these jobs are alternately suspended to let one job at a time have
+dedicated access to the resources for a configured period of time.
+</P>
+<P>
+Preemptive priority job scheduling is another form of gang-scheduling that is
+supported by SLURM. See the <a href="preempt.html">Preemption</a> document for
+more information.
 </P>
 <P>
 A resource manager that supports timeslicing can improve it's responsiveness
@@ -87,7 +91,8 @@ the overhead of gang scheduling.
 The <I>FORCE</I> option now supports an additional parameter that controls 
 how many jobs can share a resource (FORCE[:max_share]). By default the 
 max_share value is 4. To allow up to 6 jobs from this partition to be 
-allocated to a common resource, set <I>Shared=FORCE:6</I>.
+allocated to a common resource, set <I>Shared=FORCE:6</I>. To only let 2 jobs
+timeslice on the same resources, set <I>Shared=FORCE:2</I>.
 </LI>
 </UL>
 <P>
@@ -489,10 +494,6 @@ around on the cores to maximize performance. This is different than when

 <H2>Future Work</H2>

-<P>
-Priority scheduling and preemptive scheduling are other forms of gang
-scheduling that are currently under development for SLURM.
-</P>
 <P>
 <B>Making use of swap space</B>: (note that this topic is not currently
 scheduled for development, unless someone would like to pursue this) It should
@@ -508,6 +509,6 @@ For now this idea could be experimented with by disabling memory support in the
 selector and submitting appropriately sized jobs.
 </P>

-<p style="text-align:center;">Last modified 7 July 2008</p>
+<p style="text-align:center;">Last modified 5 December 2008</p>

 <!--#include virtual="footer.txt"-->
--- a/doc/html/preempt.shtml
+++ b/doc/html/preempt.shtml
@@ -5,10 +5,11 @@
 <P>
 SLURM version 1.2 and earlier supported dedication of resources
 to jobs based on a simple "first come, first served" policy with backfill.
-Beginning in SLURM version 1.3, priority-based <I>preemption</I> is supported. 
-Preemption is the act of suspending one or more "low-priority" jobs to let a
-"high-priority" job run uninterrupted until it completes. Preemption provides
-the ability to prioritize the workload on a cluster.
+Beginning in SLURM version 1.3, priority partitions and priority-based
+<I>preemption</I> are supported. Preemption is the act of suspending one or more
+"low-priority" jobs to let a "high-priority" job run uninterrupted until it
+completes. Preemption provides the ability to prioritize the workload on a
+cluster.
 </P>
 <P>
 The SLURM version 1.3.1 <I>sched/gang</I> plugin supports preemption. 
@@ -30,13 +31,11 @@ There are several important configuration parameters relating to preemption:
 <LI>
 <B>SelectType</B>: The SLURM <I>sched/gang</I> plugin supports nodes 
 allocated by the <I>select/linear</I> plugin and socket/core/CPU resources 
-allocated by the <I>select/cons_res</I> plugin. 
-See <A HREF="#future_work">Future Work</A> below for more
-information on "preemption with consumable resources".
+allocated by the <I>select/cons_res</I> plugin.
 </LI>
 <LI>
 <B>SelectTypeParameter</B>: Since resources will be getting overallocated 
-with jobs (the preempted job will remain in memory), the resource selection
+with jobs (suspended jobs remain in memory), the resource selection
 plugin should be configured to track the amount of memory used by each job to
 ensure that memory page swapping does not occur. When <I>select/linear</I> is
 chosen, we recommend setting <I>SelectTypeParameter=CR_Memory</I>. When
@@ -55,12 +54,14 @@ Users can use the <I>--mem</I> or <I>--mem-per-cpu</I> option
 at job submission time to specify their memory requirements.
 </LI>
 <LI>
-<B>JobAcctGatherType and JobAcctGatherFrequency</B>:
-If you wish to enforce memory limits, accounting must be enabled
-using the <I>JobAcctGatherType</I> and <I>JobAcctGatherFrequency</I>
-parameters. If accounting is enabled and a job exceeds its configured
-memory limits, it will be canceled in order to prevent it from 
-adversely effecting other jobs sharing the same resources.
+<B>JobAcctGatherType and JobAcctGatherFrequency</B>: The "maximum data segment
+size" and "maximum virtual memory size" system limits will be configured for
+each job to ensure that the job does not exceed its requested amount of memory.
+If you wish to enable additional enforcement of memory limits, configure job
+accounting with the <I>JobAcctGatherType</I> and <I>JobAcctGatherFrequency</I>
+parameters. When accounting is enabled and a job exceeds its configured memory
+limits, it will be canceled in order to prevent it from adversely effecting
+other jobs sharing the same resources.
 </LI>
 <LI>
 <B>SchedulerType</B>: Configure the <I>sched/gang</I> plugin by setting
@@ -70,24 +71,10 @@ adversely effecting other jobs sharing the same resources.
 <B>Priority</B>: Configure the partition's <I>Priority</I> setting relative to
 other partitions to control the preemptive behavior. If two jobs from two
 different partitions are allocated to the same resources, the job in the
-partition with the greater <I>Priority</I> value will preempt the job in the
-partition with the lesser <I>Priority</I> value. If the <I>Priority</I> values
-of the two partitions are equal then no preemption will occur, and the two jobs
-will run simultaneously on the same resources. The default <I>Priority</I> value
-is 1.
-</LI>
-<LI>
-<B>Shared</B>: Configure the partitions <I>Shared</I> setting to 
-<I>FORCE</I> for all partitions that will preempt or that will be preempted. The
-<I>FORCE</I> setting is required to enable the select plugins to overallocate
-resources. Jobs submitted to a partition that does not share it's resources will
-not preempt other jobs, nor will those jobs be preempted. Instead those jobs
-will wait until the resources are free for non-shared use by each job.
-<BR>
-The <I>FORCE</I> option now supports an additional parameter that controls 
-how many jobs can share a resource within the partition (FORCE[:max_share]). By
-default the max_share value is 4. To disable timeslicing within a partition but
-enable preemption with other partitions, set <I>Shared=FORCE:1</I>.
+partition with the greater <I>Priority</I> value will preempt the job in the
+partition with the lesser <I>Priority</I> value. If the <I>Priority</I> values
+of the two partitions are equal then no preemption will occur. The default
+<I>Priority</I> value is 1.
 </LI>
 <LI>
 <B>SchedulerTimeSlice</B>: The default timeslice interval is 30 seconds. 
@@ -113,9 +100,10 @@ SLURM requires a full restart of the daemons. If you just change the partition
 When enabled, the <I>sched/gang</I> plugin keeps track of the resources
 allocated to all jobs. For each partition an "active bitmap" is maintained that
 tracks all concurrently running jobs in the SLURM cluster. Each partition also
-maintains a job list for that partition, and a list of "shadow" jobs. These
-"shadow" jobs are running jobs from higher priority partitions that "cast
-shadows" on the active bitmaps of the lower priority partitions. 
+maintains a job list for that partition, and a list of "shadow" jobs. The
+"shadow" jobs are job allocations from higher priority partitions that "cast
+shadows" on the active bitmaps of the lower priority partitions. Jobs in lower
+priority partitions that are caught in these "shadows" will be suspended.
 </P>
 <P>
 Each time a new job is allocated to resources in a partition and begins running,
@@ -128,13 +116,22 @@ bitmaps of the lower priority partitions are rebuilt to see if any suspended
 jobs can be resumed.
 </P>
 <P>
-The gang scheduler plugin is primarily designed to be <I>reactive</I> to the
-resource allocation decisions made by the Selector plugins. This is why
-<I>Shared=FORCE</I> is required in each partition. The <I>Shared=FORCE</I>
-setting enables the <I>select/linear</I> and <I>select/cons_res</I> plugins to
-overallocate the resources between partitions. This keeps all of the node
-placement logic in the <I>select</I> plugins, and leaves the gang scheduler in
-charge of controlling which jobs should run on the overallocated resources. 
+The gang scheduler plugin is designed to be <I>reactive</I> to the resource
+allocation decisions made by the "select" plugins. The "select" plugins have
+been enhanced to recognize when "sched/gang" has been configured, and to factor
+in the priority of each partition when selecting resources for a job. When
+choosing resources for each job, the selector avoids resources that are in use
+by other jobs (unless sharing has been configured, in which case it does some
+load-balancing). However, when "sched/gang" is enabled, the select plugins may
+choose resources that are already in use by jobs from partitions with a lower
+priority setting, even when sharing is disabled in those partitions.
+</P>
+<P>
+This leaves the gang scheduler in charge of controlling which jobs should run on
+the overallocated resources. The <I>sched/gang</I> plugin suspends jobs via the
+same internal functions that support <I>scontrol suspend</I> and <I>scontrol
+resume</I>. A good way to observe the act of preemption is by running <I>watch
+squeue</I> in a terminal window.
 </P>
 <P>
 The <I>sched/gang</I> plugin suspends jobs via the same internal functions that
@@ -146,9 +143,8 @@ window.
 <H2>A Simple Example</H2>

 <P>
-The following example is configured with <I>select/linear</I>,
-<I>sched/gang</I>, and <I>Shared=FORCE:1</I>. This example takes place on a
-cluster of 5 nodes:
+The following example is configured with <I>select/linear</I> and
+<I>sched/gang</I>. This example takes place on a cluster of 5 nodes:
 </P>
 <PRE>
 [user@n16 ~]$ <B>sinfo</B>
@@ -161,8 +157,8 @@ Here are the Partition settings:
 </P>
 <PRE>
 [user@n16 ~]$ <B>grep PartitionName /shared/slurm/slurm.conf</B>
-PartitionName=active Priority=1 Default=YES Shared=FORCE:1 Nodes=n[12-16]
-PartitionName=hipri  Priority=2             Shared=FORCE:1 Nodes=n[12-16]
+PartitionName=active Priority=1 Default=YES Shared=NO Nodes=n[12-16]
+PartitionName=hipri  Priority=2             Shared=NO Nodes=n[12-16]
 </PRE>
 <P>
 The <I>runit.pl</I> script launches a simple load-generating app that runs
@@ -180,7 +176,7 @@ sbatch: Submitted batch job 487
 sbatch: Submitted batch job 488
 [user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
 sbatch: Submitted batch job 489
-[user@n16 ~]$ <B>squeue</B>
+[user@n16 ~]$ <B>squeue -Si</B>
 JOBID PARTITION     NAME   USER  ST   TIME  NODES NODELIST
  485    active runit.pl   user   R   0:06      1 n12
  486    active runit.pl   user   R   0:06      1 n13
@@ -194,13 +190,13 @@ Now submit a short-running 3-node job to the <I>hipri</I> partition:
 <PRE>
 [user@n16 ~]$ <B>sbatch -N3 -p hipri ./runit.pl 30</B>
 sbatch: Submitted batch job 490
-[user@n16 ~]$ <B>squeue</B>
+[user@n16 ~]$ <B>squeue -Si</B>
 JOBID PARTITION     NAME   USER  ST   TIME  NODES NODELIST
-  488    active runit.pl   user   R   0:29      1 n15
-  489    active runit.pl   user   R   0:28      1 n16
  485    active runit.pl   user   S   0:27      1 n12
  486    active runit.pl   user   S   0:27      1 n13
-  487    active runit.pl   user   S   0:26      1 n14
+  487    active runit.pl   user   S   0:26      1 n14
+  488    active runit.pl   user   R   0:29      1 n15
+  489    active runit.pl   user   R   0:28      1 n16
  490     hipri runit.pl   user   R   0:03      3 n[12-14]
 </PRE>
 <P>
@@ -223,26 +219,79 @@ JOBID PARTITION     NAME   USER  ST   TIME  NODES NODELIST
 </PRE>


-<H2><A NAME="future_work">Future Work</A></H2>
-
-<P>
-<B>Preemption with consumable resources</B>: This implementation of preemption
-relies on intelligent job placement by the <I>select</I> plugins. As of SLURM
-1.3.1 the consumable resource <I>select/cons_res</I> plugin still needs
-additional enhancements to the job placement algorithm before it's preemption
-support can be considered "competent". The mechanics of preemption work, but the
-placement of preemptive jobs relative to any low-priority jobs may not be
-optimal. The work to improve the placement of preemptive jobs relative to
-existing jobs is currently in-progress. 
-</P>
-<P>
-<B>Requeue a preempted job</B>: In some situations is may be desirable to
-requeue a low-priority job rather than suspend it. Suspending a job leaves the
-job in memory. Requeuing a job involves terminating the job and resubmitting it
-again. This will be investigated at some point in the future. Requeuing a
-preempted job may make the most sense with <I>Shared=NO</I> partitions.
-</P>
-
-<p style="text-align:center;">Last modified 7 July 2008</p>
-
-<!--#include virtual="footer.txt"-->
+<H2><A NAME="future_work">Future Ideas</A></H2>
+
+<P>
+<B>More intelligence in the select plugins</B>: This implementation of
+preemption relies on intelligent job placement by the <I>select</I> plugins. In
+SLURM 1.3.1 the <I>select/linear</I> plugin has a decent preemptive placement
+algorithm, but the consumable resource <I>select/cons_res</I> plugin had no
+preemptive placement support. In SLURM 1.4 preemptive placement support was
+added to the <I>select/cons_res</I> plugin, but there is still room for
+improvement.
+</P><P>
+Take the following example:
+</P>
+<PRE>
+[user@n8 ~]$ <B>sinfo</B>
+PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
+active*      up   infinite     5   idle n[1-5]
+hipri        up   infinite     5   idle n[1-5]
+[user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B>
+sbatch: Submitted batch job 17
+[user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B>
+sbatch: Submitted batch job 18
+[user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B>
+sbatch: Submitted batch job 19
+[user@n8 ~]$ <B>squeue</B>
+  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
+     17    active  sleepme  cholmes   R       0:03      1 n1
+     18    active  sleepme  cholmes   R       0:03      1 n2
+     19    active  sleepme  cholmes   R       0:02      1 n3
+[user@n8 ~]$ <B>sbatch -N3 -n6 -p hipri ./sleepme 20</B>
+sbatch: Submitted batch job 20
+[user@n8 ~]$ <B>squeue -Si</B>
+  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
+     17    active  sleepme  cholmes   S       0:16      1 n1
+     18    active  sleepme  cholmes   S       0:16      1 n2
+     19    active  sleepme  cholmes   S       0:15      1 n3
+     20     hipri  sleepme  cholmes   R       0:03      3 n[1-3]
+[user@n8 ~]$ <B>sinfo</B>
+PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
+active*      up   infinite     3  alloc n[1-3]
+active*      up   infinite     2   idle n[4-5]
+hipri        up   infinite     3  alloc n[1-3]
+hipri        up   infinite     2   idle n[4-5]
+</PRE>
+<P>
+It would be more ideal if the "hipri" job were placed on nodes n[3-5], which
+would allow jobs 17 and 18 to continue running. However, a new "intelligent"
+algorithm would have to include factors such as job size and required nodes in
+order to support ideal placements such as this, which can quickly complicate
+the design. Any and all help is welcome here!
+</P>
+<P>
+<B>Preemptive backfill</B>: the current backfill scheduler plugin
+("sched/backfill") is a nice way to make efficient use of otherwise idle
+resources. But SLURM only supports one scheduler plugin at a time. Fortunately,
+given the design of the new "sched/gang" plugin, there is no direct overlap
+between the backfill functionality and the gang-scheduling functionality. Thus,
+it's possible that these two plugins could technically be merged into a new
+scheduler plugin that supported preemption <U>and</U> backfill. <B>NOTE:</B>
+this is only an idea based on a code review so there would likely need to be
+some additional development, and plenty of testing!
+</P><P>
+
+</P>
+<P>
+<B>Requeue a preempted job</B>: In some situations is may be desirable to
+requeue a low-priority job rather than suspend it. Suspending a job leaves the
+job in memory. Requeuing a job involves terminating the job and resubmitting it
+again. The "sched/gang" plugin would need to be modified to recognize when a job
+is able to be requeued and when it can requeue a job (for preemption only, not
+for timeslicing!), and perform the requeue request.
+</P>
+
+<p style="text-align:center;">Last modified 5 December 2008</p>
+
+<!--#include virtual="footer.txt"-->