Skip to content
Snippets Groups Projects
Commit 27bf1bbd authored by jette's avatar jette
Browse files

Merge branch 'slurm-2.6' of https://github.com/SchedMD/slurm into slurm-2.6

parents 2f65854c d97bd588
No related branches found
No related tags found
No related merge requests found
<!--#include virtual="header.txt"--> <!--#include virtual="header.txt"-->
<h1><a name="top">Consumable Resources in SLURM</a></h1> <h1><a name="top">Consumable Resources in Slurm</a></h1>
<p>SLURM, using the default node allocation plug-in, allocates nodes to jobs in <p>Slurm, using the default node allocation plug-in, allocates nodes to jobs in
exclusive mode. This means that even when all the resources within a node are exclusive mode. This means that even when all the resources within a node are
not utilized by a given job, another job will not have access to these resources. not utilized by a given job, another job will not have access to these resources.
Nodes possess resources such as processors, memory, swap, local Nodes possess resources such as processors, memory, swap, local
disk, etc. and jobs consume these resources. The exclusive use default policy disk, etc. and jobs consume these resources. The exclusive use default policy
in SLURM can result in inefficient utilization of the cluster and of its nodes in Slurm can result in inefficient utilization of the cluster and of its nodes
resources. resources.
Slurm's <i>cons_res</i> or consumable resource plugin is available to
manage resources on a much more fine-grained basis as described below.</p>
<p>A plug-in supporting CPUs as a consumable resource is available in <h2>Using the Consumable Resource Allocation Plugin: <b>select/cons_res</b></h2>
SLURM 0.5.0 and newer versions of SLURM. Information on how to use
this plug-in is described below.
</p>
<h2>Using the Consumable Resource Node Allocation Plugin: <b>select/cons_res</b></h2> <ul>
<li>Consumable resources has been enhanced with several new resources
<ol start=1 type=1> --namely CPU (same as in previous version), Socket, Core, Memory
<li><b>SLURM version 1.2 and newer</b></li> as well as any combination of the logical processors with Memory:</li>
<ul>
<li><b>CPU</b> (<i>CR_CPU</i>): CPU as a consumable resource.</li>
<ul> <ul>
<li>Consumable resources has been enhanced with several new resources <li>No notion of sockets, cores, or threads.</li>
--namely CPU (same as in previous version), Socket, Core, Memory <li>On a multi-core system CPUs will be cores.</li>
as well as any combination of the logical processors with Memory:</li> <li>On a multi-core/hyperthread system CPUs will be threads.</li>
<ul> <li>On a single-core systems CPUs are CPUs. ;-)</li>
<li><b>CPU</b> (<i>CR_CPU</i>): CPU as a consumable resource. </ul>
<ul> <li><b>Board</b> (<i>CR_Board</i>): Baseboard as a consumable resource.</li>
<li>No notion of sockets, cores, or threads.</li> <li><b>Socket</b> (<i>CR_Socket</i>): Socket as a consumable resource.</li>
<li>On a multi-core system CPUs will be cores.</li> <li/><b>Core</b> (<i>CR_Core</i>): Core as a consumable resource.</li>
<li>On a multi-core/hyperthread system CPUs will be threads.</li> <li><b>Memory</b> (<i>CR_Memory</i>) Memory <u>only</u> as a
<li>On a single-core systems CPUs are CPUs. ;-) </li> consumable resource. Note! CR_Memory assumes Shared=Yes</li>
</ul> <li><b>Socket and Memory</b> (<i>CR_Socket_Memory</i>): Socket
<li><b>Socket</b> (<i>CR_Socket</i>): Socket as a consumable and Memory as consumable resources.</li>
resource.</li> <li><b>Core and Memory</b> (<i>CR_Core_Memory</i>): Core and
<li/><b>Core</b> (<i>CR_Core</i>): Core as a consumable Memory as consumable resources.</li>
resource.</li> <li><b>CPU and Memory</b> (<i>CR_CPU_Memory</i>) CPU and Memory
<li><b>Memory</b> (<i>CR_Memory</i>) Memory <u>only</u> as a as consumable resources.</li>
consumable resource. Note! CR_Memory assumes Shared=Yes</li> </ul>
<li><b>Socket and Memory</b> (<i>CR_Socket_Memory</i>): Socket
and Memory as consumable resources.</li> <li>In the cases where Memory is the consumable resource or one of
<li><b>Core and Memory</b> (<i>CR_Core_Memory</i>): Core and the two consumable resources the <b>RealMemory</b> parameter, which
Memory as consumable resources.</li> defines a node's amount of real memory in slurm.conf, must be
<li><b>CPU and Memory</b> (<i>CR_CPU_Memory</i>) CPU and Memory set when FastSchedule=1.</li>
as consumable resources.</li>
</ul> <li>srun's <i>-E</i> extension for sockets, cores, and threads are
<li>In the cases where Memory is the consumable resource or one of ignored within the node allocation mechanism when CR_CPU or
the two consumable resources the <b>RealMemory</b> parameter, which CR_CPU_MEMORY is selected. It is considered to compute the total
defines a node's amount of real memory in slurm.conf, must be number of tasks when -n is not specified.</li>
set when FastSchedule=1.
<li>srun's <i>-E</i> extension for sockets, cores, and threads are <li>The job submission commands (salloc, sbatch and srun) support the options
ignored within the node allocation mechanism when CR_CPU or <i>--mem=MB</i> and <i>--mem-per-cpu=MB</i> permitting users to specify
CR_CPU_MEMORY is selected. It is considered to compute the total the maximum amount of real memory per node or per allocated required.
number of tasks when -n is not specified. </li> This option is required in the environments where Memory is a consumable
<li>A new srun switch <i>--job-mem=MB</i> was added to allow users resource. It is important to specify enough memory since Slurm will not allow
to specify the maximum amount of real memory per node required the application to use more than the requested amount of real memory. The
by their application. This switch is needed in the environments default value for --mem is 1 MB. see srun man page for more details.</li>
were Memory is a consumable resource. It is important to specify
enough memory since slurmd will not allow the application to use <li><b>All CR_s assume Shared=No</b> or Shared=Force EXCEPT for
more than the requested amount of real memory per node. The <b>CR_MEMORY</b> which <b>assumes Shared=Yes</b></li>
default value for --job-mem is 1 MB. see srun man page for more
details.</li> <li>The consumable resource plugin is enabled via SelectType and
<li><b>All CR_s assume Shared=No</b> or Shared=Force EXCEPT for SelectTypeParameter in the slurm.conf.</li>
<b>CR_MEMORY</b> which <b>assumes Shared=Yes</b></li>
<li>The consumable resource plugin is enabled via SelectType and
SelectTypeParameter in the slurm.conf.</li>
<pre> <pre>
# #
# "SelectType" : node selection logic for scheduling. # Excerpts from sample slurm.conf file
# "select/bluegene" : the default on BlueGene systems, aware of
# system topology, manages bglblocks, etc.
# "select/cons_res" : allocate individual consumable resources
# (i.e. processors, memory, etc.)
# "select/linear" : the default on non-BlueGene systems,
# no topology awareness, oriented toward
# allocating nodes to jobs rather than
# resources within a node (e.g. CPUs)
#
# SelectType=select/linear
SelectType=select/cons_res
# o Define parameters to describe the SelectType plugin. For SelectType=select/cons_res
# - select/bluegene - this parameter is currently ignored SelectTypeParameters=CR_Core_Memory
# - select/linear - this parameter is currently ignored
# - select/cons_res - the parameters available are
# - CR_CPU (1) - CPUs as consumable resources.
# No notion of sockets, cores, or threads.
# On a multi-core system CPUs will be cores
# On a multi-core/hyperthread system CPUs
# will be threads
# On a single-core systems CPUs are CPUs.
# - CR_Socket (2) - Sockets as a consumable resource.
# - CR_Core (3) - Cores as a consumable resource.
# - CR_Memory (4) - Memory as a consumable resource.
# Note! CR_Memory assumes Shared=Yes
# - CR_Socket_Memory (5) - Socket and Memory as consumable
# resources.
# - CR_Core_Memory (6) - Core and Memory as consumable
# resources.
# - CR_CPU_Memory (7) - CPU and Memory as consumable
# resources.
#
# (#) refer to the output of "scontrol show config"
#
# NB!: The -E extension for sockets, cores, and threads
# are ignored within the node allocation mechanism
# when CR_CPU or CR_CPU_MEMORY is selected.
# They are considered to compute the total number of
# tasks when -n is not specified
#
# NB! All CR_s assume Shared=No or Shared=Force EXCEPT for
# CR_MEMORY which assumes Shared=Yes
#
#SelectTypeParameters=CR_CPU (default)
</pre> </pre>
<li>Using <i>--overcommit</i> or <i>-O</i> is allowed in this new version
of consumable resources. When the process to logical processor pinning is
enabled (task/affinity plug-in) the extra processes will not affect
co-scheduled jobs other than other jobs started with the -O flag.
We are currently investigating alternative approaches of handling the
pinning of jobs started with <i>--overcommit</i></li>
<li><i>-c</i> or <i>--cpus-per-task</i> works in this version of
consumable resources</li>
</ul>
<li><b>General comments</b></li>
<ul>
<li>SLURM's default <b>select/linear</b> plugin is using a best fit algorithm based on
number of consecutive nodes. The same node allocation approach is used in
<b>select/cons_res</b> for consistency.</li>
<li>The <b>select/cons_res</b> plugin is enabled or disabled cluster-wide.</li>
<li>In the case where <b>select/cons_res</b> is not enabled, the normal SLURM behaviors
are not disrupted. The only changes, users see when using the <b>select/cons_res</b>
plug-in, are that jobs can be co-scheduled on nodes when resources permits it.
The rest of SLURM such as srun and switches (except srun -s ...), etc. are not
affected by this plugin. SLURM is, from a user point of view, working the same
way as when using the default node selection scheme.</li>
<li>The <i>--exclusive</i> srun switch allows users to request nodes in
exclusive mode even when consumable resources is enabled. see "man srun"
for details. </li>
<li>srun's <i>-s</i> or <i>--share</i> is incompatible with the consumable resource
environment and will therefore not be honored. Since in this environment nodes
are shared by default, <i>--exclusive</i> allows users to obtain dedicated nodes.</li>
</ul>
</ol>
<p class="footer"><a href="#top">top</a></p> <li>Using <i>--overcommit</i> or <i>-O</i> is allowed. When the process to
logical processor pinning is enabled by using an appropriate TaskPlugin
configuration parameter, the extra processes will time share the allocated
resources.</li>
</ul>
<h2>Limitation and future work</h2> <h2>General Comments</h2>
<p>We are aware of several limitations with the current consumable
resource plug-in and plan to make enhancement the plug-in as we get
time as well as request from users to help us prioritize the features.
Please send comments and requests about the consumable resources to
<a href="mailto:slurm-dev@schedmd.com">slurm-dev@schedmd.com</a>.
<ol start=1 type=1>
<li><b>Issue with --max_nodes, --max_sockets_per_node, --max_cores_per_socket and --max_threads_per_core</b></li>
<ul>
<li><b>Problem:</b> The example below was achieve when using CR_CPU
(default mode). The systems are all "dual socket, dual core,
single threaded systems (= 4 cpus per system)".</li>
<li>The first 3 serial jobs are being allocated to node hydra12
which means that one CPU is still available on hydra12.</li>
<li>The 4th job "srun -N 2-2 -E 2:2 sleep 100" requires 8 CPUs
and since the algorithm fills up nodes in a consecutive order
(when not in dedicated mode) the algorithm will want to use the
remaining CPUs on Hydra12 first. Because the user has requested
a maximum of two nodes the allocation will put the job on
hold until hydra12 becomes available or if backfill is enabled
until hydra12's remaining CPU gets allocated to another job
which will allow the 4th job to get two dedicated nodes</li>
<li><b>Note!</b> This problem is fixed in SLURM version 1.3.</li>
<li><b>Note!</b> If you want to specify <i>--max_????</i> this
problem can be solved in the current implementation by asking
for the nodes in dedicated mode using <i>--exclusive</i></li>.
<pre> <ul>
# srun sleep 100 & <li>Slurm's default <b>select/linear</b> plugin is using a best fit algorithm
# srun sleep 100 & based on number of consecutive nodes. The same node allocation approach is used
# srun sleep 100 & in <b>select/cons_res</b> for consistency.</li>
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) <li>The <b>select/cons_res</b> plugin is enabled or disabled cluster-wide.</li>
1132 allNodes sleep sballe R 0:05 1 hydra12
1133 allNodes sleep sballe R 0:04 1 hydra12 <li>In the case where <b>select/cons_res</b> is not enabled, the normal Slurm
1134 allNodes sleep sballe R 0:02 1 hydra12 behaviors are not disrupted. The only changes, users see when using the
# srun -N 2-2 -E 2:2 sleep 100 & <b>select/cons_res</b> plugin, are that jobs can be co-scheduled on nodes when
srun: job 1135 queued and waiting for resources resources permit it.
#squeue The rest of Slurm, such as srun and switches (except srun -s ...), etc. are not
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) affected by this plugin. Slurm is, from a user point of view, working the same
1135 allNodes sleep sballe PD 0:00 2 (Resources) way as when using the default node selection scheme.</li>
1132 allNodes sleep sballe R 0:24 1 hydra12
1133 allNodes sleep sballe R 0:23 1 hydra12 <li>The <i>--exclusive</i> srun switch allows users to request nodes in
1134 allNodes sleep sballe R 0:21 1 hydra12 exclusive mode even when consumable resources is enabled. see "man srun"
</pre> for details. </li>
<li><b>Proposed solution:</b> Enhance the selection mechanism to go through {node,socket,core,thread}-tuplets to find available match for specific request (bounded knapsack problem). </li>
</ul> <li>srun's <i>-s</i> or <i>--share</i> is incompatible with the consumable
<li><b>Binding of processes in the case when <i>--overcommit</i> is specified.</b></li> resource environment and will therefore not be honored. Since in this
<ul> environment nodes are shared by default, <i>--exclusive</i> allows users to
<li>In the current implementation (SLURM 1.2) we have chosen not obtain dedicated nodes.</li>
to bind process that have been started with <i>--overcommit</i> </ul>
flag. The reasoning behind this decision is that the Linux
scheduler will move non-bound processes to available resources
when jobs with process pinning enabled are started. The
non-bound jobs do not affect the bound jobs but co-scheduled
non-bound job would affect each others runtime. We have decided
that for now this is an adequate solution.
</ul>
</ul>
</ol>
<p class="footer"><a href="#top">top</a></p> <p class="footer"><a href="#top">top</a></p>
...@@ -282,12 +177,12 @@ JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) ...@@ -282,12 +177,12 @@ JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
<h2>Example of Node Allocations Using Consumable Resource Plugin</h2> <h2>Example of Node Allocations Using Consumable Resource Plugin</h2>
<p>The following example illustrates the different ways four jobs <p>The following example illustrates the different ways four jobs
are allocated across a cluster using (1) SLURM's default allocation are allocated across a cluster using (1) Slurm's default allocation
(exclusive mode) and (2) a processor as consumable resource (exclusive mode) and (2) a processor as consumable resource
approach.</p> approach.</p>
<p>It is important to understand that the example listed below is a <p>It is important to understand that the example listed below is a
contrived example and is only given here to illustrate the use of cpu as contrived example and is only given here to illustrate the use of CPU as
consumable resources. Job 2 and Job 3 call for the node count to equal consumable resources. Job 2 and Job 3 call for the node count to equal
the processor count. This would typically be done because the processor count. This would typically be done because
that one task per node requires all of the memory, disk space, etc. The that one task per node requires all of the memory, disk space, etc. The
...@@ -295,12 +190,12 @@ bottleneck would not be processor count.</p> ...@@ -295,12 +190,12 @@ bottleneck would not be processor count.</p>
<p>Trying to execute more than one job per node will almost certainly severely <p>Trying to execute more than one job per node will almost certainly severely
impact parallel job's performance. impact parallel job's performance.
The biggest beneficiary of cpus as consumable resources will be serial jobs or The biggest beneficiary of CPUs as consumable resources will be serial jobs or
jobs with modest parallelism, which can effectively share resources. On a lot jobs with modest parallelism, which can effectively share resources. On many
of systems with larger processor count, jobs typically run one fewer task than systems with larger processor count, jobs typically run one fewer task than
there are processors to minimize interference by the kernel and daemons.</p> there are processors to minimize interference by the kernel and daemons.</p>
<p>The example cluster is composed of 4 nodes (10 cpus in total):</p> <p>The example cluster is composed of 4 nodes (10 CPUs in total):</p>
<ul> <ul>
<li>linux01 (with 2 processors), </li> <li>linux01 (with 2 processors), </li>
...@@ -322,12 +217,12 @@ there are processors to minimize interference by the kernel and daemons.</p> ...@@ -322,12 +217,12 @@ there are processors to minimize interference by the kernel and daemons.</p>
<p class="footer"><a href="#top">top</a></p> <p class="footer"><a href="#top">top</a></p>
<h2>Using SLURM's Default Node Allocation (Non-shared Mode)</h2> <h2>Using Slurm's Default Node Allocation (Non-shared Mode)</h2>
<p>The four jobs have been launched and 3 of the jobs are now <p>The four jobs have been launched and 3 of the jobs are now
pending, waiting to get resources allocated to them. Only Job 2 is running pending, waiting to get resources allocated to them. Only Job 2 is running
since it uses one cpu on all 4 nodes. This means that linux01 to linux03 each since it uses one CPU on all 4 nodes. This means that linux01 to linux03 each
have one idle cpu and linux04 has 3 idle cpus.</p> have one idle CPU and linux04 has 3 idle CPUs.</p>
<pre> <pre>
# squeue # squeue
...@@ -339,7 +234,7 @@ JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) ...@@ -339,7 +234,7 @@ JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
</pre> </pre>
<p>Once Job 2 is finished, Job 3 is scheduled and runs on <p>Once Job 2 is finished, Job 3 is scheduled and runs on
linux01, linux02, and linux03. Job 3 is only using one cpu on each of the 3 linux01, linux02, and linux03. Job 3 is only using one CPU on each of the 3
nodes. Job 4 can be allocated onto the remaining idle node (linux04) so Job 3 nodes. Job 4 can be allocated onto the remaining idle node (linux04) so Job 3
and Job 4 can run concurrently on the cluster.</p> and Job 4 can run concurrently on the cluster.</p>
...@@ -367,30 +262,29 @@ cannot be shared with other jobs.</p> ...@@ -367,30 +262,29 @@ cannot be shared with other jobs.</p>
<p>The output of squeue shows that we <p>The output of squeue shows that we
have 3 out of the 4 jobs allocated and running. This is a 2 running job have 3 out of the 4 jobs allocated and running. This is a 2 running job
increase over the default SLURM approach.</p> increase over the default Slurm approach.</p>
<p> Job 2 is running on nodes linux01 <p> Job 2 is running on nodes linux01
to linux04. Job 2's allocation is the same as for SLURM's default allocation to linux04. Job 2's allocation is the same as for Slurm's default allocation
which is that it uses one cpu on each of the 4 nodes. Once Job 2 is scheduled which is that it uses one CPU on each of the 4 nodes. Once Job 2 is scheduled
and running, nodes linux01, linux02 and linux03 still have one idle cpu each and running, nodes linux01, linux02 and linux03 still have one idle CPU each
and node linux04 has 3 idle cpus. The main difference between this approach and and node linux04 has 3 idle CPUs. The main difference between this approach and
the exclusive mode approach described above is that idle cpus within a node the exclusive mode approach described above is that idle CPUs within a node
are now allowed to be assigned to other jobs.</p> are now allowed to be assigned to other jobs.</p>
<p>It is important to note that <p>It is important to note that
<i>assigned</i> doesn't mean <i>oversubscription</i>. The consumable resource approach <i>assigned</i> doesn't mean <i>oversubscription</i>. The consumable resource approach
tracks how much of each available resource (in our case cpus) must be dedicated tracks how much of each available resource (in our case CPUs) must be dedicated
to a given job. This allows us to prevent per node oversubscription of to a given job. This allows us to prevent per node oversubscription of
resources (cpus).</p> resources (CPUs).</p>
<p>Once Job 2 is running, Job 3 is <p>Once Job 2 is running, Job 3 is
scheduled onto node linux01, linux02, and Linux03 (using one cpu on each of the scheduled onto node linux01, linux02, and Linux03 (using one CPU on each of the
nodes) and Job 4 is scheduled onto one of the remaining idle cpus on Linux04.</p> nodes) and Job 4 is scheduled onto one of the remaining idle CPUs on Linux04.</p>
<p>Job 2, Job 3, and Job 4 are now running concurrently on the cluster.</p> <p>Job 2, Job 3, and Job 4 are now running concurrently on the cluster.</p>
<pre> <pre>
# squeue # squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5 lsf sleep root PD 0:00 1 (Resources) 5 lsf sleep root PD 0:00 1 (Resources)
...@@ -441,11 +335,11 @@ other jobs if they do not use all of the resources on the nodes.</p> ...@@ -441,11 +335,11 @@ other jobs if they do not use all of the resources on the nodes.</p>
to specify that they would like their allocated to specify that they would like their allocated
nodes in exclusive mode. For more information see "man srun". nodes in exclusive mode. For more information see "man srun".
The reason for that is if users have mpi//threaded/openMP The reason for that is if users have mpi//threaded/openMP
programs that will take advantage of all the cpus within a node but only need programs that will take advantage of all the CPUs within a node but only need
one mpi process per node.</p> one mpi process per node.</p>
<p class="footer"><a href="#top">top</a></p> <p class="footer"><a href="#top">top</a></p>
<p style="text-align:center;">Last modified 3 February 2012</p> <p style="text-align:center;">Last modified 14 August 2013</p>
<!--#include virtual="footer.txt"--> <!--#include virtual="footer.txt"-->
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment