Skip to content
Snippets Groups Projects
Commit 98c45d4c authored by Moe Jette's avatar Moe Jette
Browse files

Minor format updates. No change in content.

parent cdf86885
No related branches found
No related tags found
No related merge requests found
......@@ -111,37 +111,39 @@ The Maui Scheduler</a> or
Moab Cluster Suite</a>.
Please refer to its documentation for help. For any scheduler, you can check priorities
of jobs using the command <span class="commandline">scontrol show job</span>.</p>
<p><a name="sharing"><b>4. Why does the srun --overcommit option not permit multiple jobs
to run on nodes?</b></a><br>
The <b>--overcommit</b> option is a means of indicating that a job or job step is willing
to execute more than one task per processor in the job's allocation. For example,
consider a cluster of two processor nodes. The srun execute line may be something
of this sort
of this sort</p>
<blockquote>
<p><span class="commandline">srun --ntasks=4 --nodes=1 a.out</span></p>
</blockquote>
This will result in not one, but two nodes being allocated so that each of the four
<p>This will result in not one, but two nodes being allocated so that each of the four
tasks is given its own processor. Note that the srun <b>--nodes</b> option specifies
a minimum node count and optionally a maximum node count. A command line of
a minimum node count and optionally a maximum node count. A command line of</p>
<blockquote>
<p><span class="commandline">srun --ntasks=4 --nodes=1-1 a.out</span></p>
</blockquote>
would result in the request being rejected. If the <b>--overcommit</b> option
<p>would result in the request being rejected. If the <b>--overcommit</b> option
is added to either command line, then only one node will be allocated for all
four tasks to use.
four tasks to use.</p>
<p>More than one job can execute simultaneously on the same nodes through the use
of srun's <b>--shared</b> option in conjunction with the <b>Shared</b> parameter
in SLURM's partition configuration. See the man pages for srun and slurm.conf for
more information.
more information.</p>
<p><a name="purge"><b>5. Why is my job killed prematurely?</b></a><br>
SLURM has a job purging mechanism to remove inactive jobs (resource allocations)
before reaching its time limit, which could be infinite.
This inactivity time limit is configurable by the system administrator.
You can check it's value with the command
You can check it's value with the command</p>
<blockquote>
<p><span class="commandline">scontrol show config | grep InactiveLimit</span></p>
</blockquote>
The value of InactiveLimit is in seconds.
<p>The value of InactiveLimit is in seconds.
A zero value indicates that job purging is disabled.
A job is considered inactive if it has no active job steps or if the srun
command creating the job is not responding.
......@@ -156,11 +158,11 @@ Everything after the command <span class="commandline">srun</span> is
examined to determine if it is a valid option for srun. The first
token that is not a valid option for srun is considered the command
to execute and everything after that is treated as an option to
the command. For example:
the command. For example:</p>
<blockquote>
<p><span class="commandline">srun -N2 hostname -pdebug</span></p>
</blockquote>
srun processes "-N2" as an option to itself. "hostname" is the
<p>srun processes "-N2" as an option to itself. "hostname" is the
command to execute and "-pdebug" is treated as an option to the
hostname command. Which will change the name of the computer
on which SLURM executes the command - Very bad, <b>Don't run
......@@ -180,7 +182,7 @@ It was designed to perform backfill node scheduling for a homogeneous cluster.
It does not manage scheduling on individual processors (or other consumable
resources). It also does not update the required or excluded node list of
individual jobs. These are the current limiations. You can use the
scontrol show command to check if these conditions apply.
scontrol show command to check if these conditions apply.</p>
<ul>
<li>partition: State=UP</li>
<li>partition: RootOnly=NO</li>
......@@ -193,7 +195,7 @@ scontrol show command to check if these conditions apply.
the partition</li>
<li>job: MinProcs or MinNodes not to exceed partition's MaxNodes</li>
</ul>
As soon as any priority-ordered job in the partition's queue fail to
<p>As soon as any priority-ordered job in the partition's queue fail to
satisfy the request, no lower priority job in that partition's queue
will be considered as a backfill candidate. Any programmer wishing
to augment the existing code is welcome to do so.
......@@ -346,23 +348,23 @@ or access to compute nodes?</b></a><br>
First, enable SLURM's use of PAM by setting <i>UsePAM=1</i> in
<i>slurm.conf</i>.<br>
Second, establish a PAM configuration file for slurm in <i>/etc/pam.d/slurm</i>.
A basic configuration you might use is:<br>
A basic configuration you might use is:</p>
<pre>
auth required pam_localuser.so
account required pam_unix.so
session required pam_limits.so
</pre>
Third, set the desired limits in <i>/etc/security/limits.conf</i>.
For example, to set the locked memory limit to unlimited for all users:<br>
<p>Third, set the desired limits in <i>/etc/security/limits.conf</i>.
For example, to set the locked memory limit to unlimited for all users:</p>
<pre>
* hard memlock unlimited
* soft memlock unlimited
</pre>
Finally, you need to disable SLURM's forwarding of the limits from the
<p>Finally, you need to disable SLURM's forwarding of the limits from the
session from which the <i>srun</i> initiating the job ran. By default
all resource limits are propogated from that session. For example, adding
the following line to <i>slurm.conf</i> will prevent the locked memory
limit from being propagated:<i>PropagateResourceLimitsExcept=MEMLOCK</i>.
limit from being propagated:<i>PropagateResourceLimitsExcept=MEMLOCK</i>.</p>
<p>We also have a PAM module for SLURM that prevents users from
logging into nodes that they have not been allocated (except for user
......@@ -395,7 +397,7 @@ partition. You can control the frequency of this ping with the
backup controller?</b></a><br>
If the cluster's computers used for the primary or backup controller
will be out of service for an extended period of time, it may be desirable
to relocate them. In order to do so, follow this procedure:
to relocate them. In order to do so, follow this procedure:</p>
<ol>
<li>Stop all SLURM daemons</li>
<li>Modify the <i>ControlMachine</i>, <i>ControlAddr</i>,
......@@ -403,7 +405,7 @@ to relocate them. In order to do so, follow this procedure:
<li>Distribute the updated <i>slurm.conf</i> file file to all nodes</li>
<li>Restart all SLURM daemons</li>
</ol>
There should be no loss of any running or pending jobs. Insure that
<p>There should be no loss of any running or pending jobs. Insure that
any nodes added to the cluster have a current <i>slurm.conf</i> file
installed.
<b>CAUTION:</b> If two nodes are simultaneously configured as the primary
......
<!--#include virtual="header.txt"-->
<h1>Mailing Lists</h1>
<p>We maintain two SLURM mailing lists:
<p>We maintain two SLURM mailing lists:</p>
<ul>
<li><b>slurm-announce</b> is designated for communications about SLURM releases
[low traffic].</li>
<li><b>slurm-dev</b> is designated for communications to SLURM developers
[high traffic at times].</li>
</ul>
To subscribe to either list, send a message to
<p>To subscribe to either list, send a message to
<a href="mailto:majordomo@lists.llnl.gov">majordomo@lists.llnl.gov</a> with the body of the
message containing the word "subscribe" followed by the list name and your e-mail address
(if not the sender). For example: <br>
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment