Skip to content
Snippets Groups Projects
Commit b345c8d2 authored by Don Lipari's avatar Don Lipari
Browse files

Fixed typos in faq.shtml

parent e0db5afd
No related branches found
No related tags found
No related merge requests found
......@@ -99,12 +99,12 @@ execute on DOWN nodes?</li>
<li><a href="#batch_lost">What is the meaning of the error
&quot;Batch JobId=# missing from master node, killing it&quot;?</a></li>
<li><a href="#accept_again">What does the messsage
&quot;srun: error: Unable to accept connection: Resources termporarily unavailable&quot;
&quot;srun: error: Unable to accept connection: Resources temporarily unavailable&quot;
indicate?</a></li>
<li><a href="#task_prolog">How could I automatically print a job's
SLURM job ID to its standard output?</li>
<li><a href="#moab_start">I run SLURM with the Moab or Maui scheduler.
How can I start a job under SLURM wihtout the scheduler?</li>
How can I start a job under SLURM without the scheduler?</li>
<li><a href="#orphan_procs">Why are user processes and <i>srun</i>
running even though the job is supposed to be completed?</li>
<li><a href="#slurmd_oom">How can I prevent the <i>slurmd</i> and
......@@ -129,7 +129,7 @@ for an extended period of time.
This may be indicative of processes hung waiting for a core file
to complete I/O or operating system failure.
If this state persists, the system administrator should check for processes
associated with the job that can not be terminated then use the
associated with the job that cannot be terminated then use the
<span class="commandline">scontrol</span> command to change the node's
state to DOWN (e.g. &quot;scontrol update NodeName=<i>name</i> State=DOWN Reason=hung_completing&quot;),
reboot the node, then reset the node's state to IDLE
......@@ -184,7 +184,7 @@ until no previously submitted job is pending. If the scheduler type is <b>backfi
then jobs will generally be executed in the order of submission for a given partition
with one exception: later submitted jobs will be initiated early if doing so does
not delay the expected execution time of an earlier submitted job. In order for
backfill scheduling to be effective, users jobs should specify reasonable time
backfill scheduling to be effective, users' jobs should specify reasonable time
limits. If jobs do not specify time limits, then all jobs will receive the same
time limit (that associated with the partition), and the ability to backfill schedule
jobs will be limited. The backfill scheduler does not alter job specifications
......@@ -226,7 +226,7 @@ more information.</p>
SLURM has a job purging mechanism to remove inactive jobs (resource allocations)
before reaching its time limit, which could be infinite.
This inactivity time limit is configurable by the system administrator.
You can check it's value with the command</p>
You can check its value with the command</p>
<blockquote>
<p><span class="commandline">scontrol show config | grep InactiveLimit</span></p>
</blockquote>
......@@ -251,7 +251,7 @@ the command. For example:</p>
</blockquote>
<p>srun processes "-N2" as an option to itself. "hostname" is the
command to execute and "-pdebug" is treated as an option to the
hostname command. Which will change the name of the computer
hostname command. This will change the name of the computer
on which SLURM executes the command - Very bad, <b>Don't run
this command as user root!</b></p>
......@@ -305,7 +305,7 @@ that the processes associated with the switch have been terminated
to avoid the possibility of re-using switch resources for other
jobs (even on different nodes).
SLURM considers jobs COMPLETED when all nodes allocated to the
job are either DOWN or confirm termination of all it's processes.
job are either DOWN or confirm termination of all its processes.
This enables SLURM to purge job information in a timely fashion
even when there are many failing nodes.
Unfortunately the job step information may persist longer.</p>
......@@ -490,7 +490,7 @@ indicate?</b></a><br>
The srun command normally terminates when the standard output and
error I/O from the spawned tasks end. This does not necessarily
happen at the same time that a job step is terminated. For example,
a file system problem could render a spawned tasks non-killable
a file system problem could render a spawned task non-killable
at the same time that I/O to srun is pending. Alternately a network
problem could prevent the I/O from being transmitted to srun.
In any event, the srun command is notified when a job step is
......@@ -526,8 +526,8 @@ If the user's resource limit is not propagated, the limit in
effect for the <i>slurmd</i> daemon will be used for the spawned job.
A simple way to control this is to insure that user <i>root</i> has a
sufficiently large resource limit and insuring that <i>slurmd</i> takes
full advantage of this limit. For example, you can set user's root's
locked memory limit limit to be unlimited on the compute nodes (see
full advantage of this limit. For example, you can set user root's
locked memory limit ulimit to be unlimited on the compute nodes (see
<i>"man limits.conf"</i>) and insuring that <i>slurmd</i> takes
full advantage of this limit (e.g. by adding something like
<i>"ulimit -l unlimited"</i> to the <i>/etc/init.d/slurm</i>
......@@ -632,11 +632,11 @@ accommodate all jobs allocated to a node, either running or suspended.
<p><a name="fast_schedule"><b>2. How can I configure SLURM to use
the resources actually found on a node rather than what is defined
in <i>slurm.conf</i>?</b></a><br>
SLURM can either base it's scheduling decisions upon the node
SLURM can either base its scheduling decisions upon the node
configuration defined in <i>slurm.conf</i> or what each node
actually returns as available resources.
This is controlled using the configuration parameter <i>FastSchedule</i>.
Set it's value to zero in order to use the resources actually
Set its value to zero in order to use the resources actually
found on each node, but with a higher overhead for scheduling.
A value of one is the default and results in the node configuration
defined in <i>slurm.conf</i> being used. See &quot;man slurm.conf&quot;
......@@ -667,7 +667,7 @@ See the slurm.conf and srun man pages for more information.</p>
<p><a name="multi_job"><b>5. How can I control the execution of multiple
jobs per node?</b></a><br>
There are two mechanism to control this.
There are two mechanisms to control this.
If you want to allocate individual processors on a node to jobs,
configure <i>SelectType=select/cons_res</i>.
See <a href="cons_res.html">Consumable Resources in SLURM</a>
......@@ -691,7 +691,7 @@ for more information (e.g. &quot;slurmctld -Dvvvvv&quot;).
<p><a name="sigpipe"><b>7. Why are user tasks intermittently dying
at launch with SIGPIPE error messages?</b></a><br>
If you are using ldap or some other remote name service for
If you are using LDAP or some other remote name service for
username and groups lookup, chances are that the underlying
libc library functions are triggering the SIGPIPE. You can likely
work around this problem by setting <i>CacheGroups=1</i> in your slurm.conf
......@@ -765,7 +765,7 @@ to relocate them. In order to do so, follow this procedure:</p>
<li>Stop all SLURM daemons</li>
<li>Modify the <i>ControlMachine</i>, <i>ControlAddr</i>,
<i>BackupController</i>, and/or <i>BackupAddr</i> in the <i>slurm.conf</i> file</li>
<li>Distribute the updated <i>slurm.conf</i> file file to all nodes</li>
<li>Distribute the updated <i>slurm.conf</i> file to all nodes</li>
<li>Restart all SLURM daemons</li>
</ol>
<p>There should be no loss of any running or pending jobs. Insure that
......@@ -803,7 +803,7 @@ cluster?</b></a><br>
Yes, this can be useful for testing purposes.
It has also been used to partition "fat" nodes into multiple SLURM nodes.
There are two ways to do this.
The best method for most conditins is to run one <i>slurmd</i>
The best method for most conditions is to run one <i>slurmd</i>
daemon per emulated node in the cluster as follows.
<ol>
<li>When executing the <i>configure</i> program, use the option
......@@ -822,7 +822,7 @@ slurm.conf. </li>
of the node that it is supposed to serve on the execute line.</li>
</ol>
<p>It is strongly recommended that SLURM version 1.2 or higher be used
for this due to it's improved support for multiple slurmd daemons.
for this due to its improved support for multiple slurmd daemons.
See the
<a href="programmer_guide.html#multiple_slurmd_support">Programmers Guide</a>
for more details about configuring multiple slurmd support.</p>
......@@ -983,7 +983,7 @@ This error indicates that a job credential generated by the slurmctld daemon
corresponds to a job that the slurmd daemon has already revoked.
The slurmctld daemon selects job ID values based upon the configured
value of <b>FirstJobId</b> (the default value is 1) and each job gets
an value one large than the previous job.
a value one larger than the previous job.
On job termination, the slurmctld daemon notifies the slurmd on each
allocated node that all processes associated with that job should be
terminated.
......@@ -1031,7 +1031,7 @@ of the program.
<p><a name="rpm"><b>27. Why isn't the auth_none.so (or other file) in a
SLURM RPM?</b></a><br>
The auth_none plugin is in a separete RPM and not built by default.
The auth_none plugin is in a separate RPM and not built by default.
Using the auth_none plugin means that SLURM communications are not
authenticated, so you probably do not want to run in this mode of operation
except for testing purposes. If you want to build the auth_none RPM then
......@@ -1080,7 +1080,7 @@ sinfo -t drain -h -o "scontrol update nodename='%N' state=drain reason='%E'"
execute on DOWN nodes?</a></b><br>
Hierarchical communications are used for sending this message. If there
are DOWN nodes in the communications hierarchy, messages will need to
be re-routed. This limits SLURM's ability to tightly synchroize the
be re-routed. This limits SLURM's ability to tightly synchronize the
execution of the <i>HealthCheckProgram</i> across the cluster, which
could adversely impact performance of parallel applications.
The use of CRON or node startup scripts may be better suited to insure
......@@ -1113,14 +1113,14 @@ is executing. If a batch program is expected to be running on some
node (i.e. node zero of the job's allocation) and is not found, the
message above will be logged and the job cancelled. This typically is
associated with exhausting memory on the node or some other critical
failure that can not be recovered from. The equivalent message in
failure that cannot be recovered from. The equivalent message in
earlier releases of slurm is
&quot;Master node lost JobId=#, killing it&quot;.
<p><a name="accept_again"><b>33. What does the messsage
&quot;srun: error: Unable to accept connection: Resources termporarily unavailable&quot;
&quot;srun: error: Unable to accept connection: Resources temporarily unavailable&quot;
indicate?</b></a><br>
This has been reported on some larger clusters running Suse Linux when
This has been reported on some larger clusters running SUSE Linux when
a user's resource limits are reached. You may need to increase limits
for locked memory and stack size to resolve this problem.
......@@ -1128,7 +1128,7 @@ for locked memory and stack size to resolve this problem.
SLURM job ID to its standard output?</b></a></br>
The configured <i>TaskProlog</i> is the only thing that can write to
the job's standard output or set extra environment variables for a job
or job step. To write to the job's standard output, preceed the message
or job step. To write to the job's standard output, precede the message
with "print ". To export environment variables, output a line of this
form "export name=value". The example below will print a job's SLURM
job ID and allocated hosts for a batch job only.
......@@ -1150,7 +1150,7 @@ fi
</pre>
<p><a name="moab_start"><b>35. I run SLURM with the Moab or Maui scheduler.
How can I start a job under SLURM wihtout the scheduler?</b></a></br>
How can I start a job under SLURM without the scheduler?</b></a></br>
When SLURM is configured to use the Moab or Maui scheduler, all submitted
jobs have their priority initialized to zero, which SLURM treats as a held
job. The job only begins when Moab or Maui decide where and when to start
......@@ -1163,7 +1163,7 @@ $ scontrol update jobid=1234 priority=1000000
</pre>
<p>Note that changes in the configured value of <i>SchedulerType</i> only
take effect when the <i>slurmctld</i> daemon is restarted (reconfiguring
SLURM will not change this parameter. You will also manuallly need to
SLURM will not change this parameter. You will also manually need to
modify the priority of every pending job.
When changing to Moab or Maui scheduling, set every job priority to zero.
When changing from Moab or Maui scheduling, set every job priority to a
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment