Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
Slurm
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
tud-zih-energy
Slurm
Commits
b345c8d2
Commit
b345c8d2
authored
15 years ago
by
Don Lipari
Browse files
Options
Downloads
Patches
Plain Diff
Fixed typos in faq.shtml
parent
e0db5afd
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc/html/faq.shtml
+26
-26
26 additions, 26 deletions
doc/html/faq.shtml
with
26 additions
and
26 deletions
doc/html/faq.shtml
+
26
−
26
View file @
b345c8d2
...
...
@@ -99,12 +99,12 @@ execute on DOWN nodes?</li>
<li><a href="#batch_lost">What is the meaning of the error
"Batch JobId=# missing from master node, killing it"?</a></li>
<li><a href="#accept_again">What does the messsage
"srun: error: Unable to accept connection: Resources te
r
mporarily unavailable"
"srun: error: Unable to accept connection: Resources temporarily unavailable"
indicate?</a></li>
<li><a href="#task_prolog">How could I automatically print a job's
SLURM job ID to its standard output?</li>
<li><a href="#moab_start">I run SLURM with the Moab or Maui scheduler.
How can I start a job under SLURM wi
h
tout the scheduler?</li>
How can I start a job under SLURM wit
h
out the scheduler?</li>
<li><a href="#orphan_procs">Why are user processes and <i>srun</i>
running even though the job is supposed to be completed?</li>
<li><a href="#slurmd_oom">How can I prevent the <i>slurmd</i> and
...
...
@@ -129,7 +129,7 @@ for an extended period of time.
This may be indicative of processes hung waiting for a core file
to complete I/O or operating system failure.
If this state persists, the system administrator should check for processes
associated with the job that can
not be terminated then use the
associated with the job that cannot be terminated then use the
<span class="commandline">scontrol</span> command to change the node's
state to DOWN (e.g. "scontrol update NodeName=<i>name</i> State=DOWN Reason=hung_completing"),
reboot the node, then reset the node's state to IDLE
...
...
@@ -184,7 +184,7 @@ until no previously submitted job is pending. If the scheduler type is <b>backfi
then jobs will generally be executed in the order of submission for a given partition
with one exception: later submitted jobs will be initiated early if doing so does
not delay the expected execution time of an earlier submitted job. In order for
backfill scheduling to be effective, users jobs should specify reasonable time
backfill scheduling to be effective, users
'
jobs should specify reasonable time
limits. If jobs do not specify time limits, then all jobs will receive the same
time limit (that associated with the partition), and the ability to backfill schedule
jobs will be limited. The backfill scheduler does not alter job specifications
...
...
@@ -226,7 +226,7 @@ more information.</p>
SLURM has a job purging mechanism to remove inactive jobs (resource allocations)
before reaching its time limit, which could be infinite.
This inactivity time limit is configurable by the system administrator.
You can check it
'
s value with the command</p>
You can check its value with the command</p>
<blockquote>
<p><span class="commandline">scontrol show config | grep InactiveLimit</span></p>
</blockquote>
...
...
@@ -251,7 +251,7 @@ the command. For example:</p>
</blockquote>
<p>srun processes "-N2" as an option to itself. "hostname" is the
command to execute and "-pdebug" is treated as an option to the
hostname command.
W
hi
ch
will change the name of the computer
hostname command.
T
hi
s
will change the name of the computer
on which SLURM executes the command - Very bad, <b>Don't run
this command as user root!</b></p>
...
...
@@ -305,7 +305,7 @@ that the processes associated with the switch have been terminated
to avoid the possibility of re-using switch resources for other
jobs (even on different nodes).
SLURM considers jobs COMPLETED when all nodes allocated to the
job are either DOWN or confirm termination of all it
'
s processes.
job are either DOWN or confirm termination of all its processes.
This enables SLURM to purge job information in a timely fashion
even when there are many failing nodes.
Unfortunately the job step information may persist longer.</p>
...
...
@@ -490,7 +490,7 @@ indicate?</b></a><br>
The srun command normally terminates when the standard output and
error I/O from the spawned tasks end. This does not necessarily
happen at the same time that a job step is terminated. For example,
a file system problem could render a spawned task
s
non-killable
a file system problem could render a spawned task non-killable
at the same time that I/O to srun is pending. Alternately a network
problem could prevent the I/O from being transmitted to srun.
In any event, the srun command is notified when a job step is
...
...
@@ -526,8 +526,8 @@ If the user's resource limit is not propagated, the limit in
effect for the <i>slurmd</i> daemon will be used for the spawned job.
A simple way to control this is to insure that user <i>root</i> has a
sufficiently large resource limit and insuring that <i>slurmd</i> takes
full advantage of this limit. For example, you can set user
's
root's
locked memory limit limit to be unlimited on the compute nodes (see
full advantage of this limit. For example, you can set user root's
locked memory limit
u
limit to be unlimited on the compute nodes (see
<i>"man limits.conf"</i>) and insuring that <i>slurmd</i> takes
full advantage of this limit (e.g. by adding something like
<i>"ulimit -l unlimited"</i> to the <i>/etc/init.d/slurm</i>
...
...
@@ -632,11 +632,11 @@ accommodate all jobs allocated to a node, either running or suspended.
<p><a name="fast_schedule"><b>2. How can I configure SLURM to use
the resources actually found on a node rather than what is defined
in <i>slurm.conf</i>?</b></a><br>
SLURM can either base it
'
s scheduling decisions upon the node
SLURM can either base its scheduling decisions upon the node
configuration defined in <i>slurm.conf</i> or what each node
actually returns as available resources.
This is controlled using the configuration parameter <i>FastSchedule</i>.
Set it
'
s value to zero in order to use the resources actually
Set its value to zero in order to use the resources actually
found on each node, but with a higher overhead for scheduling.
A value of one is the default and results in the node configuration
defined in <i>slurm.conf</i> being used. See "man slurm.conf"
...
...
@@ -667,7 +667,7 @@ See the slurm.conf and srun man pages for more information.</p>
<p><a name="multi_job"><b>5. How can I control the execution of multiple
jobs per node?</b></a><br>
There are two mechanism to control this.
There are two mechanism
s
to control this.
If you want to allocate individual processors on a node to jobs,
configure <i>SelectType=select/cons_res</i>.
See <a href="cons_res.html">Consumable Resources in SLURM</a>
...
...
@@ -691,7 +691,7 @@ for more information (e.g. "slurmctld -Dvvvvv").
<p><a name="sigpipe"><b>7. Why are user tasks intermittently dying
at launch with SIGPIPE error messages?</b></a><br>
If you are using
ldap
or some other remote name service for
If you are using
LDAP
or some other remote name service for
username and groups lookup, chances are that the underlying
libc library functions are triggering the SIGPIPE. You can likely
work around this problem by setting <i>CacheGroups=1</i> in your slurm.conf
...
...
@@ -765,7 +765,7 @@ to relocate them. In order to do so, follow this procedure:</p>
<li>Stop all SLURM daemons</li>
<li>Modify the <i>ControlMachine</i>, <i>ControlAddr</i>,
<i>BackupController</i>, and/or <i>BackupAddr</i> in the <i>slurm.conf</i> file</li>
<li>Distribute the updated <i>slurm.conf</i> file
file
to all nodes</li>
<li>Distribute the updated <i>slurm.conf</i> file to all nodes</li>
<li>Restart all SLURM daemons</li>
</ol>
<p>There should be no loss of any running or pending jobs. Insure that
...
...
@@ -803,7 +803,7 @@ cluster?</b></a><br>
Yes, this can be useful for testing purposes.
It has also been used to partition "fat" nodes into multiple SLURM nodes.
There are two ways to do this.
The best method for most conditins is to run one <i>slurmd</i>
The best method for most conditi
o
ns is to run one <i>slurmd</i>
daemon per emulated node in the cluster as follows.
<ol>
<li>When executing the <i>configure</i> program, use the option
...
...
@@ -822,7 +822,7 @@ slurm.conf. </li>
of the node that it is supposed to serve on the execute line.</li>
</ol>
<p>It is strongly recommended that SLURM version 1.2 or higher be used
for this due to it
'
s improved support for multiple slurmd daemons.
for this due to its improved support for multiple slurmd daemons.
See the
<a href="programmer_guide.html#multiple_slurmd_support">Programmers Guide</a>
for more details about configuring multiple slurmd support.</p>
...
...
@@ -983,7 +983,7 @@ This error indicates that a job credential generated by the slurmctld daemon
corresponds to a job that the slurmd daemon has already revoked.
The slurmctld daemon selects job ID values based upon the configured
value of <b>FirstJobId</b> (the default value is 1) and each job gets
a
n
value one large than the previous job.
a value one large
r
than the previous job.
On job termination, the slurmctld daemon notifies the slurmd on each
allocated node that all processes associated with that job should be
terminated.
...
...
@@ -1031,7 +1031,7 @@ of the program.
<p><a name="rpm"><b>27. Why isn't the auth_none.so (or other file) in a
SLURM RPM?</b></a><br>
The auth_none plugin is in a separ
e
te RPM and not built by default.
The auth_none plugin is in a separ
a
te RPM and not built by default.
Using the auth_none plugin means that SLURM communications are not
authenticated, so you probably do not want to run in this mode of operation
except for testing purposes. If you want to build the auth_none RPM then
...
...
@@ -1080,7 +1080,7 @@ sinfo -t drain -h -o "scontrol update nodename='%N' state=drain reason='%E'"
execute on DOWN nodes?</a></b><br>
Hierarchical communications are used for sending this message. If there
are DOWN nodes in the communications hierarchy, messages will need to
be re-routed. This limits SLURM's ability to tightly synchroize the
be re-routed. This limits SLURM's ability to tightly synchro
n
ize the
execution of the <i>HealthCheckProgram</i> across the cluster, which
could adversely impact performance of parallel applications.
The use of CRON or node startup scripts may be better suited to insure
...
...
@@ -1113,14 +1113,14 @@ is executing. If a batch program is expected to be running on some
node (i.e. node zero of the job's allocation) and is not found, the
message above will be logged and the job cancelled. This typically is
associated with exhausting memory on the node or some other critical
failure that can
not be recovered from. The equivalent message in
failure that cannot be recovered from. The equivalent message in
earlier releases of slurm is
"Master node lost JobId=#, killing it".
<p><a name="accept_again"><b>33. What does the messsage
"srun: error: Unable to accept connection: Resources te
r
mporarily unavailable"
"srun: error: Unable to accept connection: Resources temporarily unavailable"
indicate?</b></a><br>
This has been reported on some larger clusters running S
use
Linux when
This has been reported on some larger clusters running S
USE
Linux when
a user's resource limits are reached. You may need to increase limits
for locked memory and stack size to resolve this problem.
...
...
@@ -1128,7 +1128,7 @@ for locked memory and stack size to resolve this problem.
SLURM job ID to its standard output?</b></a></br>
The configured <i>TaskProlog</i> is the only thing that can write to
the job's standard output or set extra environment variables for a job
or job step. To write to the job's standard output, prece
e
d the message
or job step. To write to the job's standard output, preced
e
the message
with "print ". To export environment variables, output a line of this
form "export name=value". The example below will print a job's SLURM
job ID and allocated hosts for a batch job only.
...
...
@@ -1150,7 +1150,7 @@ fi
</pre>
<p><a name="moab_start"><b>35. I run SLURM with the Moab or Maui scheduler.
How can I start a job under SLURM wi
h
tout the scheduler?</b></a></br>
How can I start a job under SLURM wit
h
out the scheduler?</b></a></br>
When SLURM is configured to use the Moab or Maui scheduler, all submitted
jobs have their priority initialized to zero, which SLURM treats as a held
job. The job only begins when Moab or Maui decide where and when to start
...
...
@@ -1163,7 +1163,7 @@ $ scontrol update jobid=1234 priority=1000000
</pre>
<p>Note that changes in the configured value of <i>SchedulerType</i> only
take effect when the <i>slurmctld</i> daemon is restarted (reconfiguring
SLURM will not change this parameter. You will also manuall
l
y need to
SLURM will not change this parameter. You will also manually need to
modify the priority of every pending job.
When changing to Moab or Maui scheduling, set every job priority to zero.
When changing from Moab or Maui scheduling, set every job priority to a
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment