Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
Slurm
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
tud-zih-energy
Slurm
Commits
9221bc97
Commit
9221bc97
authored
19 years ago
by
Moe Jette
Browse files
Options
Downloads
Patches
Plain Diff
Answer some more FAQs.
parent
a3d47fde
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc/html/faq.shtml
+32
-3
32 additions, 3 deletions
doc/html/faq.shtml
with
32 additions
and
3 deletions
doc/html/faq.shtml
+
32
−
3
View file @
9221bc97
...
@@ -2,7 +2,7 @@
...
@@ -2,7 +2,7 @@
<h1>Frequently Asked Questions</h1>
<h1>Frequently Asked Questions</h1>
<ol>
<ol>
<li><a href="#comp">Why is my job/node in
"completing"
state?</a></li>
<li><a href="#comp">Why is my job/node in
COMPLETING
state?</a></li>
<li><a href="#rlimit">Why do I see the error "Can't propagate RLIMIT_..."?</a></li>
<li><a href="#rlimit">Why do I see the error "Can't propagate RLIMIT_..."?</a></li>
<li><a href="#pending">Why is my job not running?</a></li>
<li><a href="#pending">Why is my job not running?</a></li>
<li><a href="#sharing">Why does the srun --overcommit option not permit multiple jobs
<li><a href="#sharing">Why does the srun --overcommit option not permit multiple jobs
...
@@ -13,8 +13,12 @@ to run on nodes?</a></li>
...
@@ -13,8 +13,12 @@ to run on nodes?</a></li>
<li><a href="#backfill">Why is the SLURM backfill scheduler not starting my
<li><a href="#backfill">Why is the SLURM backfill scheduler not starting my
job?</a></li>
job?</a></li>
<li><a href="#suspend">How is job suspend/resume useful?</a></li>
<li><a href="#suspend">How is job suspend/resume useful?</a></li>
<li><a href="#fast_schedule">How can I configure SLURM to use the resources actually
found on a node rather than what is defined in <i>slurm.conf</i>?</li>
<li><a href="#return_to_service">Why is a node shown in state DOWN when the node
has registered for service?</li>
</ol>
</ol>
<p><a name="comp"><b>1. Why is my job/node in
"completing"
state?</b></a><br>
<p><a name="comp"><b>1. Why is my job/node in
COMPLETING
state?</b></a><br>
When a job is terminating, both the job and its nodes enter the state "completing."
When a job is terminating, both the job and its nodes enter the state "completing."
As the SLURM daemon on each node determines that all processes associated with
As the SLURM daemon on each node determines that all processes associated with
the job have terminated, that node changes state to "idle" or some other
the job have terminated, that node changes state to "idle" or some other
...
@@ -26,7 +30,7 @@ the job and one or more nodes can remain in the completing state for an extended
...
@@ -26,7 +30,7 @@ the job and one or more nodes can remain in the completing state for an extended
period of time. This may be indicative of processes hung waiting for a core file
period of time. This may be indicative of processes hung waiting for a core file
to complete I/O or operating system failure. If this state persists, the system
to complete I/O or operating system failure. If this state persists, the system
administrator should use the <span class="commandline">scontrol</span> command
administrator should use the <span class="commandline">scontrol</span> command
to change the node's state to
<i>
DOWN
</i>
(e.g. "scontrol update
to change the node's state to DOWN (e.g. "scontrol update
NodeName=<i>name</i> State=DOWN Reason=hung_completing"), reboot the node,
NodeName=<i>name</i> State=DOWN Reason=hung_completing"), reboot the node,
then reset the node's state to IDLE (e.g. "scontrol update
then reset the node's state to IDLE (e.g. "scontrol update
NodeName=<i>name</i> State=RESUME").</p>
NodeName=<i>name</i> State=RESUME").</p>
...
@@ -171,6 +175,31 @@ Suspending and resuming a job makes use of the SIGSTOP and SIGCONT
...
@@ -171,6 +175,31 @@ Suspending and resuming a job makes use of the SIGSTOP and SIGCONT
signals respectively, so swap and disk space should be sufficient to
signals respectively, so swap and disk space should be sufficient to
accommodate all jobs allocated to a node, either running or suspended.
accommodate all jobs allocated to a node, either running or suspended.
<p><a name="fast_schedule"><b>10. How can I configure SLURM to use
the resources actually found on a node rather than what is defined
in <i>slurm.conf</i>?</b></a><br>
SLURM can either base it's scheduling decisions upon the node
configuration defined in <i>slurm.conf</i> or what each node
actually returns as available resources.
This is controlled using the configuration parameter <i>FastSchedule</i>.
Set it's value to zero in order to use the resources actually
found on each node, but with a higher overhead for scheduling.
A value of one is the default and results in the node configuration
defined in <i>slurm.conf</i> being used. See "man slurm.conf"
for more details.
<p><a name="return_to_service"><b>11. Why is a node shown in state
DOWN when the node has registered for service?</b></a><br>
The configuration parameter <i>ReturnToService</i> in <i>slurm.conf</i>
controls how DOWN nodes are handled.
Set its value to one in order for DOWN nodes to automatically be
returned to service once the <i>slurmd</i> daemon registers
with a valid node configuration.
A value of zero is the default and results in a node staying DOWN
until an administrator explicity returns it to service using
the command "scontrol update NodeName=whatever State=RESUME".
See "man slurm.conf" and "man scontrol" for more details.
<p style="text-align:center;">Last modified 16 January 2006</p>
<p style="text-align:center;">Last modified 16 January 2006</p>
<!--#include virtual="footer.txt"-->
<!--#include virtual="footer.txt"-->
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment