Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
Slurm
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
tud-zih-energy
Slurm
Commits
e5ffc592
Commit
e5ffc592
authored
19 years ago
by
Moe Jette
Browse files
Options
Downloads
Patches
Plain Diff
Add description of job step allocations. Split FAQ file
into a user section and a different one for the admins.
parent
fcbe2542
No related branches found
Branches containing commit
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc/html/faq.shtml
+39
-7
39 additions, 7 deletions
doc/html/faq.shtml
with
39 additions
and
7 deletions
doc/html/faq.shtml
+
39
−
7
View file @
e5ffc592
<!--#include virtual="header.txt"-->
<h1>Frequently Asked Questions</h1>
<h2>For Users</h2>
<ol>
<li><a href="#comp">Why is my job/node in COMPLETING state?</a></li>
<li><a href="#rlimit">Why do I see the error "Can't propagate RLIMIT_..."?</a></li>
...
...
@@ -12,14 +13,21 @@ to run on nodes?</a></li>
<li><a href="#cred">Why are "Invalid job credential" errors generated?</a></li>
<li><a href="#backfill">Why is the SLURM backfill scheduler not starting my
job?</a></li>
<li><a href="#share">How can I control the execution of multiple jobs per node?</a></l
i>
<li><a href="#steps">How can I run multiple jobs from within a single script?</a></li>
</ol>
<h2>For Administrators</h2>
<ol>
<li><a href="#suspend">How is job suspend/resume useful?</a></li>
<li><a href="#fast_schedule">How can I configure SLURM to use the resources actually
found on a node rather than what is defined in <i>slurm.conf</i>?</a></li>
<li><a href="#return_to_service">Why is a node shown in state DOWN when the node
has registered for service?</a></li>
<li><a href="#down_node">What happens when a node crashes?</a></li>
<li><a href="#share">How can I control the execution of multiple jobs per node?</a></li>
</ol>
<h2>For Users</h2>
<p><a name="comp"><b>1. Why is my job/node in COMPLETING state?</b></a><br>
When a job is terminating, both the job and its nodes enter the state "completing."
As the SLURM daemon on each node determines that all processes associated with
...
...
@@ -156,7 +164,31 @@ satisfy the request, no lower priority job in that partition's queue
will be considered as a backfill candidate. Any programmer wishing
to augment the existing code is welcome to do so.
<p><a name="suspend"><b>9. How is job suspend/resume useful?</b></a><br>
<p><a name="share"><b>9. How can I control the execution of multiple
jobs per node?</b></a><br>
There are two mechanism to control this.
If you want to allocate individual processors on a node to jobs,
configure <i>SelectType=select/cons_res</i>.
See <a href="cons_res.html">Consumable Resources in SLURM</a>
for details about this configuration.
If you want to allocate whole nodes to jobs, configure
configure <i>SelectType=select/linear</i>.
Each partition also has a configuration parameter <i>Shared</i>
that enables more than one job to execute on each node.
See <i>man slurm.conf</i> for more information about these
configuration paramters.</p>
<p><a name="steps"><b>10. How can I run multiple jobs from within a
single script?</b></a><br>
A SLURM job is just a resource allocation. You can execute many
job steps within that allocation, either in parallel or sequentially.
Some jobs actually launch thousands of job steps this way. The job
steps will be allocated nodes that are not already allocated to
other job steps. This essential provides a second level of resource
management within the job for the job steps.</p>
<h2>For Administrators</h2>
<p><a name="suspend"><b>1. How is job suspend/resume useful?</b></a><br>
Job suspend/resume is most useful to get particularly large jobs initiated
in a timely fashion with minimal overhead. Say you want to get a full-system
job initiated. Normally you would need to either cancel all running jobs
...
...
@@ -177,7 +209,7 @@ Suspending and resuming a job makes use of the SIGSTOP and SIGCONT
signals respectively, so swap and disk space should be sufficient to
accommodate all jobs allocated to a node, either running or suspended.
<p><a name="fast_schedule"><b>
10
. How can I configure SLURM to use
<p><a name="fast_schedule"><b>
2
. How can I configure SLURM to use
the resources actually found on a node rather than what is defined
in <i>slurm.conf</i>?</b></a><br>
SLURM can either base it's scheduling decisions upon the node
...
...
@@ -190,7 +222,7 @@ A value of one is the default and results in the node configuration
defined in <i>slurm.conf</i> being used. See "man slurm.conf"
for more details.</p>
<p><a name="return_to_service"><b>
11
. Why is a node shown in state
<p><a name="return_to_service"><b>
3
. Why is a node shown in state
DOWN when the node has registered for service?</b></a><br>
The configuration parameter <i>ReturnToService</i> in <i>slurm.conf</i>
controls how DOWN nodes are handled.
...
...
@@ -203,7 +235,7 @@ the command "scontrol update NodeName=whatever State=RESUME".
See "man slurm.conf" and "man scontrol" for more
details.</p>
<p><a name="down_node"><b>
12
. What happens when a node crashes?</b></a><br>
<p><a name="down_node"><b>
4
. What happens when a node crashes?</b></a><br>
A node is set DOWN when the slurmd daemon on it stops responding
for <i>SlurmdTimeout</i> as defined in <i>slurm.conf</i>.
The node can also be set DOWN when certain errors occur or the
...
...
@@ -213,7 +245,7 @@ with the srun option <i>--no-kill</i>.
Any active job step on that node will be killed.
See the slurm.conf and srun man pages for more information.</p>
<p><a name="share"><b>
13
. How can I control the execution of multiple
<p><a name="share"><b>
5
. How can I control the execution of multiple
jobs per node?</b></a><br>
There are two mechanism to control this.
If you want to allocate individual processors on a node to jobs,
...
...
@@ -227,6 +259,6 @@ that enables more than one job to execute on each node.
See <i>man slurm.conf</i> for more information about these
configuration paramters.</p>
<p style="text-align:center;">Last modified
22 February
2006</p>
<p style="text-align:center;">Last modified
17 March
2006</p>
<!--#include virtual="footer.txt"-->
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment