Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
Slurm
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
tud-zih-energy
Slurm
Commits
fb413786
Commit
fb413786
authored
17 years ago
by
Moe Jette
Browse files
Options
Downloads
Patches
Plain Diff
describe need to cold-start all daemons at once.
parent
e31b7c58
No related branches found
Branches containing commit
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc/html/faq.shtml
+44
-19
44 additions, 19 deletions
doc/html/faq.shtml
with
44 additions
and
19 deletions
doc/html/faq.shtml
+
44
−
19
View file @
fb413786
...
@@ -10,7 +10,6 @@
...
@@ -10,7 +10,6 @@
to run on nodes?</a></li>
to run on nodes?</a></li>
<li><a href="#purge">Why is my job killed prematurely?</a></li>
<li><a href="#purge">Why is my job killed prematurely?</a></li>
<li><a href="#opts">Why are my srun options ignored?</a></li>
<li><a href="#opts">Why are my srun options ignored?</a></li>
<li><a href="#cred">Why are "Invalid job credential" errors generated?</a></li>
<li><a href="#backfill">Why is the SLURM backfill scheduler not starting my
<li><a href="#backfill">Why is the SLURM backfill scheduler not starting my
job?</a></li>
job?</a></li>
<li><a href="#steps">How can I run multiple jobs from within a single script?</a></li>
<li><a href="#steps">How can I run multiple jobs from within a single script?</a></li>
...
@@ -63,6 +62,11 @@ core files?</a></li>
...
@@ -63,6 +62,11 @@ core files?</a></li>
useful on a homogeneous cluster?</a></li>
useful on a homogeneous cluster?</a></li>
<li<a href="#clock">Do I need to maintain synchronized clocks
<li<a href="#clock">Do I need to maintain synchronized clocks
on the cluster?</a></li>
on the cluster?</a></li>
<li><a href="#cred_invalid">Why are "Invalid job credential" errors
generated?</a></li>
<li><a href="#cred_replay">Why are
"Task launch failed on node ... Job credential replayed"
errors generated?</a></li>
</ol>
</ol>
<h2>For Users</h2>
<h2>For Users</h2>
...
@@ -200,14 +204,7 @@ hostname command. Which will change the name of the computer
...
@@ -200,14 +204,7 @@ hostname command. Which will change the name of the computer
on which SLURM executes the command - Very bad, <b>Don't run
on which SLURM executes the command - Very bad, <b>Don't run
this command as user root!</b></p>
this command as user root!</b></p>
<p><a name="cred"><b>7. Why are "Invalid job credential" errors generated?
<p><a name="backfill"><b>7. Why is the SLURM backfill scheduler not starting my job?
</b></a><br>
This error is indicative of SLURM's job credential files being inconsistent across
the cluster. All nodes in the cluster must have the matching public and private
keys as defined by <b>JobCredPrivateKey</b> and <b>JobCredPublicKey</b> in the
slurm configuration file <b>slurm.conf</b>.
<p><a name="backfill"><b>8. Why is the SLURM backfill scheduler not starting my job?
</b></a><br>
</b></a><br>
There are significant limitations in the current backfill scheduler plugin.
There are significant limitations in the current backfill scheduler plugin.
It was designed to perform backfill node scheduling for a homogeneous cluster.
It was designed to perform backfill node scheduling for a homogeneous cluster.
...
@@ -236,7 +233,7 @@ scheduling and other jobs may be scheduled ahead of these jobs.
...
@@ -236,7 +233,7 @@ scheduling and other jobs may be scheduled ahead of these jobs.
These jobs are subject to starvation, but will not block other
These jobs are subject to starvation, but will not block other
jobs from running when sufficient resources are available for them.</p>
jobs from running when sufficient resources are available for them.</p>
<p><a name="steps"><b>
9
. How can I run multiple jobs from within a
<p><a name="steps"><b>
8
. How can I run multiple jobs from within a
single script?</b></a><br>
single script?</b></a><br>
A SLURM job is just a resource allocation. You can execute many
A SLURM job is just a resource allocation. You can execute many
job steps within that allocation, either in parallel or sequentially.
job steps within that allocation, either in parallel or sequentially.
...
@@ -245,7 +242,7 @@ steps will be allocated nodes that are not already allocated to
...
@@ -245,7 +242,7 @@ steps will be allocated nodes that are not already allocated to
other job steps. This essential provides a second level of resource
other job steps. This essential provides a second level of resource
management within the job for the job steps.</p>
management within the job for the job steps.</p>
<p><a name="orphan"><b>
10
. Why do I have job steps when my job has
<p><a name="orphan"><b>
9
. Why do I have job steps when my job has
already COMPLETED?</b></a><br>
already COMPLETED?</b></a><br>
NOTE: This only applies to systems configured with
NOTE: This only applies to systems configured with
<i>SwitchType=switch/elan</i> or <i>SwitchType=switch/federation</i>.
<i>SwitchType=switch/elan</i> or <i>SwitchType=switch/federation</i>.
...
@@ -262,7 +259,7 @@ This enables SLURM to purge job information in a timely fashion
...
@@ -262,7 +259,7 @@ This enables SLURM to purge job information in a timely fashion
even when there are many failing nodes.
even when there are many failing nodes.
Unfortunately the job step information may persist longer.</p>
Unfortunately the job step information may persist longer.</p>
<p><a name="multi_batch"><b>1
1
. How can I run a job within an existing
<p><a name="multi_batch"><b>1
0
. How can I run a job within an existing
job allocation?</b></a><br>
job allocation?</b></a><br>
There is a srun option <i>--jobid</i> that can be used to specify
There is a srun option <i>--jobid</i> that can be used to specify
a job's ID.
a job's ID.
...
@@ -278,7 +275,7 @@ If you specify that a batch job should use an existing allocation,
...
@@ -278,7 +275,7 @@ If you specify that a batch job should use an existing allocation,
that job allocation will be released upon the termination of
that job allocation will be released upon the termination of
that batch job.</p>
that batch job.</p>
<p><a name="user_env"><b>1
2
. How does SLURM establish the environment
<p><a name="user_env"><b>1
1
. How does SLURM establish the environment
for my job?</b></a><br>
for my job?</b></a><br>
SLURM processes are not run under a shell, but directly exec'ed
SLURM processes are not run under a shell, but directly exec'ed
by the <i>slurmd</i> daemon (assuming <i>srun</i> is used to launch
by the <i>slurmd</i> daemon (assuming <i>srun</i> is used to launch
...
@@ -288,13 +285,13 @@ is executed are propagated to the spawned processes.
...
@@ -288,13 +285,13 @@ is executed are propagated to the spawned processes.
The <i>~/.profile</i> and <i>~/.bashrc</i> scripts are not executed
The <i>~/.profile</i> and <i>~/.bashrc</i> scripts are not executed
as part of the process launch.</p>
as part of the process launch.</p>
<p><a name="prompt"><b>1
3
. How can I get shell prompts in interactive
<p><a name="prompt"><b>1
2
. How can I get shell prompts in interactive
mode?</b></a><br>
mode?</b></a><br>
<i>srun -u bash -i</i><br>
<i>srun -u bash -i</i><br>
Srun's <i>-u</i> option turns off buffering of stdout.
Srun's <i>-u</i> option turns off buffering of stdout.
Bash's <i>-i</i> option tells it to run in interactive mode (with prompts).
Bash's <i>-i</i> option tells it to run in interactive mode (with prompts).
<p><a name="batch_out"><b>1
4
. How can I get the task ID in the output
<p><a name="batch_out"><b>1
3
. How can I get the task ID in the output
or error file name for a batch job?</b></a><br>
or error file name for a batch job?</b></a><br>
<p>If you want separate output by task, you will need to build a script
<p>If you want separate output by task, you will need to build a script
containing this specification. For example:</p>
containing this specification. For example:</p>
...
@@ -324,7 +321,7 @@ $ cat out_65541_2
...
@@ -324,7 +321,7 @@ $ cat out_65541_2
tdev2
tdev2
</pre>
</pre>
<p><a name="parallel_make"><b>1
5
. Can the <i>make</i> command
<p><a name="parallel_make"><b>1
4
. Can the <i>make</i> command
utilize the resources allocated to a SLURM job?</b></a><br>
utilize the resources allocated to a SLURM job?</b></a><br>
Yes. There is a patch available for GNU make version 3.81
Yes. There is a patch available for GNU make version 3.81
available as part of the SLURM distribution in the file
available as part of the SLURM distribution in the file
...
@@ -337,7 +334,7 @@ overhead of SLURM's task launch. Use with make's <i>-j</i> option within an
...
@@ -337,7 +334,7 @@ overhead of SLURM's task launch. Use with make's <i>-j</i> option within an
existing SLURM allocation. Outside of a SLURM allocation, make's behavior
existing SLURM allocation. Outside of a SLURM allocation, make's behavior
will be unchanged.</p>
will be unchanged.</p>
<p><a name="terminal"><b>1
6
. Can tasks be launched with a remote
<p><a name="terminal"><b>1
5
. Can tasks be launched with a remote
terminal?</b></a><br>
terminal?</b></a><br>
In SLURM version 1.3 or higher, use srun's <i>--pty</i> option.
In SLURM version 1.3 or higher, use srun's <i>--pty</i> option.
Until then, you can accomplish this by starting an appropriate program
Until then, you can accomplish this by starting an appropriate program
...
@@ -812,9 +809,37 @@ clocks on the cluster?</b></a><br>
...
@@ -812,9 +809,37 @@ clocks on the cluster?</b></a><br>
In general, yes. Having inconsistent clocks may cause nodes to
In general, yes. Having inconsistent clocks may cause nodes to
be unusable. SLURM log files should contain references to
be unusable. SLURM log files should contain references to
expired credentials.
expired credentials.
<p><a name="cred_invalid"><b>21. Why are "Invalid job credential"
errors generated?</b></a><br>
This error is indicative of SLURM's job credential files being inconsistent across
the cluster. All nodes in the cluster must have the matching public and private
keys as defined by <b>JobCredPrivateKey</b> and <b>JobCredPublicKey</b> in the
slurm configuration file <b>slurm.conf</b>.
<p><a name="cred_replay"><b>22. Why are
"Task launch failed on node ... Job credential replayed"
errors generated?</b></a><br>
This error indicates that a job credential generated by the slurmctld daemon
corresponds to a job that the slurmd daemon has already revoked.
The slurmctld daemon selects job ID values based upon the configured
value of <b>FirstJobId</b> (the default value is 1) and each job gets
an value one large than the previous job.
On job termination, the slurmctld daemon notifies the slurmd on each
allocated node that all processes associated with that job should be
terminated.
The slurmd daemon maintains a list of the jobs which have already been
terminated to avoid replay of task launch requests.
If the slurmctld daemon is cold-started (with the "-c" option
or "/etc/init.d/slurm startclean"), it starts job ID values
over based upon <b>FirstJobId</b>.
If the slurmd is not also cold-started, it will reject job launch requests
for jobs that it considers terminated.
This solution to this problem is to cold-start all slurmd daemons whenever
the slurmctld daemon is cold-started.
<p class="footer"><a href="#top">top</a></p>
<p class="footer"><a href="#top">top</a></p>
<p style="text-align:center;">Last modified
1
October 2007</p>
<p style="text-align:center;">Last modified
2
October 2007</p>
<!--#include virtual="footer.txt"-->
<!--#include virtual="footer.txt"-->
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment