From 9221bc978519d5194456c14c237846ce5e8cacca Mon Sep 17 00:00:00 2001 From: Moe Jette <jette1@llnl.gov> Date: Wed, 18 Jan 2006 20:04:04 +0000 Subject: [PATCH] Answer some more FAQs. --- doc/html/faq.shtml | 35 ++++++++++++++++++++++++++++++++--- 1 file changed, 32 insertions(+), 3 deletions(-) diff --git a/doc/html/faq.shtml b/doc/html/faq.shtml index af420e6cb6d..f740c2a5e30 100644 --- a/doc/html/faq.shtml +++ b/doc/html/faq.shtml @@ -2,7 +2,7 @@ <h1>Frequently Asked Questions</h1> <ol> -<li><a href="#comp">Why is my job/node in "completing" state?</a></li> +<li><a href="#comp">Why is my job/node in COMPLETING state?</a></li> <li><a href="#rlimit">Why do I see the error "Can't propagate RLIMIT_..."?</a></li> <li><a href="#pending">Why is my job not running?</a></li> <li><a href="#sharing">Why does the srun --overcommit option not permit multiple jobs @@ -13,8 +13,12 @@ to run on nodes?</a></li> <li><a href="#backfill">Why is the SLURM backfill scheduler not starting my job?</a></li> <li><a href="#suspend">How is job suspend/resume useful?</a></li> +<li><a href="#fast_schedule">How can I configure SLURM to use the resources actually +found on a node rather than what is defined in <i>slurm.conf</i>?</li> +<li><a href="#return_to_service">Why is a node shown in state DOWN when the node +has registered for service?</li> </ol> -<p><a name="comp"><b>1. Why is my job/node in "completing" state?</b></a><br> +<p><a name="comp"><b>1. Why is my job/node in COMPLETING state?</b></a><br> When a job is terminating, both the job and its nodes enter the state "completing." As the SLURM daemon on each node determines that all processes associated with the job have terminated, that node changes state to "idle" or some other @@ -26,7 +30,7 @@ the job and one or more nodes can remain in the completing state for an extended period of time. This may be indicative of processes hung waiting for a core file to complete I/O or operating system failure. If this state persists, the system administrator should use the <span class="commandline">scontrol</span> command -to change the node's state to <i>DOWN</i> (e.g. "scontrol update +to change the node's state to DOWN (e.g. "scontrol update NodeName=<i>name</i> State=DOWN Reason=hung_completing"), reboot the node, then reset the node's state to IDLE (e.g. "scontrol update NodeName=<i>name</i> State=RESUME").</p> @@ -171,6 +175,31 @@ Suspending and resuming a job makes use of the SIGSTOP and SIGCONT signals respectively, so swap and disk space should be sufficient to accommodate all jobs allocated to a node, either running or suspended. +<p><a name="fast_schedule"><b>10. How can I configure SLURM to use +the resources actually found on a node rather than what is defined +in <i>slurm.conf</i>?</b></a><br> +SLURM can either base it's scheduling decisions upon the node +configuration defined in <i>slurm.conf</i> or what each node +actually returns as available resources. +This is controlled using the configuration parameter <i>FastSchedule</i>. +Set it's value to zero in order to use the resources actually +found on each node, but with a higher overhead for scheduling. +A value of one is the default and results in the node configuration +defined in <i>slurm.conf</i> being used. See "man slurm.conf" +for more details. + +<p><a name="return_to_service"><b>11. Why is a node shown in state +DOWN when the node has registered for service?</b></a><br> +The configuration parameter <i>ReturnToService</i> in <i>slurm.conf</i> +controls how DOWN nodes are handled. +Set its value to one in order for DOWN nodes to automatically be +returned to service once the <i>slurmd</i> daemon registers +with a valid node configuration. +A value of zero is the default and results in a node staying DOWN +until an administrator explicity returns it to service using +the command "scontrol update NodeName=whatever State=RESUME". +See "man slurm.conf" and "man scontrol" for more details. + <p style="text-align:center;">Last modified 16 January 2006</p> <!--#include virtual="footer.txt"--> -- GitLab