Skip to content
Snippets Groups Projects
Commit 3577021d authored by Moe Jette's avatar Moe Jette
Browse files

Add information about jobs/nodes hung in CG state.

parent de5ac831
No related branches found
No related tags found
No related merge requests found
......@@ -10,6 +10,7 @@ may also prove useful.</p>
<ul>
<li><a href="#resp">SLURM is not responding</a></li>
<li><a href="#sched">Jobs are not getting scheduled</a></li>
<li><a href="#completing">Jobs and nodes are stuck in COMPLETING state</a></li>
<li><a href="#nodes">Notes are getting set to a DOWN state</a></li>
<li><a href="#network">Networking and configuration problems</a></li>
</ul>
......@@ -97,6 +98,26 @@ Please refer to its documentation for help.</li>
<p class="footer"><a href="#top">top</a></p>
<h2><a name="completing">Jobs and nodes are stuck in COMPLETING state</a></h2>
<p>This is typically due to non-killable processes associated with the job.
SLURM will continue to attempt terminating the processes with SIGKILL, but
some jobs may stuck performing I/O and non-killable.
This is typically due to a file system problem and may be addressed in
a couple of ways.</p>
<ol>
<li>Fix the file system and/or reboot the node. <b>-OR-</b></li>
<li>Set the node to a DOWN state and then return it to service
("<i>scontrol update NodeName=&lt;node&gt; State=down Reason=hung_proc</i>"
and "<i>scontrol update NodeName=&lt;node&gt; State=resume</i>").
This permits other jobs to use the node, but leaves the non-killable
process in place.
If the process should ever complete the I/O, the pending SIGKILL
should terminate it immediately.</li>
</ol>
<p class="footer"><a href="#top">top</a></p>
<h2><a name="nodes">Notes are getting set to a DOWN state</a></h2>
<ol>
......@@ -171,6 +192,6 @@ version 1.2 daemons or vise-versa.</li>
<p class="footer"><a href="#top">top</a></p>
<p style="text-align:center;">Last modified 12 October 2006</p>
<p style="text-align:center;">Last modified 16 October 2006</p>
<!--#include virtual="footer.txt"-->
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment