From 2892dc3b6eeac9918b667ff7188328be434958ef Mon Sep 17 00:00:00 2001 From: Moe Jette <jette1@llnl.gov> Date: Fri, 13 Oct 2006 17:14:30 +0000 Subject: [PATCH] Major update to troubleshooting guide, basically done now. --- doc/html/review_release.html | 1 + doc/html/troubleshoot.shtml | 109 +++++++++++++++++++++++++++++++++-- 2 files changed, 106 insertions(+), 4 deletions(-) diff --git a/doc/html/review_release.html b/doc/html/review_release.html index 5d35271eed2..84bd7933704 100644 --- a/doc/html/review_release.html +++ b/doc/html/review_release.html @@ -43,6 +43,7 @@ <li><a href="http://cmg-rr.llnl.gov/linux/slurm/switchplugins.html">switchplugins.html</a></li> <li><a href="http://cmg-rr.llnl.gov/linux/slurm/team.html">team.html</a></li> <li><a href="http://cmg-rr.llnl.gov/linux/slurm/testimonials.html">testimonials.html</a></li> +<li><a href="http://cmg-rr.llnl.gov/linux/slurm/troubleshoot.html">troubleshoot.html</a></li> </ul> </body> </html> diff --git a/doc/html/troubleshoot.shtml b/doc/html/troubleshoot.shtml index 818a23d42f1..38bca2fd613 100644 --- a/doc/html/troubleshoot.shtml +++ b/doc/html/troubleshoot.shtml @@ -1,9 +1,11 @@ <!--#include virtual="header.txt"--> -<h1>SLURM Troubleshooting Guide</h1> +<h1><a name="top">SLURM Troubleshooting Guide</a></h1> <p>This guide is meant as a tool to help system administrators -or operators troubleshoot SLURM failures and restore services.</p> +or operators troubleshoot SLURM failures and restore services. +The <a href="faq.html">Frequently Asked Questions</a> document +may also prove useful.</p> <ul> <li><a href="#resp">SLURM is not responding</a></li> @@ -12,63 +14,162 @@ or operators troubleshoot SLURM failures and restore services.</p> <li><a href="#network">Networking problems</a></li> </ul> + <h2><a name="resp">SLURM is not responding</a></h2> <ol> <li>Execute "<i>scontrol ping</i>" to determine if the primary and backup controllers are responding. + <li>If it responds for you, this could be a <a href="#network">networking or configuration problem</a> specific to some user or node in the cluster.</li> + <li>If not responding, directly login to the machine and try again -to rule out <a href="#network">network and configuration problems</a>. +to rule out <a href="#network">network and configuration problems</a>.</li> + <li>If still not responding, check if there is an active slurmctld dameon by executing "<i>ps -el | grep slurmctld</i>".</li> + <li>If slurmctld is not running, restart it (typically as user root using the command "<i>/etc/init.d/slurm start</i>"). You should check the log file (<i>SlurmctldLog</i> in the <i>slurm.conf</i> file) for an indication of why it failed. If it keeps failing, you should contact the slurm team for help at <a href="mailto:slurm-dev@lists.llnl.gov">slurm-dev@lists.llnl.gov</a>.</li> + <li>If slurmctld is running but not responding (a very rare situation), then kill and restart it (typically as user root using the commands -"<i>/etc/init.d/slurm stop</i>" and then "<i>/etc/init.d/slurm start</i>"). +"<i>/etc/init.d/slurm stop</i>" and then "<i>/etc/init.d/slurm start</i>").</li> + <li>If it hangs again, increase the verbosity of debug messages (increase <i>SlurmctldDebug</i> in the <i>slurm.conf</i> file) and restart. Again check the log file for an indication of why it failed. At this point, you should contact the slurm team for help at <a href="mailto:slurm-dev@lists.llnl.gov">slurm-dev@lists.llnl.gov</a>.</li> + <li>If it continues to fail without an indication as to the failure mode, restart without preserving state (typically as user root using the commands "<i>/etc/init.d/slurm stop</i>" and then "<i>/etc/init.d/slurm startclean</i>"). Note: All running jobs and other state information will be lost.</li> </ol> +<p class="footer"><a href="#top">top</a></p> + <h2><a name="sched">Jobs are not getting scheduled</a></h2> +<p>This is dependent upon the scheduler used by SLURM. +Executing the command "<i>scontrol show config | grep SchedulerType</i>" +to determine this. +For any scheduler, you can check priorities of jobs using the +command "<i>scontrol show job</i>".</p> + +<ul> +<li>If the scheduler type is <i>builtin</i>, then jobs will be executed +in the order of submission for a given partition. +Even if resources are available to initiate jobs immediately, +it will be deferred until no previously submitted job is pending.</li> + +<li>If the scheduler type is <i>backfill</i>, then jobs will generally +be executed in the order of submission for a given partition with one +exception: later submitted jobs will be initiated early if doing so +does not delay the expected execution time of an earlier submitted job. +In order for backfill scheduling to be effective, users jobs should +specify reasonable time limits. +If jobs do not specify time limits, then all jobs will receive the +same time limit (that associated with the partition), and the ability +to backfill schedule jobs will be limited. +The backfill scheduler does not alter job specifications of required +or excluded nodes, so jobs which specify nodes will substantially +reduce the effectiveness of backfill scheduling. +See the <a href="faq.shtml#backfill">backfill documentation</a> +for more details.</li> + +<li>If the scheduler type is <i>wiki</i>, this represents +<a href="http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php"> +The Maui Scheduler</a> or +<a href="http://www.clusterresources.com/pages/products/moab-cluster-suite.php"> +Moab Cluster Suite</a>. +Please refer to its documentation for help.</li> +</ul> +<p class="footer"><a href="#top">top</a></p> + + <h2><a name="nodes">Notes are getting set to a DOWN state</a></h2> +<ol> +<li>Check the reason why the node is down using the command +"<i>scontrol show node <name></i>". +This will show the reason why the node was set down and the +time when it happened. +If there is insufficient disk space, memory space, etc. compared +to the parameters specified in the <i>slurm.conf</i> file then +either fix the node or change <i>slurm.conf</i>.</li> + +<li>If the reason is "Not responding", then check communications +between the control machine and the DOWN node using the command +"<i>ping <address></i>" being sure to specify the +NodeAddr values configured in <i>slurm.conf</i>. +If ping fails, then fix the network or addressses in <i>slurm.conf</i>.</li> + +<li>Next login to a node that SLURM considers to be in a DOWN +state and check if the slurmd daemon is running with the command +"<i>ps -el | grep slurmd</i>". +If slurmd is not running, restart it (typically as user root +using the command "<i>/etc/init.d/slurm start</i>"). +You should check the log file (<i>SlurmdLog</i> in the +<i>slurm.conf</i> file) for an indication of why it failed. +If it keeps failing, you should contact the slurm team for help at +<a href="mailto:slurm-dev@lists.llnl.gov">slurm-dev@lists.llnl.gov</a>.</li> + +<li>If slurmd is running but not responding (a very rare situation), +then kill and restart it (typically as user root using the commands +"<i>/etc/init.d/slurm stop</i>" and then "<i>/etc/init.d/slurm start</i>").</li> + +<li>If still not responding, try again to rule out +<a href="#network">network and configuration problems</a>.</li> + +<li>If still not responding, increase the verbosity of debug messages +(increase <i>SlurmdDebug</i> in the <i>slurm.conf</i> file) +and restart. +Again check the log file for an indication of why it failed. +At this point, you should contact the slurm team for help at +<a href="mailto:slurm-dev@lists.llnl.gov">slurm-dev@lists.llnl.gov</a>.</li> + +<li>If still not responding without an indication as to the failure +mode, restart without preserving state (typically as user root +using the commands "<i>/etc/init.d/slurm stop</i>" +and then "<i>/etc/init.d/slurm startclean</i>"). +Note: All jobs and other state information on that node will be lost.</li> +</ol> +<p class="footer"><a href="#top">top</a></p> + <h2><a name="network">Networking and configuration problems</a></h2> <ol> <li>Check the controller and/or slurmd log files (<i>SlurmctldLog</i> and <i>SlurmdLog</i> in the <i>slurm.conf</i> file) for an indication of why it is failing.</li> + <li>Check for consistent <i>slurm.conf</i> and credential files on the node(s) experiencing problems.</li> + <li>If this is user-specific problem, check that the user is configured on the controller computer(s) as well as the compute nodes. The user doesn't need to be able to login, but his user ID must exist.</li> + <li>Check that a consistent version of SLURM exists on all of the nodes (execute "<i>sinfo -V</i>" or "<i>rpm -qa | grep slurm</i>"). If the first two digits of the version number match it should work fine, but version 1.1 commands will not work with version 1.2 daemons or vise-versa.</li> </ol> +<p class="footer"><a href="#top">top</a></p> + <p style="text-align:center;">Last modified 12 October 2006</p> -- GitLab