diff --git a/doc/html/cray.shtml b/doc/html/cray.shtml index 2459d970bcee9e2716ee1ef99bcd7172ed1087f5..b538747c55b8c0ea9875981427de7b5883b47e2e 100644 --- a/doc/html/cray.shtml +++ b/doc/html/cray.shtml @@ -380,6 +380,12 @@ This is specified in the <i>slurm.conf</i> file by using the <i>FrontendName</i> and optionally the <i>FrontEndAddr</i> fields as seen in the examples below.</p> +<p>Note that SLURM will by default kill running jobs when a node goes DOWN, +while a DOWN node in ALPS only prevents new jobs from being scheduled on the +node. To help avoid confusion, we recommend that <i>SlurmdTimeout</i> in the +<i>slurm.conf</i> file be set to the same value as the <i>suspectend</i> +parameter in ALPS' <i>nodehealth.conf</i> file.</p> + <p>You need to specify the appropriate resource selection plugin (the <i>SelectType</i> option in SLURM's <i>slurm.conf</i> configuration file). Configure <i>SelectType</i> to <i>select/cray</i> The <i>select/cray</i> @@ -450,6 +456,10 @@ SlurmdPidFile=/var/run/slurmd.pid # Return DOWN nodes to service when e.g. slurmd has been unresponsive ReturnToService=1 +# Configure the suspectend parameter in ALPS' nodehealth.conf file to the same +# value as SlurmdTimeout for consistent behavior (e.g. "suspectend: 600") +SlurmdTimeout=600 + # Controls how a node's configuration specifications in slurm.conf are # used. # 0 - use hardware configuration (must agree with slurm.conf) @@ -621,6 +631,6 @@ allocation.</p> <p class="footer"><a href="#top">top</a></p> -<p style="text-align:center;">Last modified 27 July 2011</p></td> +<p style="text-align:center;">Last modified 28 July 2011</p></td> <!--#include virtual="footer.txt"-->