Add new document to describe large system configuration issues.

1deb3c5f · Moe Jette · 23d427b8 · 1deb3c5f · 1deb3c5f · 1deb3c5f
Commit 1deb3c5f authored 19 years ago by Moe Jette
--- a/doc/Makefile.am
+++ b/doc/Makefile.am
@@ -7,6 +7,7 @@ htmldir = ${prefix}/share/doc/@PACKAGE@-@VERSION@/html
 generated_html = \
 	html/api.html \
 	html/authplugins.html \
+	html/big_sys.html \
 	html/bluegene.html \
 	html/checkpoint_plugins.html \
 	html/cons_res.html \

--- a/doc/html/big_sys.shtml
+++ b/doc/html/big_sys.shtml
+<!--#include virtual="header.txt"-->
+<h1>Large Cluster Administration Guide</h1>
+</p>This document contains SLURM administrator information specifically 
+for clusters containing 1,024 nodes or more. 
+Virtually all SLURM components have been validated (through emulation) 
+for clusters containing up to 16,384 compute nodes. 
+Getting good performance at that scale does require some tuning and 
+this document should help you off to a good start.
+A working knowledge of SLURM should be considered a prerequisite 
+for this material.</p>
+<h2>Node Selection Plugin (SelectType)</h2>
+<p>While allocating individual processors within a node is great 
+for smaller clusters, the overhead of keeping track of the individual 
+processors and memory within each node adds significant overhead. 
+For best scalability, the consumable resource plugin (<i>select/cons_res</i>)
+is best avoided.</p>
+<h2>Job Accounting Plugin (JobAcctType)</h2>
+<p>Job accounting relies upon the <i>slurmd</i> daemon on each compute 
+node periodically sampling data.
+This data collection will take compute cycles away from the application
+inducing what is known as <i>system noise</i>.
+For large parallel applications, this system noise can detract for 
+application scalability.
+For optimal application performance, disabling job accounting 
+is best (<i>jobacct/none</i>).
+Consider use of job completion records (<i>JobCompType</i>) for accounting 
+purposes as this entails far less overhead.
+If job accounting is required, configure the sampling interval 
+to a relatively large size (e.g. <i>JobAcctParameters="Frequency=300"</i>).
+Some experimentation may also be required to deal with collisions 
+on data transmission.</p>
+<h2>Node Configuration</h2>
+<p>While SLURM can track the amount of memory and disk space actually found 
+on each compute node and use it for scheduling purposes, this entails 
+extra overhead. 
+Optimize performance by specifying the expected configuration using 
+the available parameters (<i>RealMemory</i>, <i>Procs</i>, and 
+<i>TmpDisk</i>). 
+If the node is found to contain less resources than configured, 
+it will be marked DOWN and not used. 
+Also set the <i>FastSchedule</i> parameter.
+While SLURM can easily handle a heterogeneous cluster, configuring 
+the nodes using the minimal number of lines in <i>slurm.conf</i>
+will both make for easier administration and better performance.</p>
+<h2>Timers</h2>
+<p>The parameter <i>HeartBeatInterval</i> determines the interval 
+at which <i>slurmctld</i> routinely communicates with <i>slurmd</i>.
+The purpose of this is to determine when a compute node fails 
+and thus should not be allocated work. 
+Longer intervals decrease system noise on compute nodes (we do 
+synchronize these requests across the cluster, but there will 
+be some impact upon applications).
+For really large clusters, <i>HeartBeatInterval</i> values of 
+60 seconds or more are reasonable. 
+The values of <i>SlurmctldTimeout</i> and <i>SlurmdTimeout</i>
+may also need to be increased (say to double the value of 
+<i>HeartBeatInterval</i>).</p>
+<p style="text-align:center;">Last modified 14 January 2006</p>
+<!--#include virtual="footer.txt"-->
--- a/doc/html/documentation.shtml
+++ b/doc/html/documentation.shtml
@@ -13,6 +13,7 @@
 <h2>SLURM Administrators</h2>
 <ul>
 <li><a href="quickstart_admin.shtml">Quick Start Administrator Guide</a></li>
+<li><a href="big_sys.shtml">Large Cluster Administration Guide</a></li>
 <li><a href="cons_res.shtml">Consumable Resources Guide</a></li>
 <li><a href="bluegene.shtml">Blue Gene User and Administrator Guide</a></li>
 <li><a href="ibm.shtml">IBM AIX User and Administrator Guide</a></li>