update some information about open file limits

d21e3e3a · Moe Jette · 6e1c50ea · d21e3e3a · d21e3e3a
Commit d21e3e3a authored 17 years ago by Moe Jette
--- a/doc/html/big_sys.shtml
+++ b/doc/html/big_sys.shtml
@@ -5,7 +5,7 @@
 <p>This document contains SLURM administrator information specifically 
 for clusters containing 1,024 nodes or more. 
 Virtually all SLURM components have been validated (through emulation) 
-for clusters containing up to 16,384 compute nodes. 
+for clusters containing up to 65,536 compute nodes. 
 Getting good performance at that scale does require some tuning and 
 this document should help you off to a good start.
 A working knowledge of SLURM should be considered a prerequisite 
@@ -16,8 +16,8 @@ for this material.</p>
 <p>While allocating individual processors within a node is great 
 for smaller clusters, the overhead of keeping track of the individual 
 processors and memory within each node adds significant overhead. 
-For best scalability, the consumable resource plugin (<i>select/cons_res</i>)
-is best avoided.</p>
+For best scalability, allocate whole nodes using <i>select/linear</i>
+or <i>select/bluegene</i> and avoid <i>select/cons_res</i>.</p>

 <h2>Job Accounting Gather Plugin (JobAcctGatherType)</h2>

@@ -61,7 +61,7 @@ and thus should not be allocated work.
 Longer intervals decrease system noise on compute nodes (we do 
 synchronize these requests across the cluster, but there will 
 be some impact upon applications).
-For really large clusters, <i>SlurmdTimeoutl</i> values of 
+For really large clusters, <i>SlurmdTimeout</i> values of 
 120 seconds or more are reasonable.</p> 

 <p>If MPICH-2 is used, the srun command will manage the key-pairs
@@ -72,8 +72,15 @@ This can be done by setting an environment variable PMI_TIME before
 executing srun to launch the tasks. 
 The default value of PMI_TIME is 500 and this is the number of 
 microseconds alloted to transmit each key-pair. 
-We have executed up to 16,000 tasks with a value of PMI_TIME=4000.
+We have executed up to 16,000 tasks with a value of PMI_TIME=4000.</p>

-<p style="text-align:center;">Last modified 19 September 2006</p>
+<h2>Limits</h2>
+
+<p>The srun command automatically increases its open file limit to 
+the hard limit in order to process all of the standard input and output
+connections to the launched tasks. It is recommended that you set the
+open file hard limit to 8192 across the cluster.</p>
+
+<p style="text-align:center;">Last modified 29 January 2008</p>

 <!--#include virtual="footer.txt"-->
--- a/doc/html/faq.shtml
+++ b/doc/html/faq.shtml
@@ -52,10 +52,12 @@ parallel for testing purposes?</a></li>
 <li><a href="#multi_slurmd">Can slurm emulate a larger cluster?</a></li>
 <li><a href="#extra_procs">Can SLURM emulate nodes with more 
 resources than physically exist on the node?</a></li>
-<li><a href="#credential_replayed">What does a "credential
-replayed" error in the <i>SlurmdLogFile</i> indicate?</a></li>
-<li><a href="#large_time">What does a "Warning: Note very large
-processing time" in the <i>SlurmctldLogFile</i> indicate?</a></li>
+<li><a href="#credential_replayed">What does a 
+&quot;credential replayed&quot; error in the <i>SlurmdLogFile</i> 
+indicate?</a></li>
+<li><a href="#large_time">What does 
+&quot;Warning: Note very large processing time&quot; 
+in the <i>SlurmctldLogFile</i> indicate?</a></li>
 <li><a href="#lightweight_core">How can I add support for lightweight
 core files?</a></li>
 <li><a href="#limit_propagation">Is resource limit propagation
@@ -69,6 +71,8 @@ generated?</a></li>
 errors generated?</a></li>
 <li><a href="#globus">Can SLURM be used with Globus?</li>
 <li><a href="#time_format">Can SLURM time output format include the year?</li>
+<li><a href="#file_limit">What causes the error 
+&quot;Unable to accept new connection: Too many open files&quot;?
 </ol>

 <h2>For Users</h2>
@@ -743,8 +747,9 @@ SLURM will use the resource specification for each node that is
 given in <i>slurm.conf</i> and will not check these specifications 
 against those actaully found on the node.

-<p><a name="credential_replayed"><b>16. What does a "credential 
-replayed" error in the <i>SlurmdLogFile</i> indicate?</b></a><br>
+<p><a name="credential_replayed"><b>16. What does a 
+&quot;credential replayed&quot; 
+error in the <i>SlurmdLogFile</i> indicate?</b></a><br>
 This error is indicative of the <i>slurmd</i> daemon not being able 
 to respond to job initiation requests from the <i>srun</i> command
 in a timely fashion (a few seconds).
@@ -768,8 +773,9 @@ value higher than the default 5 seconds.
 In earlier versions of Slurm, the <i>--msg-timeout</i> option 
 of <i>srun</i> serves a similar purpose.

-<p><a name="large_time"><b>17. What does a "Warning: Note very large
-processing time" in the <i>SlurmctldLogFile</i> indicate?</b></a><br>
+<p><a name="large_time"><b>17. What does 
+&quot;Warning: Note very large processing time&quot; 
+in the <i>SlurmctldLogFile</i> indicate?</b></a><br>
 This error is indicative of some operation taking an unexpectedly
 long time to complete, over one second to be specific.
 Setting the value of <i>SlurmctldDebug</i> configuration parameter 
@@ -858,6 +864,13 @@ Define &quot;ISO8601&quot; at SLURM build time to get the time format
 Note that this change in format will break anything that parses 
 SLURM output expecting the old format (e.g. LSF, Maui or Moab).

+<p><a name="file_limit"><b>25. What causes the error 
+&quot;Unable to accept new connection: Too many open files&quot;?</b><br>
+The srun command automatically increases its open file limit to 
+the hard limit in order to process all of the standard input and output
+connections to the launched tasks. It is recommended that you set the
+open file hard limit to 8192 across the cluster.
+
 <p class="footer"><a href="#top">top</a></p>

 <p style="text-align:center;">Last modified 6 December 2007</p>