Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
Slurm
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
tud-zih-energy
Slurm
Commits
d21e3e3a
Commit
d21e3e3a
authored
17 years ago
by
Moe Jette
Browse files
Options
Downloads
Patches
Plain Diff
update some information about open file limits
parent
6e1c50ea
No related branches found
No related tags found
No related merge requests found
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
doc/html/big_sys.shtml
+13
-6
13 additions, 6 deletions
doc/html/big_sys.shtml
doc/html/faq.shtml
+21
-8
21 additions, 8 deletions
doc/html/faq.shtml
with
34 additions
and
14 deletions
doc/html/big_sys.shtml
+
13
−
6
View file @
d21e3e3a
...
...
@@ -5,7 +5,7 @@
<p>This document contains SLURM administrator information specifically
for clusters containing 1,024 nodes or more.
Virtually all SLURM components have been validated (through emulation)
for clusters containing up to
16,384
compute nodes.
for clusters containing up to
65,536
compute nodes.
Getting good performance at that scale does require some tuning and
this document should help you off to a good start.
A working knowledge of SLURM should be considered a prerequisite
...
...
@@ -16,8 +16,8 @@ for this material.</p>
<p>While allocating individual processors within a node is great
for smaller clusters, the overhead of keeping track of the individual
processors and memory within each node adds significant overhead.
For best scalability,
the consumable resource plugin (
<i>select/
cons_res
</i>
)
is best avoided
.</p>
For best scalability,
allocate whole nodes using
<i>select/
linear
</i>
or <i>select/bluegene</i> and avoid <i>select/cons_res</i>
.</p>
<h2>Job Accounting Gather Plugin (JobAcctGatherType)</h2>
...
...
@@ -61,7 +61,7 @@ and thus should not be allocated work.
Longer intervals decrease system noise on compute nodes (we do
synchronize these requests across the cluster, but there will
be some impact upon applications).
For really large clusters, <i>SlurmdTimeout
l
</i> values of
For really large clusters, <i>SlurmdTimeout</i> values of
120 seconds or more are reasonable.</p>
<p>If MPICH-2 is used, the srun command will manage the key-pairs
...
...
@@ -72,8 +72,15 @@ This can be done by setting an environment variable PMI_TIME before
executing srun to launch the tasks.
The default value of PMI_TIME is 500 and this is the number of
microseconds alloted to transmit each key-pair.
We have executed up to 16,000 tasks with a value of PMI_TIME=4000.
We have executed up to 16,000 tasks with a value of PMI_TIME=4000.
</p>
<p style="text-align:center;">Last modified 19 September 2006</p>
<h2>Limits</h2>
<p>The srun command automatically increases its open file limit to
the hard limit in order to process all of the standard input and output
connections to the launched tasks. It is recommended that you set the
open file hard limit to 8192 across the cluster.</p>
<p style="text-align:center;">Last modified 29 January 2008</p>
<!--#include virtual="footer.txt"-->
This diff is collapsed.
Click to expand it.
doc/html/faq.shtml
+
21
−
8
View file @
d21e3e3a
...
...
@@ -52,10 +52,12 @@ parallel for testing purposes?</a></li>
<li><a href="#multi_slurmd">Can slurm emulate a larger cluster?</a></li>
<li><a href="#extra_procs">Can SLURM emulate nodes with more
resources than physically exist on the node?</a></li>
<li><a href="#credential_replayed">What does a "credential
replayed" error in the <i>SlurmdLogFile</i> indicate?</a></li>
<li><a href="#large_time">What does a "Warning: Note very large
processing time" in the <i>SlurmctldLogFile</i> indicate?</a></li>
<li><a href="#credential_replayed">What does a
"credential replayed" error in the <i>SlurmdLogFile</i>
indicate?</a></li>
<li><a href="#large_time">What does
"Warning: Note very large processing time"
in the <i>SlurmctldLogFile</i> indicate?</a></li>
<li><a href="#lightweight_core">How can I add support for lightweight
core files?</a></li>
<li><a href="#limit_propagation">Is resource limit propagation
...
...
@@ -69,6 +71,8 @@ generated?</a></li>
errors generated?</a></li>
<li><a href="#globus">Can SLURM be used with Globus?</li>
<li><a href="#time_format">Can SLURM time output format include the year?</li>
<li><a href="#file_limit">What causes the error
"Unable to accept new connection: Too many open files"?
</ol>
<h2>For Users</h2>
...
...
@@ -743,8 +747,9 @@ SLURM will use the resource specification for each node that is
given in <i>slurm.conf</i> and will not check these specifications
against those actaully found on the node.
<p><a name="credential_replayed"><b>16. What does a "credential
replayed" error in the <i>SlurmdLogFile</i> indicate?</b></a><br>
<p><a name="credential_replayed"><b>16. What does a
"credential replayed"
error in the <i>SlurmdLogFile</i> indicate?</b></a><br>
This error is indicative of the <i>slurmd</i> daemon not being able
to respond to job initiation requests from the <i>srun</i> command
in a timely fashion (a few seconds).
...
...
@@ -768,8 +773,9 @@ value higher than the default 5 seconds.
In earlier versions of Slurm, the <i>--msg-timeout</i> option
of <i>srun</i> serves a similar purpose.
<p><a name="large_time"><b>17. What does a "Warning: Note very large
processing time" in the <i>SlurmctldLogFile</i> indicate?</b></a><br>
<p><a name="large_time"><b>17. What does
"Warning: Note very large processing time"
in the <i>SlurmctldLogFile</i> indicate?</b></a><br>
This error is indicative of some operation taking an unexpectedly
long time to complete, over one second to be specific.
Setting the value of <i>SlurmctldDebug</i> configuration parameter
...
...
@@ -858,6 +864,13 @@ Define "ISO8601" at SLURM build time to get the time format
Note that this change in format will break anything that parses
SLURM output expecting the old format (e.g. LSF, Maui or Moab).
<p><a name="file_limit"><b>25. What causes the error
"Unable to accept new connection: Too many open files"?</b><br>
The srun command automatically increases its open file limit to
the hard limit in order to process all of the standard input and output
connections to the launched tasks. It is recommended that you set the
open file hard limit to 8192 across the cluster.
<p class="footer"><a href="#top">top</a></p>
<p style="text-align:center;">Last modified 6 December 2007</p>
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment