Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
Slurm
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
tud-zih-energy
Slurm
Commits
1deb3c5f
Commit
1deb3c5f
authored
19 years ago
by
Moe Jette
Browse files
Options
Downloads
Patches
Plain Diff
Add new document to describe large system configuration issues.
parent
23d427b8
No related branches found
No related tags found
No related merge requests found
Changes
3
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
doc/Makefile.am
+1
-0
1 addition, 0 deletions
doc/Makefile.am
doc/html/big_sys.shtml
+71
-0
71 additions, 0 deletions
doc/html/big_sys.shtml
doc/html/documentation.shtml
+1
-0
1 addition, 0 deletions
doc/html/documentation.shtml
with
73 additions
and
0 deletions
doc/Makefile.am
+
1
−
0
View file @
1deb3c5f
...
@@ -7,6 +7,7 @@ htmldir = ${prefix}/share/doc/@PACKAGE@-@VERSION@/html
...
@@ -7,6 +7,7 @@ htmldir = ${prefix}/share/doc/@PACKAGE@-@VERSION@/html
generated_html
=
\
generated_html
=
\
html/api.html
\
html/api.html
\
html/authplugins.html
\
html/authplugins.html
\
html/big_sys.html
\
html/bluegene.html
\
html/bluegene.html
\
html/checkpoint_plugins.html
\
html/checkpoint_plugins.html
\
html/cons_res.html
\
html/cons_res.html
\
...
...
This diff is collapsed.
Click to expand it.
doc/html/big_sys.shtml
0 → 100644
+
71
−
0
View file @
1deb3c5f
<!--#include virtual="header.txt"-->
<h1>Large Cluster Administration Guide</h1>
</p>This document contains SLURM administrator information specifically
for clusters containing 1,024 nodes or more.
Virtually all SLURM components have been validated (through emulation)
for clusters containing up to 16,384 compute nodes.
Getting good performance at that scale does require some tuning and
this document should help you off to a good start.
A working knowledge of SLURM should be considered a prerequisite
for this material.</p>
<h2>Node Selection Plugin (SelectType)</h2>
<p>While allocating individual processors within a node is great
for smaller clusters, the overhead of keeping track of the individual
processors and memory within each node adds significant overhead.
For best scalability, the consumable resource plugin (<i>select/cons_res</i>)
is best avoided.</p>
<h2>Job Accounting Plugin (JobAcctType)</h2>
<p>Job accounting relies upon the <i>slurmd</i> daemon on each compute
node periodically sampling data.
This data collection will take compute cycles away from the application
inducing what is known as <i>system noise</i>.
For large parallel applications, this system noise can detract for
application scalability.
For optimal application performance, disabling job accounting
is best (<i>jobacct/none</i>).
Consider use of job completion records (<i>JobCompType</i>) for accounting
purposes as this entails far less overhead.
If job accounting is required, configure the sampling interval
to a relatively large size (e.g. <i>JobAcctParameters="Frequency=300"</i>).
Some experimentation may also be required to deal with collisions
on data transmission.</p>
<h2>Node Configuration</h2>
<p>While SLURM can track the amount of memory and disk space actually found
on each compute node and use it for scheduling purposes, this entails
extra overhead.
Optimize performance by specifying the expected configuration using
the available parameters (<i>RealMemory</i>, <i>Procs</i>, and
<i>TmpDisk</i>).
If the node is found to contain less resources than configured,
it will be marked DOWN and not used.
Also set the <i>FastSchedule</i> parameter.
While SLURM can easily handle a heterogeneous cluster, configuring
the nodes using the minimal number of lines in <i>slurm.conf</i>
will both make for easier administration and better performance.</p>
<h2>Timers</h2>
<p>The parameter <i>HeartBeatInterval</i> determines the interval
at which <i>slurmctld</i> routinely communicates with <i>slurmd</i>.
The purpose of this is to determine when a compute node fails
and thus should not be allocated work.
Longer intervals decrease system noise on compute nodes (we do
synchronize these requests across the cluster, but there will
be some impact upon applications).
For really large clusters, <i>HeartBeatInterval</i> values of
60 seconds or more are reasonable.
The values of <i>SlurmctldTimeout</i> and <i>SlurmdTimeout</i>
may also need to be increased (say to double the value of
<i>HeartBeatInterval</i>).</p>
<p style="text-align:center;">Last modified 14 January 2006</p>
<!--#include virtual="footer.txt"-->
This diff is collapsed.
Click to expand it.
doc/html/documentation.shtml
+
1
−
0
View file @
1deb3c5f
...
@@ -13,6 +13,7 @@
...
@@ -13,6 +13,7 @@
<h2>SLURM Administrators</h2>
<h2>SLURM Administrators</h2>
<ul>
<ul>
<li><a href="quickstart_admin.shtml">Quick Start Administrator Guide</a></li>
<li><a href="quickstart_admin.shtml">Quick Start Administrator Guide</a></li>
<li><a href="big_sys.shtml">Large Cluster Administration Guide</a></li>
<li><a href="cons_res.shtml">Consumable Resources Guide</a></li>
<li><a href="cons_res.shtml">Consumable Resources Guide</a></li>
<li><a href="bluegene.shtml">Blue Gene User and Administrator Guide</a></li>
<li><a href="bluegene.shtml">Blue Gene User and Administrator Guide</a></li>
<li><a href="ibm.shtml">IBM AIX User and Administrator Guide</a></li>
<li><a href="ibm.shtml">IBM AIX User and Administrator Guide</a></li>
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment