Newer
Older
<!--#include virtual="header.txt"-->
<h1>High Throughput Computing Administration Guide</h1>
<p>This document contains SLURM administrator information specifically
for high throughput computing, namely the execution of many short jobs.
Getting optimal performance for high throughput computing does require
some tuning and this document should help you off to a good start.
A working knowledge of SLURM should be considered a prerequisite
for this material.</p>
<h2>Performance Results</h2>
<p>SLURM has been validated to execute 32,000 jobs per hour on a
sustained basis (about 9 jobs per second).</p>
<h2>System configuration</h2>
<p>Three system configuration parameters must be set to support a large number
of open files and TCP connections with large bursts of messages. Changes can
be made using the <b>/etc/rc.d/rc.local</b> or <b>/etc/sysctl.conf</b>
script to preserve changes after reboot. In either case, you can write values
directly into these files
(e.g. <i>"echo 32832 > /proc/sys/fs/file-max"</i>).</p>
<ul>
<li><b>/proc/sys/fs/file-max</b>:
The maximum number of concurrently open files.
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
<li><b>/proc/sys/net/ipv4/tcp_max_syn_backlog</b>:
Maximum number of remembered connection requests, which are still did not
receive an acknowledgment from connecting client.
The default value is 1024 for systems with more than 128Mb of memory, and 128
for low memory machines. If server suffers of overload, try to increase this
number.</li>
<li><b>/proc/sys/net/core/somaxconn</b>:
Limit of socket listen() backlog, known in userspace as SOMAXCONN. Defaults to
128. The value should be raised substantially to support bursts of request.
For example, to support a burst of 1024 requests, set somaxconn to 1024.</li>
</ul>
<h2>User limits</h2>
<p>The <b>ulimit</b> values in effect for the <b>slurmctld</b> daemon should
be set quite high for memory size, open file count and stack size.</p>
<h2>SLURM Configuration</h2>
<p>Several SLURM configuration parameter should be adjusted to reflect the
needs of high throughput computing.</p>
<ul>
<li><b>MaxJobCount</b>:
Controls how many jobs may be in the <b>slurmctld</b> daemon records at any
point in time (pending, running, suspended or completed[temporarily]).
The default value is 10,000</li>
<li><b>MessageTimeout</b>:
Controls how long to wait for a response to messages.
The default value is 10 seconds.
While the <b>slurmctld</b> daemon is highly threaded, its responsiveness
is load dependend. This value may need to be increased substantially.
A value in of 30 or 60 seconds should be sufficient in any case.</li>
<li><b>MinJobAge</b>:
Controls how soon the record of a completed job can be purged from the
<b>slurmctld</b> memory and thus not visible using the <b>squeue</b> command.
The record of jobs run will be preserved in accounting records and logs.
The default value is 300 seconds. The value should be reduced to a few
seconds if possible.</li>
<li>Other: Configure logging, accounting and other overhead to a minimum
appropriate for your environment.</li>
</ul>
<p style="text-align:center;">Last modified 4 June 2010</p>
<!--#include virtual="footer.txt"-->