Skip to content
Snippets Groups Projects
Commit abc7459e authored by Moe Jette's avatar Moe Jette
Browse files

General update.

parent 6a3235c1
No related branches found
No related tags found
No related merge requests found
...@@ -9,7 +9,7 @@ ...@@ -9,7 +9,7 @@
<meta http-equiv="keywords" content="Simple Linux Utility for Resource Management, SLURM, resource management, <meta http-equiv="keywords" content="Simple Linux Utility for Resource Management, SLURM, resource management,
Linux clusters, high-performance computing, Livermore Computing"> Linux clusters, high-performance computing, Livermore Computing">
<meta name="LLNLRandR" content="UCRL-WEB-213976"> <meta name="LLNLRandR" content="UCRL-WEB-213976">
<meta name="LLNLRandRdate" content="25 April 2005"> <meta name="LLNLRandRdate" content="7 December 2005">
<meta name="distribution" content="global"> <meta name="distribution" content="global">
<meta name="description" content="Simple Linux Utility for Resource Management"> <meta name="description" content="Simple Linux Utility for Resource Management">
<meta name="copyright" <meta name="copyright"
...@@ -66,7 +66,7 @@ work.</p> ...@@ -66,7 +66,7 @@ work.</p>
<p>SLURM has been developed through the collaborative efforts of <p>SLURM has been developed through the collaborative efforts of
<a href="http://www.llnl.gov/">Lawrence Livermore National Laboratory (LLNL)</a>, <a href="http://www.llnl.gov/">Lawrence Livermore National Laboratory (LLNL)</a>,
<a href="http://www.hp.com/">HP</a>, <a href="http://www.hp.com/">Hewlett-Packard</a>,
<a href="http://www.lnxi.com/">Linux NetworX</a>, and <a href="http://www.lnxi.com/">Linux NetworX</a>, and
<a href="http://www.pathscale.com/">PathScale</a>. <a href="http://www.pathscale.com/">PathScale</a>.
Linux NetworX distributes SLURM as a component in their ClusterWorX software. Linux NetworX distributes SLURM as a component in their ClusterWorX software.
...@@ -79,28 +79,69 @@ event of failure. Each compute server (node) has a <b>slurmd</b> daemon, which ...@@ -79,28 +79,69 @@ event of failure. Each compute server (node) has a <b>slurmd</b> daemon, which
can be compared to a remote shell: it waits for work, executes that work, returns can be compared to a remote shell: it waits for work, executes that work, returns
status, and waits for more work. User tools include <b>srun</b> to initiate jobs, status, and waits for more work. User tools include <b>srun</b> to initiate jobs,
<b>scancel</b> to terminate queued or running jobs, <b>sinfo</b> to report system <b>scancel</b> to terminate queued or running jobs, <b>sinfo</b> to report system
status, and <b>squeue</b> to report the status of jobs. There is also an administrative status, and <b>squeue</b> to report the status of jobs.
The <b>smap</b> command graphically reports system and job status including
network topology. There is also an administrative
tool <b>scontrol</b> available to monitor and/or modify configuration and state tool <b>scontrol</b> available to monitor and/or modify configuration and state
information. APIs are available for all functions.</p> information. APIs are available for all functions.</p>
<p><img src="arch.gif" width="552" height="432"></p> <p><img src="arch.gif" width="600"></p>
<p><b>Figure 1. SLURM components</b></p>
<p>SLURM has a general-purpose plugin mechanism available to easily support various <p>SLURM has a general-purpose plugin mechanism available to easily support various
infrastructure. These plugins presently include: infrastructure. This permits a wide variety of SLURM configurations using a
building block approach. These plugins presently include:
<ul> <ul>
<li>Authentication of communications: <a href="http://www.theether.org/authd/">authd</a>, <li><a href="authplugins.html">Authentication of communications</a>:
<a href="http://www.theether.org/authd/">authd</a>,
<a href="ftp://ftp.llnl.gov/pub/linux/munge/">munge</a>, or none (default).</li> <a href="ftp://ftp.llnl.gov/pub/linux/munge/">munge</a>, or none (default).</li>
<li>Checkpoint: AIX (under development) or none.</li>
<li>Job logging: text file, arbitrary script, or none (default).</li> <li><a href="checkpoint_plugins.html">Checkpoint</a>: AIX or none.</li>
<li>Node selection: Blue Gene (a 3-D torus interconnect) or linear.</li>
<li>Scheduler: <a href="http://supercluster.org/maui">The Maui Scheduler</a>, <li><a href="jobacctplugins.html">Job accounting</a>: log or none</li>
<li><a href="jobcompplugins.html">Job completion logging</a>: text file,
arbitrary script, or none (default).</li>
<li><a href="mpiplugins.html">MPI</a>: LAM, MPICH-GM, MVAPICH,
and none (default, for most other versions of MPI.</li>
<li><a href="selectplugins.html">Node selection</a>:
Blue Gene (a 3-D torus interconnect),
<a href="cons_res.html">consumable resources</a> (to allocate
individual processors and memory) or linear (to dedicate entire nodes).</li>
<li>Process tracking (for signaling): AIX, linux process tree hierarchy,
process group ID, and RMS (Quadrics Linux kernel patch).</li>
<li><a href="schedplugins.html">Scheduler</a>:
<a href="http://supercluster.org/maui">The Maui Scheduler</a>,
backfill, or FIFO (default).</li> backfill, or FIFO (default).</li>
<li>Switch or interconnect: <a href="http://www.quadrics.com/">Quadrics</a>
(Elan3 or Elan4), Federation <li><a href="switchplugins.html">Switch or interconnect</a>:
(<a href="http://publib-b.boulder.ibm.com/Redbooks.nsf/f338d71ccde39f08852568dd006f956d/55258945787efc2e85256db00051980a?OpenDocument"> <a href="http://www.quadrics.com/">Quadrics</a>
IBM High Performance Switch</a>), or none (actually means nothing requiring (Elan3 or Elan4),
special handling, such as Ethernet or Federation
<a href="http://publib-b.boulder.ibm.com/Redbooks.nsf/f338d71ccde39f08852568dd006f956d/55258945787efc2e85256db00051980a?OpenDocument">Federation</a> (IBM High Performance Switch),
or none (actually means nothing requiring special handling, such as Ethernet or
<a href="http://www.myricom.com/">Myrinet</a>, default).</li> <a href="http://www.myricom.com/">Myrinet</a>, default).</li>
</ul> </ul>
<p>The entities managed by these SLURM daemons, shown in Figure 2, include <b>nodes</b>,
the compute resource in SLURM, <b>partitions</b>, which group nodes into logical
sets, <b>jobs</b>, or allocations of resources assigned to a user for
a specified amount of time, and <b>job steps</b>, which are sets of (possibly
parallel) tasks within a job.
The partitions can be considered job queues, each of which has an assortment of
constraints such as job size limit, job time limit, users permitted to use it, etc.
Priority-ordered jobs are allocated nodes within a partition until the resources
(nodes, processors, memory, etc.) within that partition are exhausted. Once
a job is assigned a set of nodes, the user is able to initiate parallel work in
the form of job steps in any configuration within the allocation. For instance,
a single job step may be started that utilizes all nodes allocated to the job,
or several job steps may independently use a portion of the allocation.</p>
<p><img src="entities.gif" width="291" height="218">
<p><b>Figure 2. SLURM entities</b></p>
<p class="footer"><a href="#top">top</a></p> <p class="footer"><a href="#top">top</a></p>
<h3>Configurability</h3> <h3>Configurability</h3>
...@@ -149,19 +190,13 @@ PartitionName=debug Nodes=lx[0003-0030] State=UP Default=YES ...@@ -149,19 +190,13 @@ PartitionName=debug Nodes=lx[0003-0030] State=UP Default=YES
PartitionName=class Nodes=lx[0031-0040] AllowGroups=students PartitionName=class Nodes=lx[0031-0040] AllowGroups=students
PartitionName=batch Nodes=lx[0041-9999] MaxTime=UNLIMITED MaxNodes=4096 PartitionName=batch Nodes=lx[0041-9999] MaxTime=UNLIMITED MaxNodes=4096
</pre> </pre>
<h3>Status</h3>
<p>SLURM has been deployed on all LLNL Linux clusters having Quadrics Elan switches
since the summer of 2003. This includes IA32 and IA64 clusters having over 1000
nodes. Fault-tolerance has been excellent. Parallel job performance has also been
excellent. The throughput rate of simple 2000 task jobs across 1000 nodes is over
12 per minute or under 5 seconds per job.</p>
<p class="footer"><a href="#top">top</a></p></td> <p class="footer"><a href="#top">top</a></p></td>
</tr> </tr>
<tr> <tr>
<td colspan="3"><hr> <p>For information about this page, contact <a href="mailto:slurm-dev@lists.llnl.gov">slurm-dev@lists.llnl.gov</a>.</p> <td colspan="3"><hr> <p>For information about this page, contact <a href="mailto:slurm-dev@lists.llnl.gov">slurm-dev@lists.llnl.gov</a>.</p>
<p><a href="http://www.llnl.gov/"><img align=middle src="lll.gif" width="32" height="32" border="0"></a></p> <p><a href="http://www.llnl.gov/"><img align=middle src="lll.gif" width="32" height="32" border="0"></a></p>
<p class="footer">UCRL-WEB-213976<br> <p class="footer">UCRL-WEB-213976<br>
Last modified 25 April 2005</p></td> Last modified 7 December 2005</p></td>
</tr> </tr>
</table> </table>
</td> </td>
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment