diff --git a/doc/html/overview.html b/doc/html/overview.html index 9c12ef55c6a16a96e40e869941ff14e619fef959..74270239f751eaba4bbdb04933f0015e07c872ea 100644 --- a/doc/html/overview.html +++ b/doc/html/overview.html @@ -9,7 +9,7 @@ <meta http-equiv="keywords" content="Simple Linux Utility for Resource Management, SLURM, resource management, Linux clusters, high-performance computing, Livermore Computing"> <meta name="LLNLRandR" content="UCRL-WEB-213976"> -<meta name="LLNLRandRdate" content="25 April 2005"> +<meta name="LLNLRandRdate" content="7 December 2005"> <meta name="distribution" content="global"> <meta name="description" content="Simple Linux Utility for Resource Management"> <meta name="copyright" @@ -66,7 +66,7 @@ work.</p> <p>SLURM has been developed through the collaborative efforts of <a href="http://www.llnl.gov/">Lawrence Livermore National Laboratory (LLNL)</a>, -<a href="http://www.hp.com/">HP</a>, +<a href="http://www.hp.com/">Hewlett-Packard</a>, <a href="http://www.lnxi.com/">Linux NetworX</a>, and <a href="http://www.pathscale.com/">PathScale</a>. Linux NetworX distributes SLURM as a component in their ClusterWorX software. @@ -79,28 +79,69 @@ event of failure. Each compute server (node) has a <b>slurmd</b> daemon, which can be compared to a remote shell: it waits for work, executes that work, returns status, and waits for more work. User tools include <b>srun</b> to initiate jobs, <b>scancel</b> to terminate queued or running jobs, <b>sinfo</b> to report system -status, and <b>squeue</b> to report the status of jobs. There is also an administrative +status, and <b>squeue</b> to report the status of jobs. +The <b>smap</b> command graphically reports system and job status including +network topology. There is also an administrative tool <b>scontrol</b> available to monitor and/or modify configuration and state information. APIs are available for all functions.</p> -<p><img src="arch.gif" width="552" height="432"></p> +<p><img src="arch.gif" width="600"></p> +<p><b>Figure 1. SLURM components</b></p> + <p>SLURM has a general-purpose plugin mechanism available to easily support various -infrastructure. These plugins presently include: +infrastructure. This permits a wide variety of SLURM configurations using a +building block approach. These plugins presently include: <ul> -<li>Authentication of communications: <a href="http://www.theether.org/authd/">authd</a>, +<li><a href="authplugins.html">Authentication of communications</a>: +<a href="http://www.theether.org/authd/">authd</a>, <a href="ftp://ftp.llnl.gov/pub/linux/munge/">munge</a>, or none (default).</li> -<li>Checkpoint: AIX (under development) or none.</li> -<li>Job logging: text file, arbitrary script, or none (default).</li> -<li>Node selection: Blue Gene (a 3-D torus interconnect) or linear.</li> -<li>Scheduler: <a href="http://supercluster.org/maui">The Maui Scheduler</a>, + +<li><a href="checkpoint_plugins.html">Checkpoint</a>: AIX or none.</li> + +<li><a href="jobacctplugins.html">Job accounting</a>: log or none</li> + +<li><a href="jobcompplugins.html">Job completion logging</a>: text file, +arbitrary script, or none (default).</li> + +<li><a href="mpiplugins.html">MPI</a>: LAM, MPICH-GM, MVAPICH, +and none (default, for most other versions of MPI.</li> + +<li><a href="selectplugins.html">Node selection</a>: +Blue Gene (a 3-D torus interconnect), +<a href="cons_res.html">consumable resources</a> (to allocate +individual processors and memory) or linear (to dedicate entire nodes).</li> + +<li>Process tracking (for signaling): AIX, linux process tree hierarchy, +process group ID, and RMS (Quadrics Linux kernel patch).</li> + +<li><a href="schedplugins.html">Scheduler</a>: +<a href="http://supercluster.org/maui">The Maui Scheduler</a>, backfill, or FIFO (default).</li> -<li>Switch or interconnect: <a href="http://www.quadrics.com/">Quadrics</a> -(Elan3 or Elan4), Federation -(<a href="http://publib-b.boulder.ibm.com/Redbooks.nsf/f338d71ccde39f08852568dd006f956d/55258945787efc2e85256db00051980a?OpenDocument"> -IBM High Performance Switch</a>), or none (actually means nothing requiring -special handling, such as Ethernet or + +<li><a href="switchplugins.html">Switch or interconnect</a>: +<a href="http://www.quadrics.com/">Quadrics</a> +(Elan3 or Elan4), +Federation +<a href="http://publib-b.boulder.ibm.com/Redbooks.nsf/f338d71ccde39f08852568dd006f956d/55258945787efc2e85256db00051980a?OpenDocument">Federation</a> (IBM High Performance Switch), +or none (actually means nothing requiring special handling, such as Ethernet or <a href="http://www.myricom.com/">Myrinet</a>, default).</li> </ul> +<p>The entities managed by these SLURM daemons, shown in Figure 2, include <b>nodes</b>, +the compute resource in SLURM, <b>partitions</b>, which group nodes into logical +sets, <b>jobs</b>, or allocations of resources assigned to a user for +a specified amount of time, and <b>job steps</b>, which are sets of (possibly +parallel) tasks within a job. +The partitions can be considered job queues, each of which has an assortment of +constraints such as job size limit, job time limit, users permitted to use it, etc. +Priority-ordered jobs are allocated nodes within a partition until the resources +(nodes, processors, memory, etc.) within that partition are exhausted. Once +a job is assigned a set of nodes, the user is able to initiate parallel work in +the form of job steps in any configuration within the allocation. For instance, +a single job step may be started that utilizes all nodes allocated to the job, +or several job steps may independently use a portion of the allocation.</p> +<p><img src="entities.gif" width="291" height="218"> +<p><b>Figure 2. SLURM entities</b></p> + <p class="footer"><a href="#top">top</a></p> <h3>Configurability</h3> @@ -149,19 +190,13 @@ PartitionName=debug Nodes=lx[0003-0030] State=UP Default=YES PartitionName=class Nodes=lx[0031-0040] AllowGroups=students PartitionName=batch Nodes=lx[0041-9999] MaxTime=UNLIMITED MaxNodes=4096 </pre> -<h3>Status</h3> -<p>SLURM has been deployed on all LLNL Linux clusters having Quadrics Elan switches -since the summer of 2003. This includes IA32 and IA64 clusters having over 1000 -nodes. Fault-tolerance has been excellent. Parallel job performance has also been -excellent. The throughput rate of simple 2000 task jobs across 1000 nodes is over -12 per minute or under 5 seconds per job.</p> <p class="footer"><a href="#top">top</a></p></td> </tr> <tr> <td colspan="3"><hr> <p>For information about this page, contact <a href="mailto:slurm-dev@lists.llnl.gov">slurm-dev@lists.llnl.gov</a>.</p> <p><a href="http://www.llnl.gov/"><img align=middle src="lll.gif" width="32" height="32" border="0"></a></p> <p class="footer">UCRL-WEB-213976<br> -Last modified 25 April 2005</p></td> +Last modified 7 December 2005</p></td> </tr> </table> </td>