Skip to content
Snippets Groups Projects
Commit 08fccd83 authored by Moe Jette's avatar Moe Jette
Browse files

Minor word-smithing.

parent a7f04397
No related branches found
No related tags found
No related merge requests found
\begin{abstract} \begin{abstract}
A new cluster resource management system called A new cluster resource management system called
Simple Linux Utility Resource Management (SLURM) is developed and Simple Linux Utility Resource Management (SLURM) is described
presented in this paper. SLURM, initially developed for large in this paper. SLURM, initially developed for large
Linux clusters at the Lawrence Livermore National Laboratory (LLNL), Linux clusters at the Lawrence Livermore National Laboratory (LLNL),
is a simple cluster manager that can scale to thousands of processors. is a simple cluster manager that can scale to thousands of processors.
SLURM is designed to be flexible and fault-tolerant and can be ported to SLURM is designed to be flexible and fault-tolerant and can be ported to
......
...@@ -86,7 +86,7 @@ previously saved state information, ...@@ -86,7 +86,7 @@ previously saved state information,
notifies the controller that it is active, waits for work, notifies the controller that it is active, waits for work,
executes the work, returns status, and waits for more work. executes the work, returns status, and waits for more work.
Since it initiates jobs for other users, it must run with root privilege. Since it initiates jobs for other users, it must run with root privilege.
It also asynchronously exchanges node and job status information with {\tt slurmctld}. %It also asynchronously exchanges node and job status information with {\tt slurmctld}.
The only job information it has at any given time pertains to its The only job information it has at any given time pertains to its
currently executing jobs. currently executing jobs.
The \slurmd\ performs five major tasks. The \slurmd\ performs five major tasks.
...@@ -124,7 +124,7 @@ Most SLURM state information is maintained by the controller, {\tt slurmctld}. ...@@ -124,7 +124,7 @@ Most SLURM state information is maintained by the controller, {\tt slurmctld}.
The \slurmctld\ is multi-threaded with independent read and write locks The \slurmctld\ is multi-threaded with independent read and write locks
for the various data structures to enhance scalability. for the various data structures to enhance scalability.
When \slurmctld\ starts, it reads the SLURM configuration file. When \slurmctld\ starts, it reads the SLURM configuration file.
It also can read additional state information It can also read additional state information
from a checkpoint file generated by a previous execution of {\tt slurmctld}. from a checkpoint file generated by a previous execution of {\tt slurmctld}.
Full controller state information is written to Full controller state information is written to
disk periodically with incremental changes written to disk immediately disk periodically with incremental changes written to disk immediately
......
...@@ -5,9 +5,9 @@ and portable cluster resource management system. ...@@ -5,9 +5,9 @@ and portable cluster resource management system.
The contribution of this work is that we have provided a immediately-available The contribution of this work is that we have provided a immediately-available
and open-source tool that virtually anybody can use to efficiently manage clusters of and open-source tool that virtually anybody can use to efficiently manage clusters of
different sizes and architecture. different sizes and architecture.
We expect SLURM to begin production use on LLNL Linux clusters %We expect SLURM to begin production use on LLNL Linux clusters
starting in March 2003 and be available for distribution shortly %starting in March 2003 and be available for distribution shortly
thereafter. %thereafter.
Looking ahead, we anticipate adding support for additional Looking ahead, we anticipate adding support for additional
operating systems. operating systems.
......
...@@ -8,16 +8,17 @@ The continuous decrease in the price of the COTS parts in conjunction with ...@@ -8,16 +8,17 @@ The continuous decrease in the price of the COTS parts in conjunction with
the good scalability of the cluster architecture has now made it feasible to economically the good scalability of the cluster architecture has now made it feasible to economically
build large-scale clusters with thousands of processors~\cite{MCRWeb,PCRWeb}. build large-scale clusters with thousands of processors~\cite{MCRWeb,PCRWeb}.
An essential component that is needed to harness such a resource is a An essential component that is needed to harness such a computer is a
cluster management system. resource management system.
A cluster management system (or cluster manager) performs such crucial tasks as A resource management system (or resource manager) performs such crucial tasks as
scheduling user jobs, monitoring machine and job status, launching user applications, and scheduling user jobs, monitoring machine and job status, launching user applications, and
managing machine configuration, managing machine configuration,
An ideal cluster manager should be simple, efficient, scalable, and fault-tolerant. An ideal resource manager should be simple, efficient, scalable, fault-tolerant,
and portable.
Unfortunately there are no open-source cluster management systems currently available Unfortunately there are no open-source resource management systems currently available
which satisfy these requirements. which satisfy these requirements.
A survey~\cite{Jette02} has revealed that many existing cluster managers have poor scalability and fault-tolerance rendering them unsuitable for large clusters having A survey~\cite{Jette02} has revealed that many existing resource managers have poor scalability and fault-tolerance rendering them unsuitable for large clusters having
thousands of processors~\cite{LoadLevelerWeb,LoadLevelerManual}. thousands of processors~\cite{LoadLevelerWeb,LoadLevelerManual}.
While some proprietary cluster managers are suitable for large clusters, While some proprietary cluster managers are suitable for large clusters,
they are typically designed for particular computer systems and/or they are typically designed for particular computer systems and/or
...@@ -30,7 +31,7 @@ even though the scheduler does not necessarily meet the need of organization tha ...@@ -30,7 +31,7 @@ even though the scheduler does not necessarily meet the need of organization tha
Clear separation of the cluster management functionality from scheduling policy is desired. Clear separation of the cluster management functionality from scheduling policy is desired.
This observation led us to set out to design a simple, highly scalable, and This observation led us to set out to design a simple, highly scalable, and
portable cluster management system that performs the core of cluster resource management functions. portable resource management system.
The result of this effort is Simple Linux Utility Resource Management The result of this effort is Simple Linux Utility Resource Management
(SLURM\footnote{A tip of the hat to Matt Groening and creators of {\em Futurama}, (SLURM\footnote{A tip of the hat to Matt Groening and creators of {\em Futurama},
where Slurm is the most popular carbonated beverage in the universe.}). where Slurm is the most popular carbonated beverage in the universe.}).
...@@ -94,10 +95,9 @@ deterministic. ...@@ -94,10 +95,9 @@ deterministic.
\end{itemize} \end{itemize}
The main contribution of our work is that we have provided a readily available The main contribution of our work is that we have provided a readily available
and inexpensive tool that anybody can use to efficiently manage clusters of different size and architecture. tool that anybody can use to efficiently manage clusters of different size and architecture.
The SLURM is highly scalable\footnote{It was observed that it took less than two seconds for SLURM to launch a thousand-task job on SLURM is highly scalable\footnote{It was observed that it took less than five seconds for SLURM to launch a 1900-task job over 950 nodes on recently installed cluster at Lawrence Livermore National Laboratory.}.
a large cluster currently being built for Lawrence Livermore National Laboratory.}. The SLURM can be easily ported to any cluster system with minimal effort with its plugin
The SLURM can be easily ported to any cluster system with minimal effort with its plug-in
capability and can be used with any meta-batch scheduler or a Grid resource broker~\cite{Gridbook} capability and can be used with any meta-batch scheduler or a Grid resource broker~\cite{Gridbook}
with its well-defined interfaces. with its well-defined interfaces.
......
...@@ -24,7 +24,7 @@ high priority queue for smaller "interactive" jobs. Signal to daemons ...@@ -24,7 +24,7 @@ high priority queue for smaller "interactive" jobs. Signal to daemons
causes current log file to be closed, renamed with causes current log file to be closed, renamed with
time-stamp, and a new log file created. time-stamp, and a new log file created.
Although the PBS is portable and has broad user base, it has significant drawbacks. Although the PBS is portable and has a broad user base, it has significant drawbacks.
PBS is single threaded and hence exhibits poor performance on large clusters. PBS is single threaded and hence exhibits poor performance on large clusters.
This is particularly problematic when a compute node in the system fails: This is particularly problematic when a compute node in the system fails:
PBS tries to contact down node while other activities must wait. PBS tries to contact down node while other activities must wait.
...@@ -83,17 +83,17 @@ PBS also has a weak mechanism for starting and cleaning up parallel jobs. ...@@ -83,17 +83,17 @@ PBS also has a weak mechanism for starting and cleaning up parallel jobs.
\subsection{Quadrics RMS} \subsection{Quadrics RMS}
Quadrics RMS\cite{Quadrics02} Quadrics RMS\cite{Quadrics02}
(Resource Management System) is a cluster management system for (Resource Management System) is for
Unix systems having Quadrics Elan interconnects. Unix systems having Quadrics Elan interconnects.
RMS functionality and performance is excellent. RMS functionality and performance is excellent.
It major drawback is the requirement for a Quadrics interconnect. It major limitation is the requirement for a Quadrics interconnect.
The priorietary code and cost may also pose difficulties under some The proprietary code and cost may also pose difficulties under some
circumstances. circumstances.
\subsection*{Maui Scheduler} \subsection*{Maui Scheduler}
Maui Scheduler~\cite{Maui} is an advance reservation HPC batch scheduler Maui Scheduler~\cite{Maui} is an advanced reservation HPC batch scheduler
for use with SP, O2K, and UNIX/Linux clusters. for use with SP, O2K, and UNIX/Linux clusters.
It is widely used to extend the functionality of PBS and LoadLeveler, It is widely used to extend the functionality of PBS and LoadLeveler,
which Maui requires to perform the parallel job initiation and management. which Maui requires to perform the parallel job initiation and management.
...@@ -128,7 +128,7 @@ overuse of a computer where not authorized. ...@@ -128,7 +128,7 @@ overuse of a computer where not authorized.
% not authorized. % not authorized.
%\end{itemize} %\end{itemize}
While DPCS does have these attractive characteristics, it supports only a DPCS supports only a
limited number of computer systems: IBM RS/6000 and SP, Linux, limited number of computer systems: IBM RS/6000 and SP, Linux,
Sun Solaris, and Compaq Alpha. Sun Solaris, and Compaq Alpha.
Like the Maui Scheduler, DPCS requires an underlying infrastructure for Like the Maui Scheduler, DPCS requires an underlying infrastructure for
...@@ -140,10 +140,10 @@ LoadLeveler~\cite{LoadLevelerManual,LoadLevelerWeb} ...@@ -140,10 +140,10 @@ LoadLeveler~\cite{LoadLevelerManual,LoadLevelerWeb}
is a proprietary batch system and parallel job manager by is a proprietary batch system and parallel job manager by
IBM. LoadLeveler supports few non-IBM systems. Very primitive IBM. LoadLeveler supports few non-IBM systems. Very primitive
scheduling software exists and other software is required for reasonable scheduling software exists and other software is required for reasonable
performance such as Maui Scheduler or DPCS. performance, such as Maui Scheduler or DPCS.
The LoadLeveler is simple and very flexible queue and job class structure is available The LoadLeveler has a simple and very flexible queue and job class structure available
operating in "matrix" fashion. operating in "matrix" fashion.
The biggest problem of the LoadLeveler is its extremely poor scalability. The biggest problem of the LoadLeveler is its poor scalability.
It typically requires 20 minutes to execute even a trivial 500-node, 8000-task It typically requires 20 minutes to execute even a trivial 500-node, 8000-task
on the IBM SP computers at LLNL. on the IBM SP computers at LLNL.
%In addition, all jobs must be initiated through the LoadLeveler, and a special version of %In addition, all jobs must be initiated through the LoadLeveler, and a special version of
...@@ -184,7 +184,7 @@ architectures, it has sophisticated scheduling software including ...@@ -184,7 +184,7 @@ architectures, it has sophisticated scheduling software including
fair-share, backfill, consumable resources, an job preemption and fair-share, backfill, consumable resources, an job preemption and
very flexible queue structure. very flexible queue structure.
It also provides good status information on nodes and LSF daemons. It also provides good status information on nodes and LSF daemons.
While LSF is quite powerful, it is not open source and is costly on While LSF is quite powerful, it is not open-source and can be costly on
larger clusters. larger clusters.
%The LSF share many of its shortcomings with the LoadLeveler: job initiation only %The LSF share many of its shortcomings with the LoadLeveler: job initiation only
%through LSF, requirement of a spwcial MPI library, etc. %through LSF, requirement of a spwcial MPI library, etc.
...@@ -227,17 +227,17 @@ their requirements for a broker to perform matches. The checkpoint ...@@ -227,17 +227,17 @@ their requirements for a broker to perform matches. The checkpoint
mechanism is used to relocate work on demand (when the "owner" of a mechanism is used to relocate work on demand (when the "owner" of a
desktop machine wants to resume work). desktop machine wants to resume work).
%
\subsection*{Linux PAGG Process Aggregates} %\subsection*{Linux PAGG Process Aggregates}
%
PAGG~\cite{PAGG} %PAGG~\cite{PAGG}
consists of modifications to the linux kernel that allows %consists of modifications to the linux kernel that allows
developers to implement Process AGGregates as loadable kernel modules. %developers to implement Process AGGregates as loadable kernel modules.
A process aggregate is defined as a collection of processes that are %A process aggregate is defined as a collection of processes that are
all members of the same set. A set would be implemented as a container %all members of the same set. A set would be implemented as a container
for the member processes. For instance, process sessions and groups %for the member processes. For instance, process sessions and groups
could have been implemented as process aggregates. %could have been implemented as process aggregates.
%
\subsection*{Beowulf Distributed Process Space (BPROC)} \subsection*{Beowulf Distributed Process Space (BPROC)}
...@@ -250,7 +250,9 @@ processes started with this mechanism appear in the process table ...@@ -250,7 +250,9 @@ processes started with this mechanism appear in the process table
of the front end machine in a cluster. This allows remote process of the front end machine in a cluster. This allows remote process
management using the normal UNIX process control facilities. Signals management using the normal UNIX process control facilities. Signals
are transparently forwarded to remote processes and exit status is are transparently forwarded to remote processes and exit status is
received using the usual wait() mechanisms. received using the usual wait() mechanisms. This tight coupling of
a cluster's nodes is convenient, but high scalability can be difficult
to achieve.
%\subsection{xcat} %\subsection{xcat}
% %
...@@ -269,18 +271,18 @@ received using the usual wait() mechanisms. ...@@ -269,18 +271,18 @@ received using the usual wait() mechanisms.
%NQS\footnote{http://umbc7.umbc.edu/nqs/nqsmain.html}, %NQS\footnote{http://umbc7.umbc.edu/nqs/nqsmain.html},
%the Network Queueing System, is a serial batch system. %the Network Queueing System, is a serial batch system.
% %
\subsection*{LAM / MPI} %\subsection*{LAM / MPI}
%
Local Area Multicomputer (LAM)~\cite{LAM} %Local Area Multicomputer (LAM)~\cite{LAM}
is an MPI programming environment and development system for heterogeneous %is an MPI programming environment and development system for heterogeneous
computers on a network. %computers on a network.
With LAM, a dedicated cluster or an existing network %With LAM, a dedicated cluster or an existing network
computing infrastructure can act as one parallel computer solving %computing infrastructure can act as one parallel computer solving
one problem. LAM features extensive debugging support in the %one problem. LAM features extensive debugging support in the
application development cycle and peak performance for production %application development cycle and peak performance for production
applications. LAM features a full implementation of the MPI %applications. LAM features a full implementation of the MPI
communication standard. %communication standard.
%
%\subsection{MPICH} %\subsection{MPICH}
% %
%MPICH\footnote{http://www-unix.mcs.anl.gov/mpi/mpich/} %MPICH\footnote{http://www-unix.mcs.anl.gov/mpi/mpich/}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment