
Home
About
Overview
What's New
Publications
SLURM Team
Using
Documentation
FAQ
Getting Help
Mailing Lists
Installing
Platforms
Download
Guide |
 |
The Simple Linux Utility for Resource Management (SLURM) is an open source,
fault-tolerant, and highly scalable cluster management and job scheduling system
for large and small Linux clusters. SLURM requires no kernel modifications for
its operation and is relatively self-contained. As a cluster resource manager,
SLURM has three key functions. First, it allocates exclusive and/or non-exclusive
access to resources (compute nodes) to users for some duration of time so they
can perform work. Second, it provides a framework for starting, executing, and
monitoring work (normally a parallel job) on the set of allocated nodes. Finally,
it arbitrates conflicting requests for resources by managing a queue of pending
work.
SLURM has been developed through the collaborative efforts of
Lawrence Livermore National Laboratory (LLNL),
Hewlett-Packard,
Linux NetworX, and
PathScale.
Linux NetworX distributes SLURM as a component in their ClusterWorX software.
HP distributes and supports SLURM as a component in their XC System Software.
Architecture
SLURM has a centralized manager, slurmctld, to monitor resources and
work. There may also be a backup manager to assume those responsibilities in the
event of failure. Each compute server (node) has a slurmd daemon, which
can be compared to a remote shell: it waits for work, executes that work, returns
status, and waits for more work. User tools include srun to initiate jobs,
scancel to terminate queued or running jobs, sinfo to report system
status, and squeue to report the status of jobs.
The smap command graphically reports system and job status including
network topology. There is also an administrative
tool scontrol available to monitor and/or modify configuration and state
information. APIs are available for all functions.

Figure 1. SLURM components
SLURM has a general-purpose plugin mechanism available to easily support various
infrastructure. This permits a wide variety of SLURM configurations using a
building block approach. These plugins presently include:
- Authentication of communications:
authd,
munge, or none (default).
- Checkpoint: AIX or none.
- Job accounting: log or none
- Job completion logging: text file,
arbitrary script, or none (default).
- MPI: LAM, MPICH-GM, MVAPICH,
and none (default, for most other versions of MPI.
- Node selection:
Blue Gene (a 3-D torus interconnect),
consumable resources (to allocate
individual processors and memory) or linear (to dedicate entire nodes).
- Process tracking (for signaling): AIX, linux process tree hierarchy,
process group ID, and RMS (Quadrics Linux kernel patch).
- Scheduler:
The Maui Scheduler,
backfill, or FIFO (default).
- Switch or interconnect:
Quadrics
(Elan3 or Elan4),
Federation
Federation (IBM High Performance Switch),
or none (actually means nothing requiring special handling, such as Ethernet or
Myrinet, default).
The entities managed by these SLURM daemons, shown in Figure 2, include nodes,
the compute resource in SLURM, partitions, which group nodes into logical
sets, jobs, or allocations of resources assigned to a user for
a specified amount of time, and job steps, which are sets of (possibly
parallel) tasks within a job.
The partitions can be considered job queues, each of which has an assortment of
constraints such as job size limit, job time limit, users permitted to use it, etc.
Priority-ordered jobs are allocated nodes within a partition until the resources
(nodes, processors, memory, etc.) within that partition are exhausted. Once
a job is assigned a set of nodes, the user is able to initiate parallel work in
the form of job steps in any configuration within the allocation. For instance,
a single job step may be started that utilizes all nodes allocated to the job,
or several job steps may independently use a portion of the allocation.
Figure 2. SLURM entities
Configurability
Node state monitored include: count of processors, size of real memory, size
of temporary disk space, and state (UP, DOWN, etc.). Additional node information
includes weight (preference in being allocated work) and features (arbitrary information
such as processor speed or type). Nodes are grouped into disjoint partitions.
Partition information includes: name, list of associated nodes, state (UP or DOWN),
maximum job time limit, maximum node count per job, group access list, and shared
node access (YES, NO or FORCE). Bit maps are used to represent nodes and scheduling
decisions can be made by performing a small number of comparisons and a series
of fast bit map manipulations. A sample (partial) SLURM configuration file follows.
#
# Sample /etc/slurm.conf
#
ControlMachine=linux0001
BackupController=linux0002
#
AuthType=auth/munge
Epilog=/usr/local/slurm/sbin/epilog
HeartbeatInterval=60
PluginDir=/usr/local/slurm/lib
Prolog=/usr/local/slurm/sbin/prolog
SlurmctldPort=7002
SlurmctldTimeout=120
SlurmdPort=7003
SlurmdSpoolDir=/var/tmp/slurmd.spool
SlurmdTimeout=120
StateSaveLocation=/usr/local/slurm/slurm.state
SwitchType=switch/elan
TmpFS=/tmp
#
# Node Configurations
#
NodeName=DEFAULT TmpDisk=16384 State=IDLE
NodeName=lx[0001-0002] State=DRAINED
NodeName=lx[0003-8000] Procs=16 RealMemory=2048 Weight=16
NodeName=lx[8001-9999] Procs=32 RealMemory=4096 Weight=40 Feature=1200MHz
#
# Partition Configurations
#
PartitionName=DEFAULT MaxTime=30 MaxNodes=2
PartitionName=login Nodes=lx[0001-0002] State=DOWN
PartitionName=debug Nodes=lx[0003-0030] State=UP Default=YES
PartitionName=class Nodes=lx[0031-0040] AllowGroups=students
PartitionName=batch Nodes=lx[0041-9999] MaxTime=UNLIMITED MaxNodes=4096
|