Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
Slurm
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
tud-zih-energy
Slurm
Commits
4bf00f78
Commit
4bf00f78
authored
19 years ago
by
Moe Jette
Browse files
Options
Downloads
Patches
Plain Diff
Major revisions.
parent
f0bfee44
No related branches found
Branches containing commit
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc/sigops/report.tex
+39
-27
39 additions, 27 deletions
doc/sigops/report.tex
with
39 additions
and
27 deletions
doc/sigops/report.tex
+
39
−
27
View file @
4bf00f78
...
@@ -127,7 +127,7 @@ Upon the program termination, it's allocated resources
...
@@ -127,7 +127,7 @@ Upon the program termination, it's allocated resources
are released for use by other programs.
are released for use by other programs.
All of these operations are straightforward, but performing
All of these operations are straightforward, but performing
resource management on clusters containing thousands of
resource management on clusters containing thousands of
nodes and over 1
0
0,000 processor cores requires more
nodes and over 1
3
0,000 processor cores requires more
than a high degree of parallelism.
than a high degree of parallelism.
In many respects, data management and fault-tolerance issues
In many respects, data management and fault-tolerance issues
are paramount.
are paramount.
...
@@ -135,7 +135,7 @@ are paramount.
...
@@ -135,7 +135,7 @@ are paramount.
SLURM is a resource manager jointly developed by Lawrence
SLURM is a resource manager jointly developed by Lawrence
Livermore National Laboratory (LLNL),
Livermore National Laboratory (LLNL),
Hewlett-Packard, and Linux NetworX
Hewlett-Packard, and Linux NetworX
\cite
{
SLURM2003,Yoo2003,SlurmWeb
}
.
~
\cite
{
SLURM2003,Yoo2003,SlurmWeb
}
.
SLURM's general characteristics include:
SLURM's general characteristics include:
\begin{itemize}
\begin{itemize}
...
@@ -148,7 +148,7 @@ workload prioritization.
...
@@ -148,7 +148,7 @@ workload prioritization.
\item
{
\tt
Open Source
}
: SLURM is available to everyone and
\item
{
\tt
Open Source
}
: SLURM is available to everyone and
will remain free.
will remain free.
Its source code is distributed under the GNU General Public
Its source code is distributed under the GNU General Public
License
\cite
{
GPL2002
}
.
License
~
\cite
{
GPL2002
}
.
\item
{
\tt
Portability
}
: SLURM is written in the C language,
\item
{
\tt
Portability
}
: SLURM is written in the C language,
with a GNU
{
\em
autoconf
}
configuration engine.
with a GNU
{
\em
autoconf
}
configuration engine.
...
@@ -165,8 +165,9 @@ is only 2 seconds for 4,800 tasks on 2,400 nodes. Clusters
...
@@ -165,8 +165,9 @@ is only 2 seconds for 4,800 tasks on 2,400 nodes. Clusters
containing up to 16,384 nodes have been emulated with highly
containing up to 16,384 nodes have been emulated with highly
scalable performance.
scalable performance.
\item
{
\tt
Fault Tolerance
}
: SLURM can handle a variety of failure
\item
{
\tt
Fault Tolerance
}
: SLURM can handle a variety of failures
modes without terminating workloads.
in hardward or the infrastructure without inducing failures in
the workload.
\item
{
\tt
Security
}
: SLURM employs crypto technology to authenticate
\item
{
\tt
Security
}
: SLURM employs crypto technology to authenticate
users to services and services to each other with a variety of options
users to services and services to each other with a variety of options
...
@@ -189,7 +190,7 @@ interfaces are usable by scripts and its behavior is highly deterministic.
...
@@ -189,7 +190,7 @@ interfaces are usable by scripts and its behavior is highly deterministic.
\end{figure}
\end{figure}
SLURM's commands and daemons are illustrated in Figure~
\ref
{
arch
}
.
SLURM's commands and daemons are illustrated in Figure~
\ref
{
arch
}
.
The main SLURM control program
{
\tt
slurmctld
}
orchestrates
The main SLURM control program
,
{
\tt
slurmctld
}
,
orchestrates
activities throughout the cluster. While highly optimized,
activities throughout the cluster. While highly optimized,
{
\tt
slurmctld
}
is best run on a dedicated node of the cluster for optimal performance.
{
\tt
slurmctld
}
is best run on a dedicated node of the cluster for optimal performance.
In addition, SLURM provides the option of running a backup controller
In addition, SLURM provides the option of running a backup controller
...
@@ -235,8 +236,8 @@ to a user for a specified amount of time, and
...
@@ -235,8 +236,8 @@ to a user for a specified amount of time, and
Each node must be capable of independent scheduling and job execution
Each node must be capable of independent scheduling and job execution
\footnote
{
On BlueGene computers, the c-nodes can not be independently
\footnote
{
On BlueGene computers, the c-nodes can not be independently
scheduled. Each midplane or base partition is considered a SLURM node
scheduled. Each midplane or base partition is considered a SLURM node
with 1,024 processors. SLURM
configuration supports small bglblocks
with 1,024 processors. SLURM
supports the execution of more than one
and
per
mits the execution of more than one job per
node.
}
.
job
per
BlueGene
node.
}
.
Each job in the priority-ordered queue is allocated nodes within a single
Each job in the priority-ordered queue is allocated nodes within a single
partition.
partition.
Since nodes can be in multiple partitions, one can think of them as
Since nodes can be in multiple partitions, one can think of them as
...
@@ -254,7 +255,7 @@ While allocation of entire nodes to jobs is still a recommended mode of
...
@@ -254,7 +255,7 @@ While allocation of entire nodes to jobs is still a recommended mode of
operation for very large clusters, an alternate SLURM plugin provides
operation for very large clusters, an alternate SLURM plugin provides
resource management down the the resolution of individual processors.
resource management down the the resolution of individual processors.
The SLURM
daemons and the command
{
\tt
srun
}
are extensively
The SLURM
's
{
\tt
srun
}
command and daemons
are extensively
multi-threaded.
multi-threaded.
{
\tt
slurmctld
}
also maintains independent read and
{
\tt
slurmctld
}
also maintains independent read and
write locks for critical data structures.
write locks for critical data structures.
...
@@ -281,11 +282,12 @@ can communicate directly with {\tt slurmd} daemons on 32 nodes
...
@@ -281,11 +282,12 @@ can communicate directly with {\tt slurmd} daemons on 32 nodes
(the degree of fanout in communications is configurable).
(the degree of fanout in communications is configurable).
Each of those
{
\tt
slurmd
}
will simultaneously forward the request
Each of those
{
\tt
slurmd
}
will simultaneously forward the request
to
{
\tt
slurmd
}
programs on another 32 nodes.
to
{
\tt
slurmd
}
programs on another 32 nodes.
This improves performance by distributing the workload.
This improves performance by distributing the communication workload.
Note that every communication is authenticated and acknowleged.
Note that every communication is authenticated and acknowleged
for fault-tolerance.
A number of interesting papers
A number of interesting papers
\cite
{
Jones2003,Kerbyson2001
Petrini2003,Phillips2003,Tsafrir2005
}
~
\cite
{
Jones2003,Kerbyson2001
,
Petrini2003,Phillips2003,Tsafrir2005
}
have recently been written about
have recently been written about
the impact of system daemons and other system overhead on
the impact of system daemons and other system overhead on
parallel job performance. This
{
\tt
system noise
}
can have a
parallel job performance. This
{
\tt
system noise
}
can have a
...
@@ -309,8 +311,9 @@ during the entire job execution period
...
@@ -309,8 +311,9 @@ during the entire job execution period
highly synchronized fashion across all nodes
highly synchronized fashion across all nodes
\end{itemize}
\end{itemize}
In addition, the default mode of operation is to allocate entire
In addition, the default mode of operation is to allocate entire
nodes to applications rather than individual processors on each node.
nodes with all of their processors to applications rather than
This eliminates the possibility of interference between jobs
individual processors on each node.
This eliminates the possibility of interference between jobs,
which could severely degrade performance of parallel applications.
which could severely degrade performance of parallel applications.
Allocation of resources to the resolution of individual processors
Allocation of resources to the resolution of individual processors
on each node is supported by SLURM, but this comes at a higher cost
on each node is supported by SLURM, but this comes at a higher cost
...
@@ -329,8 +332,9 @@ a prefix of "linux" and numeric suffix from 1 to 4096.
...
@@ -329,8 +332,9 @@ a prefix of "linux" and numeric suffix from 1 to 4096.
These naming convention permits even the largest clusters
These naming convention permits even the largest clusters
to be described in a configure file containing only a
to be described in a configure file containing only a
couple of dozen lines.
couple of dozen lines.
State information output from various SLURM commands is
State information output from various SLURM commands uses
equally terse.
the same convention to maintain a modest volume of output
on even large cluster.
Extensive use is made of bitmaps to represent nodes in the cluster.
Extensive use is made of bitmaps to represent nodes in the cluster.
For example, bitmaps are maintained for each unique node configuration,
For example, bitmaps are maintained for each unique node configuration,
...
@@ -341,20 +345,26 @@ scheduling operations to very rapid AND and OR operations on those bitmaps.
...
@@ -341,20 +345,26 @@ scheduling operations to very rapid AND and OR operations on those bitmaps.
\section
{
Application Launch
}
\section
{
Application Launch
}
To better illustrate SLURM's operation, the execution of an
To better illustrate SLURM's operation, the execution of an
application is detailed below.
application is detailed below
and illustrated in Figure~
\ref
{
launch
}
.
This example is based upon a typical configuration and the
This example is based upon a typical configuration and the
{
\em
interactive
}
mode, in which stdout and
{
\em
interactive
}
mode, in which stdout and
stderr are displayed on the user's terminal in real time, and stdin and
stderr are displayed on the user's terminal in real time, and stdin and
signals may be forwarded from the terminal transparently to the remote
signals may be forwarded from the terminal transparently to the remote
tasks.
tasks.
\begin{figure}
[tb]
\centerline
{
\epsfig
{
file=../figures/arch.eps,scale=0.35
}}
\caption
{
\small
SLURM Job Launch
}
\label
{
launch
}
\end{figure}
The task launch request is initiated by a user's execution of the
The task launch request is initiated by a user's execution of the
{
\tt
srun
}
command.
{
\tt
Srun
}
has a multitude of options to specify
{
\tt
srun
}
command.
{
\tt
Srun
}
has a multitude of options to specify
resource requirements such as minimum memory per node, minimum
resource requirements such as minimum memory per node, minimum
temporary disk space per node, features associated with nodes,
temporary disk space per node, features associated with nodes,
partition to use, node count, task count, etc.
partition to use, node count, task count, etc.
{
\tt
Srun
}
gets a credential to identify the user and his group
{
\tt
Srun
}
gets a credential to identify the user and his group
then sends the request to
{
\tt
slurmctld
}
.
then sends the request to
{
\tt
slurmctld
}
(message 1)
.
{
\tt
Slurmctld
}
authenticates the request and identifies the resources
{
\tt
Slurmctld
}
authenticates the request and identifies the resources
to be allocated using a series of bitmap operations.
to be allocated using a series of bitmap operations.
...
@@ -369,11 +379,12 @@ The requested node and/or processor count is then satisfied from
...
@@ -369,11 +379,12 @@ The requested node and/or processor count is then satisfied from
the nodes identified with the resulting bitmap.
the nodes identified with the resulting bitmap.
This completes the job allocation process, but for interactive
This completes the job allocation process, but for interactive
mode, a job step credential is also constructed for the allocation
mode, a job step credential is also constructed for the allocation
and sent to
{
\tt
srun
}
in the
launch reply
.
and sent to
{
\tt
srun
}
in the
reply (message 2)
.
The
{
\tt
srun
}
command open sockets for task input and output then
The
{
\tt
srun
}
command open sockets for task input and output then
gets a second credential and sends the job step credential directly
sends the job step credential directly to the
{
\tt
slurmd
}
daemons
to the
{
\tt
slurmd
}
daemons in order to launch the tasks.
(message 3) in order to launch the tasks, which is acknowledged
(message 4).
Note the
{
\tt
slurmctld
}
and
{
\tt
slurmd
}
daemons do not directly
Note the
{
\tt
slurmctld
}
and
{
\tt
slurmd
}
daemons do not directly
communicate during the task launch operation in order to minimize the
communicate during the task launch operation in order to minimize the
workload on the
{
\tt
slurmctld
}
, which has to manage the entire
workload on the
{
\tt
slurmctld
}
, which has to manage the entire
...
@@ -381,12 +392,13 @@ cluster.
...
@@ -381,12 +392,13 @@ cluster.
Task termination is communicated to
{
\tt
srun
}
over the same
Task termination is communicated to
{
\tt
srun
}
over the same
socket used for input and output.
socket used for input and output.
When all tasks have terminated,
{
\tt
srun
}
gets its third credential
When all tasks have terminated,
{
\tt
srun
}
notifies
{
\tt
slurmctld
}
and notifies
{
\tt
slurmctld
}
of the job step termination.
of the job step termination (message 5).
{
\tt
Slurmctld
}
authenticates the request and sends messages to
{
\tt
Slurmctld
}
authenticates the request, acknowledges it
the
{
\tt
slurmd
}
daemons to insure that all processes associated
(message 6) and sends messages to the
{
\tt
slurmd
}
daemons to
with the job have terminated.
insure that all processes associated with the job have
Upon receipt of job termination confirmation on each node,
terminated (message 7).
Upon receipt of job termination confirmation on each node (message 8),
{
\tt
slurmctld
}
releases the resources for use by another job.
{
\tt
slurmctld
}
releases the resources for use by another job.
The full time for execution of a simple parallel application across
The full time for execution of a simple parallel application across
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment