Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
Slurm
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
tud-zih-energy
Slurm
Commits
ac5dc247
Commit
ac5dc247
authored
23 years ago
by
Moe Jette
Browse files
Options
Downloads
Patches
Plain Diff
Minor updates - Jette
Major updates to controller description in report.tex. - Jette
parent
ed16324e
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc/pubdesign/report.tex
+72
-59
72 additions, 59 deletions
doc/pubdesign/report.tex
with
72 additions
and
59 deletions
doc/pubdesign/report.tex
+
72
−
59
View file @
ac5dc247
...
@@ -195,7 +195,7 @@ copied'' to the {\tt srun} command during job execution.
...
@@ -195,7 +195,7 @@ copied'' to the {\tt srun} command during job execution.
\subsubsection
{
Controller
}
\subsubsection
{
Controller
}
Most SLURM state exists in the controller,
{
\tt
slurmctld
}
.
Most SLURM state
information
exists in the controller,
{
\tt
slurmctld
}
.
When
{
\tt
slurmctld
}
starts, it reads its configuration from a file:
When
{
\tt
slurmctld
}
starts, it reads its configuration from a file:
{
\tt
/etc/slurmctld.conf
}
. It also can read additional state from a
{
\tt
/etc/slurmctld.conf
}
. It also can read additional state from a
checkpoint file left over from a previous
{
\tt
slurmctld
}
.
checkpoint file left over from a previous
{
\tt
slurmctld
}
.
...
@@ -285,7 +285,7 @@ authentication.
...
@@ -285,7 +285,7 @@ authentication.
\item
{
\tt
slurmadmin
}
: Perform privileged administrative commands
\item
{
\tt
slurmadmin
}
: Perform privileged administrative commands
such as draining a partition in preparation for maintenance, or terminating
such as draining a partition in preparation for maintenance, or terminating
jobs.
M
ust be run as the root user.
jobs.
It m
ust be run as the root user.
\end{itemize}
\end{itemize}
...
@@ -415,8 +415,6 @@ a SIGINT resulting from a Control-C), it is sent to each {\tt slurmd} which
...
@@ -415,8 +415,6 @@ a SIGINT resulting from a Control-C), it is sent to each {\tt slurmd} which
terminates the individual tasks and reports this to the job status manager,
terminates the individual tasks and reports this to the job status manager,
which cleans up the job.
which cleans up the job.
\marginpar
{
This is as far as I got in the ``big picture'' update --JG
}
\section
{
Controller Design
}
\section
{
Controller Design
}
The controller will be modular and multi-threaded.
The controller will be modular and multi-threaded.
...
@@ -453,21 +451,44 @@ Node information that we intend to monitor includes:
...
@@ -453,21 +451,44 @@ Node information that we intend to monitor includes:
\item
Count of processors on the node
\item
Count of processors on the node
\item
Size of real memory on the node
\item
Size of real memory on the node
\item
Size of temporary disk storage
\item
Size of temporary disk storage
\item
State of node (RUN, IDLE, DRAIN, etc.)
\item
State of node (RUN, IDLE, DRAINED, etc.)
\item
Weight (preference in being allocated work)
\item
Feature (arbitrary description)
\end{itemize}
\end{itemize}
The SLURM administrator could at a minimum specify a list of system node
The SLURM administrator could at a minimum specify a list of system node
names using a regular expression (e.g. "NodeName=linux[001-512] CPUs=4
names using a regular expression (e.g. "NodeName=linux[001-512] CPUs=4
RealMemory=1024 TmpDisk=4096").
RealMemory=1024 TmpDisk=4096").
This would be considered the minimal node configuration values which are
These values for CPUs, RealMemory, and TmpDisk would be considered the
acceptable for the node to enter into service.
minimal node configuration values which are acceptable for the node to
enter into service.
If a node registers with less resources, it will be placed in DOWN
state and the event will be logged.
Note the regular expression node name syntax permits even very large heterogeneous
clusters to be described in only a few lines.
In fact, a smaller number of unique configurations provides SLURM with
greater efficiency in scheduling work.
The weight is used to order available nodes in assigning work to them.
In a heterogeneous cluster (e.g. larger memory or faster processors)
should be assigned a larger weight.
The units used are arbitrary and should reflect the priorities of that resource.
Pending jobs will be assigned the least capable nodes (i.e. lowest
weight) which satisfy their requirements.
This will tend to leave the more capable nodes to those jobs requiring
those capabilities.
The feature is an arbitrary string describing the node, such as a
particular software package or processor speed.
While the feature does not have a numeric value, one might include a numeric
value within the feature name (e.g. "1200MHz" or "16GB
\_
Swap").
The partition manager will identify groups of nodes to be used for
The partition manager will identify groups of nodes to be used for
execution of user jobs. Data to be associated with a partition will include:
execution of user jobs. Data to be associated with a partition will include:
\begin{itemize}
\begin{itemize}
\item
Name
\item
Name
\item
Access controlled by key granted to
key
(
s
o support external schedulers)
\item
Access controlled by key granted to
user root
(
t
o support external schedulers)
\item
List of associated nodes
\item
List of associated nodes
(may use regular expression)
\item
State of partition (UP or DOWN)
\item
State of partition (UP or DOWN)
\item
Maximum time limit for any job
\item
Maximum time limit for any job
\item
Maximum nodes allocated to any single job
\item
Maximum nodes allocated to any single job
...
@@ -475,8 +496,8 @@ execution of user jobs. Data to be associated with a partition will include:
...
@@ -475,8 +496,8 @@ execution of user jobs. Data to be associated with a partition will include:
\end{itemize}
\end{itemize}
It will be possible to alter this data in real-time in order to effect the
It will be possible to alter this data in real-time in order to effect the
scheduling of pending jobs (currently executing jobs would continue).
Unlike some
scheduling of pending jobs (currently executing jobs would continue).
other parallel job management systems, w
e believe this information can be
W
e believe this information can be
confined to the SLURM control machine for better scalability. It would be used
confined to the SLURM control machine for better scalability. It would be used
by the Job Manager (and possibly an external scheduler), which either exist only
by the Job Manager (and possibly an external scheduler), which either exist only
on the control machine or communicate only with the control machine. An API to
on the control machine or communicate only with the control machine. An API to
...
@@ -498,12 +519,44 @@ such a capability at a later time if so desired.
...
@@ -498,12 +519,44 @@ such a capability at a later time if so desired.
Future enhancements could include constraining jobs to a specific CPU count
Future enhancements could include constraining jobs to a specific CPU count
or memory size within a node, which could be used to space-share the node.
or memory size within a node, which could be used to space-share the node.
Bit maps are used to indicate which nodes are up, idle, associated with
each partition, and associated with unique node configuration.
This technique permits scheduling decisions to be made by performing a
small number of tests followed by fast bit map manipulations.
A sample configuration file follows.
\begin{verbatim}
#
# Sample /etc/SLURM.conf
# Author: John Doe
# Date: 11/06/2001
#
ControlMachine=lx0001
BackupController=lx0002
#
# Node Configurations
#
NodeName=DEFAULT TmpDisk=16384
NodeName=lx[0001-0002] State=DRAINED
NodeName=lx[0003-8000] CPUs=16 RealMemory=2048 Weight=16
NodeName=lx[8001-9999] CPUs=32 RealMemory=4096 Weight=40 Feature=1200MHz
#
# Partition Configurations
#
PartitionName=DEFAULT MaxTime=30 MaxNodes=2
PartitionName=login Nodes=lx[0001-0002] State=DOWN # Don't schedule work here
PartitionName=debug Nodes=lx[0003-0030] State=UP Default=YES
PartitionName=class Nodes=lx[0031-0040] AllowGroups=students,teachers
PartitionName=batch Nodes=lx[0041-9999] MaxTime=UNLIMITED MaxNodes=4096 Key=YES
\end{verbatim}
\subsection
{
Job Manager
}
\subsection
{
Job Manager
}
The core functions to be supported by the job manager include:
The core functions to be supported by the job manager include:
\begin{itemize}
\begin{itemize}
\item
Queue job request
\item
Queue job request
\item
Order job queue (under control of external scheduler)
\item
Reset priority of jobs (for external scheduler to order queue)
\item
Allocate nodes to job
\item
Initiate job
\item
Initiate job
\item
Will job run query (test if "Initiate job" request would succeed)
\item
Will job run query (test if "Initiate job" request would succeed)
\item
Status job (including node list, memory and CPU use data)
\item
Status job (including node list, memory and CPU use data)
...
@@ -543,7 +596,7 @@ DPCS\footnote{http://www.llnl.gov/icc/lc/dpcs/dpcs\_overview.html}.
...
@@ -543,7 +596,7 @@ DPCS\footnote{http://www.llnl.gov/icc/lc/dpcs/dpcs\_overview.html}.
DPCS has flexible scheduling algorithms that suit our needs well and
DPCS has flexible scheduling algorithms that suit our needs well and
provide the scalability required for this application. Most of the resource
provide the scalability required for this application. Most of the resource
accounting and some of the job management functions presently within DPCS would
accounting and some of the job management functions presently within DPCS would
be moved into the proposed SLURM Job Management
and Job Status
component
s
.
be moved into the proposed SLURM Job Management component.
DPCS will require some modification to operate within this new, richer
DPCS will require some modification to operate within this new, richer
environment. The DPCS Central Manager would also require porting to Linux.
environment. The DPCS Central Manager would also require porting to Linux.
...
@@ -554,8 +607,9 @@ We are not contemplating making this database software available through SLURM,
...
@@ -554,8 +607,9 @@ We are not contemplating making this database software available through SLURM,
but might consider writing this data to an open source database if so desired.
but might consider writing this data to an open source database if so desired.
System specific scripts can be executed prior to the initiation of a user job
System specific scripts can be executed prior to the initiation of a user job
and after the termination of a user job (e.g. prolog and epilog). These scripts
and after the termination of a user job (e.g. prolog and epilog).
can be used to establish an appropriate environment for the user (e.g. permit
These scripts are executed as user root and can be used to establish an
appropriate environment for the user (e.g. permit
logins, disable logins, terminate "orphan" processes, etc.).
logins, disable logins, terminate "orphan" processes, etc.).
An API for all functions would be developed initially, followed by a
An API for all functions would be developed initially, followed by a
command-line tool utilizing the API.
command-line tool utilizing the API.
...
@@ -575,7 +629,7 @@ three.
...
@@ -575,7 +629,7 @@ three.
Slurmd is a multi-threaded daemon for managing user job and
Slurmd is a multi-threaded daemon for managing user job and
monitoring system state.
monitoring system state.
Upon initiation it will read the /etc/
SLURM
.conf file, capture
Upon initiation it will read the /etc/
slurmd
.conf file, capture
system state, and await requests from the SLURM control daemon
system state, and await requests from the SLURM control daemon
(slurmctrld).
(slurmctrld).
...
@@ -589,8 +643,8 @@ Differences in resource utilization values from process table
...
@@ -589,8 +643,8 @@ Differences in resource utilization values from process table
snapshot to snapshot will be accumulated. Slurmd will
snapshot to snapshot will be accumulated. Slurmd will
insure these accumulated values are not decremented if resource
insure these accumulated values are not decremented if resource
consumption for a user happens to decrease from snapshot to
consumption for a user happens to decrease from snapshot to
snapshot, which would simply reflect the termination of
some
snapshot, which would simply reflect the termination of
processes.
one or more
processes.
Both the memory high-water marks will be recorded and the
Both the memory high-water marks will be recorded and the
integral of memory consumption (e.g. megabyte-hours).
integral of memory consumption (e.g. megabyte-hours).
Resource consumption will be grouped by user ID and
Resource consumption will be grouped by user ID and
...
@@ -703,47 +757,6 @@ Phase three will add Quadrics Elan3 switch support and overall documentation.
...
@@ -703,47 +757,6 @@ Phase three will add Quadrics Elan3 switch support and overall documentation.
Phase four rounds out SLURM with job accounting, fault-tolerance,
Phase four rounds out SLURM with job accounting, fault-tolerance,
and full integration with DPCS (Distributed Production Control System).
and full integration with DPCS (Distributed Production Control System).
\section
{
Costs
}
Very preliminary effort estimates are provided below. More research should
still be performed to investigate the availability of open source code. More
design work is also required to establish more accurate effort estimates.
\begin{center}
\begin{tabular}
{
|l|c|
}
\hline
\multicolumn
{
2
}{
|c|
}{
\em
I - Basic communication and node status
}
\\
\hline
Communications Library
&
1.0 FTE month
\\
Machine Status Collection
&
1.0 FTE month
\\
Machine Status Manager
&
1.0 FTE month
\\
Machine Status Tool
&
0.5 FTE month
\\
{
\em
TOTAL Phase I
}
&
{
\em
3.5 FTE months
}
\\
\hline
\multicolumn
{
2
}{
|c|
}{
\em
II - Basic job initiation
}
\\
\hline
Communications Library Enhancement
&
1.0 FTE month
\\
Job Management Daemon
&
1.0 FTE month
\\
Job Manager
&
2.0 FTE months
\\
Partition Manager
&
1.0 FTE month
\\
{
\em
TOTAL Phase II
}
&
{
\em
5.0 FTE months
}
\\
\hline
\multicolumn
{
2
}{
|c|
}{
\em
III - Switch support and documentation
}
\\
\hline
Communications Library Security
&
1.0 FTE month
\\
Job Status Daemon
&
1.0 FTE month
\\
Basic Switch Daemon
&
2.0 FTE months
\\
MPI Interface to SLURM
&
2.0 FTE months
\\
Switch Health Monitor
&
2.0 FTE months
\\
User and Admin Documentation
&
1.0 FTE month
\\
DPCS uses SLURM Job Manager
&
1.0 FTE month
\\
{
\em
TOTAL Phase III
}
&
{
\em
10.0 FTE months
}
\\
\hline
\multicolumn
{
2
}{
|c|
}{
\em
IV - Switch health, and DPCS on Linux
}
\\
\hline
Job Accounting
&
1.5 FTE months
\\
Fault-tolerant SLURM Managers
&
3.0 FTE months
\\
Direct SLURM Switch Use (optional)
&
2.0 FTE months
\\
DPCS uses SLURM Job Status
&
1.5 FTE months
\\
DPCS Controller on Linux
&
0.5 FTE months
\\
{
\em
TOTAL Phase IV
}
&
{
\em
8.5 FTE months
}
\\
\hline
{
\em
GRAND TOTAL
}
&
{
\em
27.0 FTE months
}
\\
\hline
\end{tabular}
\end{center}
\appendix
\appendix
\newpage
\newpage
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment