Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
Slurm
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
tud-zih-energy
Slurm
Commits
60627207
Commit
60627207
authored
22 years ago
by
Moe Jette
Browse files
Options
Downloads
Patches
Plain Diff
Moved remaining information from admin.guide into quick.start.guide.
parent
4dabf832
No related branches found
No related tags found
No related merge requests found
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
doc/html/admin.guide.html
+0
-696
0 additions, 696 deletions
doc/html/admin.guide.html
doc/html/quickstart.html
+13
-1
13 additions, 1 deletion
doc/html/quickstart.html
with
13 additions
and
697 deletions
doc/html/admin.guide.html
deleted
100644 → 0
+
0
−
696
View file @
4dabf832
<html>
<head>
<title>
SLURM Administrator's Guide
</title>
</head>
<body>
<h1>
SLURM Administrator's Guide
</h1>
<h2>
Overview
</h2>
Simple Linux Utility for Resource Management (SLURM) is an open source,
fault-tolerant, and highly scalable cluster management and job
scheduling system for Linux clusters having
thousands of nodes. Components include machine status, partition
management, job management, scheduling and stream copy modules.
<h2>
Build Information
</h2>
TBD
Include PKI build instructions.
<h2>
Configuration
</h2>
There a single SLURM configuration file containing:
overall SLURM options, node configurations, and partition configuration.
This file is located at "/etc/slurm.conf" by default.
The file location can be modified at system build time using the
DEFAULT_SLURM_CONF parameter.
The overall SLURM configuration options specify the control and backup
control machines.
The locations of daemons, state information storage, and other details
are specified at build time.
See the
<a
href=
"#Build"
>
Build Parameters
</a>
section for details.
The node configuration tell SLURM what nodes it is to manage as well as
their expected configuration.
The partition configuration permits you to define sets (or partitions)
of nodes and establish distinct job limits or access control for them.
Configuration information may be read or updated using SLURM APIs.
This configuration file or a copy of it must be accessible on every computer under
SLURM management.
<p>
The following parameters may be specified:
<dl>
<dt>
BackupController
<dd>
The name of the machine where SLURM control functions are to be
executed in the event that ControlMachine fails. This node
may also be used as a compute server if so desired. It will come into service
as a controller only upon the failure of ControlMachine and will revert
to a "standby" mode when the ControlMachine becomes available once again.
This should be a node name without the full domain name (e.g. "lx0002").
While not essential, it is highly recommended that you specify a backup controller.
<dt>
ControlMachine
<dd>
The name of the machine where SLURM control functions are executed.
This should be a node name without the full domain name (e.g. "lx0001").
This value must be specified.
<dt>
Epilog
<dd>
Fully qualified pathname of a program to execute as user root on every
node when a user's job completes (e.g. "/usr/local/slurm/epilog"). This may
be used to purge files, disable user login, etc. By default there is no epilog.
<dt>
FastSchedule
<dd>
If set to 1, then consider the configuration of each node to be that
specified in the configuration file. If set to 0, then base scheduling
decisions upon the actual configuration of each node. If the number of
node configuration entries in the configuration file is signficantly
lower than the number of nodes, setting FastSchedule to 1 will permit
much faster scheduling decisions to be made. The default value is 1.
<dt>
FirstJobId
<dd>
The job id to be used for the first submitted to SLURM without a
specific requested value. Job id values generated will incremented by 1
for each subsequent job. This may be used to provide a meta-scheduler
with a job id space which is disjoint from the interactive jobs.
The use of node names containing a numeric suffix will provide faster
operation for larger clusters. The default value is 10.
<dt>
HashBase
<dd>
If the node names include a sequence number, this value defines the
base to be used in building a hash table based upon node name. Value of 8
and 10 are recognized for octal and decimal sequence numbers respectively.
The value of zero is also recognized for node names lacking a sequence number.
The default value is 10.
<dt>
HeartbeatInterval
<dd>
The interval, in seconds, at which the SLURM controller tests the
status of other daemons. The default value is 30 seconds.
<dt>
InactiveLimit
<dd>
The interval, in seconds, a job is permitted to be inactive (with
no active job steps) before it is terminated. This prevents forgotten
jobs to be purged in a timely fashion without waiting for their time
limit to be reached. The default value is unlimited (zero).
<dt>
JobCredentialPrivateKey
<dd>
Fully qualified pathname of a file containing a private key used for
authentication by Slurm daemons.
<dt>
JobCredentialPublicCertificate
<dd>
Fully qualified pathname of a file containing a public key used for
authentication by Slurm daemons.
<dt>
KillWait
<dd>
The interval, in seconds, given to a job's processes between the
SIGTERM and SIGKILL signals upon reaching its time limit.
If the job fails to terminate gracefully
in the interval specified, it will be forcably terminated. The default
value is 30 seconds.
<dt>
Prioritize
<dd>
Fully qualified pathname of a program to execute in order to establish
the initial priority of a newly submitted job. By default there is no
prioritization program and each job gets a priority lower than that of
any existing jobs.
<dt>
Prolog
<dd>
Fully qualified pathname of a program to execute as user root on every
node when a user's job begins execution (e.g. "/usr/local/slurm/prolog").
This may be used to purge files, enable user login, etc. By default there
is no prolog.
<dt>
ReturnToService
<dd>
If set to 1, then a DOWN node will become available for use
upon registration. The default value is 0, which
means that a node will remain DOWN until a system administrator explicitly
makes it available for use.
<dt>
SlurmctldPort
<dd>
The port number that the SLURM controller,
<i>
slurmctld
</i>
, listens
to for work. The default value is SLURMCTLD_PORT as established at system
build time.
<dt>
SlurmctldTimeout
<dd>
The interval, in seconds, that the backup controller waits for the
primary controller to respond before assuming control. The default value
is 300 seconds.
<dt>
SlurmdPort
<dd>
The port number that the SLURM compute node daemon,
<i>
slurmd
</i>
, listens
to for work. The default value is SLURMD_PORT as established at system
build time.
<dt>
SlurmdTimeout
<dd>
The interval, in seconds, that the SLURM controller waits for
<i>
slurmd
</i>
to respond before configuring that node's state to DOWN. The default value
is 300 seconds.
<dt>
StateSaveLocation
<dd>
Fully qualified pathname of a directory into which the slurm controller,
<i>
slurmctld
</i>
, saves its state (e.g. "/usr/local/slurm/checkpoint"). SLURM
state will saved here to recover from system failures. The default value is "/tmp".
If any slurm daemons terminate abnormally, their core files will also be written
into this directory.
<dt>
TmpFS
<dd>
Fully qualified pathname of the file system available to user jobs for
temporary storage. This parameter is used in establishing a node's
<i>
TmpDisk
</i>
space. The default value is "/tmp".
</dl>
Any text after "#" until the end of the line in the configuration file
will be considered a comment.
If you need to use "#" in a value within the configuration file, proceed
it with backslash "\").
The configuration file should contain a keyword followed by an
equal sign, followed by the value.
Keyword value pairs should be separated from each other by white space.
The field descriptor keywords are case sensitive.
The size of each line in the file is limited to 1024 characters.
A sample SLURM configuration file (without node or partition information)
follows.
<pre>
ControlMachine=lx0001 BackupController=lx0002
Epilog=/usr/local/slurm/epilog Prolog=/usr/local/slurm/prolog
FastSchedule=1
FirstJobId=65536
HashBase=10
HeartbeatInterval=60
InactiveLimit=120
KillWait=30
Prioritize=/usr/local/maui/priority
SlurmctldPort=7002 SlurmdPort=7003
SlurmctldTimeout=300 SlurmdTimeout=300
StateSaveLocation=/tmp/slurm.state
TmpFS=/tmp
</pre>
<p>
The node configuration permits you to identify the nodes (or machines)
to be managed by Slurm. You may also identify the characteristics of the
nodes in the configuration file. Slurm operates in a heterogeneous environment
and users are able to specify resource requirements for each job.
Slurm is optimized for scheduling systems in which the number of
unique configurations is small. It is recommended that the system
node configuration be provided in a minimal number of entries.
In many systems, this may be accomplished in only a few lines.
The node configuration specifies the following information:
<dl>
<dt>
NodeName
<dd>
Name of a node as returned by hostname (e.g. "lx0012").
<a
name=
"NodeExp"
>
A simple regular expression may optionally
be used to specify ranges
of nodes to avoid building a configuration file with large numbers
of entries. The regular expression can contain one
pair of square brackets with a sequence of comma separated
numbers and/or ranges of numbers separated by a "-"
(e.g. "linux[0-64,128]", or "lx[15,18,32-33]").
</a>
If the NodeName is "DEFAULT", the values specified
with that record will apply to subsequent node specifications
unless explicitly set to other values in that node record or
replaced with a different set of default values.
For architectures in which the node order is significant,
nodes will be considered consecutive in the order defined.
For example, if the configuration for NodeName=charlie immediately
follows the configuration for NodeName=baker they will be
considered adjacent in the computer.
<dt>
Feature
<dd>
A comma delimited list of arbitrary strings indicative of some
characteristic associated with the node.
There is no value associated with a feature at this time, a node
either has a feature or it does not.
If desired a feature may contain a numeric component indicating,
for example, processor speed.
By default a node has no features.
<dt>
RealMemory
<dd>
Size of real memory on the node in MegaBytes (e.g. "2048").
The default value is 1.
<dt>
Procs
<dd>
Number of processors on the node (e.g. "2").
The default value is 1.
<dt>
State
<dd>
State of the node with respect to the initiation of user jobs.
Acceptable values are "DOWN", "UNKNOWN", "IDLE", and "DRAINING".
The
<a
href=
"#NodeStates"
>
node states
</a>
are fully described below.
The default value is "UNKNOWN".
<dt>
TmpDisk
<dd>
Total size of temporary disk storage in TmpFS in MegaBytes
(e.g. "16384"). TmpFS (for "Temporary File System")
identifies the location which jobs should use for temporary storage.
Note this does not indicate the amount of free
space available to the user on the node, only the total file
system size. The system administration should insure this file
system is purged as needed so that user jobs have access to
most of this space.
The Prolog and/or Epilog programs (specified in the configuration file)
might be used to insure the file system is kept clean.
The default value is 1.
<dt>
Weight
<dd>
The priority of the node for scheduling purposes.
All things being equal, jobs will be allocated the nodes with
the lowest weight which satisfies their requirements.
For example, a heterogeneous collection of nodes might
be placed into a single partition for greater system
utilization, responsiveness and capability. It would be
preferable to allocate smaller memory nodes rather than larger
memory nodes if either will satisfy a job's requirements.
The units of weight are arbitrary, but larger weights
should be assigned to nodes with more processors, memory,
disk space, higher processor speed, etc.
Weight is an integer value with a default value of 1.
</dl>
<p>
Only the NodeName must be supplied in the configuration file; all other
items are optional.
It is advisable to establish baseline node configurations in
the configuration file, especially if the cluster is heterogeneous.
Nodes which register to the system with less than the configured resources
(e.g. too little memory), will be placed in the "DOWN" state to
avoid scheduling jobs on them.
Establishing baseline configurations will also speed SLURM's
scheduling process by permitting it to compare job requirements
against these (relatively few) configuration parameters and
possibly avoid having to perform job requirements
against every individual node's configuration.
The resources checked at node registration time are: Procs,
RealMemory and TmpDisk.
While baseline values for each of these can be established
in the configuration file, the actual values upon node
registration are recorded and these actual values may be
used for scheduling purposes (depending upon the value of
<i>
FastSchedule
</i>
in the configuration file.
Default values can be specified with a record in which
"NodeName" is "DEFAULT".
The default entry values will apply only to lines following it in the
configuration file and the default values can be reset multiple times
in the configuration file with multiple entries where "NodeName=DEFAULT".
The "NodeName=" specification must be placed on every line
describing the configuration of nodes.
In fact, it is generally possible and desirable to define the
configurations of all nodes in only a few lines.
This convention permits significant optimization in the scheduling
of larger clusters.
The field descriptors above are case sensitive.
In order to support the concept of jobs requiring consecutive nodes
on some architectures,
node specifications should be place in this file in consecutive order.
The size of each line in the file is limited to 1024 characters.
<p>
<a
name=
"NodeStates"
>
The node states have the following meanings:
</a>
<dl>
<dt>
BUSY
<dd>
The node has been allocated work (one or more user jobs).
<dt>
DOWN
<dd>
The node is unavailable for use. It has been explicitly configured
DOWN or failed to respond to system state inquiries or has
explicitly removed itself from service due to a failure. This state
typically indicates some problem requiring administrator intervention.
<dt>
DRAINED
<dd>
The node is idle, but not available for use. The state of a node
will automatically change from DRAINING to DRAINED when user job(s) executing
on that node terminate. Since this state is entered by explicit
administrator request, additional SLURM administrator intervention is typically
not required.
<dt>
DRAINING
<dd>
The node has been made unavailable for new work by explicit administrator
intervention. It is processing some work at present and will enter state
"DRAINED" when that work has been completed. This might be used to
prepare some nodes for maintenance work.
<dt>
IDLE
<dd>
The node is idle and available for use.
<dt>
UNKNOWN
<dd>
Default initial node state upon startup of SLURM.
An attempt will be made to contact the node and acquire current state information.
</dl>
<p>
SLURM uses a hash table in order to locate table entries rapidly.
Each table entry can be directly accessed without any searching
if the name contains a sequence number suffix. The value of
<i>
HashBase
</i>
in the configuration file specifies the hashing algorithm.
Possible contains values are "10" and "8" for names containing
decimal and octal sequence numbers respectively
or "0" which processes mixed alpha-numeric without sequence numbers.
The default value of
<i>
HashBase
</i>
is "10".
If you use a naming convention lacking a sequence number, it may be
desirable to review the hashing function
<i>
hash_index
</i>
in the
node_mgr.c module. This is especially important in clusters having
large numbers of nodes. The sequence numbers can start at any
desired number, but should contain consecutive numbers. The
sequence number portion may contain leading zeros for a consistent
name length, if so desired. Note that correct operation
will be provided with any nodes names, but performance will suffer
without this optimization.
A sample SLURM configuration file (node information only) follows.
<pre>
#
# Node Configurations
#
NodeName=DEFAULT TmpDisk=16384 State=IDLE
NodeName=lx[0001-0002] State=DRAINED
NodeName=lx[0003-8000] Procs=16 RealMemory=2048 Weight=16
NodeName=lx[8001-9999] Procs=32 RealMemory=4096 Weight=40 Feature=1200MHz,VizTools
</pre>
<p>
The partition configuration permits you to establish different job
limits or access controls for various groups (or partitions) of nodes.
Nodes may be in only one partition. The partition configuration
file contains the following information:
<dl>
<dt>
AllowGroups
<dd>
Comma separated list of group IDs which may use the partition.
If at least one group associated with the user submitting the
job is in AllowGroups, he will be permitted to use this partition.
The default value is "ALL".
<dt>
Default
<dd>
If this keyword is set, jobs submitted without a partition
specification will utilize this partition.
Possible values are "YES" and "NO".
The default value is "NO".
<dt>
RootOnly
<dd>
Specifies if only user ID zero (or user
<i>
root
</i>
may
initiate jobs in this partition.
Possible values are "YES" and "NO".
The default value is "NO".
<dt>
MaxNodes
<dd>
Maximum count of nodes which may be allocated to any single job,
The default value is "UNLIMITED", which is represented internally as -1.
<dt>
MaxTime
<dd>
Maximum wall-time limit for any job in minutes. The default
value is "UNLIMITED", which is represented internally as -1.
<dt>
Nodes
<dd>
Comma separated list of nodes which are associated with this
partition. Node names may be specified using the
<a
href=
"#NodeExp"
>
real expression syntax
</a>
described above. A blank list of nodes
(i.e. "Nodes= ") can be used if one wants a partition to exist,
but have no resources (possibly on a temporary basis).
<dt>
PartitionName
<dd>
Name by which the partition may be referenced (e.g. "Interactive").
This name can be specified by users when submitting jobs.
<dt>
Shared
<dd>
Ability of the partition to execute more than one job at a
time on each node. Shared nodes will offer unpredictable performance
for application programs, but can provide higher system utilization
and responsiveness than otherwise possible.
Possible values are "FORCE", "YES", and "NO".
The default value is "NO".
<dt>
State
<dd>
State of partition or availability for use. Possible values
are "UP" or "DOWN". The default value is "UP".
</dl>
<p>
Only the PartitionName must be supplied in the configuration file.
Other parameters will assume default values if not specified.
The default values can be specified with a record in which
"PartitionName" is "DEFAULT" if non-standard default values are desired.
The default entry values will apply only to lines following it in the
configuration file and the default values can be reset multiple times
in the configuration file with multiple entries where "PartitionName=DEFAULT".
The configuration of one partition should be specified per line.
The field descriptors above are case sensitive.
The size of each line in the file is limited to 1024 characters.
A sample SLURM configuration file (partition information only) follows.
<p>
A single job may be allocated nodes from only one partition and
satisfy the configuration specifications for that partitions.
The job may specify a particular PartitionName, if so desired,
or use the system's default partition.
<pre>
#
# Partition Configurations
#
PartitionName=DEFAULT MaxTime=30 MaxNodes=2
PartitionName=login Nodes=lx[0001-0002] State=DOWN
PartitionName=debug Nodes=lx[0003-0030] State=UP Default=YES
PartitionName=class Nodes=lx[0031-0040] AllowGroups=students
PartitionName=batch Nodes=lx[0041-9999] MaxTime=UNLIMITED MaxNodes=4096 RootOnly=YES
</pre>
<p>
APIs and an administrative tool can be used to alter the SLRUM
configuration in real time.
When the SLURM controller restarts, it's state will be restored
to that at the time it terminated unless the SLURM configuration
file is newer, it which case the configuration will be rebuilt
from that file.
State information not incorporated in the configuration file,
such as job state, will be preserved.
A
<a
href=
"#SampleConfig"
>
SLURM configuration file
</a>
is included
at the end of this document.
<h3>
Job Configuration
</h3>
The job configuration format specified below is used by the
scontrol administration tool to modify job state information:
<dl>
<dt>
Contiguous
<dd>
Determine if the nodes allocated to the job must be contiguous.
Acceptable values are "YES" and "NO" with the default being "NO".
<dt>
Features
<dd>
Required features of nodes to be allocated to this job.
Features may be combined using "|" for OR, "
&
" for AND,
and square brackets.
For example, "Features=1000MHz|1200MHz
&
CoolTool".
The feature list is processes left to right except for
the grouping by brackets.
Square brackets are used to identify alternate features,
but ones that must apply to every node allocated to the job.
For example, some clusters are configured with more than
one parallel file system. These parallel file systems
may be accessible only to a subset of the nodes in a cluster.
The application may not care which parallel file system
is used, but all nodes allocated to it must be in the
subset of nodes assessing a single parallel file system.
This might be specified with a specification of
"Features=[PFS1|PFS2|PFS3|PFS4]".
<dt>
JobName
<dd>
Name to be associated with the job
<dt>
JobId
<dd>
Identification for the job, a sequence number.
<dt>
MinMemory
<dd>
Minimum number of megabytes of real memory per node.
<dt>
MinProcs
<dd>
Minimum number of processors per node.
<dt>
MinTmpDisk
<dd>
Minimum number of megabytes of temporary disk storage per node.
<dt>
ReqNodes
<dd>
The total number of nodes required to execute this job.
<dt>
ReqNodeList
<dd>
A comma separated list of nodes to be allocated to the job.
The nodes may be specified using regular expressions (e.g.
"lx[0010-0020,0033-0040]" or "baker,charlie,delta").
<dt>
ReqProcs
<dd>
The total number of processors required to execute this job.
<dt>
Partition
<dd>
Name of the partition in which this job should execute.
<dt>
Priority
<dd>
Integer priority of the pending job. The value may
be specified by user root initiated jobs, otherwise SLURM will
select a value. Generally, higher priority jobs will be initiated
before lower priority jobs.
<dt>
Shared
<dd>
Job can share nodes with other jobs. Possible values are YES and NO.
<dt>
State
<dd>
State of the job. Possible values are "PENDING", "STARTING",
"RUNNING", and "ENDING".
<dt>
TimeLimit
<dd>
Maximum wall-time limit for the job in minutes. An "UNLIMITED"
value is represented internally as -1.
</dl>
<a
name=
"Build"
><h2>
Build Parameters
</h2></a>
The following configuration parameters are established at SLURM build time.
State and configuration information may be read or updated using SLURM APIs.
<dl>
<dt>
SLURMCTLD_PATH
<dd>
The fully qualified pathname of the file containing the SLURM daemon
to execute on the ControlMachine,
<i>
slurmctld
</i>
. The default value is "/usr/local/slurm/bin/slurmctld".
This file must be accessible to the ControlMachine and BackupController.
<dt>
SLURMD_PATH
<dd>
The fully qualified pathname of the file containing the SLURM daemon
to execute on every compute server node. The default value is "/usr/local/slurm/bin/slurmd".
This file must be accessible to every SLURM compute server.
<dt>
SLURM_CONF
<dd>
The fully qualified pathname of the file containing the SLURM
configuration file. The default value is "/etc/SLURM.conf".
<dt>
SLURMCTLD_PORT
<dd>
The port number that the SLURM controller,
<i>
slurmctld
</i>
, listens
to for work.
<dt>
SLURMD_PORT
<dd>
The port number that the SLURM compute node daemon,
<i>
slurmd
</i>
, listens
to for work.
</dl>
<h2>
scontrol Administration Tool
</h2>
The tool you will primarily use in the administration of SLURM is scontrol.
It provides the means of viewing and updating node and partition
configurations. It can also be used to update some job state
information. You can execute scontrol with a single keyword on
the execute line or it will query you for input and process those
keywords on an interactive basis. The scontrol keywords are shown below.
A
<a
href=
"#SampleAdmin"
>
sample scontrol session
</a>
with examples is appended.
<p>
Usage: scontrol [-q | -v] [
<
keyword
>
]
<br>
-q is equivalent to the "quiet" keyword
<br>
-v is equivalent to the "verbose" keyword
<br>
</pre>
<dl>
<dt>
abort
<dd>
Cause
<i>
slurmctld
</i>
terminate immediately and generate a core file.
<dt>
exit
<dd>
Terminate scontrol.
<dt>
help
<dd>
Display this list of scontrol commands and options.
<dt>
quiet
<dd>
Print no messages other than error messages.
<dt>
quit
<dd>
Terminate scontrol.
<dt>
reconfigure
<dd>
The SLURM control daemon re-reads its configuration files.
<dt>
show
<
entity
>
[
<
ID
>
]
<dd>
Show the configuration for a given entity. Entity must
be "config", "job", "node", "partition" or "step" for SLURM
configuration parameters, job, node, partition, and job step
information respectively.
By default, state information for all records is reported.
If you only wish to see the state of one entity record,
specify either its ID number (assumed if entirely numeric)
or its name.
<a
href=
"#NodeExp"
>
Regular expressions
</a>
may
be used to identify node names.
<dt>
shutdown
<dd>
Cause
<i>
slurmctld
</i>
to save state and terminate.
<dt>
update
<
options
>
<dd>
Update the configuration information.
Options are of the same format as the configuration file
and the output of the
<i>
scontrol show
</i>
command.
Not all configuration information can be modified using
this mechanism, such as the configuration of a node
after it has registered (only a node's state can be modified).
One can always modify the SLURM configuration file and
use the reconfigure command to rebuild all controller
information if required.
This command can only be issued by user
<b>
root
</b>
.
<dt>
verbose
<dd>
Enable detailed logging of scontrol execution state information.
<dt>
version
<dd>
Display the scontrol tool version number.
</dl>
<h2>
Miscellaneous
</h2>
There is no necessity for synchronized clocks on the nodes.
Events occur either in real-time based upon message traffic
or based upon changes in the time on a node. However, synchronized
clocks will permit easier analysis of SLURM logs from multiple
nodes.
<p>
SLURM uses the syslog function to record events. It uses a
range of importance levels for these messages. Be certain
that your system's syslog functionality is operational.
<a
name=
"SampleConfig"
><h2>
Sample Configuration File
</h2></a>
<pre>
#
# Sample /etc/slurm.conf
# Author: John Doe
# Date: 11/06/2001
#
ControlMachine=lx0001 BackupController=lx0002
Epilog="" Prolog=""
FastSchedule=1
FirstJobId=65536
HashBase=10
HeartbeatInterval=60
InactiveLimit=120
KillWait=30
Prioritize=/usr/local/maui/priority
SlurmctldPort=7002 SlurmdPort=7003
SlurmctldTimeout=300 SlurmdTimeout=300
StateSaveLocation=/tmp/slurm.state
TmpFS=/tmp
#
# Node Configurations
#
NodeName=DEFAULT TmpDisk=16384 State=IDLE
NodeName=lx[0001-0002] State=DRAINED
NodeName=lx[0003-8000] Procs=16 RealMemory=2048 Weight=16
NodeName=lx[8001-9999] Procs=32 RealMemory=4096 Weight=40 Feature=1200MHz
#
# Partition Configurations
#
PartitionName=DEFAULT MaxTime=30 MaxNodes=2
PartitionName=login Nodes=lx[0001-0002] State=DOWN
PartitionName=debug Nodes=lx[0003-0030] State=UP Default=YES
PartitionName=class Nodes=lx[0031-0040] AllowGroups=students
PartitionName=batch Nodes=lx[0041-9999] MaxTime=UNLIMITED MaxNodes=4096 RootOnly=YES
</pre>
<a
name=
"SampleAdmin"
><h2>
Sample scontrol Execution
</h2></a>
<pre>
Remove node lx0030 from service, removing jobs as needed:
# scontrol
scontrol: update NodeName=lx0030 State=DRAINING
scontrol: show job
ID=1234 Name=Simulation MaxTime=100 Nodes=lx[0029-0030] State=RUNNING User=smith
ID=1235 Name=MyBigTest MaxTime=100 Nodes=lx0020,lx0023 State=RUNNING User=smith
scontrol: update job ID=1234 State=ENDING
scontrol: show job 1234
Job 1234 not found
scontrol: show node lx0030
Name=lx0030 Partition=class State=DRAINED Procs=16 RealMemory=2048 TmpDisk=16384
scontrol: quit
</pre>
<hr>
URL = http://www-lc.llnl.gov/dctg-lc/slurm/admin.guide.html
<p>
Last Modified October 22, 2002
</p>
<address>
Maintained by
<a
href=
"mailto:slurm-dev@lists.llnl.gov"
>
slurm-dev@lists.llnl.gov
</a></address>
</body>
</html>
This diff is collapsed.
Click to expand it.
doc/html/quickstart.html
+
13
−
1
View file @
60627207
...
...
@@ -100,6 +100,8 @@ it is restored to service. The controller saves its state to disk
whenever there is a change. This state can be recovered by the controller
at startup time.
<b>
slurmctld
</b>
would typically execute as a
special user specifically for this purpose (not user root).
State changes are saved so that jobs and other state can be
preserved when slurmctld moves or is restarted.
<p>
The
<b>
slurmd
</b>
daemon executes on every compute node.
It resembles a remote shell daemon to export control to SLURM.
...
...
@@ -199,12 +201,22 @@ The remaining information provides basic SLURM administration information.
Individuals only interested in making use of SLURM need not read
read further.
<h3>
Authentication
</h3>
<h3>
Infrastructure
</h3>
All communications between SLURM components are authenticated.
The authentication infrastructure used is specified in the SLURM
configuration file and options include:
<a
href=
"http://www.theether.org/authd/"
>
authd
</a>
, munged and none.
<p>
SLURM uses the syslog function to record events. It uses a
range of importance levels for these messages. Be certain
that your system's syslog functionality is operational.
<p>
There is no necessity for synchronized clocks on the nodes.
Events occur either in real-time based upon message traffic.
However, synchronized clocks will permit easier analysis of
SLURM logs from multiple nodes.
<h3>
Configuration
</h3>
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment