Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
Slurm
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
tud-zih-energy
Slurm
Commits
953b8cac
Commit
953b8cac
authored
13 years ago
by
Morris Jette
Browse files
Options
Downloads
Patches
Plain Diff
Major clean-up of bluegene web page
Fix typos, punctuation problems, gramar, formatting, etc. Minor changes to content.
parent
edbf5513
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc/html/bluegene.shtml
+124
-121
124 additions, 121 deletions
doc/html/bluegene.shtml
with
124 additions
and
121 deletions
doc/html/bluegene.shtml
+
124
−
121
View file @
953b8cac
...
@@ -41,9 +41,10 @@ to represent multiples of 1024 or "m" for multiples of 1,048,576 (1024 x 1024).
...
@@ -41,9 +41,10 @@ to represent multiples of 1024 or "m" for multiples of 1,048,576 (1024 x 1024).
For example, "2k" is equivalent to "2048".</p>
For example, "2k" is equivalent to "2048".</p>
<p>If you are running a system that is smaller than 1 midplane (a
<p>If you are running a system that is smaller than 1 midplane (a
nodecard/nodeboard or such you can
set
your system up like
this in
nodecard/nodeboard or such
)
you can
configure
your system up like
your
bluegene.conf. Below is an example
on a
Q system
.
this in the
bluegene.conf
file
. Below is an example
for a BlueGene/
Q system
:</p>
<pre>
<pre>
# Excerpt from bluegene.conf file for BlueGene/Q system
...
...
BasePartitionNodeCnt=512
BasePartitionNodeCnt=512
NodeCardNodeCnt=32
NodeCardNodeCnt=32
...
@@ -52,53 +53,52 @@ LayoutMode=STATIC
...
@@ -52,53 +53,52 @@ LayoutMode=STATIC
MPs=0000 type=small 32cnblocks=16
MPs=0000 type=small 32cnblocks=16
...
...
</pre>
</pre>
This will create a small block on each nodeboard on the system. If your
<p>
This will create a small block on each nodeboard on the system. If your
system is different than this adjust appropriately. The idea is SLURM
system is different than this
,
adjust appropriately. The idea is SLURM
will create the smallest block possible on every possible hardware
will create the smallest block possible on every possible hardware
location. The system will then check for missing hardware and remove
location. The system will then check for missing hardware and remove
the
blocks
are
are invaild. This will get around the problem if you
blocks
that
are invaild. This will get around the problem if you
have for instance the 4th nodeboard populated instead of the 1st.
have
,
for instance
,
the 4th nodeboard populated instead of the 1st.
</p>
</p>
<h2>User Tools</h2>
<h2>User Tools</h2>
<p>The normal set of SLURM user tools: sbatch
, scancel, sinfo, squeue, and
<p>The normal set of SLURM user tools:
<i>
sbatch
</i>, <i>scancel</i>,
scontrol provide all of the expected
services except support for job steps,
<i>sinfo</i>, <i>squeue</i>, and <i>
scontrol
</i>
provide all of the expected
which is detailed later.
services except support for job steps,
which is detailed later.
</p>
<ul>
Seven
sbatch
options are available
:
<p>
Seven
job submission
options are available
exclusively on BlueGene systems:</p>
<table>
<table>
<tr VALIGN=TOP><td><i>--geometry</i></td><td>Specify job size in each dimension,
<tr VALIGN=TOP><td><i>--geometry</i></td><td>Specify job size in each dimension,
(i.e. 1x4x4 = 16 nodes)</td></tr>
(i.e. 1x4x4 = 16 nodes)</td></tr>
<tr VALIGN=TOP><td><i>--no-rotate</i></td><td>Disable rotation of geometry, by default
<tr VALIGN=TOP><td><i>--no-rotate</i></td><td>Disable rotation of geometry, by default
1x4x4 could be
manipul
ated to be 4x1x4)</td>
1x4x4 could be
rot
ated to be 4x1x4)</td>
<tr VALIGN=TOP><td><i>--conn-type</i></td><td>Specify interconnect
<tr VALIGN=TOP><td><i>--conn-type</i></td><td>Specify interconnect
type between midplanes, mesh or torus
, o
n BlueGene/Q you can
type between midplanes, mesh or torus
. O
n BlueGene/Q
systems
you can
specify a different conn-type for each dimension, TTMT would
specify a different conn-type for each dimension, TTMT would
give you Torus in all dimensions except
Y where it would b
e
give you Torus in all dimensions except
the Y dimension, wher
e
Mesh.</td></tr>
it would be
Mesh.</td></tr>
<tr VALIGN=TOP><td><i>--blrts-image</i></td><td>(B
GL only) Specify alternative
<tr VALIGN=TOP><td><i>--blrts-image</i></td><td>(B
lueGene/L systems only)
blrts image for bluegene block. Default if not set.</td></tr>
Specify alternative
blrts image for bluegene block. Default if not set.</td></tr>
<tr VALIGN=TOP><td><i>--cnload-image</i></td><td>(B
GP
only) Specify
<tr VALIGN=TOP><td><i>--cnload-image</i></td><td>(B
lueGene/P systems
only) Specify
alternative c-node image for bluegene block. Default if not set.</td></tr>
alternative c-node image for bluegene block. Default if not set.</td></tr>
<tr VALIGN=TOP><td><i>--ioload-image</i></td><td>(B
GP
only) Specify
<tr VALIGN=TOP><td><i>--ioload-image</i></td><td>(B
lueGene/P systems
only) Specify
alternative io image for bluegene block. Default if not set.</td></tr>
alternative io image for bluegene block. Default if not set.</td></tr>
<tr VALIGN=TOP><td><i>--linux-image</i></td><td>(B
GL only) Specify alternative
<tr VALIGN=TOP><td><i>--linux-image</i></td><td>(B
lueGene/L systems only)
linux image for bluegene block. Default if not set.</td></tr>
Specify alternative
linux image for bluegene block. Default if not set.</td></tr>
<tr VALIGN=TOP><td><i>--mloader-image</i></td><td>Specify
<tr VALIGN=TOP><td><i>--mloader-image</i></td><td>Specify
alternative mloader image for bluegene block. Default if not set.</td></tr>
alternative mloader image for bluegene block. Default if not set.</td></tr>
<tr VALIGN=TOP><td><i>--ramdisk-image</i></td><td>(B
GPL only) Specify
<tr VALIGN=TOP><td><i>--ramdisk-image</i></td><td>(B
lueGene/L or P systems only)
alternative ramdisk image for bluegene block. Default if not set.</td></tr>
Specify
alternative ramdisk image for bluegene block. Default if not set.</td></tr>
</table>
</table>
The <i>--nodes</i> option with a minimum and (optionally) maximum node count
continues
<p>
The <i>--nodes</i> option with a minimum and (optionally) maximum node count
to be available.
continues
to be available.
Note that this is a c-node count.</p>
Note that this is a c-node count.</p>
<h3>Task Launch on BlueGene/Q only</h3>
<h3>Task Launch on BlueGene/Q only</h3>
<p>Use SLURM's srun command to launch tasks (srun is a wrapper for IBM's
<p>Use SLURM's
<i>
srun
</i>
command to launch tasks (
<i>
srun
</i>
is a wrapper for IBM's
<i>runjob</i> command.
<i>runjob</i> command.
SLURM job step information, including accounting, functions as expected.</p>
SLURM job step information, including accounting, functions as expected.</p>
...
@@ -107,12 +107,13 @@ SLURM job step information, including accounting, functions as expected.</p>
...
@@ -107,12 +107,13 @@ SLURM job step information, including accounting, functions as expected.</p>
<p>SLURM performs resource allocation for the job, but initiation of tasks is
<p>SLURM performs resource allocation for the job, but initiation of tasks is
performed using the <i>mpirun</i> command. SLURM has no concept of a job step
performed using the <i>mpirun</i> command. SLURM has no concept of a job step
on BlueGene/L or BlueGene/P systems.
on BlueGene/L or BlueGene/P systems.
To reiterate: salloc or sbatch are used to create a job allocation, but
To reiterate:
<u><i>
salloc
</i>
or
<i>
sbatch
</i>
are used to create a job allocation, but
<i>mpirun</i> is used to launch the parallel tasks.
<i>mpirun</i> is used to launch the parallel tasks.
</u>
The script that you submit to SLURM can contain multiple invocations of mpirun
The script that you submit to SLURM can contain multiple invocations of mpirun
as well as any desired commands for pre- and post-processing.
as well as any desired commands for pre- and post-processing.
The mpirun command will get its <i>bgblock</i> information from the
The mpirun command will get its <i>bgblock</i> information from the
<i>MPIRUN_PARTITION</i> as set by SLURM. A sample script is shown below.</p>
<i>MPIRUN_PARTITION</i> environment variable as set by SLURM. A sample script
is shown below.</p>
<pre>
<pre>
#!/bin/bash
#!/bin/bash
# pre-processing
# pre-processing
...
@@ -139,10 +140,10 @@ bgp630, bgp631, bgp720, bgp721, bgp730 and bgp731).</p>
...
@@ -139,10 +140,10 @@ bgp630, bgp631, bgp720, bgp721, bgp730 and bgp731).</p>
<p><b>IMPORTANT:</b> SLURM can support up to 36 elements in each
<p><b>IMPORTANT:</b> SLURM can support up to 36 elements in each
BlueGene dimension by supporting "A-Z" as valid numbers. SLURM requires the
BlueGene dimension by supporting "A-Z" as valid numbers. SLURM requires the
prefix to be lower case and any letters in the suffix must always be upper
prefix to be lower case and any letters in the suffix must always be upper
case. This schema must be used in both the slurm.conf and bluegene.conf
case. This schema must be used in both the
<i>
slurm.conf
</i>
and bluegene.conf
configuration files when specifying midplane/node names (the prefix is
configuration files when specifying midplane/node names (the prefix is
optional). This schema should also be used to specify midplanes or locations
optional). This schema should also be used to specify midplanes or locations
in configure mode of smap:
in configure mode of
<i>
smap
</i>
:
<br>
<br>
valid: bgl[000xC44], bgl000, bglZZZ
valid: bgl[000xC44], bgl000, bglZZZ
<br>
<br>
...
@@ -150,7 +151,7 @@ invalid: BGL[000xC44], BglC00, bglb00, Bglzzz
...
@@ -150,7 +151,7 @@ invalid: BGL[000xC44], BglC00, bglb00, Bglzzz
</p>
</p>
<p>In a system configured with <i>small blocks</i> (any block less
<p>In a system configured with <i>small blocks</i> (any block less
than a full midplane) there will be divisions in the midplane
than a full midplane)
,
there will be divisions in the midplane
notation. On BlueGene/L and BlueGene/P systems, the midplane name may
notation. On BlueGene/L and BlueGene/P systems, the midplane name may
be followed by a square bracket enclosing ID numbers of the IO nodes associated
be followed by a square bracket enclosing ID numbers of the IO nodes associated
with the block. For example, if there are 64 psets in a BlueGene/L
with the block. For example, if there are 64 psets in a BlueGene/L
...
@@ -166,7 +167,7 @@ one in each of the five dimensions.</p>
...
@@ -166,7 +167,7 @@ one in each of the five dimensions.</p>
<p>Two topology-aware graphical user interfaces are provided: <i>smap</i> and
<p>Two topology-aware graphical user interfaces are provided: <i>smap</i> and
<i>sview</i> (<i>sview</i> provides more viewing and configuring options).
<i>sview</i> (<i>sview</i> provides more viewing and configuring options).
See each command's man page for details.
See each command's man page for details.
A sample of smap output is provided below showing the location of five jobs.
A sample of
<i>
smap
</i>
output is provided below showing the location of five jobs.
Note the format of the list of midplanes allocated to each job.
Note the format of the list of midplanes allocated to each job.
Also note that idle (unassigned) midplanes are indicated by a period.
Also note that idle (unassigned) midplanes are indicated by a period.
Down and drained midplanes (those not available for use) are
Down and drained midplanes (those not available for use) are
...
@@ -210,7 +211,7 @@ You can identify the bgblock associated with your job using the command
...
@@ -210,7 +211,7 @@ You can identify the bgblock associated with your job using the command
<i>smap -Db -c</i>.
<i>smap -Db -c</i>.
The time to boot a bgblock is related to its size, but should range from
The time to boot a bgblock is related to its size, but should range from
from a few minutes to about 15 minutes for a bgblock containing 128
from a few minutes to about 15 minutes for a bgblock containing 128
midplanes (
BGL
).
midplanes (
on a BlueGene/L system
).
Only after the bgblock is READY will your job's output file be created
Only after the bgblock is READY will your job's output file be created
and the script execution begin.
and the script execution begin.
If the bgblock boot fails, SLURM will attempt to reboot several times (3)
If the bgblock boot fails, SLURM will attempt to reboot several times (3)
...
@@ -223,10 +224,10 @@ five minutes.
...
@@ -223,10 +224,10 @@ five minutes.
In summary, your job may appear in SLURM as RUNNING for 15 minutes
In summary, your job may appear in SLURM as RUNNING for 15 minutes
before the script actually begins to 5 minutes after it completes.
before the script actually begins to 5 minutes after it completes.
These delays are the result of the BlueGene infrastructure issues and are
These delays are the result of the BlueGene infrastructure issues and are
not due to anything in SLURM.
In later BlueGene infrastructures P/Q
not due to anything in SLURM.
These times have improved considerably on the
these times have gotten much better
.</p>
more recent BlueGene/P and BlueGene/Q systems
.</p>
<p>When using smap in default output mode you can scroll through
<p>When using
<i>
smap
</i>
in default output mode you can scroll through
the different windows using the arrow keys.
the different windows using the arrow keys.
The <b>up</b> and <b>down</b> arrow keys scroll
The <b>up</b> and <b>down</b> arrow keys scroll
the window containing the grid, and the <b>left</b> and <b>right</b> arrow
the window containing the grid, and the <b>left</b> and <b>right</b> arrow
...
@@ -236,26 +237,27 @@ keys scroll the window containing the text information.</p>
...
@@ -236,26 +237,27 @@ keys scroll the window containing the text information.</p>
<h2>System Administration for BlueGene/Q only</h2>
<h2>System Administration for BlueGene/Q only</h2>
<p>
In order to make srun
work
correctly with the underlying system
<p>In order to make
<i>
srun
</i> operate
correctly with the underlying system
and to ensure security for new
mpi
jobs
running on your system you
and to ensure security for new
MPI
jobs
, it is necessary to enable the
will need to enable the
SLURM plugin for the IBM runjob_mux. This can
SLURM plugin for the IBM runjob_mux. This can
be done by altering the bg.properties file. In the [runjob.mux]
be done by altering the bg.properties file. In the [runjob.mux]
section of the bg.properties file change the
plugin option to
section of the bg.properties file change the plugin option to
$prefix/lib/slurm/runjob_plugin.so and also set the plugin_flags
<i>
$prefix/lib/slurm/runjob_plugin.so
</i>
and also set the plugin_flags
option to 0x0101 (RTLD_LAZY | RTLD_GLOBAL) which allows the
option to
<i>
0x0101
</i>
(RTLD_LAZY | RTLD_GLOBAL) which allows the
forwarding of symbols to shared objects like SLURM uses for plugins.
forwarding of symbols to shared objects like SLURM uses for plugins.
</p>
<pre>
<pre>
[runjob.mux]
[runjob.mux]
...
...
plugin = /usr/lib64/slurm/runjob_plugin.so
plugin = /usr/lib64/slurm/runjob_plugin.so
# Path to the plugin used for communicating with a job scheduler.
# Path to the plugin used for communicating with a
# This value can be updated by the runjob_mux_refresh_config command on the
# job scheduler. This value can be updated by the
# runjob_mux_refresh_config command on the
# Login Node where a runjob_mux process runs.
# Login Node where a runjob_mux process runs.
...
...
plugin_flags = 0x0101 # RTLD_LAZY | RTLD_GLOBAL
plugin_flags = 0x0101 # RTLD_LAZY | RTLD_GLOBAL
</pre>
</pre>
After these settings are set (re)start each runjob_mux running on your
<p>
After these settings are set (re)start each runjob_mux running on your
system.</p>
system.</p>
<p>When a new version of SLURM is installed it is a wise idea to "refresh" the
<p>When a new version of SLURM is installed it is a wise idea to "refresh" the
...
@@ -273,7 +275,7 @@ when finishing not being known. This is expected and can usually be ignored.
...
@@ -273,7 +275,7 @@ when finishing not being known. This is expected and can usually be ignored.
<i>configure</i> program locating some expected files.
<i>configure</i> program locating some expected files.
In particular for a BlueGene/L system, the configure script searches
In particular for a BlueGene/L system, the configure script searches
for <i>libdb2.so</i> in the
for <i>libdb2.so</i> in the
directories <i>/bgl/BlueLight/ppcfloor/bglsys</i> <i>/opt/IBM/db2/V8.1</i>
directories <i>/bgl/BlueLight/ppcfloor/bglsys</i>
,
<i>/opt/IBM/db2/V8.1</i>
<i>/home/bgdb2cli/sqllib</i> and <i>/u/bgdb2cli/sqllib</i>. If your
<i>/home/bgdb2cli/sqllib</i> and <i>/u/bgdb2cli/sqllib</i>. If your
DB2 library file is in a different location, use the configure
DB2 library file is in a different location, use the configure
option <i>--with-db2-dir=PATH</i> to specify the parent directory.
option <i>--with-db2-dir=PATH</i> to specify the parent directory.
...
@@ -281,25 +283,25 @@ This option does not apply to any other BlueGene arch.
...
@@ -281,25 +283,25 @@ This option does not apply to any other BlueGene arch.
If you have the same version of the operating system on both the
If you have the same version of the operating system on both the
Service Node (SN) and the Front End Nodes (FEN) then you can configure
Service Node (SN) and the Front End Nodes (FEN) then you can configure
and build one set of files on the SN and install them on both the SN and FEN.
and build one set of files on the SN and install them on both the SN and FEN.
Note that all smap functionality will be provided on the FEN
Note that all
<i>
smap
</i>
functionality will be provided on the FEN
except for the ability to map SLURM node names to and from
except for the ability to map SLURM node names to and from
row/rack/midplane data, which requires direct use of the Bridge API
row/rack/midplane data, which requires direct use of the Bridge API
calls only available on the S
N
.</p>
calls only available on the S
ervice Node
.</p>
<p>The slurmctld daemon should execute on the system's service node.
<p>The
<i>
slurmctld
</i>
daemon should execute on the system's service node.
If an optional backup daemon is used, it must be in some location where
If an optional backup daemon is used, it must be in some location where
it is capable of executing Bridge APIs.
it is capable of executing Bridge APIs.
The slurmd daemons executes the user scripts and there must be at least one
The
<i>
slurmd
</i>
daemons executes the user scripts and there must be at least one
front end node configured for this purpose. Multiple front end nodes may be
front end node configured for this purpose. Multiple front end nodes may be
configured for slurmd use to improve performance and fault tolerance.
configured for
<i>
slurmd
</i>
use to improve performance and fault tolerance.
Each slurmd can execute jobs for every midplane and the work will be
Each
<i>
slurmd
</i>
can execute jobs for every midplane and the work will be
distributed among the slurmd daemons to balance the workload.
distributed among the
<i>
slurmd
</i>
daemons to balance the workload.
You can use the scontrol command to drain individual compute nodes as desired
You can use the
<i>
scontrol
</i>
command to drain individual compute nodes as desired
and return them to service.</p>
and return them to service.</p>
<p>The <i>slurm.conf</i> (configuration) file needs to have the value of
<p>The <i>slurm.conf</i> (configuration) file needs to have the value of
<i>InactiveLimit</i> set to zero or not specified (it defaults to a value of zero).
<i>InactiveLimit</i> set to zero or not specified (it defaults to a value of zero).
This is because
if there are no job steps,
we don't want to purge jobs prematurely.
This is because we don't want to purge jobs prematurely
if there are no job steps
.
The value of <i>SelectType</i> must be set to "select/bluegene" in order to have
The value of <i>SelectType</i> must be set to "select/bluegene" in order to have
node selection performed using a system aware of the system's topography
node selection performed using a system aware of the system's topography
and interfaces.
and interfaces.
...
@@ -312,7 +314,7 @@ will wait until the bgblock identified by the MPIRUN_PARTITION environment
...
@@ -312,7 +314,7 @@ will wait until the bgblock identified by the MPIRUN_PARTITION environment
variable is no longer usable by this job. It is recommended that you construct a script
variable is no longer usable by this job. It is recommended that you construct a script
that serves this function and calls the supplied program <i>sbin/slurm_epilog</i>.
that serves this function and calls the supplied program <i>sbin/slurm_epilog</i>.
The prolog and epilog programs are used to insure proper synchronization
The prolog and epilog programs are used to insure proper synchronization
between the slurmctld daemon, the user job, and MMCS.
between the
<i>
slurmctld
</i>
daemon, the user job, and MMCS.
A multitude of other functions may also be placed into the prolog and
A multitude of other functions may also be placed into the prolog and
epilog as desired (e.g. enabling/disabling user logins, purging file systems,
epilog as desired (e.g. enabling/disabling user logins, purging file systems,
etc.). Sample prolog and epilog scripts follow. </p>
etc.). Sample prolog and epilog scripts follow. </p>
...
@@ -355,9 +357,9 @@ is enabled to execute jobs only at certain times; while a default partition
...
@@ -355,9 +357,9 @@ is enabled to execute jobs only at certain times; while a default partition
could be configured to execute jobs at other times.
could be configured to execute jobs at other times.
Jobs could still be queued in a partition that is configured in a DOWN
Jobs could still be queued in a partition that is configured in a DOWN
state and scheduled to execute when changed to an UP state.
state and scheduled to execute when changed to an UP state.
m
idplanes can also be moved between
slurm
partitions either by changing
M
idplanes can also be moved between
SLURM
partitions either by changing
the <i>slurm.conf</i> file and restarting the slurmctld daemon or by using
the <i>slurm.conf</i> file and restarting the
<i>
slurmctld
</i>
daemon or by using
the scontrol reconfig command. </p>
the
<i>
scontrol
</i>
reconfig command. </p>
<p>SLURM node and partition descriptions should make use of the
<p>SLURM node and partition descriptions should make use of the
<a href="#naming">naming</a> conventions described above. For example,
<a href="#naming">naming</a> conventions described above. For example,
...
@@ -367,18 +369,18 @@ in an 8 by 4 by 4 matrix. The node name prefix of "bg" defined by
...
@@ -367,18 +369,18 @@ in an 8 by 4 by 4 matrix. The node name prefix of "bg" defined by
NodeName can be anything you want, but needs to be consistent
NodeName can be anything you want, but needs to be consistent
throughout the <i>slurm.conf</i> file. No computer is actually
throughout the <i>slurm.conf</i> file. No computer is actually
expected to a hostname of "bg000" and no attempt will be made to route
expected to a hostname of "bg000" and no attempt will be made to route
message traffic to this address. Starting in 2.4 SLURM can
gather how many
message traffic to this address. Starting in
version
2.4
,
SLURM can
determine
Sockets, CoresPerSocket, and ThreadsPerCore are available on each
how many
Sockets, CoresPerSocket, and ThreadsPerCore are available on each
midplane, so no configuration is needed to determine how many cores
midplane, so no configuration is needed to determine how many cores
are on each midplane.</p>
are on each midplane.</p>
<p>Front end nodes used for executing the slurmd daemons must also be defined
<p>Front end nodes used for executing the
<i>
slurmd
</i>
daemons must also be defined
in the <i>slurm.conf</i> file.
in the <i>slurm.conf</i> file.
It is recommended that at least two front end nodes be dedicated to use by
It is recommended that at least two front end nodes be dedicated to use by
the slurmd daemons for fault tolerance.
the
<i>
slurmd
</i>
daemons for fault tolerance.
For example:
For example:
"FrontendName=frontend[00-03] State=UNKNOWN"
"FrontendName=frontend[00-03] State=UNKNOWN"
is used to define four front end nodes for running slurmd daemons.</p>
is used to define four front end nodes for running
<i>
slurmd
</i>
daemons.</p>
<pre>
<pre>
# Portion of slurm.conf for BlueGene system
# Portion of slurm.conf for BlueGene system
...
@@ -393,14 +395,14 @@ NodeName=bg[000x733] State=UNKNOWN
...
@@ -393,14 +395,14 @@ NodeName=bg[000x733] State=UNKNOWN
<p>While users are unable to initiate SLURM job steps on BlueGene/L or BlueGene/P
<p>While users are unable to initiate SLURM job steps on BlueGene/L or BlueGene/P
systems, this restriction does not apply to user root or <i>SlurmUser</i>.
systems, this restriction does not apply to user root or <i>SlurmUser</i>.
Be advised that the slurmd daemon is unable to manage a large number of job
Be advised that the
<i>
slurmd
</i>
daemon is unable to manage a large number of job
steps, so this ability should be used only to verify normal SLURM operation.
steps, so this ability should be used only to verify normal SLURM operation.
If large numbers of job steps are initiated by slurmd, expect the daemon to
If large numbers of job steps are initiated by
<i>
slurmd
</i>
, expect the daemon to
fail due to lack of memory or other resources.
fail due to lack of memory or other resources.
It is best to minimize other work on the front end nodes executing slurmd
It is best to minimize other work on the front end nodes executing
<i>
slurmd
</i>
so as to maximize its performance and minimize other risk factors.</p>
so as to maximize its performance and minimize other risk factors.</p>
<a name="bluegene-conf"><h2>
B
luegene.conf File Creation</h2></a>
<a name="bluegene-conf"><h2>
b
luegene.conf File Creation</h2></a>
<p>In addition to the normal <i>slurm.conf</i> file, a new
<p>In addition to the normal <i>slurm.conf</i> file, a new
<i>bluegene.conf</i> configuration file is required with information pertinent
<i>bluegene.conf</i> configuration file is required with information pertinent
to the system.
to the system.
...
@@ -411,9 +413,9 @@ System administrators should use the <i>smap</i> tool to build appropriate
...
@@ -411,9 +413,9 @@ System administrators should use the <i>smap</i> tool to build appropriate
configuration file for static partitioning.
configuration file for static partitioning.
Note that <i>smap -Dc</i> can be run without the SLURM daemons
Note that <i>smap -Dc</i> can be run without the SLURM daemons
active to establish the initial configuration.
active to establish the initial configuration.
Note that the bgblocks defined using smap may not overlap (except for the
Note that the bgblocks defined using
<i>
smap
</i>
may not overlap (except for the
full-system bgblock, which is implicitly created).
full-system bgblock, which is implicitly created).
See the smap man page for more information.</p>
See the
<i>
smap
</i>
man page for more information.</p>
<p>There are 3 different modes which the system administrator can define
<p>There are 3 different modes which the system administrator can define
BlueGene partitions (or bgblocks) available to execute jobs: static,
BlueGene partitions (or bgblocks) available to execute jobs: static,
...
@@ -465,28 +467,27 @@ if resources are available and prevent larger jobs from running.
...
@@ -465,28 +467,27 @@ if resources are available and prevent larger jobs from running.
Bgblocks need not be assigned in the <i>bluegene.conf</i> file
Bgblocks need not be assigned in the <i>bluegene.conf</i> file
for this mode.</p>
for this mode.</p>
<p>Blocks can be freed or set in an error state
with
scontrol,
<p>Blocks can be freed or set in an error state
using the <i>
scontrol
</i>
,
(i.e. "<i>scontrol update BlockName=RMP0 state=error</i>").
command
(i.e. "<i>scontrol update BlockName=RMP0 state=error</i>").
This will
end
any job on the block and set the state of the block to ERROR
This will
terminate
any job on the block and set the state of the block to ERROR
making it so no job will run on the block. To set it back to a usable
making it so no job will run on the block. To set it back to a usable
state, you can resume the block with state=resume
(i.e.
state, you can resume the block with
the <i>scontrol</i> option
state=resume
"<i>scontrol update BlockName=RMP0 state=resume</i>"). This is
handy
(i.e.
"<i>scontrol update BlockName=RMP0 state=resume</i>"). This is
useful
if you temporarily put the block in an error state and the block is
if you temporarily put the block in an error state and the block is
really booted and ready to start jobs. You can also put the block
really booted and ready to start jobs. You can also put the block
in free state
with
the state=free. Valid states are
"
Error, Free,
in free state
using
the state=free
option
. Valid states are Error, Free,
Recreate, Remove, Resume
"
.
Recreate, Remove, Resume.
<p>Alternatively, if only part of a midplane needs to be put
<p>Alternatively, if only part of a midplane needs to be put
into an error state which isn't already in a block of the size you
into an error state which isn't already in a block of the size you
need, you can set a collection of IO nodes into an error state using
scontrol
need, you can set a collection of IO nodes into an error state using
(i.e. "<i>scontrol update submpname=bg000[0-3] state=error</i>").
<i>scontrol</i>
(i.e. "<i>scontrol update submpname=bg000[0-3] state=error</i>").
This will end any job on the nodes listed, create a block there, and set
This will end any job on the nodes listed, create a block there, and set
the state of the block to ERROR making it so no job will run on the
the state of the block to ERROR making it so no job will run on the
block. Then resume the block when it is ready to be used again (i.e.
block. Then resume the block when it is ready to be used again (i.e.
"<i>scontrol update BlockName=RMP0 state=resume</i>"). This is
"<i>scontrol update BlockName=RMP0 state=resume</i>"). This is
helpful to allow other jobs to run on the unaffected nodes in
helpful to allow other jobs to run on the unaffected nodes in
the midplane.
the midplane.</p>
<p>One of these modes must be defined in the <i>bluegene.conf</i> file
<p>One of these modes must be defined in the <i>bluegene.conf</i> file
with the option <i>LayoutMode=MODE</i> (where MODE=STATIC, DYNAMIC or OVERLAP).</p>
with the option <i>LayoutMode=MODE</i> (where MODE=STATIC, DYNAMIC or OVERLAP).</p>
...
@@ -497,8 +498,8 @@ This is done using the keywords <i>MidplaneNodeCnt=NODE_COUNT</i>
...
@@ -497,8 +498,8 @@ This is done using the keywords <i>MidplaneNodeCnt=NODE_COUNT</i>
and <i>NodeCardNodeCnt=NODE_COUNT</i> respectively in the <i>bluegene.conf</i>
and <i>NodeCardNodeCnt=NODE_COUNT</i> respectively in the <i>bluegene.conf</i>
file (i.e. <i>MidplaneNodeCnt=512</i> and <i>NodeCardNodeCnt=32</i>).</p>
file (i.e. <i>MidplaneNodeCnt=512</i> and <i>NodeCardNodeCnt=32</i>).</p>
<p>Note that the <i>IONodesPerMP</i> value
s
defined in
<p>Note that the <i>IONodesPerMP</i> value defined in
<i>bluegene.conf</i> is used only when SLURM creates bgblocks this
<i>bluegene.conf</i> is used only when SLURM creates bgblocks
and
this
determines if the system is IO rich or not. For most BlueGene/L
determines if the system is IO rich or not. For most BlueGene/L
systems this value is either 8 (for IO poor systems) or 64 (for IO rich
systems this value is either 8 (for IO poor systems) or 64 (for IO rich
systems).</p>
systems).</p>
...
@@ -507,7 +508,7 @@ systems).</p>
...
@@ -507,7 +508,7 @@ systems).</p>
booting a bgblock and the valid images are different for each BlueGene system
booting a bgblock and the valid images are different for each BlueGene system
type (e.g. L, P and Q). Their values can change during job allocation based on
type (e.g. L, P and Q). Their values can change during job allocation based on
input from the user.
input from the user.
If you change the bgblock layout, then slurmctld and slurmd should
If you change the bgblock layout, then
<i>
slurmctld
</i>
and
<i>
slurmd
</i>
should
both be cold-started (without preserving any state information,
both be cold-started (without preserving any state information,
"/etc/init.d/slurm startclean").</p>
"/etc/init.d/slurm startclean").</p>
...
@@ -519,7 +520,7 @@ additional bgblock is created containing all resources defined
...
@@ -519,7 +520,7 @@ additional bgblock is created containing all resources defined
all of the other defined bgblocks.
all of the other defined bgblocks.
Make use of the SLURM partition mechanism to control access to these
Make use of the SLURM partition mechanism to control access to these
bgblocks.
bgblocks.
A sample <i>bluegene.conf</i> file is shown below.
A sample <i>bluegene.conf</i> file is shown below.
</p>
<pre>
<pre>
###############################################################################
###############################################################################
# Global specifications for a BlueGene/L system
# Global specifications for a BlueGene/L system
...
@@ -539,7 +540,7 @@ A sample <i>bluegene.conf</i> file is shown below.
...
@@ -539,7 +540,7 @@ A sample <i>bluegene.conf</i> file is shown below.
# AltMloaderImage: Alternative MloaderImage(s).
# AltMloaderImage: Alternative MloaderImage(s).
# AltRamDiskImage: Alternative RamDiskImage(s).
# AltRamDiskImage: Alternative RamDiskImage(s).
#
#
# LayoutMode: Mode in which
slurm
will create blocks:
# LayoutMode: Mode in which
SLURM
will create blocks:
# STATIC: Use defined non-overlapping bgblocks
# STATIC: Use defined non-overlapping bgblocks
# OVERLAP: Use defined bgblocks, which may overlap
# OVERLAP: Use defined bgblocks, which may overlap
# DYNAMIC: Create bgblocks as needed for each job
# DYNAMIC: Create bgblocks as needed for each job
...
@@ -626,8 +627,7 @@ BPs=[001x001] Type=SMALL 32CNBlocks=4 128CNBlocks=3 # 1x1x1 = 4-Nodecard sized
...
@@ -626,8 +627,7 @@ BPs=[001x001] Type=SMALL 32CNBlocks=4 128CNBlocks=3 # 1x1x1 = 4-Nodecard sized
# c-node blocks 3-Base
# c-node blocks 3-Base
# Partition Quarter sized
# Partition Quarter sized
# c-node blocks
# c-node blocks
</pre>
</pre></p>
<p>The above <i>bluegene.conf</i> file defines multiple bgblocks to be
<p>The above <i>bluegene.conf</i> file defines multiple bgblocks to be
created in a single midplane (see the "SMALL" option).
created in a single midplane (see the "SMALL" option).
...
@@ -644,33 +644,33 @@ scheduler performance.
...
@@ -644,33 +644,33 @@ scheduler performance.
As in all SLURM configuration files, parameters and values
As in all SLURM configuration files, parameters and values
are case insensitive.</p>
are case insensitive.</p>
<p>The valid image names on a BlueGene/P system are CnloadImage
, MloaderImage
,
<p>The valid image names on a BlueGene/P system are
<i>
CnloadImage
</i>
,
and IoloadImage. The only image name on BlueGene/Q
systems is MloaderImage.
<i>MloaderImage</i>,
and
<i>
IoloadImage
</i>
. The only image name on BlueGene/Q
Alternate images may be specified as described
above for all BlueGene system
systems is <i>MloaderImage</i>.
Alternate images may be specified as described
types.</p>
above for all BlueGene system
types.</p>
<p>One more thing is required to support SLURM interactions with
<p>One more thing is required to support SLURM interactions with
the DB2 database (at least as of the time this was written).
the DB2 database (at least as of the time this was written).
DB2 database access is required by the slurmctld daemon only.
DB2 database access is required by the
<i>
slurmctld
</i>
daemon only.
All other SLURM daemons and commands interact with DB2 using
All other SLURM daemons and commands interact with DB2 using
remote procedure calls, which are processed by slurmctld.
remote procedure calls, which are processed by
<i>
slurmctld
</i>
.
DB2 access is dependent upon the environment variable
DB2 access is dependent upon the environment variable
<i>BRIDGE_CONFIG_FILE</i>.
<i>BRIDGE_CONFIG_FILE</i>.
Make sure this is set appropriate before initiating the
Make sure this is set appropriate before initiating the
slurmctld daemon.
<i>
slurmctld
</i>
daemon.
If desired, this environment variable and any other logic
If desired, this environment variable and any other logic
can be executed through the script <i>/etc/sysconfig/slurm</i>,
can be executed through the script <i>/etc/sysconfig/slurm</i>,
which is automatically executed by <i>/etc/init.d/slurm</i>
which is automatically executed by <i>/etc/init.d/slurm</i>
prior to initiating the SLURM daemons.</p>
prior to initiating the SLURM daemons.</p>
<p>When slurmctld is initially started on an idle system, the bgblocks
<p>When
<i>
slurmctld
</i>
is initially started on an idle system, the bgblocks
already defined in MMCS are read using the Bridge APIs.
already defined in MMCS are read using the Bridge APIs.
If these bgblocks do not correspond to those defined in the <i>bluegene.conf</i>
If these bgblocks do not correspond to those defined in the <i>bluegene.conf</i>
file, the old bgblocks with a prefix of "RMP" are destroyed and new ones
file, the old bgblocks with a prefix of "RMP" are destroyed and new ones
created.
created.
When a job is scheduled, the appropriate bgblock is identified,
When a job is scheduled, the appropriate bgblock is identified,
its user set, and it is booted.
its user set, and it is booted.
Node use (virtual or coprocessor) is set from the mpirun command line
now,
Node use (virtual or coprocessor) is set from the mpirun command line
;
SLURM has nothing to do with setting the node use.
SLURM has nothing to do with setting the node use.
Subsequent jobs use this same bgblock without rebooting by changing
Subsequent jobs use this same bgblock without rebooting by changing
the associated user field.
the associated user field.
...
@@ -694,20 +694,23 @@ repeated reboots and the likely failure of user jobs.
...
@@ -694,20 +694,23 @@ repeated reboots and the likely failure of user jobs.
A system administrator should address the problem before returning
A system administrator should address the problem before returning
the midplanes to service.</p>
the midplanes to service.</p>
<p>If the slurmctld daemon is cold-started (<
b
>/etc/init.d/slurm startclean</
b
>
<p>If the
<i>
slurmctld
</i>
daemon is cold-started (<
i
>/etc/init.d/slurm startclean</
i
>
or <
b
>slurmctld -c</
b
>) it is recommended that the slurmd daemon(s) be
or <
i
>slurmctld -c</
i
>) it is recommended that the
<i>
slurmd
</i>
daemon(s) be
cold-started at the same time.
cold-started at the same time.
Failure to do so may result in errors being reported by both slurmd
Failure to do so may result in errors being reported by both
<i>
slurmd
</i>
and slurmctld due to bgblocks that previously existed being deleted.</p>
and
<i>
slurmctld
</i>
due to bgblocks that previously existed being deleted.</p>
<h4>Resource Reservations</h4>
<h4>Resource Reservations</h4>
<p>SLURM's advance reservation mechanism can accept a node count specification
<p>SLURM's advance reservation mechanism can accept a node count specification
as input rather than identification of specific nodes/midplanes. In that case,
as input rather than identification of specific nodes/midplanes. In SLURM
SLURM may reserve nodes/midplanes which may not be formed into an appropriate
version 2.4, an attempt will be made to select nodes which can be used to
bgblock. Work is planned for SLURM version 2.4 to remedy this problem. Until
create a single block of the specified size. Multiple block sizes can also be
that time, identifying the specific nodes/midplanes to be included in an
specified and a reservation will be made that includes those block sizes
advanced reservation may be necessary.</p>
(e.g. <i>scontrol create reservation nodecnt=4k,2k ...</i>). In earlier
versions of SLURM, the nodes/midplanes selected for a reservation when
specifying a node count might not be suitable for creating block(s) of the
desired size(s).</p>
<p>SLURM's advance reservation mechanism is designed to reserve resources
<p>SLURM's advance reservation mechanism is designed to reserve resources
at the level of whole nodes, which on a BlueGene systems would represent
at the level of whole nodes, which on a BlueGene systems would represent
...
@@ -723,15 +726,15 @@ explicitly reserved are available to any job.</p>
...
@@ -723,15 +726,15 @@ explicitly reserved are available to any job.</p>
"<i>Licenses=cnode*512</i>". Then create an advanced reservation with a
"<i>Licenses=cnode*512</i>". Then create an advanced reservation with a
command like this:<br>
command like this:<br>
"<i>scontrol create reservation licenses="cnode*32" starttime=now duration=30:00 users=joe</i>".<br>
"<i>scontrol create reservation licenses="cnode*32" starttime=now duration=30:00 users=joe</i>".<br>
Jobs run in this reservation will then have <
b
>at least</
b
> 32 c-nodes
Jobs run in this reservation will then have <
u
>at least</
u
> 32 c-nodes
available for their use, but could use more given an appropriate workload.</p>
available for their use, but could use more given an appropriate workload.</p>
<p>There is also a job_submit/cnode plugin available for use that will
<p>There is also a job_submit/cnode plugin available for use that will
automatically set a job's license specification to match its c-node request
automatically set a job's license specification to match its c-node request
(i.e. a command like<br>
(i.e. a command like<br>
"<i>sbatch -N32 my.sh</i>" would automatically be translated to<br>
"<i>sbatch -N32 my.sh</i>" would automatically be translated to<br>
"<i>sbatch -N32 --licenses=cnode*32 my.sh</i>" by the slurmctld daemon.
"<i>sbatch -N32 --licenses=cnode*32 my.sh</i>" by the
<i>
slurmctld
</i>
daemon.
Enable this plugin in the slurm.conf configuration file with the option
Enable this plugin in the
<i>
slurm.conf
</i>
configuration file with the option
"<i>JobSubmitPlugins=cnode</i>".</p>
"<i>JobSubmitPlugins=cnode</i>".</p>
<h4>Debugging</h4>
<h4>Debugging</h4>
...
@@ -747,26 +750,26 @@ On BlueGene systems, there is also a <i>BridgeAPILogFile</i> defined
...
@@ -747,26 +750,26 @@ On BlueGene systems, there is also a <i>BridgeAPILogFile</i> defined
in <i>bluegene.conf</i> which can be configured to contain detailed
in <i>bluegene.conf</i> which can be configured to contain detailed
information about every Bridge API call issued.</p>
information about every Bridge API call issued.</p>
<p>Note that slurmc
l
tld log messages of the sort
<p>Note that
<i>
slurmctld
</i>
log messages of the sort
<i>Nodes bg[000x133] not responding</i> are indicative of the slurmd
<i>Nodes bg[000x133] not responding</i> are indicative of the
<i>
slurmd
</i>
daemon serving as a front-end to those midplanes is not responding (on
daemon serving as a front-end to those midplanes is not responding (on
non-BlueGene systems, the slurmd actually does run on the compute
non-BlueGene systems, the
<i>
slurmd
</i>
actually does run on the compute
nodes, so the message is more meaningful there). </p>
nodes, so the message is more meaningful there). </p>
<p>Note that you can emulate a BlueGene/L system on stand-alone Linux
<p>Note that you can emulate a BlueGene/L system on stand-alone Linux
system.
system.
Run <
b
>configure</
b
> with the <
b
>--enable-bgl-emulation</
b
> option.
Run <
i
>configure</
i
> with the <
i
>--enable-bgl-emulation</
i
> option.
This will define "HAVE_BG", "HAVE_BGL", and "HAVE_FRONT_END" in the
This will define "HAVE_BG", "HAVE_BGL", and "HAVE_FRONT_END" in the
config.h file.
config.h file.
You can also emulate a BlueGene/P system with
You can also emulate a BlueGene/P system with
the <
b
>--enable-bgp-emulation</
b
> option.
the <
i
>--enable-bgp-emulation</
i
> option.
This will define "HAVE_BG", "HAVE_BGP", and "HAVE_FRONT_END" in the
This will define "HAVE_BG", "HAVE_BGP", and "HAVE_FRONT_END" in the
config.h file.
config.h file.
You can also emulate a BlueGene/Q system using
You can also emulate a BlueGene/Q system using
the <
b
>--enable-bgq-emulation</
b
> option.
the <
i
>--enable-bgq-emulation</
i
> option.
This will define "HAVE_BG", "HAVE_BGQ", and "HAVE_FRONT_END" in the
This will define "HAVE_BG", "HAVE_BGQ", and "HAVE_FRONT_END" in the
config.h file.
config.h file.
Then execute <
b
>make</
b
> normally.
Then execute <
i
>make</
i
> normally.
These variables will build the code as if it were running
These variables will build the code as if it were running
on an actual BlueGene computer, but avoid making calls to the
on an actual BlueGene computer, but avoid making calls to the
Bridge library (that is controlled by the variable "HAVE_BG_FILES",
Bridge library (that is controlled by the variable "HAVE_BG_FILES",
...
@@ -775,6 +778,6 @@ scheduling logic, etc. </p>
...
@@ -775,6 +778,6 @@ scheduling logic, etc. </p>
<p class="footer"><a href="#top">top</a></p>
<p class="footer"><a href="#top">top</a></p>
<p style="text-align:center;">Last modified
16 August
201
1
</p>
<p style="text-align:center;">Last modified
30 January
201
2
</p>
<!--#include virtual="footer.txt"-->
<!--#include virtual="footer.txt"-->
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment