Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
Slurm
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
tud-zih-energy
Slurm
Commits
6d57b632
Commit
6d57b632
authored
11 years ago
by
Morris Jette
Browse files
Options
Downloads
Plain Diff
Merge branch 'slurm-2.6'
parents
47c92666
3c4892ad
No related branches found
No related tags found
No related merge requests found
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
doc/html/hdf5_profile_user_guide.shtml
+72
-72
72 additions, 72 deletions
doc/html/hdf5_profile_user_guide.shtml
doc/html/slurm.shtml
+6
-3
6 additions, 3 deletions
doc/html/slurm.shtml
with
78 additions
and
75 deletions
doc/html/hdf5_profile_user_guide.shtml
+
72
−
72
View file @
6d57b632
...
@@ -9,78 +9,78 @@
...
@@ -9,78 +9,78 @@
<a href="#Administration">Administration</a><br>
<a href="#Administration">Administration</a><br>
<a href="#Profiling">Profiling Jobs</a><br>
<a href="#Profiling">Profiling Jobs</a><br>
<a href="#HDF5">HDF5</a><br>
<a href="#HDF5">HDF5</a><br>
<a href="#DataSeries">Data S
eries
</a><br>
<a href="#DataSeries">Data S
tructure
</a><br>
<a id="Overview"></a>
<a id="Overview"></a>
<h2>Overview</h2>
<h2>Overview</h2>
The AcctGatherProfileType/hdf5 plugin allows SLURM to coordinate collecting
<p>
The AcctGatherProfileType/hdf5 plugin allows SLURM to coordinate collecting
data on jobs it runs on a cluster that is more detailed than is practical to
data on jobs it runs on a cluster that is more detailed than is practical to
include in its database. The data comes from periodically sampling various
include in its database. The data comes from periodically sampling various
performance data either collected by SLURM, the operating system, or
performance data either collected by SLURM, the operating system, or
component software. The plugin will record the data from each source
component software. The plugin will record the data from each source
as a <b>Time Series</b> and also accumulate totals for each statistic for
as a <b>Time Series</b> and also accumulate totals for each statistic for
the job.
the job.
</p>
<p>Time Series are energy data collected by an acct_gather_energy plugin,
<p>Time Series are energy data collected by an acct_gather_energy plugin,
I/O data from a network interface collected by an acct_gather_infiniband plugin,
I/O data from a network interface collected by an acct_gather_infiniband plugin,
I/O data from parallel file systems such as Lustre collected by an
I/O data from parallel file systems such as Lustre collected by an
acct_gather_filesystem plugin, and task performance data such as local disk I/O,
acct_gather_filesystem plugin, and task performance data such as local disk I/O,
cpu consumption, and memory use from a jobacct_gather plugin.
cpu consumption, and memory use from a jobacct_gather plugin.
Data from other sources may be added in the future.
Data from other sources may be added in the future.
</p>
<p>The data is collected into a file on a shared file system for each step on
<p>The data is collected into a file on a shared file system for each step on
each allocated node of a job and then merged into a HDF5 file.
each allocated node of a job and then merged into a HDF5 file.
Individual files on a shared file system was chosen because it is possible
Individual files on a shared file system was chosen because it is possible
that the data is voluminous so solutions that pass data to the SLURM control
that the data is voluminous so solutions that pass data to the SLURM control
daemon via RPC may not scale to very large clusters or jobs with
daemon via RPC may not scale to very large clusters or jobs with
many allocated nodes.
many allocated nodes.
</p>
<p>A separate <a href="acct_gather_profile_plugins.html">
<p>A separate <a href="acct_gather_profile_plugins.html">
SLURM Profile Accounting Plugin API (AcctGatherProfileType)</a> documents how
SLURM Profile Accounting Plugin API (AcctGatherProfileType)</a> documents how
write other Profile Accounting plugins.
write other Profile Accounting plugins.
</P>
<a id="Administration"></a>
<a id="Administration"></a>
<h2>Administration</h2>
<h2>Administration</h2>
<h3>Shared File System</h3>
<h3>Shared File System</h3>
<div style="margin-left: 20px;">
<div style="margin-left: 20px;">
The HDF5 Profile Plugin requires a common shared file system on all
the compute
<p>
The HDF5 Profile Plugin requires a common shared file system on all
nodes. While a job is running, the plugin writes a
file into this file
the compute
nodes. While a job is running, the plugin writes a
system for each step of the job on each node. When
the job ends,
file into this file
system for each step of the job on each node. When
the merge process is launched and the node-step files
are combined into one
the job ends,
the merge process is launched and the node-step files
HDF5 file for the job.
are combined into one
HDF5 file for the job.
</p>
<p>
The root of the directory structure is declared in the <b>ProfileHDF5Dir</b>
<p>
The root of the directory structure is declared in the <b>ProfileHDF5Dir</b>
option in the acct_gather.conf file. The directory will be created by SLURM
option in the acct_gather.conf file. The directory will be created by SLURM
if it doesn't exist.
if it doesn't exist.
</p>
<p>
Each user that creates a profile will have a subdirector to the profile
<p>
Each user that creates a profile will have a subdirector to the profile
directory that has read/write permission only for the user.
directory that has read/write permission only for the user.
</p>
</span>
</span>
</div>
</div>
<h3>Configuration parameters</h3>
<h3>Configuration parameters</h3>
<div style="margin-left: 20px;">
<p>
<div style="margin-left: 20px;">
The profile plugin is enabled in the
<p>
The profile plugin is enabled in the
<a href="slurm.conf.html">slurm.conf</a> file, but is internally
<a href="slurm.conf.html">slurm.conf</a> file, but is internally
configured in the
configured in the
<a href="acct_gather.conf.html">acct_gather.conf</a> file.
<a href="acct_gather.conf.html">acct_gather.conf</a> file.
</p>
</div>
</div>
<div style="margin-left: 20px;">
<div style="margin-left: 20px;">
<h4>slurm.conf parameters</h4>
<h4>slurm.conf parameters</h4>
<div style="margin-left: 20px;">
<div style="margin-left: 20px;">
<
br
><b>AcctGatherProfileType=acct_gather_profile/hdf5</b> enables the HDF5
<
p
><b>AcctGatherProfileType=acct_gather_profile/hdf5</b> enables the HDF5
plugin.
plugin.
</p>
<
br
><b>JobAcctGatherFrequency=[energy=freq[,lustre=freq[,network=freq[task=freq]]]]</b>
<
p
><b>JobAcctGatherFrequency=[energy=freq[,lustre=freq[,network=freq[task=freq]]]]</b>
sets default sample frequencies for data types.
sets default sample frequencies for data types.
</p>
</div>
</div>
</div>
</div>
<div style="margin-left: 20px;">
<div style="margin-left: 20px;">
<h4>act_gather.conf parameters</h4>
<h4>act_gather.conf parameters</h4>
<div style="margin-left: 20px;">
<div style="margin-left: 20px;">
These parameters are directly used by the HDF5 Profile Plugin.
<p>
These parameters are directly used by the HDF5 Profile Plugin.
</p>
<dl>
<dl>
<dt><B>ProfileHDF5Dir</B>=<path></dt>
<dt><B>ProfileHDF5Dir</B>=<path></dt>
<dd>This parameter is the path to the shared folder into which the
<dd>This parameter is the path to the shared folder into which the
...
@@ -104,54 +104,52 @@ add the --profile option to the launch scripts.</dd>
...
@@ -104,54 +104,52 @@ add the --profile option to the launch scripts.</dd>
<div style="margin-left: 20px;">
<div style="margin-left: 20px;">
<h4>Time Series Control Parameters</h4>
<h4>Time Series Control Parameters</h4>
<div style="margin-left: 20px;">
<div style="margin-left: 20px;">
Other plugins add time series data to the HDF5 collection. They typically
<p>
Other plugins add time series data to the HDF5 collection. They typically
have a default polling frequency specified in slurm.conf in the
have a default polling frequency specified in slurm.conf in the
JobAcctGatherFrequency parameter. The polling frequency can be overridden
JobAcctGatherFrequency parameter. The polling frequency can be overridden
using the --acctg-freq
using the --acctg-freq
<a href="srun.html">srun</a> parameter.
<a href="srun.html">srun</a> parameter.
They are both of the form task=sec,energy=sec,luster=sec,network=sec.
They are both of the form task=sec,energy=sec,luster=sec,network=sec.
<p>
<p>
The IPMI energy plugin also needs the EnergyIPMIFrequency value set
<p>
The IPMI energy plugin also needs the EnergyIPMIFrequency value set
in the acct_gather.conf file. This sets the rate at which the plugin samples
in the acct_gather.conf file. This sets the rate at which the plugin samples
the external sensors. This value should be the same as the energy=sec in
the external sensors. This value should be the same as the energy=sec in
either JobAcctGatherFrequency or --acctg-freq.
either JobAcctGatherFrequency or --acctg-freq.
</p>
<p>
Note that the IPMI and profile sampling are not synchronous.
<p>
Note that the IPMI and profile sampling are not synchronous.
The profile sample simply takes the last available IPMI sample value.
The profile sample simply takes the last available IPMI sample value.
If the profile energy sample is more frequent than the IPMI sample rate,
If the profile energy sample is more frequent than the IPMI sample rate,
the IPMI value will be repeated. If the profile energy sample is greater
the IPMI value will be repeated. If the profile energy sample is greater
than the IPMI rate, IPMI values will be lost.
than the IPMI rate, IPMI values will be lost.
</p>
<p>
Also note that smallest effective IPMI (EnergyIPMIFrequency) sample rate
<p>
Also note that smallest effective IPMI (EnergyIPMIFrequency) sample rate
for 2013 era Intel processors is 3 seconds.
for 2013 era Intel processors is 3 seconds.
</p>
<p>
</div>
</div>
</div>
</div>
<a id="Profiling"></a>
<a id="Profiling"></a>
<h2>Profiling Jobs</h2>
<h2>Profiling Jobs</h2>
<h3>Data Collection</h3>
<h3>Data Collection</h3>
The --profile option on salloc|sbatch|srun controls whether data is
<p>
The --profile option on salloc|sbatch|srun controls whether data is
collected and what type of data is collected. If --profile is not specified
collected and what type of data is collected. If --profile is not specified
no data collected unless the <B>ProfileHDF5CollectDefault</B>
no data collected unless the <B>ProfileHDF5CollectDefault</B>
option is used in acct_gather.conf. --profile on the command line overrides
option is used in acct_gather.conf. --profile on the command line overrides
any value specified in the configuration file.<p>
any value specified in the configuration file.<
/
p>
<DT><B>--profile</B>=<all|none|[energy[,|task[,|lustre[,|network]]]]>
<DT><B>--profile</B>=<all|none|[energy[,|task[,|lustre[,|network]]]]>
<DD>
<DD>
enables detailed data collection by the acct_gather_profile plugin.
<p>
enables detailed data collection by the acct_gather_profile plugin.
Detailed data are typically time-series that are stored in a HDF5 file for
Detailed data are typically time-series that are stored in a HDF5 file for
the job.</DD>
the job.</
p></
DD>
</DT>
</DT>
<P>
<div style="margin-left: 20px;">
<div style="margin-left: 20px;">
<DL>
<DL>
<DT><B>All</B>
<DT><B>All</B>
<DD>All data types are collected. (Cannot be combined with other values.)
<DD>All data types are collected. (Cannot be combined with other values.)
</DD></DT>
</DD></DT>
<P>
<DT><B>None</B>
<DT><B>None</B>
<DD>No data types are collected. This is the default. (Cannot be
combined with
<DD>No data types are collected. This is the default. (Cannot be
other values.)
combined with
other values.)
</DD></DT>
</DD></DT>
<DT><B>Energy</B>
<DT><B>Energy</B>
...
@@ -170,37 +168,37 @@ other values.)
...
@@ -170,37 +168,37 @@ other values.)
</div>
</div>
<h3>Data Consolidation</h3>
<h3>Data Consolidation</h3>
The node-step files are merged into one HDF5 file for the job using the
<p>
The node-step files are merged into one HDF5 file for the job using the
<a href="sh5util.html">sh5util</a>.
<a href="sh5util.html">sh5util</a>.
</p>
<p>If the job is started with sbatch, the command line may added to the normal
<p>If the job is started with sbatch, the command line may added to the normal
launch script, For example
;
launch script, For example
:</p>
<pre>
<pre>
sbatch -n1 -d$SLURM_JOB_ID --wrap="sh5util -j $SLURM_JOB_ID"
sbatch -n1 -d$SLURM_JOB_ID --wrap="sh5util -j $SLURM_JOB_ID"
</pre>
</pre>
<h3>Data Extraction</h3>
<h3>Data Extraction</h3>
The <a href="sh5util.html">sh5util</a> program can also be used to extract
<p>
The <a href="sh5util.html">sh5util</a> program can also be used to extract
specific data from the HDF5 file and write it in <i>comma separated value (csv)</i>
specific data from the HDF5 file and write it in <i>comma separated value (csv)</i>
form for importation into other analysis tools such as spreadsheets.
form for importation into other analysis tools such as spreadsheets.
</p>
<a id="HDF5"></a>
<a id="HDF5"></a>
<h2>HDF5</h2>
<h2>HDF5</h2>
HDF5 is a well known structured data set that allows heterogeneous but
<p>
HDF5 is a well known structured data set that allows heterogeneous but
related data to be stored in one file.
related data to be stored in one file.
(.i.e. sections for energy statistics, network I/O, Task data,
…
)
(.i.e. sections for energy statistics, network I/O, Task data,
etc.
)
Its internal structure resembles a
Its internal structure resembles a
file system with <b>groups</b> being similar to <i>directories</i> and
file system with <b>groups</b> being similar to <i>directories</i> and
<b>data sets</b> being similar to <i>files</i>. It also allows <b>attributes</b>
<b>data sets</b> being similar to <i>files</i>. It also allows <b>attributes</b>
to be attached to groups to store application defined properties.
to be attached to groups to store application defined properties.
</p>
<p>There are commodity programs, notably
<p>There are commodity programs, notably
<a href="http://www.hdfgroup.org/hdf-java-html/hdfview/index.html">
<a href="http://www.hdfgroup.org/hdf-java-html/hdfview/index.html">
HDFView</a> for viewing and manipulating these files.
HDFView</a> for viewing and manipulating these files.
<p>Below is a screen shot from HDFView expanding the job tree and showing the
<p>Below is a screen shot from HDFView expanding the job tree and showing the
attributes for a specific task.
attributes for a specific task.
</p>
<
p
>
<
br
>
<img src="hdf5_task_attr.png" width="275" height="275" >
<img src="hdf5_task_attr.png" width="275" height="275" >
...
@@ -212,8 +210,8 @@ attributes for a specific task.
...
@@ -212,8 +210,8 @@ attributes for a specific task.
<td><img src="hdf5_job_outline.png" width="205" height="570"></td>
<td><img src="hdf5_job_outline.png" width="205" height="570"></td>
<td style="vertical-align: top;">
<td style="vertical-align: top;">
<div style="margin-left: 5px;">
<div style="margin-left: 5px;">
In the job file, there will be a group for each <b>step</b> of the job.
<p>
In the job file, there will be a group for each <b>step</b> of the job.
Within each step, there will be a group for nodes, and a group for tasks.
Within each step, there will be a group for nodes, and a group for tasks.
</p>
</div>
</div>
<ul>
<ul>
<li>
<li>
...
@@ -240,13 +238,13 @@ executed. This set of groups is essentially a cross reference table.
...
@@ -240,13 +238,13 @@ executed. This set of groups is essentially a cross reference table.
</table>
</table>
<h3>Energy Data</h3>
<h3>Energy Data</h3>
<b>AcctGatherEnergyType=acct_gather_energy/ipmi</b>
<p>
<b>AcctGatherEnergyType=acct_gather_energy/ipmi</b>
is required in slurm.conf to collect energy data.
is required in slurm.conf to collect energy data.
Appropriately set energy=freq in either JobAcctGatherFrequency in slurm.conf
Appropriately set energy=freq in either JobAcctGatherFrequency in slurm.conf
or in --acctg-freq on the command line.
or in --acctg-freq on the command line.
Also appropriately set EnergyIPMIFrequency in acct_gather.conf.
Also appropriately set EnergyIPMIFrequency in acct_gather.conf.
</p>
<p>Each data sample in the Energe Time Series contains the following data items.
<p>Each data sample in the Energe Time Series contains the following data items.
<DL>
</p>
<DL>
<DT><B>Date Time</B>
<DT><B>Date Time</B>
<DD>Time of day at which the data sample was taken. This can be used to
<DD>Time of day at which the data sample was taken. This can be used to
correlate activity with other sources such as logs.</DD></DT>
correlate activity with other sources such as logs.</DD></DT>
...
@@ -259,13 +257,13 @@ correlate activity with other sources such as logs.</DD></DT>
...
@@ -259,13 +257,13 @@ correlate activity with other sources such as logs.</DD></DT>
</DL>
</DL>
<h3>Luster Data</h3>
<h3>Luster Data</h3>
<b>AcctGatherFilesystemType=acct_gather_filesystem/lustre</b>
<p>
<b>AcctGatherFilesystemType=acct_gather_filesystem/lustre</b>
is required in slurm.conf to collect task data.
is required in slurm.conf to collect task data.
Appropriately set luster=freq in either JobAcctGatherFrequency in slurm.conf
Appropriately set luster=freq in either JobAcctGatherFrequency in slurm.conf
or in --acctg-freq on the command line.
or in --acctg-freq on the command line.
</p>
<p>
Each data sample in the Lustre Time Series contains the following data items.
<p>
Each data sample in the Lustre Time Series contains the following data items.
<DL>
</p>
<DL>
<DT><B>Date Time</B>
<DT><B>Date Time</B>
<DD>Time of day at which the data sample was taken. This can be used to
<DD>Time of day at which the data sample was taken. This can be used to
correlate activity with other sources such as logs.</DD></DT>
correlate activity with other sources such as logs.</DD></DT>
...
@@ -282,11 +280,12 @@ correlate activity with other sources such as logs.</DD></DT>
...
@@ -282,11 +280,12 @@ correlate activity with other sources such as logs.</DD></DT>
</DL>
</DL>
<h3>Network (Infiniband Data)</h3>
<h3>Network (Infiniband Data)</h3>
<b>JobAcctInfinibandType=acct_gather_infiniband/ofed</b>
<p>
<b>JobAcctInfinibandType=acct_gather_infiniband/ofed</b>
is required in slurm.conf to collect task data.
is required in slurm.conf to collect task data.
Appropriately set network=freq in either JobAcctGatherFrequency in slurm.conf
Appropriately set network=freq in either JobAcctGatherFrequency in slurm.conf
or in --acctg-freq on the command line.
or in --acctg-freq on the command line.</p>
<p>Each data sample in the Network Time Series contains the following data items.
<p>Each data sample in the Network Time Series contains the following
data items.</p>
<DL>
<DL>
<DT><B>Date Time</B>
<DT><B>Date Time</B>
<DD>Time of day at which the data sample was taken. This can be used to
<DD>Time of day at which the data sample was taken. This can be used to
...
@@ -304,11 +303,12 @@ correlate activity with other sources such as logs.</DD></DT>
...
@@ -304,11 +303,12 @@ correlate activity with other sources such as logs.</DD></DT>
</DL>
</DL>
<h3>Task Data</h3>
<h3>Task Data</h3>
<b>JobAcctGatherType=jobacct_gather/linux</b>
<p>
<b>JobAcctGatherType=jobacct_gather/linux</b>
is required in slurm.conf to collect task data.
is required in slurm.conf to collect task data.
Appropriately set task=freq in either JobAcctGatherFrequency in slurm.conf
Appropriately set task=freq in either JobAcctGatherFrequency in slurm.conf
or in --acctg-freq on the command line.
or in --acctg-freq on the command line.</p>
<p>Each data sample in the Task Time Series contains the following data items.
<p>Each data sample in the Task Time Series contains the following data
items.</p>
<DL>
<DL>
<DT><B>Date Time</B>
<DT><B>Date Time</B>
<DD>Time of day at which the data sample was taken. This can be used to
<DD>Time of day at which the data sample was taken. This can be used to
...
@@ -336,6 +336,6 @@ correlate activity with other sources such as logs.</DD></DT>
...
@@ -336,6 +336,6 @@ correlate activity with other sources such as logs.</DD></DT>
<p class="footer"><a href="#top">top</a></p>
<p class="footer"><a href="#top">top</a></p>
<p style="text-align:center;">Last modified 1
2
Ju
ne
2013</p>
<p style="text-align:center;">Last modified 1 Ju
ly
2013</p>
<!--#include virtual="footer.txt"-->
<!--#include virtual="footer.txt"-->
This diff is collapsed.
Click to expand it.
doc/html/slurm.shtml
+
6
−
3
View file @
6d57b632
...
@@ -53,7 +53,10 @@ help identify load imbalances and other anomalies.</li>
...
@@ -53,7 +53,10 @@ help identify load imbalances and other anomalies.</li>
</ul></p>
</ul></p>
<p>Slurm provides workload management on many of the most powerful computers in
<p>Slurm provides workload management on many of the most powerful computers in
the world including:
the world. On the June 2013 <a href="http://www.top500.org">Top500</a> list,
five of the ten top systems use Slurm including the number one system.
These five systems alone contain over 5.7 million cores.
A few of the systems using Slurm are listed below:
<ul>
<ul>
<li><a href="http://www.top500.org/blog/lists/2013/06/press-release/">
<li><a href="http://www.top500.org/blog/lists/2013/06/press-release/">
Tianhe-2</a> designed by
Tianhe-2</a> designed by
...
@@ -74,7 +77,7 @@ is a <a herf="http://www.dell.com">Dell</a> with over
...
@@ -74,7 +77,7 @@ is a <a herf="http://www.dell.com">Dell</a> with over
80,000 <a href="http://www.intel.com">Intel</a> Xeon cores,
80,000 <a href="http://www.intel.com">Intel</a> Xeon cores,
Intel Phi co-processors, plus
Intel Phi co-processors, plus
128 <a href="http://www.nvidia.com">NVIDIA</a> GPUs
128 <a href="http://www.nvidia.com">NVIDIA</a> GPUs
delivering
2.66
Petaflops.</li>
delivering
5.17
Petaflops.</li>
<li><a href="http://www-hpc.cea.fr/en/complexe/tgcc-curie.htm">TGCC Curie</a>,
<li><a href="http://www-hpc.cea.fr/en/complexe/tgcc-curie.htm">TGCC Curie</a>,
owned by <a href="http://www.genci.fr">GENCI</a> and operated in the TGCC by
owned by <a href="http://www.genci.fr">GENCI</a> and operated in the TGCC by
...
@@ -110,6 +113,6 @@ named after Monte Rosa in the Swiss-Italian Alps, elevation 4,634m.
...
@@ -110,6 +113,6 @@ named after Monte Rosa in the Swiss-Italian Alps, elevation 4,634m.
</ul>
</ul>
<p style="text-align:center;">Last modified
24
Ju
ne
2013</p>
<p style="text-align:center;">Last modified
1
Ju
ly
2013</p>
<!--#include virtual="footer.txt"-->
<!--#include virtual="footer.txt"-->
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment