diff --git a/slurm/doc/html/hdf5_profile_user_guide.shtml b/slurm/doc/html/hdf5_profile_user_guide.shtml deleted file mode 100644 index 73c6502f3a0e1bb564ae1ff4563f3bce80c2b7b6..0000000000000000000000000000000000000000 --- a/slurm/doc/html/hdf5_profile_user_guide.shtml +++ /dev/null @@ -1,336 +0,0 @@ -<!--#include virtual="header.txt"--> -<!-- Copyright (C) 2013 Bull S. A. S. - Bull, Rue Jean Jaures, B.P.68, 78340, Les Clayes-sous-Bois. --> - -<h1>Profiling Using HDF5 User Guide</h1> - -<h2>Contents</h2> -<a href="#Overview">Overview</a><br> -<a href="#Administration">Administration</a><br> -<a href="#Profiling">Profiling Jobs</a><br> -<a href="#HDF5">HDF5</a><br> -<a href="#DataSeries">Data Series</a><br> - - - -<a id="Overview"></a> -<h2>Overview</h2> -The AcctGatherProfileType/hdf5 plugin allows SLURM to coordinate collecting -data on jobs it runs on a cluster that is more detailed than is practical to -include in its database. The data comes from periodically sampling various -performance data either collected by SLURM, the operating system, or -component software. The plugin will record the data from each source -as a <b>Time Series</b> and also accumulate totals for each statistic for -the job. - -<p>Time Series are energy data collected by an AcctGatherEnergy plugin, -I/O data from a network interface collected by an AcctGatherInfiniband plugin, -I/O data from parallel file systems such as Lustre, -and task performance data such as local disk I/O, cpu consumption, -and memory us, as well as potential data from other sources. - -<p>The data is collected into a file on a shared file system for each step on -each allocated node of a job and then merged into a HDF5 file. -Individual files on a shared file system was chosen because it is possible -that the data is voluminous so solutions that pass data to the SLURM control -daemon via RPC may not scale to very large clusters or jobs with -many allocated nodes. - -<p>A seperate <a href="acct_gather_profile_plugins.html"> -SLURM Profile Accounting Plugin API (AcctGatherProfileType)</a> documents how -write other Profile Accounting plugins. - -<a id="Administration"></a> -<h2>Administration</h2> - -<h3>Shared File System</h3> -<div style="margin-left: 20px;"> -The HDF5 Profile Plugin requires a common shared file system on all the compute -nodes. While a job is running, the plugin writes a file into this file -system for each step of the job on each node. When the job ends, -the merge process is launched and the node-step files are combined into one -hdf5 file for the job. -<p> -The root of the directory structure is declared in the <b>ProfileHDF5Dir</b> -option in the acct_gather.conf file. The directory will be created by slurm -if it doesn't exist. -<p> -Each user that creates a profile will have a subdirector to the profile -directory that has read/write permission only for the user. -</span> -</div> -<h3>Configuration parameters</h3> - -<div style="margin-left: 20px;"> -The profile plugin is enabled in the -<a href="slurm.conf.html">slurm.conf</a> file, but is internally -configured in the -<a href="acct_gather.conf.html">acct_gather.conf</a> file. -</div> -<div style="margin-left: 20px;"> -<h4>slurm.conf parameters</h4> -<div style="margin-left: 20px;"> -This line the slum.conf enables the HDF5 Profile Plugin. -<br><b>AcctGatherProfileType=acct_gather_profile/hdf5</b> -</div> -</div> -<div style="margin-left: 20px;"> -<h4>act_gather.conf parameters</h4> -<div style="margin-left: 20px;"> -There are parameters directly used by the HDF5 Profile Plugin. -<dl> -<dt><B>ProfileHDF5Dir</B>=<path></dt> -<dd>This parameter is the path to the shared folder into which the -acct_gather_profile plugin will write detailed data as an HDF5 file. -The directory is assumed to be on a file system shared by the controller and -all compute nodes. This is a required parameter. -<dt><B>ProfileHDF5CollectDefault</B>=opt{,opt{,opt}}</dt> -<dd>Default <b>--Profile</b value> for data types collected for each job -submission. It ia a comma separated list of data streams. -Use this option with caution. A node-step file will be created for on every -node of every step for every job. They will not automatically be merged -into job files. (Even job files for small jobs would fill the fill the -file system.) This option is intended for test environments where you -might want to profile a series of jobs but do not want to have to -add the --profile option to the launch scripts.</dd> -</dl> -</div> -</div> - - -<div style="margin-left: 20px;"> -<h4>Time Series Control Paramters</h4> -<div style="margin-left: 20px;"> -Other plugins add time series data to the HDF5 collection. They typically -have a polling frequency specified in one of the above configuration files. -<p> -The following table summarized parameters that control sample frequency. -<p> -<table border="1" style="margin-left: 20; padding: 5 -px;" > -<tr><th>Conf file</th><th>Parameter</th><th>Time Series</th></tr> -<tr><td>slurm.conf</td><td>JobAcctGatherFrequency</td><td>Task, Lustre</td></tr> -<tr><td>acct_gather.conf</td><td>EnergyIPMIFrequency</td><td>Energy</td></tr> -<tr><td>acct_gather.conf</td><td>InfinibandOFEDFrequency</td> -<td>Network</td></tr> -</table> -</div> -</div> -<a id="Profiling"></a> -<h2>Profiling Jobs</h2> -<h3>Data Collection</h3> -The --profile option on salloc|sbatch|srun controls whether if data is -collected and what type of data is collected. If --profile is not specified -the default is no data collected (unless the <B>ProfileHDF5CollectDefault</B> -option is used in acct_gather.conf. --profile on the command line overrides -any value specified in the configuration file.)<p> - -<DT><B>--profile</B>=<all|none|[energy[,|task[,|lustre[,|network]]]]> -<DD> -enables detailed data collection by the acct_gather_profile plugin. -Detailed data are typically time-series that are stored in an HDF5 file for -the job.</DD> -</DT> -<P> -<div style="margin-left: 20px;"> -<DL> -<DT><B>All</B> -<DD>All data types are collected. (Cannot be combined with other values.) -</DD></DT> -<P> -<DT><B>None</B> -<DD>No data types are collected. This is the default. -<BR> (Cannot be combined with other values.) -</DD></DT> - -<DT><B>Energy</B> -<DD>Energy data is collected.</DD></DT> - -<DT><B>Task</B> -<DD>Task (I/O, Memory, ...) data is collected.</DD></DT> - -<DT><B>Lustre</B> -<DD>Lustre data is collected.</DD></DT> - -<DT><B>Network</B> -<DD>Network (InfiniBand) data is collected.</DD></DT> - -</DL> -</div> - -<h3>Data Consolidation</h3> -The node-step files are merged into one HDF5 file for the job using the -<a href="sh5util.html">sh5util</a>. - -<p>The command line may added to the normal launch script, if the job is -started with sbatch. For example; -<pre> -sbatch -n1 -d$last_job_id --wrap="sh5util --profile=none -j $last_job_id" -</pre> -Note that --profile=none is required if the enclosing sbatch command included -a --profile parameter. - -<h3>Data Extraction</h3> -The <a href="sh5util.html">sh5util</a> program can also be used to extract -specific data from the hdf5 file an write it in <i>comma separated value</i> -for importation into other analysis tools such as spreadsheets. - -<a id="HDF5"></a> -<h2>HDF5</h2> -HDF5 is a well known structured data set that allows heterogeneous data but -related data to be stored in one file. -(.i.e. sections for energy statistics, sections for network I/O, -sections for Task data, …) -Its internal structure resembles a -file system with <b>groups</b> being similar to <i>directories</i> and -<b>data sets</b> being similar to <i>files</i>. It also allows <b>attributes</b> -to be attached to groups to store application defined properties. - -<p>There are commodity programs, notably -<a href="http://www.hdfgroup.org/hdf-java-html/hdfview/index.html"> -HDFView</a> for viewing and manipulating these files. - -<p>Below is a screen shot from HDFView expanding the job tree and showing the -attributes for a specific task. -<p> -<img src="hdf5_task_attr.png" width="275" height="275" > - - -<a id="DataSeries"></a> -<h2>Data Structure</h2> - -<table> -<tr> -<td><img src="hdf5_job_outline.png" width="205" height="570"></td> -<td style="vertical-align: top;"> -<div style="margin-left: 5px;"> -In the job file, there will be a group for each <b>step</b> of the job. -Within each step, there will be a group for nodes, and a group for tasks. -</div> -<ul> -<li> -The <b>nodes</b> group will have a group for each node in the step allocation. -For each node group, there is a sub-group for Time Series and another -for Totals. -<ul> -<li> -The <b>Time Series</b> group -contains a group/dataset containing the time series for each collector. -</li> -<li> -The <b>Totals</b> group contains a corresponding group/dataset that has the -Minimum, Average, Maximum, and Sum Total for each item in the time series. -</li> -</ul> -<li> -The <b>Tasks</b> group will only contain a subgroup for each task. -It primarily contains an attribute stating the node on which the task was -executed. This set of groups is essentially a cross reference table. -</li> -</ul> -</td></tr> -</table> - -<h3>Energy Data</h3> -<b>AcctGatherEnergyType=acct_gather_energy/ipmi</b> -is required in slurm.conf to collect energy data. -Also appropriately set -<b>EnergyIPMIFrequency</b> -in acct_gather.conv -<DL> -<DT><B>Date Time</B> -<DD>Time of day at which the data sample was taken. This can be used to -correlate activity with other sources such as logs.</DD></DT> -<DT><B>Time</B> -<DD>Elapsed time since the begining of the step.</DD></DT> -<DT><B>Power</B> -<DD>Power consumption during the interval.</DD></DT> -<DT><B>CPU Frequency</B> -<DD>CPU Frequency at time of sample in kilohertz.</DD></DT> -</DL> - -<h3>Infiniband Data</h3> -<b>JobAcctInfinibandType=acct_gather_infiniband/ofed</b> -is required in slurm.conf to collect task data. -Also appropriately set -<b>InfinibandOFEDFrequency</b> -in acct_gather.conf -Each data sample in the Lustre Time Series contains the following data items. -<DL> -<DT><B>Date Time</B> -<DD>Time of day at which the data sample was taken. This can be used to -correlate activity with other sources such as logs.</DD></DT> -<DT><B>Time</B> -<DD>Elapsed time since the begining of the step.</DD></DT> -<DT><B>Packets In</B> -<DD>Number of packets coming in.</DD></DT> -<DT><B>Megabytes Read</B> -<DD>Number of megabytes coming in through the interface.</DD></DT> -<DT><B>Packets Out</B> -<DD>Number of packets going out.</DD></DT> -<DT><B>Megabytes Write</B> -<DD>Number of megabytes going out through the interface.</DD></DT> -</DL> - -<h3>Luster Data</h3> -<b>JobAcctGatherType=jobacct_gather/linux</b> -is required in slurm.conf to collect task data. -Also appropriately set -<b>JobAcctGatherFrequency</b> -in slurm.conf -<p> -Each data sample in the Lustre Time Series contains the following data items. -<DL> -<DT><B>Date Time</B> -<DD>Time of day at which the data sample was taken. This can be used to -correlate activity with other sources such as logs.</DD></DT> -<DT><B>Time</B> -<DD>Elapsed time since the begining of the step.</DD></DT> -<DT><B>Reads</B> -<DD>Number of read operations.</DD></DT> -<DT><B>Megabytes Read</B> -<DD>Number of megabytes read.</DD></DT> -<DT><B>Writes</B> -<DD>Number of write operations.</DD></DT> -<DT><B>Megabytes Write</B> -<DD>Number of megabytes written.</DD></DT> -</DL> - -<h3>Task Data</h3> -<b>JobAcctGatherType=jobacct_gather/linux</b> -is required in slurm.conf to collect task data. -Also appropriately set -<b>JobAcctGatherFrequency</b> -in slurm.conf -<DL> -<DT><B>Date Time</B> -<DD>Time of day at which the data sample was taken. This can be used to -correlate activity with other sources such as logs.</DD></DT> -<DT><B>Time</B> -<DD>Elapsed time since the begining of the step.</DD></DT> -<DT><B>CPU Frequency</B> -<DD>CPU Frequency at time of sample.</DD></DT> -<DT><B>CPU Time</B> -<DD>Seconds of CPU time used during the sample.</DD></DT> -<DT><B>CPU Utilization</B> -<DD>CPU Utilization during the interval.</DD></DT> - - -<DT><B>RSS</B> -<DD>Value of RSS at time of sample.</DD></DT> -<DT><B>VM Size</B> -<DD>Value of VM Size at time of sample.</DD></DT> -<DT><B>Pages</B> -<DD>Pages used in sample.</DD></DT> -<DT><B>Read Megabytes</B> -<DD>Number of megabytes read from local disk.</DD></DT> -<DT><B>Write Megabytes</B> -<DD>Number of megabytes written to local disk.</DD></DT> -</DL> - - -<p class="footer"><a href="#top">top</a></p> - -<p style="text-align:center;">Last modified 17 May 2013</p> - -<!--#include virtual="footer.txt"-->