Skip to content
Snippets Groups Projects
Commit 005b0e77 authored by Alejandro Sanchez's avatar Alejandro Sanchez
Browse files

acct_gather_profile/influxdb - add documentation.

Bug 2693.
parent f32e4667
No related branches found
No related tags found
No related merge requests found
......@@ -16,7 +16,8 @@ be profiled.)
A seperate
<a href="hdf5_profile_user_guide.html">User Guide</a> documents how to use
the hdf5 version of the plugin.
the hdf5 version of the plugin. An influxdb plugin is also available since
Slurm 17.11.
<p>The plugin provides an API for making calls to store data at various
points in a step's lifecycle. It collects data periodically from potentially
......@@ -29,12 +30,12 @@ avoids having to transfer files back to the controller at step end. Data is
typically gathered at job_acct_gather interval or acct_gather_energy interval
and the volume is not expected to be burdensome.
<p>The <i>hdf5</i> implementation records I/O counts from the
network interface (Interconnect), I/O counts from the node from the Lustre
<p>The <i>hdf5</i> and <i>influxdb</i> implementations record I/O counts from
the network interface (Interconnect), I/O counts from the node from the Lustre
parallel file system, disk I/O counts, cpu and memory utilization
for each task, and a record of energy use.
<p>This implementation stores this data in a HDF5 file for each step
<p>The <i>hdf5</i> implementation stores this data in a HDF5 file for each step
on each node for the jobs. A separate program
(<a href="sh5util.html">sh5util</a>) is provided to
consolidate all the node-step files in one container for the job.
......
......@@ -1198,7 +1198,7 @@ Only Slurm operators and administrators can set the priority of a job.
\fB\-\-profile\fR=<all|none|[energy[,|task[,|lustre[,|network]]]]>
enables detailed data collection by the acct_gather_profile plugin.
Detailed data are typically time-series that are stored in an HDF5 file for
the job.
the job or an InfluxDB database depending on the configured plugin.
.RS
.TP 10
......
......@@ -1305,7 +1305,7 @@ Only Slurm operators and administrators can set the priority of a job.
\fB\-\-profile\fR=<all|none|[energy[,|task[,|lustre[,|network]]]]>
enables detailed data collection by the acct_gather_profile plugin.
Detailed data are typically time-series that are stored in an HDF5 file for
the job.
the job or an InfluxDB database depending on the configured plugin.
.RS
.TP 10
......
......@@ -1676,7 +1676,7 @@ This option applies to job allocations only.
\fB\-\-profile\fR=<all|none|[energy[,|task[,|filesystem[,|network]]]]>
enables detailed data collection by the acct_gather_profile plugin.
Detailed data are typically time-series that are stored in an HDF5 file for
the job.
the job or an InfluxDB database depending on the configured plugin.
.RS
.TP 10
......
......@@ -131,6 +131,89 @@ Task (I/O, Memory, ...) data is collected.
.RE
.RE
.TP
\fBProfileInfluxDB\fR
Options used for AcctGatherProfileType/influxdb are as follows:
.RS
.TP 10
\fBProfileInfluxDBDatabase\fR
InfluxDB database name where profiling information is to be written.
.TP
\fBProfileInfluxDBDefault\fR
A comma delimited list of data types to be collected for each job submission.
Allowed values are:
.RS
.TP 8
\fBAll\fR
All data types are collected. (Cannot be combined with other values.)
.TP
\fBNone\fR
No data types are collected. This is the default.
(Cannot be combined with other values.)
.TP
\fBEnergy\fR
Energy data is collected.
.TP
\fBFilesystem\fR
File system (Lustre) data is collected.
.TP
\fBNetwork\fR
Network (InfiniBand) data is collected.
.TP
\fBTask\fR
Task (I/O, Memory, ...) data is collected.
.RE
.TP
\fBProfileInfluxDBHost\fR=<hostname>:<port>
The hostname of the machine where the influxd instance is executed and the port
used by the HTTP API. The port used by the HTTP API is the one configured
through the bind-address influxdb.conf option in the [http] section. Example:
ProfileInfluxDBHost=myinfluxhost:8086
.TP
\fBProfileInfluxDBRTPolicy\fR
The InfluxDB retention policy name for the database configured in
ProfileInfluxDBDatabase option.
.RE
.TP
NOTE:
This plugin requires the libcurl development files to be installed.
.TP
NOTE:
Information on how to install and configure InfluxDB and manage databases,
retention policies and such is available on the official webpage.
.TP
NOTE:
Collected information is written from every compute node where a job runs to
the influxd instance listening on the ProfileInfluxDBHost. In order to avoid
overloading the influxd instance with incoming connection requests, the plugin
uses an internal buffer which is filled with samples. Once the buffer is full, a
HTTP API write request is performed and the buffer is emptied to hold subsequent
samples. A final request is also performed when a task ends even if the buffer
isn't full.
.TP
NOTE:
Failed HTTP API write requests are discarded. This means that collected profile
information in the plugin buffer is lost if it can't be written to the influxd
database for any reason.
.TP
NOTE:
Plugin messages are logged along with the slurmstepd logs to SlurmdLogFile. In
order to troubleshoot any issues, it is recommended to temporarily increase
the slurmd debug level to debug3 and add Profile to the debug flags. This can
be accomplished by setting the slurm.conf SlurmdDebug and DebugFlags
respectively or dynamically through scontrol setdebug and setdebugflags.
.TP
\fBInfinibandOFED\fR
......
......@@ -263,6 +263,11 @@ No profile data is collected.
This enables the HDF5 plugin. The directory where the profile files
are stored and which values are collected are configured in the
acct_gather.conf file.
.TP
\fBacct_gather_profile/influxdb\fR
This enables the influxdb plugin. The influxdb instance host, port, database,
retention policy and which values are collected are configured in the
acct_gather.conf file.
.RE
.TP
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment