From 005b0e7732ba91aafb99f2cdcf8f7a3d6f127890 Mon Sep 17 00:00:00 2001
From: Alejandro Sanchez <alex@schedmd.com>
Date: Wed, 13 Sep 2017 19:34:56 +0200
Subject: [PATCH] acct_gather_profile/influxdb - add documentation.

Bug 2693.
---
 doc/html/acct_gather_profile_plugins.shtml |  9 +--
 doc/man/man1/salloc.1                      |  2 +-
 doc/man/man1/sbatch.1                      |  2 +-
 doc/man/man1/srun.1                        |  2 +-
 doc/man/man5/acct_gather.conf.5            | 83 ++++++++++++++++++++++
 doc/man/man5/slurm.conf.5                  |  5 ++
 6 files changed, 96 insertions(+), 7 deletions(-)

diff --git a/doc/html/acct_gather_profile_plugins.shtml b/doc/html/acct_gather_profile_plugins.shtml
index 57e63573f6c..ae52d594447 100644
--- a/doc/html/acct_gather_profile_plugins.shtml
+++ b/doc/html/acct_gather_profile_plugins.shtml
@@ -16,7 +16,8 @@ be profiled.)
 
 A seperate
 <a href="hdf5_profile_user_guide.html">User Guide</a> documents how to use
-the hdf5 version of the plugin.
+the hdf5 version of the plugin. An influxdb plugin is also available since
+Slurm 17.11.
 
 <p>The plugin provides an API for making calls to store data at various
 points in a step's lifecycle. It collects data periodically from potentially
@@ -29,12 +30,12 @@ avoids having to transfer files back to the controller at step end. Data is
 typically gathered at job_acct_gather interval or acct_gather_energy interval
 and the volume is not expected to be burdensome.
 
-<p>The <i>hdf5</i> implementation records I/O counts from the
-network interface (Interconnect), I/O counts from the node from the Lustre
+<p>The <i>hdf5</i> and <i>influxdb</i> implementations record I/O counts from
+the network interface (Interconnect), I/O counts from the node from the Lustre
 parallel file system, disk I/O counts, cpu and memory utilization
 for each task, and a record of energy use.
 
-<p>This implementation stores this data in a HDF5 file for each step
+<p>The <i>hdf5</i> implementation stores this data in a HDF5 file for each step
 on each node for the jobs. A separate program
 (<a href="sh5util.html">sh5util</a>) is provided to
 consolidate all the node-step files in one container for the job.
diff --git a/doc/man/man1/salloc.1 b/doc/man/man1/salloc.1
index 1ef2be9cf24..80d6be6c4bc 100644
--- a/doc/man/man1/salloc.1
+++ b/doc/man/man1/salloc.1
@@ -1198,7 +1198,7 @@ Only Slurm operators and administrators can set the priority of a job.
 \fB\-\-profile\fR=<all|none|[energy[,|task[,|lustre[,|network]]]]>
 enables detailed data collection by the acct_gather_profile plugin.
 Detailed data are typically time-series that are stored in an HDF5 file for
-the job.
+the job or an InfluxDB database depending on the configured plugin.
 
 .RS
 .TP 10
diff --git a/doc/man/man1/sbatch.1 b/doc/man/man1/sbatch.1
index 8a167dab0bd..041538b3625 100644
--- a/doc/man/man1/sbatch.1
+++ b/doc/man/man1/sbatch.1
@@ -1305,7 +1305,7 @@ Only Slurm operators and administrators can set the priority of a job.
 \fB\-\-profile\fR=<all|none|[energy[,|task[,|lustre[,|network]]]]>
 enables detailed data collection by the acct_gather_profile plugin.
 Detailed data are typically time-series that are stored in an HDF5 file for
-the job.
+the job or an InfluxDB database depending on the configured plugin.
 
 .RS
 .TP 10
diff --git a/doc/man/man1/srun.1 b/doc/man/man1/srun.1
index d99082d577f..7264445225d 100644
--- a/doc/man/man1/srun.1
+++ b/doc/man/man1/srun.1
@@ -1676,7 +1676,7 @@ This option applies to job allocations only.
 \fB\-\-profile\fR=<all|none|[energy[,|task[,|filesystem[,|network]]]]>
 enables detailed data collection by the acct_gather_profile plugin.
 Detailed data are typically time-series that are stored in an HDF5 file for
-the job.
+the job or an InfluxDB database depending on the configured plugin.
 
 .RS
 .TP 10
diff --git a/doc/man/man5/acct_gather.conf.5 b/doc/man/man5/acct_gather.conf.5
index 12121f92205..0b6bc60705b 100644
--- a/doc/man/man5/acct_gather.conf.5
+++ b/doc/man/man5/acct_gather.conf.5
@@ -131,6 +131,89 @@ Task (I/O, Memory, ...) data is collected.
 
 .RE
 .RE
+.TP
+\fBProfileInfluxDB\fR
+Options used for AcctGatherProfileType/influxdb are as follows:
+
+.RS
+.TP 10
+\fBProfileInfluxDBDatabase\fR
+InfluxDB database name where profiling information is to be written.
+
+.TP
+\fBProfileInfluxDBDefault\fR
+A comma delimited list of data types to be collected for each job submission.
+Allowed values are:
+
+.RS
+.TP 8
+\fBAll\fR
+All data types are collected. (Cannot be combined with other values.)
+
+.TP
+\fBNone\fR
+No data types are collected. This is the default.
+(Cannot be combined with other values.)
+
+.TP
+\fBEnergy\fR
+Energy data is collected.
+
+.TP
+\fBFilesystem\fR
+File system (Lustre) data is collected.
+
+.TP
+\fBNetwork\fR
+Network (InfiniBand) data is collected.
+
+.TP
+\fBTask\fR
+Task (I/O, Memory, ...) data is collected.
+.RE
+
+.TP
+\fBProfileInfluxDBHost\fR=<hostname>:<port>
+The hostname of the machine where the influxd instance is executed and the port
+used by the HTTP API. The port used by the HTTP API is the one configured
+through the bind-address influxdb.conf option in the [http] section. Example:
+
+ProfileInfluxDBHost=myinfluxhost:8086
+
+.TP
+\fBProfileInfluxDBRTPolicy\fR
+The InfluxDB retention policy name for the database configured in
+ProfileInfluxDBDatabase option.
+.RE
+
+.TP
+NOTE:
+This plugin requires the libcurl development files to be installed.
+.TP
+NOTE:
+Information on how to install and configure InfluxDB and manage databases,
+retention policies and such is available on the official webpage.
+.TP
+NOTE:
+Collected information is written from every compute node where a job runs to
+the influxd instance listening on the ProfileInfluxDBHost. In order to avoid
+overloading the influxd instance with incoming connection requests, the plugin
+uses an internal buffer which is filled with samples. Once the buffer is full, a
+HTTP API write request is performed and the buffer is emptied to hold subsequent
+samples. A final request is also performed when a task ends even if the buffer
+isn't full.
+.TP
+NOTE:
+Failed HTTP API write requests are discarded. This means that collected profile
+information in the plugin buffer is lost if it can't be written to the influxd
+database for any reason.
+.TP
+NOTE:
+Plugin messages are logged along with the slurmstepd logs to SlurmdLogFile. In
+order to troubleshoot any issues, it is recommended to temporarily increase
+the slurmd debug level to debug3 and add Profile to the debug flags. This can
+be accomplished by setting the slurm.conf SlurmdDebug and DebugFlags
+respectively or dynamically through scontrol setdebug and setdebugflags.
 
 .TP
 \fBInfinibandOFED\fR
diff --git a/doc/man/man5/slurm.conf.5 b/doc/man/man5/slurm.conf.5
index 56351b142fb..0630deeed1f 100644
--- a/doc/man/man5/slurm.conf.5
+++ b/doc/man/man5/slurm.conf.5
@@ -263,6 +263,11 @@ No profile data is collected.
 This enables the HDF5 plugin. The directory where the profile files
 are stored and which values are collected are configured in the
 acct_gather.conf file.
+.TP
+\fBacct_gather_profile/influxdb\fR
+This enables the influxdb plugin. The influxdb instance host, port, database,
+retention policy and which values are collected are configured in the
+acct_gather.conf file.
 .RE
 
 .TP
-- 
GitLab