From 005b0e7732ba91aafb99f2cdcf8f7a3d6f127890 Mon Sep 17 00:00:00 2001 From: Alejandro Sanchez <alex@schedmd.com> Date: Wed, 13 Sep 2017 19:34:56 +0200 Subject: [PATCH] acct_gather_profile/influxdb - add documentation. Bug 2693. --- doc/html/acct_gather_profile_plugins.shtml | 9 +-- doc/man/man1/salloc.1 | 2 +- doc/man/man1/sbatch.1 | 2 +- doc/man/man1/srun.1 | 2 +- doc/man/man5/acct_gather.conf.5 | 83 ++++++++++++++++++++++ doc/man/man5/slurm.conf.5 | 5 ++ 6 files changed, 96 insertions(+), 7 deletions(-) diff --git a/doc/html/acct_gather_profile_plugins.shtml b/doc/html/acct_gather_profile_plugins.shtml index 57e63573f6c..ae52d594447 100644 --- a/doc/html/acct_gather_profile_plugins.shtml +++ b/doc/html/acct_gather_profile_plugins.shtml @@ -16,7 +16,8 @@ be profiled.) A seperate <a href="hdf5_profile_user_guide.html">User Guide</a> documents how to use -the hdf5 version of the plugin. +the hdf5 version of the plugin. An influxdb plugin is also available since +Slurm 17.11. <p>The plugin provides an API for making calls to store data at various points in a step's lifecycle. It collects data periodically from potentially @@ -29,12 +30,12 @@ avoids having to transfer files back to the controller at step end. Data is typically gathered at job_acct_gather interval or acct_gather_energy interval and the volume is not expected to be burdensome. -<p>The <i>hdf5</i> implementation records I/O counts from the -network interface (Interconnect), I/O counts from the node from the Lustre +<p>The <i>hdf5</i> and <i>influxdb</i> implementations record I/O counts from +the network interface (Interconnect), I/O counts from the node from the Lustre parallel file system, disk I/O counts, cpu and memory utilization for each task, and a record of energy use. -<p>This implementation stores this data in a HDF5 file for each step +<p>The <i>hdf5</i> implementation stores this data in a HDF5 file for each step on each node for the jobs. A separate program (<a href="sh5util.html">sh5util</a>) is provided to consolidate all the node-step files in one container for the job. diff --git a/doc/man/man1/salloc.1 b/doc/man/man1/salloc.1 index 1ef2be9cf24..80d6be6c4bc 100644 --- a/doc/man/man1/salloc.1 +++ b/doc/man/man1/salloc.1 @@ -1198,7 +1198,7 @@ Only Slurm operators and administrators can set the priority of a job. \fB\-\-profile\fR=<all|none|[energy[,|task[,|lustre[,|network]]]]> enables detailed data collection by the acct_gather_profile plugin. Detailed data are typically time-series that are stored in an HDF5 file for -the job. +the job or an InfluxDB database depending on the configured plugin. .RS .TP 10 diff --git a/doc/man/man1/sbatch.1 b/doc/man/man1/sbatch.1 index 8a167dab0bd..041538b3625 100644 --- a/doc/man/man1/sbatch.1 +++ b/doc/man/man1/sbatch.1 @@ -1305,7 +1305,7 @@ Only Slurm operators and administrators can set the priority of a job. \fB\-\-profile\fR=<all|none|[energy[,|task[,|lustre[,|network]]]]> enables detailed data collection by the acct_gather_profile plugin. Detailed data are typically time-series that are stored in an HDF5 file for -the job. +the job or an InfluxDB database depending on the configured plugin. .RS .TP 10 diff --git a/doc/man/man1/srun.1 b/doc/man/man1/srun.1 index d99082d577f..7264445225d 100644 --- a/doc/man/man1/srun.1 +++ b/doc/man/man1/srun.1 @@ -1676,7 +1676,7 @@ This option applies to job allocations only. \fB\-\-profile\fR=<all|none|[energy[,|task[,|filesystem[,|network]]]]> enables detailed data collection by the acct_gather_profile plugin. Detailed data are typically time-series that are stored in an HDF5 file for -the job. +the job or an InfluxDB database depending on the configured plugin. .RS .TP 10 diff --git a/doc/man/man5/acct_gather.conf.5 b/doc/man/man5/acct_gather.conf.5 index 12121f92205..0b6bc60705b 100644 --- a/doc/man/man5/acct_gather.conf.5 +++ b/doc/man/man5/acct_gather.conf.5 @@ -131,6 +131,89 @@ Task (I/O, Memory, ...) data is collected. .RE .RE +.TP +\fBProfileInfluxDB\fR +Options used for AcctGatherProfileType/influxdb are as follows: + +.RS +.TP 10 +\fBProfileInfluxDBDatabase\fR +InfluxDB database name where profiling information is to be written. + +.TP +\fBProfileInfluxDBDefault\fR +A comma delimited list of data types to be collected for each job submission. +Allowed values are: + +.RS +.TP 8 +\fBAll\fR +All data types are collected. (Cannot be combined with other values.) + +.TP +\fBNone\fR +No data types are collected. This is the default. +(Cannot be combined with other values.) + +.TP +\fBEnergy\fR +Energy data is collected. + +.TP +\fBFilesystem\fR +File system (Lustre) data is collected. + +.TP +\fBNetwork\fR +Network (InfiniBand) data is collected. + +.TP +\fBTask\fR +Task (I/O, Memory, ...) data is collected. +.RE + +.TP +\fBProfileInfluxDBHost\fR=<hostname>:<port> +The hostname of the machine where the influxd instance is executed and the port +used by the HTTP API. The port used by the HTTP API is the one configured +through the bind-address influxdb.conf option in the [http] section. Example: + +ProfileInfluxDBHost=myinfluxhost:8086 + +.TP +\fBProfileInfluxDBRTPolicy\fR +The InfluxDB retention policy name for the database configured in +ProfileInfluxDBDatabase option. +.RE + +.TP +NOTE: +This plugin requires the libcurl development files to be installed. +.TP +NOTE: +Information on how to install and configure InfluxDB and manage databases, +retention policies and such is available on the official webpage. +.TP +NOTE: +Collected information is written from every compute node where a job runs to +the influxd instance listening on the ProfileInfluxDBHost. In order to avoid +overloading the influxd instance with incoming connection requests, the plugin +uses an internal buffer which is filled with samples. Once the buffer is full, a +HTTP API write request is performed and the buffer is emptied to hold subsequent +samples. A final request is also performed when a task ends even if the buffer +isn't full. +.TP +NOTE: +Failed HTTP API write requests are discarded. This means that collected profile +information in the plugin buffer is lost if it can't be written to the influxd +database for any reason. +.TP +NOTE: +Plugin messages are logged along with the slurmstepd logs to SlurmdLogFile. In +order to troubleshoot any issues, it is recommended to temporarily increase +the slurmd debug level to debug3 and add Profile to the debug flags. This can +be accomplished by setting the slurm.conf SlurmdDebug and DebugFlags +respectively or dynamically through scontrol setdebug and setdebugflags. .TP \fBInfinibandOFED\fR diff --git a/doc/man/man5/slurm.conf.5 b/doc/man/man5/slurm.conf.5 index 56351b142fb..0630deeed1f 100644 --- a/doc/man/man5/slurm.conf.5 +++ b/doc/man/man5/slurm.conf.5 @@ -263,6 +263,11 @@ No profile data is collected. This enables the HDF5 plugin. The directory where the profile files are stored and which values are collected are configured in the acct_gather.conf file. +.TP +\fBacct_gather_profile/influxdb\fR +This enables the influxdb plugin. The influxdb instance host, port, database, +retention policy and which values are collected are configured in the +acct_gather.conf file. .RE .TP -- GitLab