4.94 KB
Newer Older
Robert Dietrich's avatar
Robert Dietrich committed
1 2 3 4 5 6
# PIKA - Center-Wide and Job-Aware Cluster Monitoring

PIKA is an infrastructure for continuous monitoring and analysis of HPC systems. 
It uses the collection daemon collectd, InfluxDB to store time-series data and MariaDB to store job metadata. 
Furthermore, it provides a powerful web-frontend for the visualization of job data. 

7 8
Files that are required for the execution of the monitoring daemon (collectd), are located in the daemon folder. This includes the collectd configuration file and the LIKWID event group files as well as scripts that are periodically triggered to perform log rotation and error detection. 
Prolog and epilog scripts ensure that the PIKA package is installed and the daemon is running. Corresponding files are located in the folder job_control. 
Robert Dietrich's avatar
Robert Dietrich committed
Scripts for post processing, such as the generation of footprints, are located in the post_processing folder. 
10 11
Scripts to determine the scalability and the overhead of the monitoring as well as regression tests are located in the test folder. 

Robert Dietrich's avatar
Robert Dietrich committed
12 13 14
## Installation
The software stack consists of several components and tools. 
To simplify the installation, appropriate install scripts are available. 
15 16 17
For detailed install instructions see the [](install/ in the install directory.

## Configuration
Five files are used to configure the software stack: 

* *pika.conf* 
Robert Dietrich's avatar
Robert Dietrich committed
contains the global version independent configuration variables. It also sets some environment variables that are used in the job prolog and epilog. It uses `source` to read the environment variables from *.pika_access*.
22 23 24 25 26 27
* *.pika_access* 
exports the environment variables with the access parameters for the databases. Thus, this file should have restricted read access. You can use [pika_access_template](pika_access_template) to create this file. 
* *pika-VERSION.conf* 
is used for versioning of the PIKA package. It sets the PIKA package version along with the used version of collectd, LIKWID and Python. Finally, it uses `source` to read the environment variables from *pika.conf*. 
* *pika_utils.conf* 
provides utility functions for prolog, epilog and other bash scripts. 

29 30 31 32
Edit *pika.conf* an change the variables *LOCAL_STORE*, *PIKA_LOGPATH* and *PIKA_INSTALL_PATH* according to your needs or system setup. 
*LOCAL_STORE* specifies the path where temporary files are placed during prolog and read by the epilog script. It is also used for locking of the install and collectd start procedure. 
*PIKA_LOGPATH* specifies the path where the collectd log file *pika_collectd.log* will be written to.
*PIKA_INSTALL_PATH* specifies the path where the PIKA software (binaries, libraries, etc.) is installed to.

34 35 36
Edit *pika-VERSION.conf* and set the variable *PIKA_ROOT* to the path where the PIKA sources (and also the *.conf files) are located. 
This file also specifies the collectd batch size (number of metric values that are collected until being sent to the database) with the variable *PIKA_COLLECTD_BATCH_SIZE*. 
Furthermore, it does some exception handling for different types of nodes.
Frank Winkler's avatar
Frank Winkler committed

38 39 40 41
Finally, a symbolic link that points on a *pika-VERSION.conf* file has to be created an named pika-current.conf. For example:

    ln -s pika-1.2.conf pika-current.conf

42 43 44
To create a new PIKA software package, copy a *pika-VERSION.conf* file with a new version number and change the variables *PIKA_VERSION*, *COLLECTD_VERSION*, *LIKWID_VERSION* and, if necessary, *LIKWID_VERSION_SHA*.

## How components are connected
Robert Dietrich's avatar
Robert Dietrich committed
45 46 47
![Flow Graph](./flow_graph.svg)

### What is written/read or send/received?
Frank Winkler's avatar
Frank Winkler committed
Robert Dietrich's avatar
Robert Dietrich committed
49 50 51 52 53 54 55 56 57 58 59 60 61
(The environment variables SLURM_JOB_ID, SLURM_NODELIST, SLURM_JOB_USER and SLURM_JOB_PARTITION are available within prolog.)

NUM_NODES ... via $(nodeset -c ${SLURM_NODELIST}<br>
ARRAY_ID ... for non-array jobs 0, otherwise ... (currently not available)

3) Update STATUS='completed|timeout', JOB_END=`date +%s`, PROPERTY_ID for a SLURM_JOB_ID and START<br>
PROPERTY_ID ... bit field which defines several properties, e.g. monitoring was disabled, incomplete Slurm data<br>
(Delete jobs shorter than one minute.)

4) Chunks/batches of time-series data. For a complete list of metrics see [daemon/collectd](daemon/collectd).

Frank Winkler's avatar
Frank Winkler committed
62 63 64 65 66 67 68 69 70 71 72 73
5) Update job metadata according to the SLURM backup database, see [](post_processing/

|Job_Data (PIKA)|taurus_job_table (SLURM backup)|
|PROJECT|account |
|STATUS|state (convert id to string)|
|NUM_CORES|cpus_req |
|NAME|job_name |
|SUBMIT| time_submit|
|P_PARTITION|partition |
|EXCLUSIVE| nodes_alloc*(core number per partition) |
|ARRAY_ID|id_array_job |