Commit 7ee48312 authored by Frank Winkler's avatar Frank Winkler

More documentation.

parent 0a7f4c45
......@@ -2,13 +2,17 @@
PIKA is an infrastructure for continuous monitoring and analysis of HPC systems.
It uses the collection daemon collectd, InfluxDB to store time-series data and MariaDB to store job metadata.
Furthermore, it provides a powerful web-frontend for the visualization of job data.
Furthermore, it provides a powerful [web-frontend](https://gitlab.hrz.tu-chemnitz.de/pika/visualization) for the visualization of job data.
Files that are required for the execution of the monitoring daemon (collectd), are located in the daemon folder. This includes the collectd configuration file and the LIKWID event group files as well as scripts that are periodically triggered to perform log rotation and error detection.
Prolog and epilog scripts ensure that the PIKA package is installed and the daemon is running. Corresponding files are located in the folder job_control.
Scripts for post processing, such as the generation of footprints, are located in the post_processing folder.
Scripts to determine the scalability and the overhead of the monitoring as well as regression tests are located in the test folder.
- [Installation](#markdown-header-installation)
- [Configuration](#markdown-header-configuration)
- [Data Collection](#markdown-header-data-collection)
- [Job Control](#markdown-header-job-control)
- [Post-Processing](#markdown-header-post-processing)
- [How Components Are Connected](#markdown-header-how-components-are-connected)
- [Evaluation Test](#markdown-header-evaluation-test)
***
## Installation
The software stack consists of several components and tools.
To simplify the installation, appropriate install scripts are available.
......@@ -41,7 +45,16 @@ Finally, a symbolic link that points on a *pika-VERSION.conf* file has to be cre
To create a new PIKA software package, copy a *pika-VERSION.conf* file with a new version number and change the variables *PIKA_VERSION*, *COLLECTD_VERSION*, *LIKWID_VERSION* and, if necessary, *LIKWID_VERSION_SHA*.
## How components are connected
## Data Collection
Files that are required for the execution of the monitoring daemon (collectd), are located in the daemon folder. This includes the collectd configuration file and the LIKWID event group files as well as scripts that are periodically triggered to perform log rotation and error detection. For detailed instructions see the [README.md](daemon/README.md) in the daemon directory.
## Job Control
Prolog and epilog scripts ensure that the PIKA package is installed and the daemon is running. Corresponding files are located in the folder job_control. For detailed instructions on Taurus see the [README.md](job_control/slurm/taurus/README.md) in the job_control directory.
## Post-Processing
Post-processing includes the backup and analysis of the recorded job data. For detailed instructions see the [README.md](post-processing/README.md) in the post-processing directory.
## How Components Are Connected
![Flow Graph](./flow_graph.svg)
### What is written/read or send/received?
......@@ -59,7 +72,7 @@ PROPERTY_ID ... bit field which defines several properties, e.g. monitoring was
4) Chunks/batches of time-series data. For a complete list of metrics see [daemon/collectd](daemon/collectd).
5) Update job metadata according to the SLURM backup database, see [revise_mariadb.py](post_processing/revise_mariadb.py)
<!--Update job metadata according to the SLURM backup database, see [revise_mariadb.py](post_processing/revise_mariadb.py)
|Job_Data (PIKA)|taurus_job_table (SLURM backup)|
|---|---|
......@@ -70,4 +83,11 @@ PROPERTY_ID ... bit field which defines several properties, e.g. monitoring was
|SUBMIT| time_submit|
|P_PARTITION|partition |
|EXCLUSIVE| nodes_alloc*(core number per partition) |
|ARRAY_ID|id_array_job |
\ No newline at end of file
|ARRAY_ID|id_array_job |-->
5) From "taurus_job_table": account, state (convert id to string), cpus_req, job_name, time_submit, partition, nodes_alloc, id_array_job
## Evaluation Test
Scripts to determine the scalability and the overhead of the monitoring as well as regression tests are located in the test folder.
# Error Detection Using Logrotate
To detect general errors in any part of the monitoring process, we write and analyze log files.
There are different log files for the monitoring daemon, the job prolog and epilog as well as the post-processing of the stored data.
Currently, we simply grep for keywords in these log files, such as `error` and `failure`, `outlier`.
There are different log files for the monitoring daemon as well as the job prolog and epilog.
Currently, we simply grep for keywords in these log files, such as `error` and `failure`.
If a keyword was found, the respective log file is saved with the name of the compute node to a shared file system and an email is sent to the administrators.
## Setup Logrotate for PIKA:
......@@ -20,12 +20,12 @@ Adjust the paths in the following files according to your setup:
## Specify Prerotate Script
Logrotate is calling [pika_collect_errors.sh](pika_collect_errors.sh). You can add here "grep" keywords for collectd and prolog/epilog logfiles.
Logrotate is calling [pika_collect_errors.sh](pika_collect_errors.sh). You can add "grep" keywords for collectd and prolog/epilog logfiles.
If this script detects errors, an error file is copied to `$ERROR_COLLECTION_PATH`.
## Create Cronjob and Register Email Addresses
Add [pika_mail_error_info.sh](pika_mail_error_info.sh) to crontab on a service node.
Add [pika_mail_error_info.sh](pika_mail_error_info.sh) to crontab on the respective service node.
Example:
......
# Post-Processing
Post-processing takes place every week and contains the following steps:
1. Backup the latest shard from InfluxDB's short-term database: [backup_influxdb.sh](backup_influxdb.sh)
2. Restore latest shard to InfluxDB's long-term database: [restore_influxdb.sh](restore_influxdb.sh)
3. Update/complete job metadata according to the SLURM backup database: [revise_mariadb.py](revise_mariadb.py)
4. Create performance footprints: [footprints.py](footprints.py)
5. Create job tags based on the performance footprints: [tags.py](tags.py)
**Note**: **All these scripts must not be called directly!**
The script [pika_post_processing.sh](pika_post_processing.sh) performs all above mentioned steps.
You can either perform a single step or all steps by setting the respective arguments:
./pika_post_processing [all|backup|restore|revise|footprint|tag]
**But be aware that each further step requires a successful completion of the previous step!**
You can add `pika_post_processing.sh` to crontab on the respective service node.
Example:
0 1 */2 * * /sw/taurus/tools/pika/post_processing/pika_post_processing.sh all
**Note**: This script sends an email to all registers members if an error occurs during the post-processing.
Register email addresses in [pika_emails](../../pika_emails) to get error notifications.
\ No newline at end of file
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment