diff --git a/doc.zih.tu-dresden.de/docs/software/perf_tools.md b/doc.zih.tu-dresden.de/docs/software/perf_tools.md index 897c2dbc05d30275015552e86ae5876d2c20844d..12e1a0cc70418fa08b38e433148c082b385714f9 100644 --- a/doc.zih.tu-dresden.de/docs/software/perf_tools.md +++ b/doc.zih.tu-dresden.de/docs/software/perf_tools.md @@ -1,10 +1,12 @@ # Perf Tools -## Introduction +The Linux `perf` command provides support for sampling applications and reading performance +counters. `perf` consists of two parts: the kernel space implementation and the userland tools. +This compendium page focusses on the latter. -`perf` consists of two parts: the kernel space implementation and the userland tools. This wiki -entry focusses on the latter. These tools are installed on ZIH systems, and others and provides -support for sampling applications and reading performance counters. +For detailed information, please refer to the [perf +documentation](https://perf.wiki.kernel.org/index.php/Main_Page) and the comprehensive +[perf examples page](https://www.brendangregg.com/perf.html) of Brendan Gregg. ## Configuration @@ -32,9 +34,12 @@ performance data can provide hints on the internals of the application. ### For Users Run `perf stat <Your application>`. This will provide you with a general -overview on some counters. +overview on some counters. The following listing holds an exemplary output for sampling the `ls` +command. -```Bash +```console +marie@compute$ perf stat ls +[...] Performance counter stats for 'ls':= 2,524235 task-clock # 0,352 CPUs utilized 15 context-switches # 0,006 M/sec @@ -52,7 +57,7 @@ Performance counter stats for 'ls':= - Generally speaking **task clock** tells you how parallel your job has been/how many cpus were used. -- **[Context switches](http://en.wikipedia.org/wiki/Context_switch)** +- [Context switches](http://en.wikipedia.org/wiki/Context_switch) are an information about how the scheduler treated the application. Also interrupts cause context switches. Lower is better. - **CPU migrations** are an information on whether the scheduler moved @@ -91,23 +96,26 @@ measures the performance counters for the whole computing node over one second. ## Perf Record `perf record` provides the possibility to sample an application or a system. You can find -performance issues and hot parts of your code. By default perf record samples your program at a 4000 +performance issues and hot parts of your code. By default `perf record` samples your program at 4000 Hz. It records CPU, Instruction Pointer and, if you specify it, the call chain. If your code runs -long (or often) enough, you can find hot spots in your application and external libraries. Use -**perf report** to evaluate the result. You should have debug symbols available, otherwise you won't -be able to see the name of the functions that are responsible for your load. You can pass one or -multiple events to define the **sampling event**. - -**What is a sampling event?** Sampling reads values at a specific sampling frequency. This -frequency is usually static and given in Hz, so you have for example 4000 events per second and a -sampling frequency of 4000 Hz and a sampling rate of 250 microseconds. With the sampling event, the -concept of a static sampling frequency in time is somewhat redefined. Instead of a constant factor -in time (sampling rate) you define a constant factor in events. So instead of a sampling rate of 250 -microseconds, you have a sampling rate of 10,000 floating point operations. - -**Why would you need sampling events?** Passing an event allows you to find the functions -that produce cache misses, floating point operations, ... Again, you can use events defined in `perf -list` and raw events. +long (or often) enough, you can find hot spots in your application and external libraries. +Use [perf report](#perf-report) to evaluate the result. You should have debug symbols available, +otherwise you won't be able to see the name of the functions that are responsible for your load. You +can pass one or multiple events to define the **sampling event**. + +!!! note "What is a sampling event?" + + Sampling reads values at a specific sampling frequency. This frequency is usually static and + given in Hz, so you have for example 4000 events per second and a sampling frequency of 4000 Hz + and a sampling rate of 250 microseconds. With the sampling event, the concept of a static + sampling frequency in time is somewhat redefined. Instead of a constant factor in time (sampling + rate) you define a constant factor in events. So instead of a sampling rate of 250 microseconds, + you have a sampling rate of 10,000 floating point operations. + +!!! note "Why would you need sampling events?" + + Passing an event allows you to find the functions that produce cache misses, floating point + operations, ... Again, you can use events defined in `perf list` and raw events. Use the `-g` flag to receive a call graph. @@ -127,7 +135,7 @@ perf record -o perf.data.$SLURM_JOB_ID.$SLURM_PROCID $@ ``` To start the MPI program type `srun ./perfwrapper ./myapp` on your command line. The result will be -n independent perf.data files that can be analyzed individually with perf report. +n independent `perf.data` files that can be analyzed individually using `perf report`. ### For Admins @@ -139,14 +147,18 @@ record -a -g` to monitor the whole node. `perf report` is a command line UI for evaluating the results from perf record. It creates something like a profile from the recorded samplings. These profiles show you what the most used have been. -If you added a callchain, it also gives you a callchain profile.\<br /> \*Disclaimer: Sampling is -not an appropriate way to gain exact numbers. So this is merely a rough overview and not guaranteed -to be absolutely correct.\*\<span style="font-size: 1em;"> \</span> +If you added a callchain, it also gives you a callchain profile. + +!!! note "Disclaimer" + + Sampling is not an appropriate way to gain exact numbers. So this is merely a rough overview and + not guaranteed to be absolutely correct. -### On ZIH systems +### On ZIH Systems On ZIH systems, users are not allowed to see the kernel functions. If you have multiple events -defined, then the first thing you select in `perf report` is the type of event. Press right +defined, then the first thing you select in `perf report` is the type of event. Press the right +arrow key: ```Bash Available samples @@ -154,12 +166,12 @@ Available samples 11 cache-misse ``` -**Hints:** +!!! hint -* The more samples you have, the more exact is the profile. 96 or -11 samples is not enough by far. -* Repeat the measurement and set `-F 50000` to increase the sampling frequency. -* The higher the frequency, the higher the influence on the measurement. + * The more samples you have, the more exact is the profile. 96 or + 11 samples is not enough by far. + * Repeat the measurement and set `-F 50000` to increase the sampling frequency. + * The higher the frequency, the higher the influence on the measurement. If you'd select cycles, you would get such a screen: @@ -173,7 +185,7 @@ Events: 96 cycles + 2,02% test_gcc_perf [kernel.kallsyms] [k] 0xffffffff8102e9ea ``` -Increased sample frequency: +With increased sample frequency, it might look like this: ```Bash Events: 7K cycles @@ -199,16 +211,16 @@ Events: 7K cycles + 0,00% test_gcc_perf libc-2.12.so [.] __execvpe ``` -Now you select the most often sampled function and zoom into it by pressing right. If debug symbols -are not available, perf report will show which assembly instruction is hit most often when sampling. -If debug symbols are available, it will also show you the source code lines for these assembly -instructions. You can also go back and check which instruction caused the cache misses or whatever -event you were passing to perf record. +Now you select the most often sampled function and zoom into it by pressing the right arrow key. If +debug symbols are not available, `perf report` will show which assembly instruction is hit most often +when sampling. If debug symbols are available, it will also show you the source code lines for +these assembly instructions. You can also go back and check which instruction caused the cache +misses or whatever event you were passing to `perf record`. ## Perf Script If you need a trace of the sampled data, you can use `perf script` command, which by default prints -all samples to stdout. You can use various interfaces (e.g., python) to process such a trace. +all samples to stdout. You can use various interfaces (e.g., Python) to process such a trace. ## Perf Top