diff --git a/doc.zih.tu-dresden.de/docs/software/PapiLibrary.md b/doc.zih.tu-dresden.de/docs/software/PapiLibrary.md new file mode 100644 index 0000000000000000000000000000000000000000..414e08a4bf0226493e10bfeabbad620df14f59c1 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/PapiLibrary.md @@ -0,0 +1,40 @@ +# PAPI Library + +Related work: + +* [PAPI documentation](http://icl.cs.utk.edu/projects/papi/wiki/Main_Page) +* [Intel 64 and IA-32 Architectures Software Developers Manual (Per thread/per core PMCs)] + (http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf) + +Additional sources for **Haswell** Processors: [Intel Xeon Processor E5-2600 v3 Product Family Uncore +Performance Monitoring Guide (Uncore PMCs) - Download link] +(http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v3-uncore-performance-monitoring.html) + +## Introduction + +PAPI enables users and developers to monitor how their code performs on a specific architecture. To +do so, they can register events that are counted by the hardware in performance monitoring counters +(PMCs). These counters relate to a specific hardware unit, for example a processor core. Intel +Processors used on taurus support eight PMCs per processor core. As the partitions on taurus are run +with HyperThreading Technology (HTT) enabled, each CPU can use four of these. In addition to the +**four core PMCs**, Intel processors also support **a number of uncore PMCs** for non-core +resources. (see the uncore manuals listed in top of this documentation). + +## Usage + +[Score-P](ScoreP.md) supports per-core PMCs. To include uncore PMCs into Score-P traces use the +software module **scorep-uncore/2016-03-29**on the Haswell partition. If you do so, disable +profiling to include the uncore measurements. This metric plugin is available at +[github](https://github.com/score-p/scorep_plugin_uncore/). + +If you want to use PAPI directly in your software, load the latest papi module, which establishes +the environment variables **PAPI_INC**, **PAPI_LIB**, and **PAPI_ROOT**. Have a look at the +[PAPI documentation](http://icl.cs.utk.edu/projects/papi/wiki/Main_Page) for details on the usage. + +## Related Software + +* [Score-P](ScoreP.md) +* [Linux Perf Tools](PerfTools.md) + +If you just need a short summary of your job, you might want to have a look at +[perf stat](PerfTools.md). diff --git a/doc.zih.tu-dresden.de/docs/software/PerfTools.md b/doc.zih.tu-dresden.de/docs/software/PerfTools.md new file mode 100644 index 0000000000000000000000000000000000000000..176c772bf4bcbe3d9cf3a1eda725d0cc7f14daac --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/PerfTools.md @@ -0,0 +1,226 @@ +# Introduction + +`perf` consists of two parts: the kernel space implementation and the userland tools. This wiki +entry focusses on the latter. These tools are installed on taurus, and others and provides support +for sampling applications and reading performance counters. + +## Installation + +On taurus load the module via + +```Bash +module load perf/r31 +``` + +## Configuration + +Admins can change the behaviour of the perf tools kernel part via the +following interfaces + +| | | +|---------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------| +| File Name | Description | +| `/proc/sys/kernel/perf_event_max_sample_rate` | describes the maximal sample rate for perf record and native access. This is used to limit the performance influence of sampling. | +| `/proc/sys/kernel/perf_event_mlock_kb` | defines the number of pages that can be used for sampling via perf record or the native interface | +| `/proc/sys/kernel/perf_event_paranoid` | defines access rights: | +| | -1 - Not paranoid at all | +| | 0 - Disallow raw tracepoint access for unpriv | +| | 1 - Disallow cpu events for unpriv | +| | 2 - Disallow kernel profiling for unpriv | +| `/proc/sys/kernel/kptr_restrict` | Defines whether the kernel address maps are restricted | + +## Perf Stat + +`perf stat` provides a general performance statistic for a program. You +can attach to a running (own) process, monitor a new process or monitor +the whole system. The latter is only available for root user, as the +performance data can provide hints on the internals of the application. + +### For Users + +Run `perf stat <Your application>`. This will provide you with a general +overview on some counters. + +```Bash +Performance counter stats for 'ls':= + 2,524235 task-clock # 0,352 CPUs utilized + 15 context-switches # 0,006 M/sec + 0 CPU-migrations # 0,000 M/sec + 292 page-faults # 0,116 M/sec + 6.431.241 cycles # 2,548 GHz + 3.537.620 stalled-cycles-frontend # 55,01% frontend cycles idle + 2.634.293 stalled-cycles-backend # 40,96% backend cycles idle + 6.157.440 instructions # 0,96 insns per cycle + # 0,57 stalled cycles per insn + 1.248.527 branches # 494,616 M/sec + 34.044 branch-misses # 2,73% of all branches + 0,007167707 seconds time elapsed +``` + +- Generally speaking **task clock** tells you how parallel your job + has been/how many cpus were used. +- **[Context switches](http://en.wikipedia.org/wiki/Context_switch)** + are an information about how the scheduler treated the application. Also interrupts cause context + switches. Lower is better. +- **CPU migrations** are an information on whether the scheduler moved + the application between cores. Lower is better. Please pin your programs to CPUs to avoid + migrations. This can be done with environment variables for OpenMP and MPI, with `likwid-pin`, + `numactl` and `taskset`. +- [Page faults](http://en.wikipedia.org/wiki/Page_fault) describe + how well the Translation Lookaside Buffers fit for the program. Lower is better. +- **Cycles** tells you how many CPU cycles have been spent in + executing the program. The normalized value tells you the actual average frequency of the CPU(s) + running the application. +- **stalled-cycles-...** tell you how well the processor can execute + your code. Every stall cycle is a waste of CPU time and energy. The reason for such stalls can be + numerous. It can be wrong branch predictions, cache misses, occupation of CPU resources by long + running instructions and so on. If these stall cycles are to high you might want to review your + code. +- The normalized **instructions** number tells you how well your code + is running. More is better. Current x86 CPUs can run 3 to 5 instructions per cycle, depending on + the instruction mix. A count of less then 1 is not favorable. In such a case you might want to + review your code. +- **branches** and **branch-misses** tell you how many jumps and loops + are performed in your code. Correctly [predicted](http://en.wikipedia.org/wiki/Branch_prediction) + branches should not hurt your performance, **branch-misses** on the other hand hurt your + performance very badly and lead to stall cycles. +- Other events can be passed with the `-e` flag. For a full list of + predefined events run `perf list` +- PAPI runs on top of the same infrastructure as `perf stat`, so you + might want to use their meaningful event names. Otherwise you can use raw events, listed in the + processor manuals. + +### For Admins + +Administrators can run a system wide performance statistic, e.g., with `perf stat -a sleep 1` which +measures the performance counters for the whole computing node over one second. + +## Perf Record + +`perf record` provides the possibility to sample an application or a system. You can find +performance issues and hot parts of your code. By default perf record samples your program at a 4000 +Hz. It records CPU, Instruction Pointer and, if you specify it, the call chain. If your code runs +long (or often) enough, you can find hot spots in your application and external libraries. Use +**perf report** to evaluate the result. You should have debug symbols available, otherwise you won't +be able to see the name of the functions that are responsible for your load. You can pass one or +multiple events to define the **sampling event**. + +**What is a sampling event?** Sampling reads values at a specific sampling frequency. This +frequency is usually static and given in Hz, so you have for example 4000 events per second and a +sampling frequency of 4000 Hz and a sampling rate of 250 microseconds. With the sampling event, the +concept of a static sampling frequency in time is somewhat redefined. Instead of a constant factor +in time (sampling rate) you define a constant factor in events. So instead of a sampling rate of 250 +microseconds, you have a sampling rate of 10,000 floating point operations. + +**Why would you need sampling events?** Passing an event allows you to find the functions +that produce cache misses, floating point operations, ... Again, you can use events defined in `perf +list` and raw events. + +Use the `-g` flag to receive a call graph. + +### For Users + +Just run `perf record ./myapp` or attach to a running process. + +#### Using Perf with MPI + +Perf can also be used to record data for indivdual MPI processes. This requires a wrapper script +(`perfwrapper`) with the following content. Also make sure that the wrapper script is executable +(`chmod +x`). + +```Bash +#!/bin/bash +perf record -o perf.data.$SLURM_JOB_ID.$SLURM_PROCID $@ +``` + +To start the MPI program type `srun ./perfwrapper ./myapp` on your command line. The result will be +n independent perf.data files that can be analyzed individually with perf report. + +### For Admins + +This tool is very effective, if you want to help users find performance problems and hot-spots in +their code but also helps to find OS daemons that disturb such applications. You would start `perf +record -a -g` to monitor the whole node. + +## Perf Report + +`perf report` is a command line UI for evaluating the results from perf record. It creates something +like a profile from the recorded samplings. These profiles show you what the most used have been. +If you added a callchain, it also gives you a callchain profile.\<br /> \*Disclaimer: Sampling is +not an appropriate way to gain exact numbers. So this is merely a rough overview and not guaranteed +to be absolutely correct.\*\<span style="font-size: 1em;"> \</span> + +### On Taurus + +On Taurus, users are not allowed to see the kernel functions. If you have multiple events defined, +then the first thing you select in `perf report` is the type of event. Press right + +```Bash +Available samples +96 cycles +11 cache-misse +``` + +**Hints:** + +* The more samples you have, the more exact is the profile. 96 or +11 samples is not enough by far. +* Repeat the measurement and set `-F 50000` to increase the sampling frequency. +* The higher the frequency, the higher the influence on the measurement. + +If you'd select cycles, you would get such a screen: + +```Bash +Events: 96 cycles ++ 49,13% test_gcc_perf test_gcc_perf [.] main.omp_fn.0 ++ 34,48% test_gcc_perf test_gcc_perf [.] ++ 6,92% test_gcc_perf test_gcc_perf [.] omp_get_thread_num@plt ++ 5,20% test_gcc_perf libgomp.so.1.0.0 [.] omp_get_thread_num ++ 2,25% test_gcc_perf test_gcc_perf [.] main.omp_fn.1 ++ 2,02% test_gcc_perf [kernel.kallsyms] [k] 0xffffffff8102e9ea +``` + +Increased sample frequency: + +```Bash +Events: 7K cycles ++ 42,61% test_gcc_perf test_gcc_perf [.] p ++ 40,28% test_gcc_perf test_gcc_perf [.] main.omp_fn.0 ++ 6,07% test_gcc_perf test_gcc_perf [.] omp_get_thread_num@plt ++ 5,95% test_gcc_perf libgomp.so.1.0.0 [.] omp_get_thread_num ++ 4,14% test_gcc_perf test_gcc_perf [.] main.omp_fn.1 ++ 0,69% test_gcc_perf [kernel.kallsyms] [k] 0xffffffff8102e9ea ++ 0,04% test_gcc_perf ld-2.12.so [.] check_match.12442 ++ 0,03% test_gcc_perf libc-2.12.so [.] printf ++ 0,03% test_gcc_perf libc-2.12.so [.] vfprintf ++ 0,03% test_gcc_perf libc-2.12.so [.] __strchrnul ++ 0,03% test_gcc_perf libc-2.12.so [.] _dl_addr ++ 0,02% test_gcc_perf ld-2.12.so [.] do_lookup_x ++ 0,01% test_gcc_perf libc-2.12.so [.] _int_malloc ++ 0,01% test_gcc_perf libc-2.12.so [.] free ++ 0,01% test_gcc_perf libc-2.12.so [.] __sigprocmask ++ 0,01% test_gcc_perf libgomp.so.1.0.0 [.] 0x87de ++ 0,01% test_gcc_perf libc-2.12.so [.] __sleep ++ 0,01% test_gcc_perf ld-2.12.so [.] _dl_check_map_versions ++ 0,01% test_gcc_perf ld-2.12.so [.] local_strdup ++ 0,00% test_gcc_perf libc-2.12.so [.] __execvpe +``` + +Now you select the most often sampled function and zoom into it by pressing right. If debug symbols +are not available, perf report will show which assembly instruction is hit most often when sampling. +If debug symbols are available, it will also show you the source code lines for these assembly +instructions. You can also go back and check which instruction caused the cache misses or whatever +event you were passing to perf record. + +## Perf Script + +If you need a trace of the sampled data, you can use `perf script` command, which by default prints +all samples to stdout. You can use various interfaces (e.g., python) to process such a trace. + +## Perf Top + +`perf top` is only available for admins, as long as the paranoid flag is not changed (see +configuration). + +It behaves like the `top` command, but gives you not only an overview of the processes and the time +they are consuming but also on the functions that are processed by these. diff --git a/doc.zih.tu-dresden.de/mkdocs.yml b/doc.zih.tu-dresden.de/mkdocs.yml index 37da298857672ade7803554f33729eb23b46b3d3..d80e9f3f8d666093981a3e0ed5c37b1d39c13599 100644 --- a/doc.zih.tu-dresden.de/mkdocs.yml +++ b/doc.zih.tu-dresden.de/mkdocs.yml @@ -53,6 +53,8 @@ nav: - Debuggers: software/Debuggers.md - MPI Error Detection: software/MPIUsageErrorDetection.md - Score-P: software/ScoreP.md + - PAPI Library: software/PapiLibrary.md + - Perf Tools: software/PerfTools.md - Data Management: - Overview: data_management/DataManagement.md - Announcement of Quotas: data_management/AnnouncementOfQuotas.md diff --git a/twiki2md/root/DebuggingTools/MPIUsageErrorDetection.md b/twiki2md/root/DebuggingTools/MPIUsageErrorDetection.md deleted file mode 100644 index 2b72f35df97902b323fb5cb589387acbb52cd6d6..0000000000000000000000000000000000000000 --- a/twiki2md/root/DebuggingTools/MPIUsageErrorDetection.md +++ /dev/null @@ -1,81 +0,0 @@ -# Introduction - -MPI as the de-facto standard for parallel applications of the the -massage passing paradigm offers more than one hundred different API -calls with complex restrictions. As a result, developing applications -with this interface is error prone and often time consuming. Some usage -errors of MPI may only manifest on some platforms or some application -runs, which further complicates the detection of these errors. Thus, -special debugging tools for MPI applications exist that automatically -check whether an application conforms to the MPI standard and whether -its MPI calls are safe. At ZIH, we maintain and support MUST for this -task, though different types of these tools exist (see last section). - -# MUST - -MUST checks if your application conforms to the MPI standard and will -issue warnings if there are errors or non-portable constructs. You can -apply MUST without modifying your source code, though we suggest to add -the debugging flag "-g" during compilation. - -- [MUST introduction slides](%ATTACHURL%/parallel_debugging_must.pdf) - -## Setup and Modules - -You need to load a module file in order to use MUST. Each MUST -installation uses a specific combination of a compiler and an MPI -library, make sure to use a combination that fits your needs. Right now -we only provide a single combination on each system, contact us if you -need further combinations. You can query for the available modules with: - - module avail must - -You can load a MUST module as follows: - - module load must - -Besides loading a MUST module, no further changes are needed during -compilation and linking. - -## Running with MUST - -In order to run with MUST you need to replace the mpirun/mpiexec command -with mustrun: - - mustrun -np <NPROC> ./a.out - -Besides replacing the mpiexec command you need to be aware that **MUST -always allocates an extra process**. I.e. if you issue a "mustrun -np 4 -./a.out" then MUST will start 5 processes instead. This is usually not -critical, however in batch jobs **make sure to allocate space for this -extra task**. - -Finally, MUST assumes that your application may crash at any time. To -still gather correctness results under this assumption is extremely -expensive in terms of performance overheads. Thus, if your application -does not crashs, you should add an "--must:nocrash" to the mustrun -command to make MUST aware of this knowledge. Overhead is drastically -reduced with this switch. - -## Result Files - -After running your application with MUST you will have its output in the -working directory of your application. The output is named -"MUST_Output.html". Open this files in a browser to anlyze the results. -The HTML file is color coded: Entries in green represent notes and -useful information. Entries in yellow represent warnings, and entries in -red represent errors. - -# Other MPI Correctness Tools - -Besides MUST, there exist further MPI correctness tools, these are: - -- Marmot (predecessor of MUST) -- MPI checking library of the Intel Trace Collector -- ISP (From Utah) -- Umpire (predecessor of MUST) - -ISP provides a more thorough deadlock detection as it investigates -alternative execution paths, however its overhead is drastically higher -as a result. Contact our support if you have a specific use cases that -needs one of these tools. diff --git a/twiki2md/root/PerformanceTools/PapiLibrary.md b/twiki2md/root/PerformanceTools/PapiLibrary.md deleted file mode 100644 index 5516c7e310d68e7509384ef081b89816ebe41dde..0000000000000000000000000000000000000000 --- a/twiki2md/root/PerformanceTools/PapiLibrary.md +++ /dev/null @@ -1,55 +0,0 @@ -# PAPI Library - -Related work: [PAPI -documentation](http://icl.cs.utk.edu/projects/papi/wiki/Main_Page), -[Intel 64 and IA-32 Architectures Software Developers Manual (Per -thread/per core -PMCs)](http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf) - -Additional sources for **Sandy Bridge** Processors: [Intel Xeon -Processor E5-2600 Product Family Uncore Performance Monitoring Guide -(Uncore -PMCs)](http://www.intel.com/content/dam/www/public/us/en/documents/design-guides/xeon-e5-2600-uncore-guide.pdf) - -Additional sources for **Haswell** Processors: [Intel Xeon Processor -E5-2600 v3 Product Family Uncore Performance Monitoring Guide (Uncore -PMCs) - Download -link](http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v3-uncore-performance-monitoring.html) - -## Introduction - -PAPI enables users and developers to monitor how their code performs on -a specific architecture. To do so, they can register events that are -counted by the hardware in performance monitoring counters (PMCs). These -counters relate to a specific hardware unit, for example a processor -core. Intel Processors used on taurus support eight PMCs per processor -core. As the partitions on taurus are run with HyperThreading Technology -(HTT) enabled, each CPU can use four of these. In addition to the **four -core PMCs**, Intel processors also support **a number of uncore PMCs** -for non-core resources. (see the uncore manuals listed in top of this -documentation). - -## Usage - -[Score-P](ScoreP) supports per-core PMCs. To include uncore PMCs into -Score-P traces use the software module **scorep-uncore/2016-03-29**on -the Haswell partition. If you do so, disable profiling to include the -uncore measurements. This metric plugin is available at -[github](https://github.com/score-p/scorep_plugin_uncore/). - -If you want to use PAPI directly in your software, load the latest papi -module, which establishes the environment variables **PAPI_INC**, -**PAPI_LIB**, and **PAPI_ROOT**. Have a look at the [PAPI -documentation](http://icl.cs.utk.edu/projects/papi/wiki/Main_Page) for -details on the usage. - -## Related Software - -[Score-P](ScoreP) - -[Linux Perf Tools](PerfTools) - -If you just need a short summary of your job, you might want to have a -look at [perf stat](PerfTools) - --- Main.UlfMarkwardt - 2012-10-09 diff --git a/twiki2md/root/PerformanceTools/PerfTools.md b/twiki2md/root/PerformanceTools/PerfTools.md deleted file mode 100644 index 1a9fd98517e5274f88be0f55e7d6002f907d0b5c..0000000000000000000000000000000000000000 --- a/twiki2md/root/PerformanceTools/PerfTools.md +++ /dev/null @@ -1,236 +0,0 @@ -(This page is under construction) - -# Introduction - -perf consists of two parts: the kernel space implementation and the -userland tools. This wiki entry focusses on the latter. These tools are -installed on taurus, and others and provides support for sampling -applications and reading performance counters. - -# Installation - -On taurus load the module via - - module load perf/r31 - -# Configuration - -Admins can change the behaviour of the perf tools kernel part via the -following interfaces - -| | | -|---------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------| -| File Name | Description | -| /proc/sys/kernel/perf_event_max_sample_rate | describes the maximal sample rate for perf record and native access. This is used to limit the performance influence of sampling. | -| /proc/sys/kernel/perf_event_mlock_kb | defines the number of pages that can be used for sampling via perf record or the native interface | -| /proc/sys/kernel/perf_event_paranoid | defines access rights: | -| | -1 - Not paranoid at all | -| | 0 - Disallow raw tracepoint access for unpriv | -| | 1 - Disallow cpu events for unpriv | -| | 2 - Disallow kernel profiling for unpriv | -| /proc/sys/kernel/kptr_restrict | Defines whether the kernel address maps are restricted | - -# perf stat - -`perf stat` provides a general performance statistic for a program. You -can attach to a running (own) process, monitor a new process or monitor -the whole system. The latter is only available for root user, as the -performance data can provide hints on the internals of the application. - -## For users - -Run `perf stat <Your application>`. This will provide you with a general -overview on some counters. - - Performance counter stats for 'ls':= - 2,524235 task-clock # 0,352 CPUs utilized - 15 context-switches # 0,006 M/sec - 0 CPU-migrations # 0,000 M/sec - 292 page-faults # 0,116 M/sec - 6.431.241 cycles # 2,548 GHz - 3.537.620 stalled-cycles-frontend # 55,01% frontend cycles idle - 2.634.293 stalled-cycles-backend # 40,96% backend cycles idle - 6.157.440 instructions # 0,96 insns per cycle - # 0,57 stalled cycles per insn - 1.248.527 branches # 494,616 M/sec - 34.044 branch-misses # 2,73% of all branches - 0,007167707 seconds time elapsed - -- Generally speaking **task clock** tells you how parallel your job - has been/how many cpus were used. -- **[Context switches](http://en.wikipedia.org/wiki/Context_switch)** - are an information about how the scheduler treated the application. - Also interrupts cause context switches. Lower is better. -- **CPU migrations** are an information on whether the scheduler moved - the application between cores. Lower is better. Please pin your - programs to CPUs to avoid migrations. This can be done with - environment variables for OpenMP and MPI, with `likwid-pin`, - `numactl` and `taskset`. -- **[Page faults](http://en.wikipedia.org/wiki/Page_fault)** describe - how well the Translation Lookaside Buffers fit for the program. - Lower is better. -- **Cycles** tells you how many CPU cycles have been spent in - executing the program. The normalized value tells you the actual - average frequency of the CPU(s) running the application. -- **stalled-cycles-...** tell you how well the processor can execute - your code. Every stall cycle is a waste of CPU time and energy. The - reason for such stalls can be numerous. It can be wrong branch - predictions, cache misses, occupation of CPU resources by long - running instructions and so on. If these stall cycles are to high - you might want to review your code. -- The normalized **instructions** number tells you how well your code - is running. More is better. Current x86 CPUs can run 3 to 5 - instructions per cycle, depending on the instruction mix. A count of - less then 1 is not favorable. In such a case you might want to - review your code. -- **branches** and **branch-misses** tell you how many jumps and loops - are performed in your code. Correctly - [predicted](http://en.wikipedia.org/wiki/Branch_prediction) branches - should not hurt your performance, **branch-misses** on the other - hand hurt your performance very badly and lead to stall cycles. -- Other events can be passed with the `-e` flag. For a full list of - predefined events run `perf list` -- PAPI runs on top of the same infrastructure as perf stat, so you - might want to use their meaningful event names. Otherwise you can - use raw events, listed in the processor manuals. ( - [Intel](http://download.intel.com/products/processor/manual/325384.pdf), - [AMD](http://support.amd.com/us/Processor_TechDocs/42300_15h_Mod_10h-1Fh_BKDG.pdf)) - -## For admins - -Administrators can run a system wide performance statistic, e.g., with -`perf stat -a sleep 1` which measures the performance counters for the -whole computing node over one second.\<span style="font-size: 1em;"> -\</span> - -# perf record - -`perf record` provides the possibility to sample an application or a -system. You can find performance issues and hot parts of your code. By -default perf record samples your program at a 4000 Hz. It records CPU, -Instruction Pointer and, if you specify it, the call chain. If your code -runs long (or often) enough, you can find hot spots in your application -and external libraries. Use **perf report** to evaluate the result. You -should have debug symbols available, otherwise you won't be able to see -the name of the functions that are responsible for your load. You can -pass one or multiple events to define the **sampling event**. \<br /> -**What is a sampling event?** \<br /> Sampling reads values at a -specific sampling frequency. This frequency is usually static and given -in Hz, so you have for example 4000 events per second and a sampling -frequency of 4000 Hz and a sampling rate of 250 microseconds. With the -sampling event, the concept of a static sampling frequency in time is -somewhat redefined. Instead of a constant factor in time (sampling rate) -you define a constant factor in events. So instead of a sampling rate of -250 microseconds, you have a sampling rate of 10,000 floating point -operations. \<br /> **Why would you need sampling events?** \<br /> -Passing an event allows you to find the functions that produce cache -misses, floating point operations, ... Again, you can use events defined -in `perf list` and raw events. \<br />\<br /> Use the `-g` flag to -receive a call graph. - -## For users - -Just run `perf record ./myapp` or attach to a running process. - -### Using perf with MPI - -Perf can also be used to record data for indivdual MPI processes. This -requires a wrapper script (perfwrapper) with the following content. Also -make sure that the wrapper script is executable (chmod +x). - - #!/bin/bash - <span style="font-size: 1em;">perf record -o perf.data.$SLURM_JOB_ID.$SLURM_PROCID $@</span> - -To start the MPI program type \<span>srun ./perfwrapper ./myapp -\</span>on your command line. The result will be n independent perf.data -files that can be analyzed individually with perf report. - -## For admins - -This tool is very effective, if you want to help users find performance -problems and hot-spots in their code but also helps to find OS daemons -that disturb such applications. You would start `perf record -a -g` to -monitor the whole node. - -# perf report - -perf report is a command line UI for evaluating the results from perf -record. It creates something like a profile from the recorded samplings. -These profiles show you what the most used have been. If you added a -callchain, it also gives you a callchain profile.\<br /> \*Disclaimer: -Sampling is not an appropriate way to gain exact numbers. So this is -merely a rough overview and not guaranteed to be absolutely -correct.\*\<span style="font-size: 1em;"> \</span> - -## On taurus - -On taurus, users are not allowed to see the kernel functions. If you -have multiple events defined, then the first thing you select in -`perf report` is the type of event. Press right - - Available samples - 96 cycles - 11 cache-misse - -**Hint: The more samples you have, the more exact is the profile. 96 or -11 samples is not enough by far.** I repeat the measurement and set -`-F 50000` to increase the sampling frequency. **Hint: The higher the -frequency, the higher the influence on the measurement.** If youd'd -select cycles, you would get such a screen: - - Events: 96 cycles - + 49,13% test_gcc_perf test_gcc_perf [.] main.omp_fn.0 - + 34,48% test_gcc_perf test_gcc_perf [.] - + 6,92% test_gcc_perf test_gcc_perf [.] omp_get_thread_num@plt - + 5,20% test_gcc_perf libgomp.so.1.0.0 [.] omp_get_thread_num - + 2,25% test_gcc_perf test_gcc_perf [.] main.omp_fn.1 - + 2,02% test_gcc_perf [kernel.kallsyms] [k] 0xffffffff8102e9ea - -Increased sample frequency: - - Events: 7K cycles - + 42,61% test_gcc_perf test_gcc_perf [.] p - + 40,28% test_gcc_perf test_gcc_perf [.] main.omp_fn.0 - + 6,07% test_gcc_perf test_gcc_perf [.] omp_get_thread_num@plt - + 5,95% test_gcc_perf libgomp.so.1.0.0 [.] omp_get_thread_num - + 4,14% test_gcc_perf test_gcc_perf [.] main.omp_fn.1 - + 0,69% test_gcc_perf [kernel.kallsyms] [k] 0xffffffff8102e9ea - + 0,04% test_gcc_perf ld-2.12.so [.] check_match.12442 - + 0,03% test_gcc_perf libc-2.12.so [.] printf - + 0,03% test_gcc_perf libc-2.12.so [.] vfprintf - + 0,03% test_gcc_perf libc-2.12.so [.] __strchrnul - + 0,03% test_gcc_perf libc-2.12.so [.] _dl_addr - + 0,02% test_gcc_perf ld-2.12.so [.] do_lookup_x - + 0,01% test_gcc_perf libc-2.12.so [.] _int_malloc - + 0,01% test_gcc_perf libc-2.12.so [.] free - + 0,01% test_gcc_perf libc-2.12.so [.] __sigprocmask - + 0,01% test_gcc_perf libgomp.so.1.0.0 [.] 0x87de - + 0,01% test_gcc_perf libc-2.12.so [.] __sleep - + 0,01% test_gcc_perf ld-2.12.so [.] _dl_check_map_versions - + 0,01% test_gcc_perf ld-2.12.so [.] local_strdup - + 0,00% test_gcc_perf libc-2.12.so [.] __execvpe - -Now you select the most often sampled function and zoom into it by -pressing right. If debug symbols are not available, perf report will -show which assembly instruction is hit most often when sampling. If -debug symbols are available, it will also show you the source code lines -for these assembly instructions. You can also go back and check which -instruction caused the cache misses or whatever event you were passing -to perf record. - -# perf script - -If you need a trace of the sampled data, you can use perf script -command, which by default prints all samples to stdout. You can use -various interfaces (e.g., python) to process such a trace. - -# perf top - -perf top is only available for admins, as long as the paranoid flag is -not changed (see configuration). - -It behaves like the top command, but gives you not only an overview of -the processes and the time they are consuming but also on the functions -that are processed by these. - --- Main.RobertSchoene - 2013-04-29