Skip to content
Snippets Groups Projects
Commit 6eb9a1e2 authored by Martin Schroschk's avatar Martin Schroschk
Browse files

Transfer content to new wiki

parent b3bcd6c5
No related branches found
No related tags found
3 merge requests!322Merge preview into main,!319Merge preview into main,!158Transfer content to new wiki
# PAPI Library
Related work:
* [PAPI documentation](http://icl.cs.utk.edu/projects/papi/wiki/Main_Page)
* [Intel 64 and IA-32 Architectures Software Developers Manual (Per thread/per core PMCs)]
(http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf)
Additional sources for **Haswell** Processors: [Intel Xeon Processor E5-2600 v3 Product Family Uncore
Performance Monitoring Guide (Uncore PMCs) - Download link]
(http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v3-uncore-performance-monitoring.html)
## Introduction
PAPI enables users and developers to monitor how their code performs on a specific architecture. To
do so, they can register events that are counted by the hardware in performance monitoring counters
(PMCs). These counters relate to a specific hardware unit, for example a processor core. Intel
Processors used on taurus support eight PMCs per processor core. As the partitions on taurus are run
with HyperThreading Technology (HTT) enabled, each CPU can use four of these. In addition to the
**four core PMCs**, Intel processors also support **a number of uncore PMCs** for non-core
resources. (see the uncore manuals listed in top of this documentation).
## Usage
[Score-P](ScoreP.md) supports per-core PMCs. To include uncore PMCs into Score-P traces use the
software module **scorep-uncore/2016-03-29**on the Haswell partition. If you do so, disable
profiling to include the uncore measurements. This metric plugin is available at
[github](https://github.com/score-p/scorep_plugin_uncore/).
If you want to use PAPI directly in your software, load the latest papi module, which establishes
the environment variables **PAPI_INC**, **PAPI_LIB**, and **PAPI_ROOT**. Have a look at the
[PAPI documentation](http://icl.cs.utk.edu/projects/papi/wiki/Main_Page) for details on the usage.
## Related Software
* [Score-P](ScoreP.md)
* [Linux Perf Tools](PerfTools.md)
If you just need a short summary of your job, you might want to have a look at
[perf stat](PerfTools.md).
# Introduction
`perf` consists of two parts: the kernel space implementation and the userland tools. This wiki
entry focusses on the latter. These tools are installed on taurus, and others and provides support
for sampling applications and reading performance counters.
## Installation
On taurus load the module via
```Bash
module load perf/r31
```
## Configuration
Admins can change the behaviour of the perf tools kernel part via the
following interfaces
| | |
|---------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|
| File Name | Description |
| `/proc/sys/kernel/perf_event_max_sample_rate` | describes the maximal sample rate for perf record and native access. This is used to limit the performance influence of sampling. |
| `/proc/sys/kernel/perf_event_mlock_kb` | defines the number of pages that can be used for sampling via perf record or the native interface |
| `/proc/sys/kernel/perf_event_paranoid` | defines access rights: |
| | -1 - Not paranoid at all |
| | 0 - Disallow raw tracepoint access for unpriv |
| | 1 - Disallow cpu events for unpriv |
| | 2 - Disallow kernel profiling for unpriv |
| `/proc/sys/kernel/kptr_restrict` | Defines whether the kernel address maps are restricted |
## Perf Stat
`perf stat` provides a general performance statistic for a program. You
can attach to a running (own) process, monitor a new process or monitor
the whole system. The latter is only available for root user, as the
performance data can provide hints on the internals of the application.
### For Users
Run `perf stat <Your application>`. This will provide you with a general
overview on some counters.
```Bash
Performance counter stats for 'ls':=
2,524235 task-clock # 0,352 CPUs utilized
15 context-switches # 0,006 M/sec
0 CPU-migrations # 0,000 M/sec
292 page-faults # 0,116 M/sec
6.431.241 cycles # 2,548 GHz
3.537.620 stalled-cycles-frontend # 55,01% frontend cycles idle
2.634.293 stalled-cycles-backend # 40,96% backend cycles idle
6.157.440 instructions # 0,96 insns per cycle
# 0,57 stalled cycles per insn
1.248.527 branches # 494,616 M/sec
34.044 branch-misses # 2,73% of all branches
0,007167707 seconds time elapsed
```
- Generally speaking **task clock** tells you how parallel your job
has been/how many cpus were used.
- **[Context switches](http://en.wikipedia.org/wiki/Context_switch)**
are an information about how the scheduler treated the application. Also interrupts cause context
switches. Lower is better.
- **CPU migrations** are an information on whether the scheduler moved
the application between cores. Lower is better. Please pin your programs to CPUs to avoid
migrations. This can be done with environment variables for OpenMP and MPI, with `likwid-pin`,
`numactl` and `taskset`.
- [Page faults](http://en.wikipedia.org/wiki/Page_fault) describe
how well the Translation Lookaside Buffers fit for the program. Lower is better.
- **Cycles** tells you how many CPU cycles have been spent in
executing the program. The normalized value tells you the actual average frequency of the CPU(s)
running the application.
- **stalled-cycles-...** tell you how well the processor can execute
your code. Every stall cycle is a waste of CPU time and energy. The reason for such stalls can be
numerous. It can be wrong branch predictions, cache misses, occupation of CPU resources by long
running instructions and so on. If these stall cycles are to high you might want to review your
code.
- The normalized **instructions** number tells you how well your code
is running. More is better. Current x86 CPUs can run 3 to 5 instructions per cycle, depending on
the instruction mix. A count of less then 1 is not favorable. In such a case you might want to
review your code.
- **branches** and **branch-misses** tell you how many jumps and loops
are performed in your code. Correctly [predicted](http://en.wikipedia.org/wiki/Branch_prediction)
branches should not hurt your performance, **branch-misses** on the other hand hurt your
performance very badly and lead to stall cycles.
- Other events can be passed with the `-e` flag. For a full list of
predefined events run `perf list`
- PAPI runs on top of the same infrastructure as `perf stat`, so you
might want to use their meaningful event names. Otherwise you can use raw events, listed in the
processor manuals.
### For Admins
Administrators can run a system wide performance statistic, e.g., with `perf stat -a sleep 1` which
measures the performance counters for the whole computing node over one second.
## Perf Record
`perf record` provides the possibility to sample an application or a system. You can find
performance issues and hot parts of your code. By default perf record samples your program at a 4000
Hz. It records CPU, Instruction Pointer and, if you specify it, the call chain. If your code runs
long (or often) enough, you can find hot spots in your application and external libraries. Use
**perf report** to evaluate the result. You should have debug symbols available, otherwise you won't
be able to see the name of the functions that are responsible for your load. You can pass one or
multiple events to define the **sampling event**.
**What is a sampling event?** Sampling reads values at a specific sampling frequency. This
frequency is usually static and given in Hz, so you have for example 4000 events per second and a
sampling frequency of 4000 Hz and a sampling rate of 250 microseconds. With the sampling event, the
concept of a static sampling frequency in time is somewhat redefined. Instead of a constant factor
in time (sampling rate) you define a constant factor in events. So instead of a sampling rate of 250
microseconds, you have a sampling rate of 10,000 floating point operations.
**Why would you need sampling events?** Passing an event allows you to find the functions
that produce cache misses, floating point operations, ... Again, you can use events defined in `perf
list` and raw events.
Use the `-g` flag to receive a call graph.
### For Users
Just run `perf record ./myapp` or attach to a running process.
#### Using Perf with MPI
Perf can also be used to record data for indivdual MPI processes. This requires a wrapper script
(`perfwrapper`) with the following content. Also make sure that the wrapper script is executable
(`chmod +x`).
```Bash
#!/bin/bash
perf record -o perf.data.$SLURM_JOB_ID.$SLURM_PROCID $@
```
To start the MPI program type `srun ./perfwrapper ./myapp` on your command line. The result will be
n independent perf.data files that can be analyzed individually with perf report.
### For Admins
This tool is very effective, if you want to help users find performance problems and hot-spots in
their code but also helps to find OS daemons that disturb such applications. You would start `perf
record -a -g` to monitor the whole node.
## Perf Report
`perf report` is a command line UI for evaluating the results from perf record. It creates something
like a profile from the recorded samplings. These profiles show you what the most used have been.
If you added a callchain, it also gives you a callchain profile.\<br /> \*Disclaimer: Sampling is
not an appropriate way to gain exact numbers. So this is merely a rough overview and not guaranteed
to be absolutely correct.\*\<span style="font-size: 1em;"> \</span>
### On Taurus
On Taurus, users are not allowed to see the kernel functions. If you have multiple events defined,
then the first thing you select in `perf report` is the type of event. Press right
```Bash
Available samples
96 cycles
11 cache-misse
```
**Hints:**
* The more samples you have, the more exact is the profile. 96 or
11 samples is not enough by far.
* Repeat the measurement and set `-F 50000` to increase the sampling frequency.
* The higher the frequency, the higher the influence on the measurement.
If you'd select cycles, you would get such a screen:
```Bash
Events: 96 cycles
+ 49,13% test_gcc_perf test_gcc_perf [.] main.omp_fn.0
+ 34,48% test_gcc_perf test_gcc_perf [.]
+ 6,92% test_gcc_perf test_gcc_perf [.] omp_get_thread_num@plt
+ 5,20% test_gcc_perf libgomp.so.1.0.0 [.] omp_get_thread_num
+ 2,25% test_gcc_perf test_gcc_perf [.] main.omp_fn.1
+ 2,02% test_gcc_perf [kernel.kallsyms] [k] 0xffffffff8102e9ea
```
Increased sample frequency:
```Bash
Events: 7K cycles
+ 42,61% test_gcc_perf test_gcc_perf [.] p
+ 40,28% test_gcc_perf test_gcc_perf [.] main.omp_fn.0
+ 6,07% test_gcc_perf test_gcc_perf [.] omp_get_thread_num@plt
+ 5,95% test_gcc_perf libgomp.so.1.0.0 [.] omp_get_thread_num
+ 4,14% test_gcc_perf test_gcc_perf [.] main.omp_fn.1
+ 0,69% test_gcc_perf [kernel.kallsyms] [k] 0xffffffff8102e9ea
+ 0,04% test_gcc_perf ld-2.12.so [.] check_match.12442
+ 0,03% test_gcc_perf libc-2.12.so [.] printf
+ 0,03% test_gcc_perf libc-2.12.so [.] vfprintf
+ 0,03% test_gcc_perf libc-2.12.so [.] __strchrnul
+ 0,03% test_gcc_perf libc-2.12.so [.] _dl_addr
+ 0,02% test_gcc_perf ld-2.12.so [.] do_lookup_x
+ 0,01% test_gcc_perf libc-2.12.so [.] _int_malloc
+ 0,01% test_gcc_perf libc-2.12.so [.] free
+ 0,01% test_gcc_perf libc-2.12.so [.] __sigprocmask
+ 0,01% test_gcc_perf libgomp.so.1.0.0 [.] 0x87de
+ 0,01% test_gcc_perf libc-2.12.so [.] __sleep
+ 0,01% test_gcc_perf ld-2.12.so [.] _dl_check_map_versions
+ 0,01% test_gcc_perf ld-2.12.so [.] local_strdup
+ 0,00% test_gcc_perf libc-2.12.so [.] __execvpe
```
Now you select the most often sampled function and zoom into it by pressing right. If debug symbols
are not available, perf report will show which assembly instruction is hit most often when sampling.
If debug symbols are available, it will also show you the source code lines for these assembly
instructions. You can also go back and check which instruction caused the cache misses or whatever
event you were passing to perf record.
## Perf Script
If you need a trace of the sampled data, you can use `perf script` command, which by default prints
all samples to stdout. You can use various interfaces (e.g., python) to process such a trace.
## Perf Top
`perf top` is only available for admins, as long as the paranoid flag is not changed (see
configuration).
It behaves like the `top` command, but gives you not only an overview of the processes and the time
they are consuming but also on the functions that are processed by these.
......@@ -53,6 +53,8 @@ nav:
- Debuggers: software/Debuggers.md
- MPI Error Detection: software/MPIUsageErrorDetection.md
- Score-P: software/ScoreP.md
- PAPI Library: software/PapiLibrary.md
- Perf Tools: software/PerfTools.md
- Data Management:
- Overview: data_management/DataManagement.md
- Announcement of Quotas: data_management/AnnouncementOfQuotas.md
......
# Introduction
MPI as the de-facto standard for parallel applications of the the
massage passing paradigm offers more than one hundred different API
calls with complex restrictions. As a result, developing applications
with this interface is error prone and often time consuming. Some usage
errors of MPI may only manifest on some platforms or some application
runs, which further complicates the detection of these errors. Thus,
special debugging tools for MPI applications exist that automatically
check whether an application conforms to the MPI standard and whether
its MPI calls are safe. At ZIH, we maintain and support MUST for this
task, though different types of these tools exist (see last section).
# MUST
MUST checks if your application conforms to the MPI standard and will
issue warnings if there are errors or non-portable constructs. You can
apply MUST without modifying your source code, though we suggest to add
the debugging flag "-g" during compilation.
- [MUST introduction slides](%ATTACHURL%/parallel_debugging_must.pdf)
## Setup and Modules
You need to load a module file in order to use MUST. Each MUST
installation uses a specific combination of a compiler and an MPI
library, make sure to use a combination that fits your needs. Right now
we only provide a single combination on each system, contact us if you
need further combinations. You can query for the available modules with:
module avail must
You can load a MUST module as follows:
module load must
Besides loading a MUST module, no further changes are needed during
compilation and linking.
## Running with MUST
In order to run with MUST you need to replace the mpirun/mpiexec command
with mustrun:
mustrun -np <NPROC> ./a.out
Besides replacing the mpiexec command you need to be aware that **MUST
always allocates an extra process**. I.e. if you issue a "mustrun -np 4
./a.out" then MUST will start 5 processes instead. This is usually not
critical, however in batch jobs **make sure to allocate space for this
extra task**.
Finally, MUST assumes that your application may crash at any time. To
still gather correctness results under this assumption is extremely
expensive in terms of performance overheads. Thus, if your application
does not crashs, you should add an "--must:nocrash" to the mustrun
command to make MUST aware of this knowledge. Overhead is drastically
reduced with this switch.
## Result Files
After running your application with MUST you will have its output in the
working directory of your application. The output is named
"MUST_Output.html". Open this files in a browser to anlyze the results.
The HTML file is color coded: Entries in green represent notes and
useful information. Entries in yellow represent warnings, and entries in
red represent errors.
# Other MPI Correctness Tools
Besides MUST, there exist further MPI correctness tools, these are:
- Marmot (predecessor of MUST)
- MPI checking library of the Intel Trace Collector
- ISP (From Utah)
- Umpire (predecessor of MUST)
ISP provides a more thorough deadlock detection as it investigates
alternative execution paths, however its overhead is drastically higher
as a result. Contact our support if you have a specific use cases that
needs one of these tools.
# PAPI Library
Related work: [PAPI
documentation](http://icl.cs.utk.edu/projects/papi/wiki/Main_Page),
[Intel 64 and IA-32 Architectures Software Developers Manual (Per
thread/per core
PMCs)](http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf)
Additional sources for **Sandy Bridge** Processors: [Intel Xeon
Processor E5-2600 Product Family Uncore Performance Monitoring Guide
(Uncore
PMCs)](http://www.intel.com/content/dam/www/public/us/en/documents/design-guides/xeon-e5-2600-uncore-guide.pdf)
Additional sources for **Haswell** Processors: [Intel Xeon Processor
E5-2600 v3 Product Family Uncore Performance Monitoring Guide (Uncore
PMCs) - Download
link](http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v3-uncore-performance-monitoring.html)
## Introduction
PAPI enables users and developers to monitor how their code performs on
a specific architecture. To do so, they can register events that are
counted by the hardware in performance monitoring counters (PMCs). These
counters relate to a specific hardware unit, for example a processor
core. Intel Processors used on taurus support eight PMCs per processor
core. As the partitions on taurus are run with HyperThreading Technology
(HTT) enabled, each CPU can use four of these. In addition to the **four
core PMCs**, Intel processors also support **a number of uncore PMCs**
for non-core resources. (see the uncore manuals listed in top of this
documentation).
## Usage
[Score-P](ScoreP) supports per-core PMCs. To include uncore PMCs into
Score-P traces use the software module **scorep-uncore/2016-03-29**on
the Haswell partition. If you do so, disable profiling to include the
uncore measurements. This metric plugin is available at
[github](https://github.com/score-p/scorep_plugin_uncore/).
If you want to use PAPI directly in your software, load the latest papi
module, which establishes the environment variables **PAPI_INC**,
**PAPI_LIB**, and **PAPI_ROOT**. Have a look at the [PAPI
documentation](http://icl.cs.utk.edu/projects/papi/wiki/Main_Page) for
details on the usage.
## Related Software
[Score-P](ScoreP)
[Linux Perf Tools](PerfTools)
If you just need a short summary of your job, you might want to have a
look at [perf stat](PerfTools)
-- Main.UlfMarkwardt - 2012-10-09
(This page is under construction)
# Introduction
perf consists of two parts: the kernel space implementation and the
userland tools. This wiki entry focusses on the latter. These tools are
installed on taurus, and others and provides support for sampling
applications and reading performance counters.
# Installation
On taurus load the module via
module load perf/r31
# Configuration
Admins can change the behaviour of the perf tools kernel part via the
following interfaces
| | |
|---------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|
| File Name | Description |
| /proc/sys/kernel/perf_event_max_sample_rate | describes the maximal sample rate for perf record and native access. This is used to limit the performance influence of sampling. |
| /proc/sys/kernel/perf_event_mlock_kb | defines the number of pages that can be used for sampling via perf record or the native interface |
| /proc/sys/kernel/perf_event_paranoid | defines access rights: |
| | -1 - Not paranoid at all |
| | 0 - Disallow raw tracepoint access for unpriv |
| | 1 - Disallow cpu events for unpriv |
| | 2 - Disallow kernel profiling for unpriv |
| /proc/sys/kernel/kptr_restrict | Defines whether the kernel address maps are restricted |
# perf stat
`perf stat` provides a general performance statistic for a program. You
can attach to a running (own) process, monitor a new process or monitor
the whole system. The latter is only available for root user, as the
performance data can provide hints on the internals of the application.
## For users
Run `perf stat <Your application>`. This will provide you with a general
overview on some counters.
Performance counter stats for 'ls':=
2,524235 task-clock # 0,352 CPUs utilized
15 context-switches # 0,006 M/sec
0 CPU-migrations # 0,000 M/sec
292 page-faults # 0,116 M/sec
6.431.241 cycles # 2,548 GHz
3.537.620 stalled-cycles-frontend # 55,01% frontend cycles idle
2.634.293 stalled-cycles-backend # 40,96% backend cycles idle
6.157.440 instructions # 0,96 insns per cycle
# 0,57 stalled cycles per insn
1.248.527 branches # 494,616 M/sec
34.044 branch-misses # 2,73% of all branches
0,007167707 seconds time elapsed
- Generally speaking **task clock** tells you how parallel your job
has been/how many cpus were used.
- **[Context switches](http://en.wikipedia.org/wiki/Context_switch)**
are an information about how the scheduler treated the application.
Also interrupts cause context switches. Lower is better.
- **CPU migrations** are an information on whether the scheduler moved
the application between cores. Lower is better. Please pin your
programs to CPUs to avoid migrations. This can be done with
environment variables for OpenMP and MPI, with `likwid-pin`,
`numactl` and `taskset`.
- **[Page faults](http://en.wikipedia.org/wiki/Page_fault)** describe
how well the Translation Lookaside Buffers fit for the program.
Lower is better.
- **Cycles** tells you how many CPU cycles have been spent in
executing the program. The normalized value tells you the actual
average frequency of the CPU(s) running the application.
- **stalled-cycles-...** tell you how well the processor can execute
your code. Every stall cycle is a waste of CPU time and energy. The
reason for such stalls can be numerous. It can be wrong branch
predictions, cache misses, occupation of CPU resources by long
running instructions and so on. If these stall cycles are to high
you might want to review your code.
- The normalized **instructions** number tells you how well your code
is running. More is better. Current x86 CPUs can run 3 to 5
instructions per cycle, depending on the instruction mix. A count of
less then 1 is not favorable. In such a case you might want to
review your code.
- **branches** and **branch-misses** tell you how many jumps and loops
are performed in your code. Correctly
[predicted](http://en.wikipedia.org/wiki/Branch_prediction) branches
should not hurt your performance, **branch-misses** on the other
hand hurt your performance very badly and lead to stall cycles.
- Other events can be passed with the `-e` flag. For a full list of
predefined events run `perf list`
- PAPI runs on top of the same infrastructure as perf stat, so you
might want to use their meaningful event names. Otherwise you can
use raw events, listed in the processor manuals. (
[Intel](http://download.intel.com/products/processor/manual/325384.pdf),
[AMD](http://support.amd.com/us/Processor_TechDocs/42300_15h_Mod_10h-1Fh_BKDG.pdf))
## For admins
Administrators can run a system wide performance statistic, e.g., with
`perf stat -a sleep 1` which measures the performance counters for the
whole computing node over one second.\<span style="font-size: 1em;">
\</span>
# perf record
`perf record` provides the possibility to sample an application or a
system. You can find performance issues and hot parts of your code. By
default perf record samples your program at a 4000 Hz. It records CPU,
Instruction Pointer and, if you specify it, the call chain. If your code
runs long (or often) enough, you can find hot spots in your application
and external libraries. Use **perf report** to evaluate the result. You
should have debug symbols available, otherwise you won't be able to see
the name of the functions that are responsible for your load. You can
pass one or multiple events to define the **sampling event**. \<br />
**What is a sampling event?** \<br /> Sampling reads values at a
specific sampling frequency. This frequency is usually static and given
in Hz, so you have for example 4000 events per second and a sampling
frequency of 4000 Hz and a sampling rate of 250 microseconds. With the
sampling event, the concept of a static sampling frequency in time is
somewhat redefined. Instead of a constant factor in time (sampling rate)
you define a constant factor in events. So instead of a sampling rate of
250 microseconds, you have a sampling rate of 10,000 floating point
operations. \<br /> **Why would you need sampling events?** \<br />
Passing an event allows you to find the functions that produce cache
misses, floating point operations, ... Again, you can use events defined
in `perf list` and raw events. \<br />\<br /> Use the `-g` flag to
receive a call graph.
## For users
Just run `perf record ./myapp` or attach to a running process.
### Using perf with MPI
Perf can also be used to record data for indivdual MPI processes. This
requires a wrapper script (perfwrapper) with the following content. Also
make sure that the wrapper script is executable (chmod +x).
#!/bin/bash
<span style="font-size: 1em;">perf record -o perf.data.$SLURM_JOB_ID.$SLURM_PROCID $@</span>
To start the MPI program type \<span>srun ./perfwrapper ./myapp
\</span>on your command line. The result will be n independent perf.data
files that can be analyzed individually with perf report.
## For admins
This tool is very effective, if you want to help users find performance
problems and hot-spots in their code but also helps to find OS daemons
that disturb such applications. You would start `perf record -a -g` to
monitor the whole node.
# perf report
perf report is a command line UI for evaluating the results from perf
record. It creates something like a profile from the recorded samplings.
These profiles show you what the most used have been. If you added a
callchain, it also gives you a callchain profile.\<br /> \*Disclaimer:
Sampling is not an appropriate way to gain exact numbers. So this is
merely a rough overview and not guaranteed to be absolutely
correct.\*\<span style="font-size: 1em;"> \</span>
## On taurus
On taurus, users are not allowed to see the kernel functions. If you
have multiple events defined, then the first thing you select in
`perf report` is the type of event. Press right
Available samples
96 cycles
11 cache-misse
**Hint: The more samples you have, the more exact is the profile. 96 or
11 samples is not enough by far.** I repeat the measurement and set
`-F 50000` to increase the sampling frequency. **Hint: The higher the
frequency, the higher the influence on the measurement.** If youd'd
select cycles, you would get such a screen:
Events: 96 cycles
+ 49,13% test_gcc_perf test_gcc_perf [.] main.omp_fn.0
+ 34,48% test_gcc_perf test_gcc_perf [.]
+ 6,92% test_gcc_perf test_gcc_perf [.] omp_get_thread_num@plt
+ 5,20% test_gcc_perf libgomp.so.1.0.0 [.] omp_get_thread_num
+ 2,25% test_gcc_perf test_gcc_perf [.] main.omp_fn.1
+ 2,02% test_gcc_perf [kernel.kallsyms] [k] 0xffffffff8102e9ea
Increased sample frequency:
Events: 7K cycles
+ 42,61% test_gcc_perf test_gcc_perf [.] p
+ 40,28% test_gcc_perf test_gcc_perf [.] main.omp_fn.0
+ 6,07% test_gcc_perf test_gcc_perf [.] omp_get_thread_num@plt
+ 5,95% test_gcc_perf libgomp.so.1.0.0 [.] omp_get_thread_num
+ 4,14% test_gcc_perf test_gcc_perf [.] main.omp_fn.1
+ 0,69% test_gcc_perf [kernel.kallsyms] [k] 0xffffffff8102e9ea
+ 0,04% test_gcc_perf ld-2.12.so [.] check_match.12442
+ 0,03% test_gcc_perf libc-2.12.so [.] printf
+ 0,03% test_gcc_perf libc-2.12.so [.] vfprintf
+ 0,03% test_gcc_perf libc-2.12.so [.] __strchrnul
+ 0,03% test_gcc_perf libc-2.12.so [.] _dl_addr
+ 0,02% test_gcc_perf ld-2.12.so [.] do_lookup_x
+ 0,01% test_gcc_perf libc-2.12.so [.] _int_malloc
+ 0,01% test_gcc_perf libc-2.12.so [.] free
+ 0,01% test_gcc_perf libc-2.12.so [.] __sigprocmask
+ 0,01% test_gcc_perf libgomp.so.1.0.0 [.] 0x87de
+ 0,01% test_gcc_perf libc-2.12.so [.] __sleep
+ 0,01% test_gcc_perf ld-2.12.so [.] _dl_check_map_versions
+ 0,01% test_gcc_perf ld-2.12.so [.] local_strdup
+ 0,00% test_gcc_perf libc-2.12.so [.] __execvpe
Now you select the most often sampled function and zoom into it by
pressing right. If debug symbols are not available, perf report will
show which assembly instruction is hit most often when sampling. If
debug symbols are available, it will also show you the source code lines
for these assembly instructions. You can also go back and check which
instruction caused the cache misses or whatever event you were passing
to perf record.
# perf script
If you need a trace of the sampled data, you can use perf script
command, which by default prints all samples to stdout. You can use
various interfaces (e.g., python) to process such a trace.
# perf top
perf top is only available for admins, as long as the paranoid flag is
not changed (see configuration).
It behaves like the top command, but gives you not only an overview of
the processes and the time they are consuming but also on the functions
that are processed by these.
-- Main.RobertSchoene - 2013-04-29
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment