diff --git a/doc.zih.tu-dresden.de/docs/software/PapiLibrary.md b/doc.zih.tu-dresden.de/docs/software/PapiLibrary.md
new file mode 100644
index 0000000000000000000000000000000000000000..414e08a4bf0226493e10bfeabbad620df14f59c1
--- /dev/null
+++ b/doc.zih.tu-dresden.de/docs/software/PapiLibrary.md
@@ -0,0 +1,40 @@
+# PAPI Library
+
+Related work:
+
+* [PAPI documentation](http://icl.cs.utk.edu/projects/papi/wiki/Main_Page)
+* [Intel 64 and IA-32 Architectures Software Developers Manual (Per thread/per core PMCs)]
+  (http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf)
+
+Additional sources for **Haswell** Processors: [Intel Xeon Processor E5-2600 v3 Product Family Uncore
+Performance Monitoring Guide (Uncore PMCs) - Download link]
+(http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v3-uncore-performance-monitoring.html)
+
+## Introduction
+
+PAPI enables users and developers to monitor how their code performs on a specific architecture. To
+do so, they can register events that are counted by the hardware in performance monitoring counters
+(PMCs). These counters relate to a specific hardware unit, for example a processor core. Intel
+Processors used on taurus support eight PMCs per processor core. As the partitions on taurus are run
+with HyperThreading Technology (HTT) enabled, each CPU can use four of these. In addition to the
+**four core PMCs**, Intel processors also support **a number of uncore PMCs** for non-core
+resources. (see the uncore manuals listed in top of this documentation).
+
+## Usage
+
+[Score-P](ScoreP.md) supports per-core PMCs. To include uncore PMCs into Score-P traces use the
+software module **scorep-uncore/2016-03-29**on the Haswell partition. If you do so, disable
+profiling to include the uncore measurements. This metric plugin is available at
+[github](https://github.com/score-p/scorep_plugin_uncore/).
+
+If you want to use PAPI directly in your software, load the latest papi module, which establishes
+the environment variables **PAPI_INC**, **PAPI_LIB**, and **PAPI_ROOT**. Have a look at the
+[PAPI documentation](http://icl.cs.utk.edu/projects/papi/wiki/Main_Page) for details on the usage.
+
+## Related Software
+
+* [Score-P](ScoreP.md)
+* [Linux Perf Tools](PerfTools.md)
+
+If you just need a short summary of your job, you might want to have a look at
+[perf stat](PerfTools.md).
diff --git a/doc.zih.tu-dresden.de/docs/software/PerfTools.md b/doc.zih.tu-dresden.de/docs/software/PerfTools.md
new file mode 100644
index 0000000000000000000000000000000000000000..176c772bf4bcbe3d9cf3a1eda725d0cc7f14daac
--- /dev/null
+++ b/doc.zih.tu-dresden.de/docs/software/PerfTools.md
@@ -0,0 +1,226 @@
+# Introduction
+
+`perf` consists of two parts: the kernel space implementation and the userland tools. This wiki
+entry focusses on the latter. These tools are installed on taurus, and others and provides support
+for sampling applications and reading performance counters.
+
+## Installation
+
+On taurus load the module via
+
+```Bash
+module load perf/r31
+```
+
+## Configuration
+
+Admins can change the behaviour of the perf tools kernel part via the
+following interfaces
+
+|                                             |                                                                                                                                   |
+|---------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|
+| File Name                                   | Description                                                                                                                       |
+| `/proc/sys/kernel/perf_event_max_sample_rate` | describes the maximal sample rate for perf record and native access. This is used to limit the performance influence of sampling. |
+| `/proc/sys/kernel/perf_event_mlock_kb`        | defines the number of pages that can be used for sampling via perf record or the native interface                                 |
+| `/proc/sys/kernel/perf_event_paranoid`        | defines access rights:                                                                                                            |
+|                                             | -1 - Not paranoid at all                                                                                                          |
+|                                             | 0 - Disallow raw tracepoint access for unpriv                                                                                     |
+|                                             | 1 - Disallow cpu events for unpriv                                                                                                |
+|                                             | 2 - Disallow kernel profiling for unpriv                                                                                          |
+| `/proc/sys/kernel/kptr_restrict`              | Defines whether the kernel address maps are restricted                                                                            |
+
+## Perf Stat
+
+`perf stat` provides a general performance statistic for a program. You
+can attach to a running (own) process, monitor a new process or monitor
+the whole system. The latter is only available for root user, as the
+performance data can provide hints on the internals of the application.
+
+### For Users
+
+Run `perf stat <Your application>`. This will provide you with a general
+overview on some counters.
+
+```Bash
+Performance counter stats for 'ls':= 
+          2,524235 task-clock                #    0,352 CPUs utilized           
+                15 context-switches          #    0,006 M/sec                   
+                 0 CPU-migrations            #    0,000 M/sec                   
+               292 page-faults               #    0,116 M/sec                   
+         6.431.241 cycles                    #    2,548 GHz                     
+         3.537.620 stalled-cycles-frontend   #   55,01% frontend cycles idle    
+         2.634.293 stalled-cycles-backend    #   40,96% backend  cycles idle    
+         6.157.440 instructions              #    0,96  insns per cycle         
+                                             #    0,57  stalled cycles per insn 
+         1.248.527 branches                  #  494,616 M/sec                   
+            34.044 branch-misses             #    2,73% of all branches         
+       0,007167707 seconds time elapsed
+```
+
+- Generally speaking **task clock** tells you how parallel your job
+  has been/how many cpus were used.
+- **[Context switches](http://en.wikipedia.org/wiki/Context_switch)**
+  are an information about how the scheduler treated the application.  Also interrupts cause context
+  switches. Lower is better.
+- **CPU migrations** are an information on whether the scheduler moved
+  the application between cores. Lower is better. Please pin your programs to CPUs to avoid
+  migrations. This can be done with environment variables for OpenMP and MPI, with `likwid-pin`,
+  `numactl` and `taskset`.
+- [Page faults](http://en.wikipedia.org/wiki/Page_fault) describe
+  how well the Translation Lookaside Buffers fit for the program.  Lower is better.
+- **Cycles** tells you how many CPU cycles have been spent in
+  executing the program. The normalized value tells you the actual average frequency of the CPU(s)
+  running the application.
+- **stalled-cycles-...** tell you how well the processor can execute
+  your code. Every stall cycle is a waste of CPU time and energy. The reason for such stalls can be
+  numerous. It can be wrong branch predictions, cache misses, occupation of CPU resources by long
+  running instructions and so on. If these stall cycles are to high you might want to review your
+  code.
+- The normalized **instructions** number tells you how well your code
+  is running. More is better. Current x86 CPUs can run 3 to 5 instructions per cycle, depending on
+  the instruction mix. A count of less then 1 is not favorable. In such a case you might want to
+  review your code.
+- **branches** and **branch-misses** tell you how many jumps and loops
+  are performed in your code. Correctly [predicted](http://en.wikipedia.org/wiki/Branch_prediction)
+  branches should not hurt your performance, **branch-misses** on the other hand hurt your
+  performance very badly and lead to stall cycles.
+- Other events can be passed with the `-e` flag. For a full list of
+  predefined events run `perf list`
+- PAPI runs on top of the same infrastructure as `perf stat`, so you
+  might want to use their meaningful event names. Otherwise you can use raw events, listed in the
+  processor manuals.
+
+### For Admins
+
+Administrators can run a system wide performance statistic, e.g., with `perf stat -a sleep 1` which
+measures the performance counters for the whole computing node over one second.
+
+## Perf Record
+
+`perf record` provides the possibility to sample an application or a system. You can find
+performance issues and hot parts of your code. By default perf record samples your program at a 4000
+Hz. It records CPU, Instruction Pointer and, if you specify it, the call chain. If your code runs
+long (or often) enough, you can find hot spots in your application and external libraries. Use
+**perf report** to evaluate the result. You should have debug symbols available, otherwise you won't
+be able to see the name of the functions that are responsible for your load. You can pass one or
+multiple events to define the **sampling event**.
+
+**What is a sampling event?** Sampling reads values at a specific sampling frequency. This
+frequency is usually static and given in Hz, so you have for example 4000 events per second and a
+sampling frequency of 4000 Hz and a sampling rate of 250 microseconds. With the sampling event, the
+concept of a static sampling frequency in time is somewhat redefined. Instead of a constant factor
+in time (sampling rate) you define a constant factor in events. So instead of a sampling rate of 250
+microseconds, you have a sampling rate of 10,000 floating point operations.
+
+**Why would you need sampling events?** Passing an event allows you to find the functions
+that produce cache misses, floating point operations, ... Again, you can use events defined in `perf
+list` and raw events.
+
+Use the `-g` flag to receive a call graph.
+
+### For Users
+
+Just run `perf record ./myapp` or attach to a running process.
+
+#### Using Perf with MPI
+
+Perf can also be used to record data for indivdual MPI processes. This requires a wrapper script
+(`perfwrapper`) with the following content. Also make sure that the wrapper script is executable
+(`chmod +x`).
+
+```Bash
+#!/bin/bash
+perf record -o perf.data.$SLURM_JOB_ID.$SLURM_PROCID $@
+```
+
+To start the MPI program type `srun ./perfwrapper ./myapp` on your command line. The result will be
+n independent perf.data files that can be analyzed individually with perf report.
+
+### For Admins
+
+This tool is very effective, if you want to help users find performance problems and hot-spots in
+their code but also helps to find OS daemons that disturb such applications. You would start `perf
+record -a -g` to monitor the whole node.
+
+## Perf Report
+
+`perf report` is a command line UI for evaluating the results from perf record. It creates something
+like a profile from the recorded samplings.  These profiles show you what the most used have been.
+If you added a callchain, it also gives you a callchain profile.\<br /> \*Disclaimer: Sampling is
+not an appropriate way to gain exact numbers. So this is merely a rough overview and not guaranteed
+to be absolutely correct.\*\<span style="font-size: 1em;"> \</span>
+
+### On Taurus
+
+On Taurus, users are not allowed to see the kernel functions. If you have multiple events defined,
+then the first thing you select in `perf report` is the type of event. Press right
+
+```Bash
+Available samples
+96 cycles
+11 cache-misse
+```
+
+**Hints:**
+
+* The more samples you have, the more exact is the profile. 96 or
+11 samples is not enough by far.
+* Repeat the measurement and set `-F 50000` to increase the sampling frequency.
+* The higher the frequency, the higher the influence on the measurement.
+
+If you'd select cycles, you would get such a screen:
+
+```Bash
+Events: 96  cycles
++  49,13%  test_gcc_perf  test_gcc_perf      [.] main.omp_fn.0
++  34,48%  test_gcc_perf  test_gcc_perf      [.] 
++   6,92%  test_gcc_perf  test_gcc_perf      [.] omp_get_thread_num@plt
++   5,20%  test_gcc_perf  libgomp.so.1.0.0   [.] omp_get_thread_num
++   2,25%  test_gcc_perf  test_gcc_perf      [.] main.omp_fn.1
++   2,02%  test_gcc_perf  [kernel.kallsyms]  [k] 0xffffffff8102e9ea
+```
+
+Increased sample frequency:
+
+```Bash
+Events: 7K cycles
++  42,61%  test_gcc_perf  test_gcc_perf      [.] p
++  40,28%  test_gcc_perf  test_gcc_perf      [.] main.omp_fn.0
++   6,07%  test_gcc_perf  test_gcc_perf      [.] omp_get_thread_num@plt
++   5,95%  test_gcc_perf  libgomp.so.1.0.0   [.] omp_get_thread_num
++   4,14%  test_gcc_perf  test_gcc_perf      [.] main.omp_fn.1
++   0,69%  test_gcc_perf  [kernel.kallsyms]  [k] 0xffffffff8102e9ea
++   0,04%  test_gcc_perf  ld-2.12.so         [.] check_match.12442
++   0,03%  test_gcc_perf  libc-2.12.so       [.] printf
++   0,03%  test_gcc_perf  libc-2.12.so       [.] vfprintf
++   0,03%  test_gcc_perf  libc-2.12.so       [.] __strchrnul
++   0,03%  test_gcc_perf  libc-2.12.so       [.] _dl_addr
++   0,02%  test_gcc_perf  ld-2.12.so         [.] do_lookup_x
++   0,01%  test_gcc_perf  libc-2.12.so       [.] _int_malloc
++   0,01%  test_gcc_perf  libc-2.12.so       [.] free
++   0,01%  test_gcc_perf  libc-2.12.so       [.] __sigprocmask
++   0,01%  test_gcc_perf  libgomp.so.1.0.0   [.] 0x87de
++   0,01%  test_gcc_perf  libc-2.12.so       [.] __sleep
++   0,01%  test_gcc_perf  ld-2.12.so         [.] _dl_check_map_versions
++   0,01%  test_gcc_perf  ld-2.12.so         [.] local_strdup
++   0,00%  test_gcc_perf  libc-2.12.so       [.] __execvpe
+```
+
+Now you select the most often sampled function and zoom into it by pressing right. If debug symbols
+are not available, perf report will show which assembly instruction is hit most often when sampling.
+If debug symbols are available, it will also show you the source code lines for these assembly
+instructions. You can also go back and check which instruction caused the cache misses or whatever
+event you were passing to perf record.
+
+## Perf Script
+
+If you need a trace of the sampled data, you can use `perf script` command, which by default prints
+all samples to stdout. You can use various interfaces (e.g., python) to process such a trace.
+
+## Perf Top
+
+`perf top` is only available for admins, as long as the paranoid flag is not changed (see
+configuration).
+
+It behaves like the `top` command, but gives you not only an overview of the processes and the time
+they are consuming but also on the functions that are processed by these.
diff --git a/doc.zih.tu-dresden.de/mkdocs.yml b/doc.zih.tu-dresden.de/mkdocs.yml
index 37da298857672ade7803554f33729eb23b46b3d3..d80e9f3f8d666093981a3e0ed5c37b1d39c13599 100644
--- a/doc.zih.tu-dresden.de/mkdocs.yml
+++ b/doc.zih.tu-dresden.de/mkdocs.yml
@@ -53,6 +53,8 @@ nav:
       - Debuggers: software/Debuggers.md
       - MPI Error Detection: software/MPIUsageErrorDetection.md
       - Score-P: software/ScoreP.md
+      - PAPI Library: software/PapiLibrary.md 
+      - Perf Tools: software/PerfTools.md 
   - Data Management:
     - Overview: data_management/DataManagement.md
     - Announcement of Quotas: data_management/AnnouncementOfQuotas.md
diff --git a/twiki2md/root/DebuggingTools/MPIUsageErrorDetection.md b/twiki2md/root/DebuggingTools/MPIUsageErrorDetection.md
deleted file mode 100644
index 2b72f35df97902b323fb5cb589387acbb52cd6d6..0000000000000000000000000000000000000000
--- a/twiki2md/root/DebuggingTools/MPIUsageErrorDetection.md
+++ /dev/null
@@ -1,81 +0,0 @@
-# Introduction
-
-MPI as the de-facto standard for parallel applications of the the
-massage passing paradigm offers more than one hundred different API
-calls with complex restrictions. As a result, developing applications
-with this interface is error prone and often time consuming. Some usage
-errors of MPI may only manifest on some platforms or some application
-runs, which further complicates the detection of these errors. Thus,
-special debugging tools for MPI applications exist that automatically
-check whether an application conforms to the MPI standard and whether
-its MPI calls are safe. At ZIH, we maintain and support MUST for this
-task, though different types of these tools exist (see last section).
-
-# MUST
-
-MUST checks if your application conforms to the MPI standard and will
-issue warnings if there are errors or non-portable constructs. You can
-apply MUST without modifying your source code, though we suggest to add
-the debugging flag "-g" during compilation.
-
--   [MUST introduction slides](%ATTACHURL%/parallel_debugging_must.pdf)
-
-## Setup and Modules
-
-You need to load a module file in order to use MUST. Each MUST
-installation uses a specific combination of a compiler and an MPI
-library, make sure to use a combination that fits your needs. Right now
-we only provide a single combination on each system, contact us if you
-need further combinations. You can query for the available modules with:
-
-    module avail must
-
-You can load a MUST module as follows:
-
-    module load must
-
-Besides loading a MUST module, no further changes are needed during
-compilation and linking.
-
-## Running with MUST
-
-In order to run with MUST you need to replace the mpirun/mpiexec command
-with mustrun:
-
-    mustrun -np <NPROC> ./a.out
-
-Besides replacing the mpiexec command you need to be aware that **MUST
-always allocates an extra process**. I.e. if you issue a "mustrun -np 4
-./a.out" then MUST will start 5 processes instead. This is usually not
-critical, however in batch jobs **make sure to allocate space for this
-extra task**.
-
-Finally, MUST assumes that your application may crash at any time. To
-still gather correctness results under this assumption is extremely
-expensive in terms of performance overheads. Thus, if your application
-does not crashs, you should add an "--must:nocrash" to the mustrun
-command to make MUST aware of this knowledge. Overhead is drastically
-reduced with this switch.
-
-## Result Files
-
-After running your application with MUST you will have its output in the
-working directory of your application. The output is named
-"MUST_Output.html". Open this files in a browser to anlyze the results.
-The HTML file is color coded: Entries in green represent notes and
-useful information. Entries in yellow represent warnings, and entries in
-red represent errors.
-
-# Other MPI Correctness Tools
-
-Besides MUST, there exist further MPI correctness tools, these are:
-
--   Marmot (predecessor of MUST)
--   MPI checking library of the Intel Trace Collector
--   ISP (From Utah)
--   Umpire (predecessor of MUST)
-
-ISP provides a more thorough deadlock detection as it investigates
-alternative execution paths, however its overhead is drastically higher
-as a result. Contact our support if you have a specific use cases that
-needs one of these tools.
diff --git a/twiki2md/root/PerformanceTools/PapiLibrary.md b/twiki2md/root/PerformanceTools/PapiLibrary.md
deleted file mode 100644
index 5516c7e310d68e7509384ef081b89816ebe41dde..0000000000000000000000000000000000000000
--- a/twiki2md/root/PerformanceTools/PapiLibrary.md
+++ /dev/null
@@ -1,55 +0,0 @@
-# PAPI Library
-
-Related work: [PAPI
-documentation](http://icl.cs.utk.edu/projects/papi/wiki/Main_Page),
-[Intel 64 and IA-32 Architectures Software Developers Manual (Per
-thread/per core
-PMCs)](http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf)
-
-Additional sources for **Sandy Bridge** Processors: [Intel Xeon
-Processor E5-2600 Product Family Uncore Performance Monitoring Guide
-(Uncore
-PMCs)](http://www.intel.com/content/dam/www/public/us/en/documents/design-guides/xeon-e5-2600-uncore-guide.pdf)
-
-Additional sources for **Haswell** Processors: [Intel Xeon Processor
-E5-2600 v3 Product Family Uncore Performance Monitoring Guide (Uncore
-PMCs) - Download
-link](http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v3-uncore-performance-monitoring.html)
-
-## Introduction
-
-PAPI enables users and developers to monitor how their code performs on
-a specific architecture. To do so, they can register events that are
-counted by the hardware in performance monitoring counters (PMCs). These
-counters relate to a specific hardware unit, for example a processor
-core. Intel Processors used on taurus support eight PMCs per processor
-core. As the partitions on taurus are run with HyperThreading Technology
-(HTT) enabled, each CPU can use four of these. In addition to the **four
-core PMCs**, Intel processors also support **a number of uncore PMCs**
-for non-core resources. (see the uncore manuals listed in top of this
-documentation).
-
-## Usage
-
-[Score-P](ScoreP) supports per-core PMCs. To include uncore PMCs into
-Score-P traces use the software module **scorep-uncore/2016-03-29**on
-the Haswell partition. If you do so, disable profiling to include the
-uncore measurements. This metric plugin is available at
-[github](https://github.com/score-p/scorep_plugin_uncore/).
-
-If you want to use PAPI directly in your software, load the latest papi
-module, which establishes the environment variables **PAPI_INC**,
-**PAPI_LIB**, and **PAPI_ROOT**. Have a look at the [PAPI
-documentation](http://icl.cs.utk.edu/projects/papi/wiki/Main_Page) for
-details on the usage.
-
-## Related Software
-
-[Score-P](ScoreP)
-
-[Linux Perf Tools](PerfTools)
-
-If you just need a short summary of your job, you might want to have a
-look at [perf stat](PerfTools)
-
--- Main.UlfMarkwardt - 2012-10-09
diff --git a/twiki2md/root/PerformanceTools/PerfTools.md b/twiki2md/root/PerformanceTools/PerfTools.md
deleted file mode 100644
index 1a9fd98517e5274f88be0f55e7d6002f907d0b5c..0000000000000000000000000000000000000000
--- a/twiki2md/root/PerformanceTools/PerfTools.md
+++ /dev/null
@@ -1,236 +0,0 @@
-(This page is under construction)
-
-# Introduction
-
-perf consists of two parts: the kernel space implementation and the
-userland tools. This wiki entry focusses on the latter. These tools are
-installed on taurus, and others and provides support for sampling
-applications and reading performance counters.
-
-# Installation
-
-On taurus load the module via
-
-    module load perf/r31
-
-# Configuration
-
-Admins can change the behaviour of the perf tools kernel part via the
-following interfaces
-
-|                                             |                                                                                                                                   |
-|---------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|
-| File Name                                   | Description                                                                                                                       |
-| /proc/sys/kernel/perf_event_max_sample_rate | describes the maximal sample rate for perf record and native access. This is used to limit the performance influence of sampling. |
-| /proc/sys/kernel/perf_event_mlock_kb        | defines the number of pages that can be used for sampling via perf record or the native interface                                 |
-| /proc/sys/kernel/perf_event_paranoid        | defines access rights:                                                                                                            |
-|                                             | -1 - Not paranoid at all                                                                                                          |
-|                                             | 0 - Disallow raw tracepoint access for unpriv                                                                                     |
-|                                             | 1 - Disallow cpu events for unpriv                                                                                                |
-|                                             | 2 - Disallow kernel profiling for unpriv                                                                                          |
-| /proc/sys/kernel/kptr_restrict              | Defines whether the kernel address maps are restricted                                                                            |
-
-# perf stat
-
-`perf stat` provides a general performance statistic for a program. You
-can attach to a running (own) process, monitor a new process or monitor
-the whole system. The latter is only available for root user, as the
-performance data can provide hints on the internals of the application.
-
-## For users
-
-Run `perf stat <Your application>`. This will provide you with a general
-overview on some counters.
-
-    Performance counter stats for 'ls':= 
-              2,524235 task-clock                #    0,352 CPUs utilized           
-                    15 context-switches          #    0,006 M/sec                   
-                     0 CPU-migrations            #    0,000 M/sec                   
-                   292 page-faults               #    0,116 M/sec                   
-             6.431.241 cycles                    #    2,548 GHz                     
-             3.537.620 stalled-cycles-frontend   #   55,01% frontend cycles idle    
-             2.634.293 stalled-cycles-backend    #   40,96% backend  cycles idle    
-             6.157.440 instructions              #    0,96  insns per cycle         
-                                                 #    0,57  stalled cycles per insn 
-             1.248.527 branches                  #  494,616 M/sec                   
-                34.044 branch-misses             #    2,73% of all branches         
-           0,007167707 seconds time elapsed 
-
--   Generally speaking **task clock** tells you how parallel your job
-    has been/how many cpus were used.
--   **[Context switches](http://en.wikipedia.org/wiki/Context_switch)**
-    are an information about how the scheduler treated the application.
-    Also interrupts cause context switches. Lower is better.
--   **CPU migrations** are an information on whether the scheduler moved
-    the application between cores. Lower is better. Please pin your
-    programs to CPUs to avoid migrations. This can be done with
-    environment variables for OpenMP and MPI, with `likwid-pin`,
-    `numactl` and `taskset`.
--   **[Page faults](http://en.wikipedia.org/wiki/Page_fault)** describe
-    how well the Translation Lookaside Buffers fit for the program.
-    Lower is better.
--   **Cycles** tells you how many CPU cycles have been spent in
-    executing the program. The normalized value tells you the actual
-    average frequency of the CPU(s) running the application.
--   **stalled-cycles-...** tell you how well the processor can execute
-    your code. Every stall cycle is a waste of CPU time and energy. The
-    reason for such stalls can be numerous. It can be wrong branch
-    predictions, cache misses, occupation of CPU resources by long
-    running instructions and so on. If these stall cycles are to high
-    you might want to review your code.
--   The normalized **instructions** number tells you how well your code
-    is running. More is better. Current x86 CPUs can run 3 to 5
-    instructions per cycle, depending on the instruction mix. A count of
-    less then 1 is not favorable. In such a case you might want to
-    review your code.
--   **branches** and **branch-misses** tell you how many jumps and loops
-    are performed in your code. Correctly
-    [predicted](http://en.wikipedia.org/wiki/Branch_prediction) branches
-    should not hurt your performance, **branch-misses** on the other
-    hand hurt your performance very badly and lead to stall cycles.
--   Other events can be passed with the `-e` flag. For a full list of
-    predefined events run `perf list`
--   PAPI runs on top of the same infrastructure as perf stat, so you
-    might want to use their meaningful event names. Otherwise you can
-    use raw events, listed in the processor manuals. (
-    [Intel](http://download.intel.com/products/processor/manual/325384.pdf),
-    [AMD](http://support.amd.com/us/Processor_TechDocs/42300_15h_Mod_10h-1Fh_BKDG.pdf))
-
-## For admins
-
-Administrators can run a system wide performance statistic, e.g., with
-`perf stat -a sleep 1` which measures the performance counters for the
-whole computing node over one second.\<span style="font-size: 1em;">
-\</span>
-
-# perf record
-
-`perf record` provides the possibility to sample an application or a
-system. You can find performance issues and hot parts of your code. By
-default perf record samples your program at a 4000 Hz. It records CPU,
-Instruction Pointer and, if you specify it, the call chain. If your code
-runs long (or often) enough, you can find hot spots in your application
-and external libraries. Use **perf report** to evaluate the result. You
-should have debug symbols available, otherwise you won't be able to see
-the name of the functions that are responsible for your load. You can
-pass one or multiple events to define the **sampling event**. \<br />
-**What is a sampling event?** \<br /> Sampling reads values at a
-specific sampling frequency. This frequency is usually static and given
-in Hz, so you have for example 4000 events per second and a sampling
-frequency of 4000 Hz and a sampling rate of 250 microseconds. With the
-sampling event, the concept of a static sampling frequency in time is
-somewhat redefined. Instead of a constant factor in time (sampling rate)
-you define a constant factor in events. So instead of a sampling rate of
-250 microseconds, you have a sampling rate of 10,000 floating point
-operations. \<br /> **Why would you need sampling events?** \<br />
-Passing an event allows you to find the functions that produce cache
-misses, floating point operations, ... Again, you can use events defined
-in `perf list` and raw events. \<br />\<br /> Use the `-g` flag to
-receive a call graph.
-
-## For users
-
-Just run `perf record ./myapp` or attach to a running process.
-
-### Using perf with MPI
-
-Perf can also be used to record data for indivdual MPI processes. This
-requires a wrapper script (perfwrapper) with the following content. Also
-make sure that the wrapper script is executable (chmod +x).
-
-    #!/bin/bash
-    <span style="font-size: 1em;">perf record -o perf.data.$SLURM_JOB_ID.$SLURM_PROCID $@</span>
-
-To start the MPI program type \<span>srun ./perfwrapper ./myapp
-\</span>on your command line. The result will be n independent perf.data
-files that can be analyzed individually with perf report.
-
-## For admins
-
-This tool is very effective, if you want to help users find performance
-problems and hot-spots in their code but also helps to find OS daemons
-that disturb such applications. You would start `perf record -a -g` to
-monitor the whole node.
-
-# perf report
-
-perf report is a command line UI for evaluating the results from perf
-record. It creates something like a profile from the recorded samplings.
-These profiles show you what the most used have been. If you added a
-callchain, it also gives you a callchain profile.\<br /> \*Disclaimer:
-Sampling is not an appropriate way to gain exact numbers. So this is
-merely a rough overview and not guaranteed to be absolutely
-correct.\*\<span style="font-size: 1em;"> \</span>
-
-## On taurus
-
-On taurus, users are not allowed to see the kernel functions. If you
-have multiple events defined, then the first thing you select in
-`perf report` is the type of event. Press right
-
-    Available samples
-    96 cycles
-    11 cache-misse
-
-**Hint: The more samples you have, the more exact is the profile. 96 or
-11 samples is not enough by far.** I repeat the measurement and set
-`-F 50000` to increase the sampling frequency. **Hint: The higher the
-frequency, the higher the influence on the measurement.** If youd'd
-select cycles, you would get such a screen:
-
-    Events: 96  cycles
-    +  49,13%  test_gcc_perf  test_gcc_perf      [.] main.omp_fn.0
-    +  34,48%  test_gcc_perf  test_gcc_perf      [.] 
-    +   6,92%  test_gcc_perf  test_gcc_perf      [.] omp_get_thread_num@plt
-    +   5,20%  test_gcc_perf  libgomp.so.1.0.0   [.] omp_get_thread_num
-    +   2,25%  test_gcc_perf  test_gcc_perf      [.] main.omp_fn.1
-    +   2,02%  test_gcc_perf  [kernel.kallsyms]  [k] 0xffffffff8102e9ea
-
-Increased sample frequency:
-
-    Events: 7K cycles
-    +  42,61%  test_gcc_perf  test_gcc_perf      [.] p
-    +  40,28%  test_gcc_perf  test_gcc_perf      [.] main.omp_fn.0
-    +   6,07%  test_gcc_perf  test_gcc_perf      [.] omp_get_thread_num@plt
-    +   5,95%  test_gcc_perf  libgomp.so.1.0.0   [.] omp_get_thread_num
-    +   4,14%  test_gcc_perf  test_gcc_perf      [.] main.omp_fn.1
-    +   0,69%  test_gcc_perf  [kernel.kallsyms]  [k] 0xffffffff8102e9ea
-    +   0,04%  test_gcc_perf  ld-2.12.so         [.] check_match.12442
-    +   0,03%  test_gcc_perf  libc-2.12.so       [.] printf
-    +   0,03%  test_gcc_perf  libc-2.12.so       [.] vfprintf
-    +   0,03%  test_gcc_perf  libc-2.12.so       [.] __strchrnul
-    +   0,03%  test_gcc_perf  libc-2.12.so       [.] _dl_addr
-    +   0,02%  test_gcc_perf  ld-2.12.so         [.] do_lookup_x
-    +   0,01%  test_gcc_perf  libc-2.12.so       [.] _int_malloc
-    +   0,01%  test_gcc_perf  libc-2.12.so       [.] free
-    +   0,01%  test_gcc_perf  libc-2.12.so       [.] __sigprocmask
-    +   0,01%  test_gcc_perf  libgomp.so.1.0.0   [.] 0x87de
-    +   0,01%  test_gcc_perf  libc-2.12.so       [.] __sleep
-    +   0,01%  test_gcc_perf  ld-2.12.so         [.] _dl_check_map_versions
-    +   0,01%  test_gcc_perf  ld-2.12.so         [.] local_strdup
-    +   0,00%  test_gcc_perf  libc-2.12.so       [.] __execvpe
-
-Now you select the most often sampled function and zoom into it by
-pressing right. If debug symbols are not available, perf report will
-show which assembly instruction is hit most often when sampling. If
-debug symbols are available, it will also show you the source code lines
-for these assembly instructions. You can also go back and check which
-instruction caused the cache misses or whatever event you were passing
-to perf record.
-
-# perf script
-
-If you need a trace of the sampled data, you can use perf script
-command, which by default prints all samples to stdout. You can use
-various interfaces (e.g., python) to process such a trace.
-
-# perf top
-
-perf top is only available for admins, as long as the paranoid flag is
-not changed (see configuration).
-
-It behaves like the top command, but gives you not only an overview of
-the processes and the time they are consuming but also on the functions
-that are processed by these.
-
--- Main.RobertSchoene - 2013-04-29