|
|
## Latency and bandwidth benchmarks for multicore and multiprocessor x86 based Systems
|
|
|
|
|
|
Set of memory benchmarks targeted at the analysis of complex memory hierarchies. Tailored at performance of data exchange between cores of a multicore processor and between processors. Provided under BSD License without any warranty. Use at your own risk.
|
|
|
|
|
|
contact: daniel.molka(at)tu-dresden.de
|
|
|
|
|
|
## Prerequirements
|
|
|
- Linux operating system
|
|
|
- glibc >= 2.6
|
|
|
- 64 Bit x86 Processors
|
|
|
- Supposd to work on Intel and AMD Processors up to Westmere and Magny-Cours, respectively
|
|
|
- More recent CPUs might need adoption in hardware detection
|
|
|
- VIA CPUs are currently not supported
|
|
|
|
|
|
## Recommended
|
|
|
- kernel with hugetlbfs support
|
|
|
- PAPI or PAPI-C to access PMU information
|
|
|
- Powermanagement invariant Time Stamp Counter (constant rate)
|
|
|
- Should be available on all server CPUs (Xeon, Opteron) except dual-core Opterons (K8)
|
|
|
- Desktop and especially mobile CPUs might report odd results
|
|
|
- Completly disabling powermanagement in the BIOS can help to get useful results
|
|
|
- TSC synchronization can be forced with BENCHIT_KERNEL_TSC_SYNC option of kernels
|
|
|
- systemwide synchronous TSC recommended for multiple-* kernels
|
|
|
- set BENCHIT_KERNEL_TSC_SYNC to "disabled" if not available
|
|
|
|
|
|
## Installation
|
|
|
- Download benchit-snapshot.tar.gz
|
|
|
- extract files and run <your-BenchIT-inst-dir>/tools/FIRSTTIME to setup BenchIT
|
|
|
- answer questions carefully as this affects result file naming
|
|
|
- rerun this if you get result files called "unknown_unknown[...].bit"
|
|
|
- use information from <your-BenchIT-inst-dir>/tools/hw_detect/cpuinfo to fill out LOCALDEFS/<your_mashine_name>_input_architecture
|
|
|
- anoying, but can be very useful later on
|
|
|
- the more information you add the easier it is to interprete results that you do not know which mashine they were created on
|
|
|
- Download x86_membench.tar.gz from https://fusionforge.zih.tu-dresden.de/frs/?group_id=885
|
|
|
- Add content of kernel/ and tools/ folders from archive to your BenchIT directory (omit the tools/ folder if you're using a snapshot of BenchIT's development version)
|
|
|
- Optionally copy example output
|
|
|
- The examples show how the results of each kernel should look like if executed correctly
|
|
|
- If your results look totally different don't hesitate to ask for assistance to setup PARAMETERS properly
|
|
|
- run hardware detection tool to check if your system is supported properly
|
|
|
- cd <your-BenchIT-inst-dir>/tools/hw_detect
|
|
|
- sh compile.sh
|
|
|
- ./cpuinfo
|
|
|
- if the output reports wrong clock frequency, topology, or odd cache and TLB information the benchmark results are likely to be wrong
|
|
|
- if results are ok chances are good that the benchmarks will work correctly
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
### DISABLE DYNAMIC FREQUENCY SCALING
|
|
|
- Use cpufreq governor "performance"
|
|
|
- echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor (as root)
|
|
|
- Disable Intel Turbo Boost if the processor has this feature (e.g. Core i7, Xeon 5500 series, Xeon E3/E5/E7)
|
|
|
- disable in BIOS or force second highest available frequency if Turbo is enabled in BIOS
|
|
|
- cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_frequencies lists frequencies
|
|
|
- limit /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq if appropriate
|
|
|
- Disable AMD Turbo Core if the processor has this feature (e.g. Phenom X6, Opteron 6200 series)
|
|
|
- if possible setup hugepages (e.g. with hugeadm) and mount hugetlbfs
|
|
|
- >100 2MiB pages per core recommended (reduce BENCHIT_KERNEL_MAX if there are not enough 2MiB pages available)
|
|
|
- set BENCHIT_KERNEL_HUGEPAGES to "0" if hugetlbfs is not mounted
|
|
|
- if hugeadm is not available, execute as root:
|
|
|
- // create hugepages:
|
|
|
- mkdir -p /mnt/huge
|
|
|
- echo <num_pages> >/proc/sys/vm/nr_hugepages
|
|
|
- mount -t hugetlbfs nodev /mnt/huge
|
|
|
- chmod 777 /mnt/huge
|
|
|
- //free hugepages:
|
|
|
- umount /mnt/huge
|
|
|
- echo 0 >/proc/sys/vm/nr_hugepages
|
|
|
- Important parameters (edit PARAMETERS files in BENCHITROOT/kernel/arch_x86_64/memory_{bandwidth|latency}/C/pthread/{0|SSE2}/<benchmark_name>)
|
|
|
- BENCHIT_KERNEL_{MIN|MAX|STEPS}
|
|
|
configure data set sizes
|
|
|
automatically chooses data set sizes suitable for display on a logarithmic scale
|
|
|
alternatively use BENCHIT_KERNEL_PROBLEMLIST to select certain dataset sizes
|
|
|
- BENCHIT_KERNEL_CPU_LIST
|
|
|
select cores to run benchmark on
|
|
|
it's suggested to run each benchmark with all available cores once and then remove cores with redundant results
|
|
|
- BENCHIT_KERNEL_USE_MODE
|
|
|
chose initial coherency state of data
|
|
|
suggested to try all available settings supported by your hardware
|
|
|
However, do not use "Owned" on Intel or "Forward" on AMD processors as it will fall back to "Modified" or "Shared", respectively
|
|
|
- BENCHIT_KERNEL_SHARE_CPU
|
|
|
Required for USE_MODE shared, forward and owned
|
|
|
should be as far away (max amount of QPI/HT hops) from first selected CPU as possible
|
|
|
MUST NOT BE INCLUDED in CPU_LIST
|
|
|
- BENCHIT_KERNEL_FLUSH_L{1|2|3}
|
|
|
enable/disable cache flushes
|
|
|
disabled by default, sometimes necessary to clearly distinguish between cache levels
|
|
|
- BENCHIT_KERNEL_ALLOC
|
|
|
select NUMA behaviour
|
|
|
- BENCHIT_KERNEL_{HUGEPAGES|HUGEPAGE_DIR}
|
|
|
setup hugetlbfs usage
|
|
|
STRONGLY RECOMMENDED, if you have them use them
|
|
|
latency results without using hugepages are basically useless (monotonic increasing memory latency as average pagetable walk penalty rises)
|
|
|
- BENCHIT_KERNEL_INSTRUCTION
|
|
|
bandwidth kernels only
|
|
|
select instruction for data transfers (mov,movdqa,movntdq,…)
|
|
|
- additional single-r1w1 parameters
|
|
|
- BENCHIT_KERNEL_LAYOUT - alignment of read and write buffers in memory
|
|
|
separate USE_MODE for read and write stream
|
|
|
- BENCHIT_KERNEL_{READ|WRITE}_LOCAL - tie one stream to local memory (producer-consumer behavior)
|
|
|
- BENCHIT_KERNEL_METHOD - select operation performed on read stream before writing
|
|
|
- Default values are strongly recommended for the remaining parameters
|
|
|
- compile and run like any other BenchIT kernel - COMPILE.SH/RUN.SH from command line or via GUI (recommended)
|
|
|
- Standard comment in the result plots will be the date, this can be changed by editing the comment in the Config tab. If you change this to "<comment>" a summary of the parameters used for this measurement will be displayed
|
|
|
- use BenchIT's mixer feature to create combined graphs from multiple measurements
|
|
|
|
|
|
## Publications
|
|
|
- D. Molka, D. Hackenberg, R. Schöne and M. S. Müller, Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System, In Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques (PACT'09), pages 261-270, IEEE, 2009,
|
|
|
- http://ieeexplore.ieee.org/search/srchabstract.jsp?tp=&arnumber=5260544
|
|
|
- [2009_Molka_PACT_slides.pdf](uploads/6b20e7a5fb750adf31354d808e522bcd/2009_Molka_PACT_slides.pdf)
|
|
|
- D. Hackenberg, D. Molka and W. E. Nagel, Comparing Cache Architectures and Coherency Protocols on x86-64 Multicore SMP Systems, In Proceedings of the 42nd International Symposium on Microarchitecture (MICRO'09), pages 413-422, ACM, 2009
|
|
|
- http://portal.acm.org/citation.cfm?id=1669165&dl=GUIDE&coll=GUIDE&CFID=81286494&CFTOKEN=18541064
|
|
|
- D. Molka, D. Hackenberg, R. Schöne and M. S. Müller, Characterizing the Energy Consumption of Data Transfers and Arithmetic Operations on x86-64 Processors, In Proceedings of the 1st International Green Computing Conference (IGCC'10), pages 123-133, IEEE, 2010
|
|
|
- http://ieeexplore.ieee.org/search/srchabstract.jsp?tp=&arnumber=5598316
|
|
|
- [2010_Molka_IGCC_slides.pdf](uploads/ab15fd88c57a47552813ff8ec468cc44/2010_Molka_IGCC_slides.pdf)
|
|
|
- D. Molka, R. Schöne, D. Hackenberg, M. S. Müller, Memory Performance and SPEC OpenMP Scalability on Quad-Socket x86 64 Systems, In Procedings of the 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP'11), October 24-26 2011, Melbourne, Australia
|
|
|
- http://www.springerlink.com/content/2k52134047538083/
|
|
|
- [2011_Molka_ICA3PP.pdf](uploads/ec9e891b75a884435763eb584ae35315/2011_Molka_ICA3PP.pdf)
|
|
|
- R.Schöne, D. Hackenberg, D. Molka, Simultaneous Multithreading on x86_64 Systems: An Energy Efficiency Evaluation, In Procedings of the 4th Workshop on Power-Aware Computing and Systems (HotPower'11), October 23-26, 2011, Cascais, Portugal
|
|
|
- http://dl.acm.org/citation.cfm?doid=2039252.2039262
|
|
|
- R.Schöne, D. Hackenberg, D. Molka, Memory performance at reduced CPU clock speeds: an analysis of current x86_64 processors, In Proceedings of the 2012 USENIX conference on Power-Aware Computing and Systems (HotPower'12), October 7, 2012, Hollywood, USA
|
|
|
- http://dl.acm.org/citation.cfm?id=2387869.2387878
|
|
|
- Daniel Molka, Daniel Hackenberg, Robert Schöne, "Main Memory and Cache Performance of Intel Sandy Bridge and AMD Bulldozer", In Proceedings of the 2014 ACM SIGPLAN workshop on Memory Systems Performance and Correctness (MSPC'14), ACM, 2014
|
|
|
- http://dl.acm.org/citation.cfm?doid=2618128.2618129
|
|
|
- [2014_Molka_MSPC.pdf](uploads/9e4f49599e376733e0e366ee6190a3b6/2014_Molka_MSPC.pdf)
|
|
|
- Daniel Molka, Daniel Hackenberg, Robert Schöne, and Wolfgang E. Nagel, "Cache Coherence Protocol and Memory Performance of the Intel Haswell-EP Architecture", In Proceedings of the 44th International Conference on Parallel Processing (ICPP'15), IEEE, 2015
|
|
|
- http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7349629
|
|
|
- [2015_Molka_ICPP.pdf](uploads/8eeda80a4c6919030ac7fb0191e189e8/2015_Molka_ICPP.pdf) |
|
|
\ No newline at end of file |