Skip to content
Snippets Groups Projects
Commit c5713b05 authored by Martin Schroschk's avatar Martin Schroschk
Browse files

Brief review w.r.t. markdown

parent c696d145
No related branches found
No related tags found
4 merge requests!392Merge preview into contrib guide for browser users,!333Draft: update NGC containers,!327Merge preview into main,!317Jobs and resources
...@@ -2,50 +2,48 @@ ...@@ -2,50 +2,48 @@
## Hardware ## Hardware
- Slurm partiton: romeo - Slurm partition: `romeo`
- Module architecture: rome - Module architecture: `rome`
- 192 nodes taurusi[7001-7192], each: - 192 nodes `taurusi[7001-7192]`, each:
- 2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, MultiThreading - 2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, Simultaneous Multithreading (SMT)
- 512 GB RAM - 512 GB RAM
- 200 GB SSD disk mounted on /tmp - 200 GB SSD disk mounted on `/tmp`
## Usage ## Usage
There is a total of 128 physical cores in each There is a total of 128 physical cores in each node. SMT is also active, so in total, 256 logical
node. SMT is also active, so in total, 256 logical cores are available cores are available per node.
per node.
!!! note !!! note
Multithreading is disabled per default in a job. To make use of it
include the Slurm parameter `--hint=multithread` in your job script
or command line, or set
the environment variable `SLURM_HINT=multithread` before job submission.
Each node brings 512 GB of main memory, so you can request roughly Multithreading is disabled per default in a job. To make use of it include the Slurm parameter
1972MB per logical core (using --mem-per-cpu). Note that you will always `--hint=multithread` in your job script or command line, or set the environment variable
get the memory for the logical core sibling too, even if you do not `SLURM_HINT=multithread` before job submission.
intend to use SMT.
Each node brings 512 GB of main memory, so you can request roughly 1972 MB per logical core (using
`--mem-per-cpu`). Note that you will always get the memory for the logical core sibling too, even if
you do not intend to use SMT.
!!! note !!! note
If you are running a job here with only ONE process (maybe
multiple cores), please explicitly set the option `-n 1` !
Be aware that software built with Intel compilers and `-x*` optimization If you are running a job here with only ONE process (maybe multiple cores), please explicitly
flags will not run on those AMD processors! That's why most older set the option `-n 1`!
modules built with intel toolchains are not available on **romeo**.
Be aware that software built with Intel compilers and `-x*` optimization flags will not run on those
AMD processors! That's why most older modules built with Intel toolchains are not available on
partition `romeo`.
We provide the script: `ml_arch_avail` that you can use to check if a We provide the script `ml_arch_avail` that can be used to check if a certain module is available on
certain module is available on rome architecture. `rome` architecture.
## Example, running CP2K on Rome ## Example, running CP2K on Rome
First, check what CP2K modules are available in general: First, check what CP2K modules are available in general:
`module load spider CP2K` or `module avail CP2K`. `module load spider CP2K` or `module avail CP2K`.
You will see that there are several different CP2K versions avail, built You will see that there are several different CP2K versions avail, built with different toolchains.
with different toolchains. Now let's assume you have to decided you want Now let's assume you have to decided you want to run CP2K version 6 at least, so to check if those
to run CP2K version 6 at least, so to check if those modules are built modules are built for rome, use:
for rome, use:
```console ```console
marie@login$ ml_arch_avail CP2K/6 marie@login$ ml_arch_avail CP2K/6
...@@ -55,13 +53,11 @@ CP2K/6.1-intel-2018a: sandy, haswell ...@@ -55,13 +53,11 @@ CP2K/6.1-intel-2018a: sandy, haswell
CP2K/6.1-intel-2018a-spglib: haswell CP2K/6.1-intel-2018a-spglib: haswell
``` ```
There you will see that only the modules built with **foss** toolchain There you will see that only the modules built with toolchain `foss` are available on architecture
are available on architecture "rome", not the ones built with **intel**. `rome`, not the ones built with `intel`. So you can load, e.g. `ml CP2K/6.1-foss-2019a`.
So you can load e.g. `ml CP2K/6.1-foss-2019a`.
Then, when writing your batch script, you have to specify the **romeo** Then, when writing your batch script, you have to specify the partition `romeo`. Also, if e.g. you
partition. Also, if e.g. you wanted to use an entire ROME node (no SMT) wanted to use an entire ROME node (no SMT) and fill it with MPI ranks, it could look like this:
and fill it with MPI ranks, it could look like this:
```bash ```bash
#!/bin/bash #!/bin/bash
...@@ -73,27 +69,26 @@ and fill it with MPI ranks, it could look like this: ...@@ -73,27 +69,26 @@ and fill it with MPI ranks, it could look like this:
srun cp2k.popt input.inp srun cp2k.popt input.inp
``` ```
## Using the Intel toolchain on Rome ## Using the Intel Toolchain on Rome
Currently, we have only newer toolchains starting at `intel/2019b` Currently, we have only newer toolchains starting at `intel/2019b` installed for the Rome nodes.
installed for the Rome nodes. Even though they have AMD CPUs, you can Even though they have AMD CPUs, you can still use the Intel compilers on there and they don't even
still use the Intel compilers on there and they don't even create create bad-performing code. When using the Intel Math Kernel Library (MKL) up to version 2019,
bad-performing code. When using the MKL up to version 2019, though, though, you should set the following environment variable to make sure that AVX2 is used:
you should set the following environment variable to make sure that AVX2
is used:
```bash ```bash
export MKL_DEBUG_CPU_TYPE=5 export MKL_DEBUG_CPU_TYPE=5
``` ```
Without it, the MKL does a CPUID check and disables AVX2/FMA on Without it, the MKL does a CPUID check and disables AVX2/FMA on non-Intel CPUs, leading to much
non-Intel CPUs, leading to much worse performance. worse performance.
!!! note !!! note
In version 2020, Intel has removed this environment variable and added separate Zen
codepaths to the library. However, they are still incomplete and do not In version 2020, Intel has removed this environment variable and added separate Zen codepaths to
cover every BLAS function. Also, the Intel AVX2 codepaths still seem to the library. However, they are still incomplete and do not cover every BLAS function. Also, the
provide somewhat better performance, so a new workaround would be to Intel AVX2 codepaths still seem to provide somewhat better performance, so a new workaround
overwrite the `mkl_serv_intel_cpu_true` symbol with a custom function: would be to overwrite the `mkl_serv_intel_cpu_true` symbol with a custom function:
```c ```c
int mkl_serv_intel_cpu_true() { int mkl_serv_intel_cpu_true() {
...@@ -108,13 +103,11 @@ marie@login$ gcc -shared -fPIC -o libfakeintel.so fakeintel.c ...@@ -108,13 +103,11 @@ marie@login$ gcc -shared -fPIC -o libfakeintel.so fakeintel.c
marie@login$ export LD_PRELOAD=libfakeintel.so marie@login$ export LD_PRELOAD=libfakeintel.so
``` ```
As for compiler optimization flags, `-xHOST` does not seem to produce As for compiler optimization flags, `-xHOST` does not seem to produce best-performing code in every
best-performing code in every case on Rome. You might want to try case on Rome. You might want to try `-mavx2 -fma` instead.
`-mavx2 -fma` instead.
### Intel MPI ### Intel MPI
We have seen only half the theoretical peak bandwidth via Infiniband We have seen only half the theoretical peak bandwidth via Infiniband between two nodes, whereas
between two nodes, whereas OpenMPI got close to the peak bandwidth, so OpenMPI got close to the peak bandwidth, so you might want to avoid using Intel MPI on partition
you might want to avoid using Intel MPI on romeo if your application `rome` if your application heavily relies on MPI communication until this issue is resolved.
heavily relies on MPI communication until this issue is resolved.
...@@ -6,6 +6,7 @@ Amdahl's ...@@ -6,6 +6,7 @@ Amdahl's
analytics analytics
anonymized anonymized
APIs APIs
AVX
BeeGFS BeeGFS
benchmarking benchmarking
BLAS BLAS
...@@ -22,6 +23,7 @@ Chemnitz ...@@ -22,6 +23,7 @@ Chemnitz
citable citable
conda conda
CPU CPU
CPUID
CPUs CPUs
css css
CSV CSV
...@@ -56,6 +58,7 @@ FFTW ...@@ -56,6 +58,7 @@ FFTW
filesystem filesystem
filesystems filesystems
Flink Flink
FMA
foreach foreach
Fortran Fortran
Gaussian Gaussian
...@@ -130,6 +133,7 @@ mpifort ...@@ -130,6 +133,7 @@ mpifort
mpirun mpirun
multicore multicore
multithreaded multithreaded
Multithreading
NAMD NAMD
natively natively
NCCL NCCL
...@@ -175,6 +179,7 @@ PowerAI ...@@ -175,6 +179,7 @@ PowerAI
ppc ppc
Preload Preload
preloaded preloaded
preloading
PSOCK PSOCK
Pthreads Pthreads
pymdownx pymdownx
...@@ -236,6 +241,8 @@ Theano ...@@ -236,6 +241,8 @@ Theano
tmp tmp
todo todo
ToDo ToDo
toolchain
toolchains
tracefile tracefile
tracefiles tracefiles
transferability transferability
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment