diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md index a6cdfba8bd47659bc3a14473cad74c10b73089d0..57ab511938f3eb515b9e38ca831e91cede692418 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md @@ -2,50 +2,48 @@ ## Hardware -- Slurm partiton: romeo -- Module architecture: rome -- 192 nodes taurusi[7001-7192], each: - - 2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, MultiThreading +- Slurm partition: `romeo` +- Module architecture: `rome` +- 192 nodes `taurusi[7001-7192]`, each: + - 2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, Simultaneous Multithreading (SMT) - 512 GB RAM - - 200 GB SSD disk mounted on /tmp + - 200 GB SSD disk mounted on `/tmp` ## Usage -There is a total of 128 physical cores in each -node. SMT is also active, so in total, 256 logical cores are available -per node. +There is a total of 128 physical cores in each node. SMT is also active, so in total, 256 logical +cores are available per node. !!! note - Multithreading is disabled per default in a job. To make use of it - include the Slurm parameter `--hint=multithread` in your job script - or command line, or set - the environment variable `SLURM_HINT=multithread` before job submission. -Each node brings 512 GB of main memory, so you can request roughly -1972MB per logical core (using --mem-per-cpu). Note that you will always -get the memory for the logical core sibling too, even if you do not -intend to use SMT. + Multithreading is disabled per default in a job. To make use of it include the Slurm parameter + `--hint=multithread` in your job script or command line, or set the environment variable + `SLURM_HINT=multithread` before job submission. + +Each node brings 512 GB of main memory, so you can request roughly 1972 MB per logical core (using +`--mem-per-cpu`). Note that you will always get the memory for the logical core sibling too, even if +you do not intend to use SMT. !!! note - If you are running a job here with only ONE process (maybe - multiple cores), please explicitly set the option `-n 1` ! -Be aware that software built with Intel compilers and `-x*` optimization -flags will not run on those AMD processors! That's why most older -modules built with intel toolchains are not available on **romeo**. + If you are running a job here with only ONE process (maybe multiple cores), please explicitly + set the option `-n 1`! + +Be aware that software built with Intel compilers and `-x*` optimization flags will not run on those +AMD processors! That's why most older modules built with Intel toolchains are not available on +partition `romeo`. -We provide the script: `ml_arch_avail` that you can use to check if a -certain module is available on rome architecture. +We provide the script `ml_arch_avail` that can be used to check if a certain module is available on +`rome` architecture. ## Example, running CP2K on Rome First, check what CP2K modules are available in general: `module load spider CP2K` or `module avail CP2K`. -You will see that there are several different CP2K versions avail, built -with different toolchains. Now let's assume you have to decided you want -to run CP2K version 6 at least, so to check if those modules are built -for rome, use: +You will see that there are several different CP2K versions avail, built with different toolchains. +Now let's assume you have to decided you want to run CP2K version 6 at least, so to check if those +modules are built for rome, use: ```console marie@login$ ml_arch_avail CP2K/6 @@ -55,13 +53,11 @@ CP2K/6.1-intel-2018a: sandy, haswell CP2K/6.1-intel-2018a-spglib: haswell ``` -There you will see that only the modules built with **foss** toolchain -are available on architecture "rome", not the ones built with **intel**. -So you can load e.g. `ml CP2K/6.1-foss-2019a`. +There you will see that only the modules built with toolchain `foss` are available on architecture +`rome`, not the ones built with `intel`. So you can load, e.g. `ml CP2K/6.1-foss-2019a`. -Then, when writing your batch script, you have to specify the **romeo** -partition. Also, if e.g. you wanted to use an entire ROME node (no SMT) -and fill it with MPI ranks, it could look like this: +Then, when writing your batch script, you have to specify the partition `romeo`. Also, if e.g. you +wanted to use an entire ROME node (no SMT) and fill it with MPI ranks, it could look like this: ```bash #!/bin/bash @@ -73,27 +69,26 @@ and fill it with MPI ranks, it could look like this: srun cp2k.popt input.inp ``` -## Using the Intel toolchain on Rome +## Using the Intel Toolchain on Rome -Currently, we have only newer toolchains starting at `intel/2019b` -installed for the Rome nodes. Even though they have AMD CPUs, you can -still use the Intel compilers on there and they don't even create -bad-performing code. When using the MKL up to version 2019, though, -you should set the following environment variable to make sure that AVX2 -is used: +Currently, we have only newer toolchains starting at `intel/2019b` installed for the Rome nodes. +Even though they have AMD CPUs, you can still use the Intel compilers on there and they don't even +create bad-performing code. When using the Intel Math Kernel Library (MKL) up to version 2019, +though, you should set the following environment variable to make sure that AVX2 is used: ```bash export MKL_DEBUG_CPU_TYPE=5 ``` -Without it, the MKL does a CPUID check and disables AVX2/FMA on -non-Intel CPUs, leading to much worse performance. +Without it, the MKL does a CPUID check and disables AVX2/FMA on non-Intel CPUs, leading to much +worse performance. + !!! note - In version 2020, Intel has removed this environment variable and added separate Zen - codepaths to the library. However, they are still incomplete and do not - cover every BLAS function. Also, the Intel AVX2 codepaths still seem to - provide somewhat better performance, so a new workaround would be to - overwrite the `mkl_serv_intel_cpu_true` symbol with a custom function: + + In version 2020, Intel has removed this environment variable and added separate Zen codepaths to + the library. However, they are still incomplete and do not cover every BLAS function. Also, the + Intel AVX2 codepaths still seem to provide somewhat better performance, so a new workaround + would be to overwrite the `mkl_serv_intel_cpu_true` symbol with a custom function: ```c int mkl_serv_intel_cpu_true() { @@ -108,13 +103,11 @@ marie@login$ gcc -shared -fPIC -o libfakeintel.so fakeintel.c marie@login$ export LD_PRELOAD=libfakeintel.so ``` -As for compiler optimization flags, `-xHOST` does not seem to produce -best-performing code in every case on Rome. You might want to try -`-mavx2 -fma` instead. +As for compiler optimization flags, `-xHOST` does not seem to produce best-performing code in every +case on Rome. You might want to try `-mavx2 -fma` instead. ### Intel MPI -We have seen only half the theoretical peak bandwidth via Infiniband -between two nodes, whereas OpenMPI got close to the peak bandwidth, so -you might want to avoid using Intel MPI on romeo if your application -heavily relies on MPI communication until this issue is resolved. +We have seen only half the theoretical peak bandwidth via Infiniband between two nodes, whereas +OpenMPI got close to the peak bandwidth, so you might want to avoid using Intel MPI on partition +`rome` if your application heavily relies on MPI communication until this issue is resolved. diff --git a/doc.zih.tu-dresden.de/wordlist.aspell b/doc.zih.tu-dresden.de/wordlist.aspell index c8b2c530bf1ddad273c373899e9c40387e24216b..744930ee80302a6f7192744e4f2030c035ec41e2 100644 --- a/doc.zih.tu-dresden.de/wordlist.aspell +++ b/doc.zih.tu-dresden.de/wordlist.aspell @@ -6,6 +6,7 @@ Amdahl's analytics anonymized APIs +AVX BeeGFS benchmarking BLAS @@ -22,6 +23,7 @@ Chemnitz citable conda CPU +CPUID CPUs css CSV @@ -56,6 +58,7 @@ FFTW filesystem filesystems Flink +FMA foreach Fortran Gaussian @@ -130,6 +133,7 @@ mpifort mpirun multicore multithreaded +Multithreading NAMD natively NCCL @@ -175,6 +179,7 @@ PowerAI ppc Preload preloaded +preloading PSOCK Pthreads pymdownx @@ -236,6 +241,8 @@ Theano tmp todo ToDo +toolchain +toolchains tracefile tracefiles transferability