Review hardware documentation pages

* Resolve issue #434 * Remove hardware documentation section from specific pages of SD Flex, Alpha Centauri and Rome nodes, because it is error-prone to have this documentation in two places. The specific pages better cover the usage. * Fix typos and adjust whitespaces before units * Some linting of markdown text

Review hardware documentation pages
8d3bb533 · Martin Schroschk · e4a01809 · 8d3bb533 · 8d3bb533 · 8d3bb533
Commit 8d3bb533 authored 2 years ago by Martin Schroschk
--- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md
+++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md
 # Alpha Centauri

-The multi-GPU sub-cluster "Alpha Centauri" had been installed for AI-related computations (ScaDS.AI).
-It has 34 nodes, each with:
+The multi-GPU sub-cluster "Alpha Centauri" has been installed for AI-related computations (ScaDS.AI).

-* 8 x NVIDIA A100-SXM4 (40 GB RAM)
-* 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz with multi-threading enabled
-* 1 TB RAM
-* 3.5 TB `/tmp` local NVMe device
-* Hostnames: `taurusi[8001-8034]`
-* Slurm partition `alpha` for batch jobs and `alpha-interactive` for interactive jobs
+The up-to-date hardware specification is documentated on
+[this site](hardware_overview.md#amd-rome-cpus-nvidia-a100).
+
+## Usage

 !!! note

    The NVIDIA A100 GPUs may only be used with **CUDA 11** or later. Earlier versions do not
    recognize the new hardware properly. Make sure the software you are using is built with CUDA11.

-## Usage
-
 ### Modules

 The easiest way is using the [module system](../software/modules.md).

--- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md
+++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md
@@ -8,80 +8,76 @@ analytics, and artificial intelligence methods with extensive capabilities for e
 and performance monitoring provides ideal conditions to achieve the ambitious research goals of the
 users and the ZIH.

-## Login Nodes
+## Login and Export Nodes

- Login-Nodes (`tauruslogin[3-6].hrsk.tu-dresden.de`)
-    - each with 2x Intel(R) Xeon(R) CPU E5-2680 v3 each with 12 cores
-      2.50GHz, Multithreading Disabled, 64 GB RAM, 128 GB SSD local disk
+- 3 Login-Nodes `tauruslogin[3-6].hrsk.tu-dresden.de`
+    - Each login node is equiped with 2x Intel(R) Xeon(R) CPU E5-2680 v3 with 24 cores in total @
+      2.50 GHz, Multithreading disabled, 64 GB RAM, 128 GB SSD local disk
    - IPs: 141.30.73.\[102-105\]
- Transfer-Nodes (`taurusexport[3-4].hrsk.tu-dresden.de`, DNS Alias
-  `taurusexport.hrsk.tu-dresden.de`)
-    - 2 Servers without interactive login, only available via file transfer protocols (`rsync`, `ftp`)
+- 2 Data-Transfer-Nodes `taurusexport[3-4].hrsk.tu-dresden.de`
+    - DNS Alias `taurusexport.hrsk.tu-dresden.de`
+    - 2 Servers without interactive login, only available via file transfer protocols
+      (`rsync`, `ftp`)
    - IPs: 141.30.73.82/83
- Direct access to these nodes is granted via IP whitelisting (contact
-  hpcsupport@zih.tu-dresden.de) - otherwise use TU Dresden VPN.
-
-!!! warning "Run time limit"
-
-    Any process on login nodes is stopped after 5 minutes.
+    - Further information on the usage is documentated on the site
+      [Export Nodes](../data_transfer/export_nodes.md)

 ## AMD Rome CPUs + NVIDIA A100

 - 34 nodes, each with
-    - 8 x NVIDIA A100-SXM4
+    - 8 x NVIDIA A100-SXM4 Tensor Core-GPUs
    - 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, Multithreading disabled
    - 1 TB RAM
    - 3.5 TB local memory at NVMe device at `/tmp`
 - Hostnames: `taurusi[8001-8034]`
- Slurm partition `alpha`
- Dedicated mostly for ScaDS-AI
+- Slurm partition: `alpha`
+- Further information on the usage is documentated on the site [AMD Rome Nodes](rome_nodes.md)

 ## Island 7 - AMD Rome CPUs

 - 192 nodes, each with
-    - 2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, Multithreading
-      enabled,
+    - 2 x AMD EPYC CPU 7702 (64 cores) @ 2.0 GHz, Multithreading enabled,
    - 512 GB RAM
-    - 200 GB /tmp on local SSD local disk
+    - 200 GB local memory on SSD at `/tmp`
 - Hostnames: `taurusi[7001-7192]`
- Slurm partition `romeo`
- More information under [Rome Nodes](rome_nodes.md)
+- Slurm partition: `romeo`
+- Further information on the usage is documentated on the site [AMD Rome Nodes](rome_nodes.md)

 ## Large SMP System HPE Superdome Flex

 - 1 node, with
-    - 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20GHz (28 cores)
+    - 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20 GHz (28 cores)
    - 47 TB RAM
- Currently configured as one single node
+- Configured as one single node
+- 48 TB RAM (usable: 47 TB - one TB is used for cache coherence protocols)
+- 370 TB of fast NVME storage available at `/nvme/<projectname>`
 - Hostname: `taurussmp8`
- Slurm partition `julia`
- More information under [HPE SD Flex](sd_flex.md)
+- Slurm partition: `julia`
+- Further information on the usage is documentated on the site [HPE Superdome Flex](sd_flex.md)

 ## IBM Power9 Nodes for Machine Learning

-For machine learning, we have 32 IBM AC922 nodes installed with this configuration:
+For machine learning, we have IBM AC922 nodes installed with this configuration:

 - 32 nodes, each with
    - 2 x IBM Power9 CPU (2.80 GHz, 3.10 GHz boost, 22 cores)
-    - 256 GB RAM DDR4 2666MHz
-    - 6x NVIDIA VOLTA V100 with 32GB HBM2
+    - 256 GB RAM DDR4 2666 MHz
+    - 6 x NVIDIA VOLTA V100 with 32 GB HBM2
    - NVLINK bandwidth 150 GB/s between GPUs and host
 - Hostnames: `taurusml[1-32]`
- Slurm partition `ml`
+- Slurm partition: `ml`

 ## Island 6 - Intel Haswell CPUs

 - 612 nodes, each with
-    - 2x Intel(R) Xeon(R) CPU E5-2680 v3 (12 cores)
-      @ 2.50GHz, Multithreading disabled, 128 GB SSD local disk
- Varying amounts of main memory (selected automatically by the batch
-  system for you according to your job requirements)
-    - 594 nodes with 2.67 GB RAM per core (64 GB total):
-      `taurusi[6001-6540,6559-6612]`
-    - 18 nodes with 10.67 GB RAM per core (256 GB total):
-    `taurusi[6541-6558]`
+    - 2 x Intel(R) Xeon(R) CPU E5-2680 v3 (12 cores) @ 2.50 GHz, Multithreading disabled
+    - 128 GB local memory on SSD
+- Varying amounts of main memory (selected automatically by the batch system for you according to
+  your job requirements)
+  * 594 nodes with 2.67 GB RAM per core (64 GB in total): `taurusi[6001-6540,6559-6612]`
+    - 18 nodes with 10.67 GB RAM per core (256 GB in total): `taurusi[6541-6558]`
 - Hostnames: `taurusi[6001-6612]`
- Slurm Partition `haswell`
+- Slurm Partition: `haswell`

 ??? hint "Node topology"

@@ -91,23 +87,21 @@ For machine learning, we have 32 IBM AC922 nodes installed with this configurati
 ## Island 2 Phase 2 - Intel Haswell CPUs + NVIDIA K80 GPUs

 - 64 nodes, each with
-    - 2x Intel(R) Xeon(R) CPU E5-E5-2680 v3 (12 cores)
-      @ 2.50GHz, Multithreading Disabled
+    - 2 x Intel(R) Xeon(R) CPU E5-E5-2680 v3 (12 cores) @ 2.50 GHz, Multithreading disabled
    - 64 GB RAM (2.67 GB per core)
-    - 128 GB SSD local disk
-    - 4x NVIDIA Tesla K80 (12 GB GDDR RAM) GPUs
+    - 128 GB local memory on SSD
+    - 4 x NVIDIA Tesla K80 (12 GB GDDR RAM) GPUs
 - Hostnames: `taurusi[2045-2108]`
- Slurm Partition `gpu2`
+- Slurm Partition: `gpu2`
 - Node topology, same as [island 4 - 6](#island-6-intel-haswell-cpus)

 ## SMP Nodes - up to 2 TB RAM

 - 5 Nodes, each with
-    - 4x Intel(R) Xeon(R) CPU E7-4850 v3 (14 cores) @
-      2.20GHz, Multithreading Disabled
+    - 4 x Intel(R) Xeon(R) CPU E7-4850 v3 (14 cores) @ 2.20 GHz, Multithreading disabled
    - 2 TB RAM
 - Hostnames: `taurussmp[3-7]`
- Slurm partition `smp2`
+- Slurm partition: `smp2`

 ??? hint "Node topology"


--- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md
+++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md
 # Island 7 - AMD Rome Nodes

-## Hardware
-
- Slurm partition: `romeo`
- Module architecture: `rome`
- 192 nodes `taurusi[7001-7192]`, each:
-    - 2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, Simultaneous Multithreading (SMT)
-    - 512 GB RAM
-    - 200 GB SSD disk mounted on `/tmp`
+The up-to-date hardware documentation is provided at
+[this site](hardware_overview.md#island-7-amd-rome-cpus).

 ## Usage


--- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md
+++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md
@@ -4,16 +4,10 @@ The HPE Superdome Flex is a large shared memory node. It is especially well suit
 intensive application scenarios, for example to process extremely large data sets completely in main
 memory or in very fast NVMe memory.

-## Configuration Details
+The hardware configuration is documented on
+[this site](hardware_overview.md#large-smp-system-hpe-superdome-flex).

- Hostname: `taurussmp8`
- Access to all shared filesystems
- Slurm partition `julia`
- 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20GHz (28 cores)
- 48 TB RAM (usable: 47 TB - one TB is used for cache coherence protocols)
- 370 TB of fast NVME storage available at `/nvme/<projectname>`
-
-## Local Temporary NVMe Storage
+## Local Temporary on NVMe Storage

 There are 370 TB of NVMe devices installed. For immediate access for all projects, a volume of 87 TB
 of fast NVMe storage is available at `/nvme/1/<projectname>`. A quota of
@@ -28,7 +22,13 @@ project's quota can be increased or dedicated volumes of up to the full capacity
 - Granularity should be a socket (28 cores)
 - Can be used for OpenMP applications with large memory demands
 - To use OpenMPI it is necessary to export the following environment
-  variables, so that OpenMPI uses shared memory instead of Infiniband
-  for message transport. `export OMPI_MCA_pml=ob1;   export  OMPI_MCA_mtl=^mxm`
+  variables, so that OpenMPI uses shared-memory instead of Infiniband
+  for message transport:
+
+  ```
+  export OMPI_MCA_pml=ob1
+  export OMPI_MCA_mtl=^mxm
+  ```
+
 - Use `I_MPI_FABRICS=shm` so that Intel MPI doesn't even consider
  using Infiniband devices itself, but only shared-memory instead