Skip to content
Snippets Groups Projects
Commit 8d3bb533 authored by Martin Schroschk's avatar Martin Schroschk
Browse files

Review hardware documentation pages

* Resolve issue #434
* Remove hardware documentation section from specific pages of SD Flex,
  Alpha Centauri and Rome nodes, because it is error-prone to have this
  documentation in two places. The specific pages better cover the
  usage.
* Fix typos and adjust whitespaces before units
* Some linting of markdown text
parent e4a01809
No related branches found
No related tags found
2 merge requests!754Automated merge from preview to main,!743Review hardware documentation pages
# Alpha Centauri
The multi-GPU sub-cluster "Alpha Centauri" had been installed for AI-related computations (ScaDS.AI).
It has 34 nodes, each with:
The multi-GPU sub-cluster "Alpha Centauri" has been installed for AI-related computations (ScaDS.AI).
* 8 x NVIDIA A100-SXM4 (40 GB RAM)
* 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz with multi-threading enabled
* 1 TB RAM
* 3.5 TB `/tmp` local NVMe device
* Hostnames: `taurusi[8001-8034]`
* Slurm partition `alpha` for batch jobs and `alpha-interactive` for interactive jobs
The up-to-date hardware specification is documentated on
[this site](hardware_overview.md#amd-rome-cpus-nvidia-a100).
## Usage
!!! note
The NVIDIA A100 GPUs may only be used with **CUDA 11** or later. Earlier versions do not
recognize the new hardware properly. Make sure the software you are using is built with CUDA11.
## Usage
### Modules
The easiest way is using the [module system](../software/modules.md).
......
......@@ -8,80 +8,76 @@ analytics, and artificial intelligence methods with extensive capabilities for e
and performance monitoring provides ideal conditions to achieve the ambitious research goals of the
users and the ZIH.
## Login Nodes
## Login and Export Nodes
- Login-Nodes (`tauruslogin[3-6].hrsk.tu-dresden.de`)
- each with 2x Intel(R) Xeon(R) CPU E5-2680 v3 each with 12 cores
2.50GHz, Multithreading Disabled, 64 GB RAM, 128 GB SSD local disk
- 3 Login-Nodes `tauruslogin[3-6].hrsk.tu-dresden.de`
- Each login node is equiped with 2x Intel(R) Xeon(R) CPU E5-2680 v3 with 24 cores in total @
2.50 GHz, Multithreading disabled, 64 GB RAM, 128 GB SSD local disk
- IPs: 141.30.73.\[102-105\]
- Transfer-Nodes (`taurusexport[3-4].hrsk.tu-dresden.de`, DNS Alias
`taurusexport.hrsk.tu-dresden.de`)
- 2 Servers without interactive login, only available via file transfer protocols (`rsync`, `ftp`)
- 2 Data-Transfer-Nodes `taurusexport[3-4].hrsk.tu-dresden.de`
- DNS Alias `taurusexport.hrsk.tu-dresden.de`
- 2 Servers without interactive login, only available via file transfer protocols
(`rsync`, `ftp`)
- IPs: 141.30.73.82/83
- Direct access to these nodes is granted via IP whitelisting (contact
hpcsupport@zih.tu-dresden.de) - otherwise use TU Dresden VPN.
!!! warning "Run time limit"
Any process on login nodes is stopped after 5 minutes.
- Further information on the usage is documentated on the site
[Export Nodes](../data_transfer/export_nodes.md)
## AMD Rome CPUs + NVIDIA A100
- 34 nodes, each with
- 8 x NVIDIA A100-SXM4
- 8 x NVIDIA A100-SXM4 Tensor Core-GPUs
- 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, Multithreading disabled
- 1 TB RAM
- 3.5 TB local memory at NVMe device at `/tmp`
- Hostnames: `taurusi[8001-8034]`
- Slurm partition `alpha`
- Dedicated mostly for ScaDS-AI
- Slurm partition: `alpha`
- Further information on the usage is documentated on the site [AMD Rome Nodes](rome_nodes.md)
## Island 7 - AMD Rome CPUs
- 192 nodes, each with
- 2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, Multithreading
enabled,
- 2 x AMD EPYC CPU 7702 (64 cores) @ 2.0 GHz, Multithreading enabled,
- 512 GB RAM
- 200 GB /tmp on local SSD local disk
- 200 GB local memory on SSD at `/tmp`
- Hostnames: `taurusi[7001-7192]`
- Slurm partition `romeo`
- More information under [Rome Nodes](rome_nodes.md)
- Slurm partition: `romeo`
- Further information on the usage is documentated on the site [AMD Rome Nodes](rome_nodes.md)
## Large SMP System HPE Superdome Flex
- 1 node, with
- 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20GHz (28 cores)
- 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20 GHz (28 cores)
- 47 TB RAM
- Currently configured as one single node
- Configured as one single node
- 48 TB RAM (usable: 47 TB - one TB is used for cache coherence protocols)
- 370 TB of fast NVME storage available at `/nvme/<projectname>`
- Hostname: `taurussmp8`
- Slurm partition `julia`
- More information under [HPE SD Flex](sd_flex.md)
- Slurm partition: `julia`
- Further information on the usage is documentated on the site [HPE Superdome Flex](sd_flex.md)
## IBM Power9 Nodes for Machine Learning
For machine learning, we have 32 IBM AC922 nodes installed with this configuration:
For machine learning, we have IBM AC922 nodes installed with this configuration:
- 32 nodes, each with
- 2 x IBM Power9 CPU (2.80 GHz, 3.10 GHz boost, 22 cores)
- 256 GB RAM DDR4 2666MHz
- 6x NVIDIA VOLTA V100 with 32GB HBM2
- 256 GB RAM DDR4 2666 MHz
- 6 x NVIDIA VOLTA V100 with 32 GB HBM2
- NVLINK bandwidth 150 GB/s between GPUs and host
- Hostnames: `taurusml[1-32]`
- Slurm partition `ml`
- Slurm partition: `ml`
## Island 6 - Intel Haswell CPUs
- 612 nodes, each with
- 2x Intel(R) Xeon(R) CPU E5-2680 v3 (12 cores)
@ 2.50GHz, Multithreading disabled, 128 GB SSD local disk
- Varying amounts of main memory (selected automatically by the batch
system for you according to your job requirements)
- 594 nodes with 2.67 GB RAM per core (64 GB total):
`taurusi[6001-6540,6559-6612]`
- 18 nodes with 10.67 GB RAM per core (256 GB total):
`taurusi[6541-6558]`
- 2 x Intel(R) Xeon(R) CPU E5-2680 v3 (12 cores) @ 2.50 GHz, Multithreading disabled
- 128 GB local memory on SSD
- Varying amounts of main memory (selected automatically by the batch system for you according to
your job requirements)
* 594 nodes with 2.67 GB RAM per core (64 GB in total): `taurusi[6001-6540,6559-6612]`
- 18 nodes with 10.67 GB RAM per core (256 GB in total): `taurusi[6541-6558]`
- Hostnames: `taurusi[6001-6612]`
- Slurm Partition `haswell`
- Slurm Partition: `haswell`
??? hint "Node topology"
......@@ -91,23 +87,21 @@ For machine learning, we have 32 IBM AC922 nodes installed with this configurati
## Island 2 Phase 2 - Intel Haswell CPUs + NVIDIA K80 GPUs
- 64 nodes, each with
- 2x Intel(R) Xeon(R) CPU E5-E5-2680 v3 (12 cores)
@ 2.50GHz, Multithreading Disabled
- 2 x Intel(R) Xeon(R) CPU E5-E5-2680 v3 (12 cores) @ 2.50 GHz, Multithreading disabled
- 64 GB RAM (2.67 GB per core)
- 128 GB SSD local disk
- 4x NVIDIA Tesla K80 (12 GB GDDR RAM) GPUs
- 128 GB local memory on SSD
- 4 x NVIDIA Tesla K80 (12 GB GDDR RAM) GPUs
- Hostnames: `taurusi[2045-2108]`
- Slurm Partition `gpu2`
- Slurm Partition: `gpu2`
- Node topology, same as [island 4 - 6](#island-6-intel-haswell-cpus)
## SMP Nodes - up to 2 TB RAM
- 5 Nodes, each with
- 4x Intel(R) Xeon(R) CPU E7-4850 v3 (14 cores) @
2.20GHz, Multithreading Disabled
- 4 x Intel(R) Xeon(R) CPU E7-4850 v3 (14 cores) @ 2.20 GHz, Multithreading disabled
- 2 TB RAM
- Hostnames: `taurussmp[3-7]`
- Slurm partition `smp2`
- Slurm partition: `smp2`
??? hint "Node topology"
......
# Island 7 - AMD Rome Nodes
## Hardware
- Slurm partition: `romeo`
- Module architecture: `rome`
- 192 nodes `taurusi[7001-7192]`, each:
- 2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, Simultaneous Multithreading (SMT)
- 512 GB RAM
- 200 GB SSD disk mounted on `/tmp`
The up-to-date hardware documentation is provided at
[this site](hardware_overview.md#island-7-amd-rome-cpus).
## Usage
......
......@@ -4,16 +4,10 @@ The HPE Superdome Flex is a large shared memory node. It is especially well suit
intensive application scenarios, for example to process extremely large data sets completely in main
memory or in very fast NVMe memory.
## Configuration Details
The hardware configuration is documented on
[this site](hardware_overview.md#large-smp-system-hpe-superdome-flex).
- Hostname: `taurussmp8`
- Access to all shared filesystems
- Slurm partition `julia`
- 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20GHz (28 cores)
- 48 TB RAM (usable: 47 TB - one TB is used for cache coherence protocols)
- 370 TB of fast NVME storage available at `/nvme/<projectname>`
## Local Temporary NVMe Storage
## Local Temporary on NVMe Storage
There are 370 TB of NVMe devices installed. For immediate access for all projects, a volume of 87 TB
of fast NVMe storage is available at `/nvme/1/<projectname>`. A quota of
......@@ -28,7 +22,13 @@ project's quota can be increased or dedicated volumes of up to the full capacity
- Granularity should be a socket (28 cores)
- Can be used for OpenMP applications with large memory demands
- To use OpenMPI it is necessary to export the following environment
variables, so that OpenMPI uses shared memory instead of Infiniband
for message transport. `export OMPI_MCA_pml=ob1; export OMPI_MCA_mtl=^mxm`
variables, so that OpenMPI uses shared-memory instead of Infiniband
for message transport:
```
export OMPI_MCA_pml=ob1
export OMPI_MCA_mtl=^mxm
```
- Use `I_MPI_FABRICS=shm` so that Intel MPI doesn't even consider
using Infiniband devices itself, but only shared-memory instead
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment