Skip to content
Snippets Groups Projects
Commit 1ea3278e authored by Natalie Breidenbach's avatar Natalie Breidenbach Committed by Sebastian Döbel
Browse files

Update 7 files

- /doc.zih.tu-dresden.de/docs/jobs_and_resources/barnard.md
- /doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md
- /doc.zih.tu-dresden.de/docs/jobs_and_resources/julia.md
- /doc.zih.tu-dresden.de/docs/jobs_and_resources/power9.md
- /doc.zih.tu-dresden.de/docs/jobs_and_resources/romeo.md
- /doc.zih.tu-dresden.de/docs/jobs_and_resources/capella.md
- /doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md
parent 22d5bb38
No related branches found
No related tags found
2 merge requests!1164Automated merge from preview to main,!1152Add page for Capella
# GPU Cluster Alpha Centauri
## Overview
The multi-GPU cluster `Alpha Centauri` has been installed for AI-related computations (ScaDS.AI).
The hardware specification is documented on the page
[HPC Resources](hardware_overview.md#alpha-centauri).
## Details
- 34 nodes, each with
- 8 x NVIDIA A100-SXM4 Tensor Core-GPUs
- 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, Multithreading available
- 1 TB RAM (16 x 32 GB DDR4-2933 MT/s per socket)
- 3.5 TB local storage on NVMe device at `/tmp`
- Login nodes: `login[1-2].alpha.hpc.tu-dresden.de`
- Hostnames: `i[8001-8037].alpha.hpc.tu-dresden.de`
- Operating system: Rocky Linux 8.9
- Further information on the usage is documented below
## Filesystems
......
......@@ -6,10 +6,19 @@ In 2023, Taurus was replaced by the cluster Barnard. Barnard is a general purpos
and is based on Intel Sapphire Rapids CPUs.
The cluster consists of 630 nodes with particular [hardware specifications](hardware_overview#barnard).
## Usage
Barnard has four [login nodes](hardware_overview#barnard).
The [filesystems](../data_lifecycle/file_systems.md)
(`/home`, `/software`, `/data/horse`, `/data/walrus`, etc.) are available.
## Details
- 630 nodes, each with
- 2 x Intel Xeon Platinum 8470 (52 cores) @ 2.00 GHz, Multithreading enabled
- 512 GB RAM (8 x 32 GB DDR5-4800 MT/s per socket)
- 12 nodes provide 1.8 TB local storage on NVMe device at `/tmp`
- All other nodes are diskless and have no or very limited local storage (i.e. `/tmp`)
- Login nodes: `login[1-4].barnard.hpc.tu-dresden.de`
- Hostnames: `n[1001-1630].barnard.hpc.tu-dresden.de`
- Operating system: Red Hat Enterpise Linux 8.9
The [filesystems](../data_lifecycle/file_systems.md) are available on `barnard`
(`/home`, `/software`, `/data/horse`, `/data/walrus`, etc.).
```console
marie@login.barnard$ module spider <module_name>
......
# GPU Cluster Capella
## Overview
The multi-GPU cluster `Capella` has been installed for AI-related computations and traditional
HPC simulations.
The hardware specification is documented on the page
[HPC Resources](hardware_overview.md#capella).
## Details
Capelle has two [login nodes](hardware_overview.md#capella)
- 144 nodes, each with
- 4 x NVIDIA H100-SXM5 Tensor Core-GPUs
- 2 x AMD EPYC CPU 9334 (32 cores) @ 2.7 GHz, Multithreading disabled
- 768 GB RAM (12 x 32 GB DDR5-4800 MT/s per socket)
- 800 GB local storage on NVMe device at `/tmp`
- Login nodes: `login[1-2].capella.hpc.tu-dresden.de`
- Hostnames: `c[1-144].capella.hpc.tu-dresden.de`
- Operating system: Alma Linux 9.4
## Filesystems
......
......@@ -8,20 +8,6 @@ analytics, and artificial intelligence methods with extensive capabilities for e
and performance monitoring provides ideal conditions to achieve the ambitious research goals of the
users and the ZIH.
HPC resources at ZIH comprise a total of **six systems**:
| Name | Description | Year of Installation | DNS |
| ----------------------------------- | ----------------------| -------------------- | --- |
| [`Capella`](#capella) | GPU cluster | 2024 | `c[1-144].capella.hpc.tu-dresden.de` |
| [`Barnard`](#barnard) | CPU cluster | 2023 | `n[1001-1630].barnard.hpc.tu-dresden.de` |
| [`Alpha Centauri`](#alpha-centauri) | GPU cluster | 2021 | `i[8001-8037].alpha.hpc.tu-dresden.de` |
| [`Julia`](#julia) | Single SMP system | 2021 | `julia.hpc.tu-dresden.de` |
| [`Romeo`](#romeo) | CPU cluster | 2020 | `i[8001-8190].romeo.hpc.tu-dresden.de` |
| [`Power9`](#power9) | IBM Power/GPU cluster | 2018 | `ml[1-29].power9.hpc.tu-dresden.de` |
All clusters will run with their own [Slurm batch system](slurm.md) and job submission is possible
only from their respective login nodes.
## Architectural Design
Over the last decade we have been running our HPC system of high heterogeneity with a single
......@@ -38,10 +24,25 @@ permanent filesystems on the page [Filesystems](../data_lifecycle/file_systems.m
![Architecture overview 2023](../jobs_and_resources/misc/architecture_2024.png)
{: align=center}
HPC resources at ZIH comprise a total of the **six systems**:
| Name | Description | Year of Installation | DNS |
| ----------------------------------- | ----------------------| -------------------- | --- |
| [`Capella`](capella.md) | GPU cluster | 2024 | `c[1-144].capella.hpc.tu-dresden.de` |
| [`Barnard`](barnard.md) | CPU cluster | 2023 | `n[1001-1630].barnard.hpc.tu-dresden.de` |
| [`Alpha Centauri`](alpha-centauri.md) | GPU cluster | 2021 | `i[8001-8037].alpha.hpc.tu-dresden.de` |
| [`Julia`](julia.md) | Single SMP system | 2021 | `julia.hpc.tu-dresden.de` |
| [`Romeo`](romeo.md) | CPU cluster | 2020 | `i[8001-8190].romeo.hpc.tu-dresden.de` |
| [`Power9`](power9.md) | IBM Power/GPU cluster | 2018 | `ml[1-29].power9.hpc.tu-dresden.de` |
All clusters will run with their own [Slurm batch system](slurm.md) and job submission is possible
only from their respective login nodes.
## Login and Dataport Nodes
- Login-Nodes
- Individual for each cluster. See sections below.
- Individual for each cluster. See the specifics in each cluster chapter
- 2 Data-Transfer-Nodes
- 2 servers without interactive login, only available via file transfer protocols
(`rsync`, `ftp`)
......@@ -52,87 +53,34 @@ permanent filesystems on the page [Filesystems](../data_lifecycle/file_systems.m
## Barnard
The cluster `Barnard` is a general purpose cluster by Bull. It is based on Intel Sapphire Rapids
CPUs.
- 630 nodes, each with
- 2 x Intel Xeon Platinum 8470 (52 cores) @ 2.00 GHz, Multithreading enabled
- 512 GB RAM (8 x 32 GB DDR5-4800 MT/s per socket)
- 12 nodes provide 1.8 TB local storage on NVMe device at `/tmp`
- All other nodes are diskless and have no or very limited local storage (i.e. `/tmp`)
- Login nodes: `login[1-4].barnard.hpc.tu-dresden.de`
- Hostnames: `n[1001-1630].barnard.hpc.tu-dresden.de`
- Operating system: Red Hat Enterpise Linux 8.9
The cluster `Barnard` is a general purpose cluster by Bull. It is based on Intel Sapphire Rapids CPUs.
Further details in [`Barnard` Chapter](barnard.md).
## Alpha Centauri
The cluster `Alpha Centauri` (short: `Alpha`) by NEC provides AMD Rome CPUs and NVIDIA A100 GPUs
and is designed for AI and ML tasks.
- 34 nodes, each with
- 8 x NVIDIA A100-SXM4 Tensor Core-GPUs
- 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, Multithreading available
- 1 TB RAM (16 x 32 GB DDR4-2933 MT/s per socket)
- 3.5 TB local storage on NVMe device at `/tmp`
- Login nodes: `login[1-2].alpha.hpc.tu-dresden.de`
- Hostnames: `i[8001-8037].alpha.hpc.tu-dresden.de`
- Operating system: Rocky Linux 8.9
- Further information on the usage is documented on the site [GPU Cluster Alpha Centauri](alpha_centauri.md)
Further details in [`Alpha Centauri` Chapter](alpha_centauri.md).
## Capella
The cluster `Capella` by MEGWARE provides AMD Genoa CPUs and NVIDIA H100 GPUs
and is designed for AI and ML tasks.
- 144 nodes, each with
- 4 x NVIDIA H100-SXM5 Tensor Core-GPUs
- 2 x AMD EPYC CPU 9334 (32 cores) @ 2.7 GHz, Multithreading disabled
- 768 GB RAM (12 x 32 GB DDR5-4800 MT/s per socket)
- 800 GB local storage on NVMe device at `/tmp`
- Login nodes: `login[1-2].capella.hpc.tu-dresden.de`
- Hostnames: `c[1-144].capella.hpc.tu-dresden.de`
- Operating system: Alma Linux 9.4
Further details in [`Capella` Chapter](capella.md).
## Romeo
The cluster `Romeo` is a general purpose cluster by NEC based on AMD Rome CPUs.
- 192 nodes, each with
- 2 x AMD EPYC CPU 7702 (64 cores) @ 2.0 GHz, Multithreading available
- 512 GB RAM (8 x 32 GB DDR4-3200 MT/s per socket)
- 200 GB local storage on SSD at `/tmp`
- Login nodes: `login[1-2].romeo.hpc.tu-dresden.de`
- Hostnames: `i[7001-7190].romeo.hpc.tu-dresden.de`
- Operating system: Rocky Linux 8.9
- Further information on the usage is documented on the site [CPU Cluster Romeo](romeo.md)
Further details in [`Romeo` Chapter](romeo.md).
## Julia
The cluster `Julia` is a large SMP (shared memory parallel) system by HPE based on Superdome Flex
architecture.
- 1 node, with
- 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20 GHz (28 cores)
- 47 TB RAM (12 x 128 GB DDR4-2933 MT/s per socket)
- Configured as one single node
- 48 TB RAM (usable: 47 TB - one TB is used for cache coherence protocols)
- 370 TB of fast NVME storage available at `/nvme/<projectname>`
- Login node: `julia.hpc.tu-dresden.de`
- Hostname: `julia.hpc.tu-dresden.de`
- Operating system: Rocky Linux 8.7
- Further information on the usage is documented on the site [SMP System Julia](julia.md)
Further details in [`Julia` Chapter](julia.md).
## Power9
The cluster `Power9` by IBM is based on Power9 CPUs and provides NVIDIA V100 GPUs.
`Power9` is specifically designed for machine learning (ML) tasks.
- 32 nodes, each with
- 2 x IBM Power9 CPU (2.80 GHz, 3.10 GHz boost, 22 cores)
- 256 GB RAM (8 x 16 GB DDR4-2666 MT/s per socket)
- 6 x NVIDIA VOLTA V100 with 32 GB HBM2
- NVLINK bandwidth 150 GB/s between GPUs and host
- Login nodes: `login[1-2].power9.hpc.tu-dresden.de`
- Hostnames: `ml[1-29].power9.hpc.tu-dresden.de`
- Operating system: Alma Linux 8.7
- Further information on the usage is documented on the site [GPU Cluster Power9](power9.md)
Further details in [`Power9` Chapter](power9.md).
# SMP Cluster Julia
## Overview
The HPE Superdome Flex is a large shared memory node. It is especially well suited for data
intensive application scenarios, for example to process extremely large data sets completely in main
memory or in very fast NVMe memory.
## Hardware Resources
The hardware specification is documented on the page
[HPC Resources](hardware_overview.md#julia).
## Details
- 1 node, with
- 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20 GHz (28 cores)
- 47 TB RAM (12 x 128 GB DDR4-2933 MT/s per socket)
- Configured as one single node
- 48 TB RAM (usable: 47 TB - one TB is used for cache coherence protocols)
- 370 TB of fast NVME storage available at `/nvme/<projectname>`
- Login node: `julia.hpc.tu-dresden.de`
- Hostname: `julia.hpc.tu-dresden.de`
- Operating system: Rocky Linux 8.7
- Further information on the usage is documented below
!!! note
......
# GPU Cluster Power9
## Overview
The multi-GPU cluster `Power9` was installed in 2018. Until the end of 2023, it was available as
partition `power` within the now decommissioned `Taurus` system. With the decommission of `Taurus`,
`Power9` has been re-engineered and is now a homogeneous, standalone cluster with own
[Slurm batch system](slurm.md) and own login nodes.
## Hardware Resources
## Details
- 32 nodes, each with
- 2 x IBM Power9 CPU (2.80 GHz, 3.10 GHz boost, 22 cores)
- 256 GB RAM (8 x 16 GB DDR4-2666 MT/s per socket)
- 6 x NVIDIA VOLTA V100 with 32 GB HBM2
- NVLINK bandwidth 150 GB/s between GPUs and host
- Login nodes: `login[1-2].power9.hpc.tu-dresden.de`
- Hostnames: `ml[1-29].power9.hpc.tu-dresden.de`
- Operating system: Alma Linux 8.7
- Further information on the usage is documented below
The hardware specification is documented on the page [HPC Resources](hardware_overview.md#power9).
## Usage
If you want to use containers on `Power9`, please refer to the page
[Singularity for Power9 Architecuture](../software/singularity_power9.md).
The compute nodes of the cluster `power` are built on the base of
The compute nodes of the cluster `power9` are built on the base of
[Power9 architecture](https://www.ibm.com/it-infrastructure/power/power9) from IBM. The system was created
for AI challenges, analytics and working with data-intensive workloads and accelerated databases.
The main feature of the nodes is the ability to work with the
[NVIDIA Tesla V100](https://www.nvidia.com/en-gb/data-center/tesla-v100/) GPU with **NV-Link**
support that allows a total bandwidth with up to 300 GB/s. Each node on the
cluster `power` has 6x Tesla V-100 GPUs. You can find a detailed specification of the cluster in our
cluster `power9` has 6x Tesla V-100 GPUs. You can find a detailed specification of the cluster in our
[Power9 documentation](../jobs_and_resources/hardware_overview.md).
......
......@@ -7,10 +7,16 @@ of 2023, it was available as partition `romeo` within `Taurus`. With the decommi
`Romeo` has been re-engineered and is now a homogeneous, standalone cluster with own
[Slurm batch system](slurm.md) and own login nodes.
## Hardware Resources
The hardware specification is documented on the page
[HPC Resources](hardware_overview.md#romeo).
## Details
- 192 nodes, each with
- 2 x AMD EPYC CPU 7702 (64 cores) @ 2.0 GHz, Multithreading available
- 512 GB RAM (8 x 32 GB DDR4-3200 MT/s per socket)
- 200 GB local storage on SSD at `/tmp`
- Login nodes: `login[1-2].romeo.hpc.tu-dresden.de`
- Hostnames: `i[7001-7190].romeo.hpc.tu-dresden.de`
- Operating system: Rocky Linux 8.9
- Further information on the usage is documented below
## Usage
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment