Skip to content
Snippets Groups Projects
Commit 74378c00 authored by Ulf Markwardt's avatar Ulf Markwardt
Browse files

schritt für schritt

parent 84690f3b
No related branches found
No related tags found
2 merge requests!850Automated merge from preview to main,!845Barnard
# HPC Resources # HPC Resources
The architecture specifically tailored to data-intensive computing, Big Data HPC resources in ZIH systems comprise the *High Performance Computing and Storage Complex* and its
analytics, and artificial intelligence methods with extensive capabilities for performance monitoring provides ideal conditions to achieve the ambitious research goals of the users and the ZIH. extension *High Performance Computing – Data Analytics*. In total it offers scientists
about 60,000 CPU cores and a peak performance of more than 1.5 quadrillion floating point
## Overview operations per second. The architecture specifically tailored to data-intensive computing, Big Data
analytics, and artificial intelligence methods with extensive capabilities for energy measurement
From the users' pespective, there are seperate clusters, all of them with their subdomains: and performance monitoring provides ideal conditions to achieve the ambitious research goals of the
users and the ZIH.
| Name | Description | Year| DNS |
| --- | --- | --- | --- | ## Login and Export Nodes
| **Barnard** | CPU cluster |2023| *.barnard.hpc.tu-dresden.de |
| **Romeo** | CPU cluster |2020|*.romeo.hpc.tu-dresden.de | - 4 Login-Nodes `tauruslogin[3-6].hrsk.tu-dresden.de`
| **Alpha Centauri** | GPU cluster |2021|*.alpha.hpc.tu-dresden.de | - Each login node is equipped with 2x Intel(R) Xeon(R) CPU E5-2680 v3 with 24 cores in total @
| **Julia** | single SMP system |2021|julia.hpc.tu-dresden.de | 2.50 GHz, Multithreading disabled, 64 GB RAM, 128 GB SSD local disk
| **Power** | IBM Power/GPU system |2018|*.power.hpc.tu-dresden.de | - IPs: 141.30.73.\[102-105\]
- 2 Data-Transfer-Nodes `taurusexport[3-4].hrsk.tu-dresden.de`
- DNS Alias `taurusexport.hrsk.tu-dresden.de`
They run with their own Slurm batch system. Job submission is possible only from their login nodes. - 2 Servers without interactive login, only available via file transfer protocols
(`rsync`, `ftp`)
All clusters have access to these shared parallel file systems: - IPs: 141.30.73.\[82,83\]
- Further information on the usage is documented on the site
| File system | Usable directory | Capacity | Purpose | [Export Nodes](../data_transfer/export_nodes.md)
| --- | --- | --- | --- |
| `Lustre` | `/lustre/bulk` | 20 PB |
| `Lustre` | `/lustre/fast` | 2 PB |
| `Weka` | `/weka` | 232 TB |
| `Home` | `/home` | 40 TB |
## Barnard - Intel Sapphire Rapids CPUs
- 630 diskless nodes, each with
- 2 x Intel(R) Xeon(R) CPU E5-2680 v3 (52 cores) @ 2.50 GHz, Multithreading enabled
- 512 GB RAM
- Hostnames: `n1[001-630].barnard.hpc.tu-dresden.de`
- Login nodes: `login[1-4].barnard.hpc.tu-dresden.de`
## AMD Rome CPUs + NVIDIA A100 ## AMD Rome CPUs + NVIDIA A100
...@@ -43,8 +29,8 @@ All clusters have access to these shared parallel file systems: ...@@ -43,8 +29,8 @@ All clusters have access to these shared parallel file systems:
- 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, Multithreading available - 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, Multithreading available
- 1 TB RAM - 1 TB RAM
- 3.5 TB local memory on NVMe device at `/tmp` - 3.5 TB local memory on NVMe device at `/tmp`
- Hostnames: `taurusi[8001-8034]` -> `n[1-37].alpha.hpc.tu-dresden.de` - Hostnames: `taurusi[8001-8034]`
- Login nodes: `login[1-2].alpha.hpc.tu-dresden.de` - Slurm partition: `alpha`
- Further information on the usage is documented on the site [Alpha Centauri Nodes](alpha_centauri.md) - Further information on the usage is documented on the site [Alpha Centauri Nodes](alpha_centauri.md)
## Island 7 - AMD Rome CPUs ## Island 7 - AMD Rome CPUs
...@@ -53,8 +39,8 @@ All clusters have access to these shared parallel file systems: ...@@ -53,8 +39,8 @@ All clusters have access to these shared parallel file systems:
- 2 x AMD EPYC CPU 7702 (64 cores) @ 2.0 GHz, Multithreading available - 2 x AMD EPYC CPU 7702 (64 cores) @ 2.0 GHz, Multithreading available
- 512 GB RAM - 512 GB RAM
- 200 GB local memory on SSD at `/tmp` - 200 GB local memory on SSD at `/tmp`
- Hostnames: `taurusi[7001-7192]` -> `n[1-190].romeo.hpc.tu-dresden.de` - Hostnames: `taurusi[7001-7192]`
- Login nodes: `login[1-2].romeo.hpc.tu-dresden.de` - Slurm partition: `romeo`
- Further information on the usage is documented on the site [AMD Rome Nodes](rome_nodes.md) - Further information on the usage is documented on the site [AMD Rome Nodes](rome_nodes.md)
## Large SMP System HPE Superdome Flex ## Large SMP System HPE Superdome Flex
...@@ -81,4 +67,43 @@ For machine learning, we have IBM AC922 nodes installed with this configuration: ...@@ -81,4 +67,43 @@ For machine learning, we have IBM AC922 nodes installed with this configuration:
- Hostnames: `taurusml[1-32]` - Hostnames: `taurusml[1-32]`
- Slurm partition: `ml` - Slurm partition: `ml`
## Island 6 - Intel Haswell CPUs
- 612 nodes, each with
- 2 x Intel(R) Xeon(R) CPU E5-2680 v3 (12 cores) @ 2.50 GHz, Multithreading disabled
- 128 GB local memory on SSD
- Varying amounts of main memory (selected automatically by the batch system for you according to
your job requirements)
* 594 nodes with 2.67 GB RAM per core (64 GB in total): `taurusi[6001-6540,6559-6612]`
- 18 nodes with 10.67 GB RAM per core (256 GB in total): `taurusi[6541-6558]`
- Hostnames: `taurusi[6001-6612]`
- Slurm Partition: `haswell`
??? hint "Node topology"
![Node topology](misc/i4000.png)
{: align=center}
## Island 2 Phase 2 - Intel Haswell CPUs + NVIDIA K80 GPUs
- 64 nodes, each with
- 2 x Intel(R) Xeon(R) CPU E5-E5-2680 v3 (12 cores) @ 2.50 GHz, Multithreading disabled
- 64 GB RAM (2.67 GB per core)
- 128 GB local memory on SSD
- 4 x NVIDIA Tesla K80 (12 GB GDDR RAM) GPUs
- Hostnames: `taurusi[2045-2108]`
- Slurm Partition: `gpu2`
- Node topology, same as [island 4 - 6](#island-6-intel-haswell-cpus)
## SMP Nodes - up to 2 TB RAM
- 5 Nodes, each with
- 4 x Intel(R) Xeon(R) CPU E7-4850 v3 (14 cores) @ 2.20 GHz, Multithreading disabled
- 2 TB RAM
- Hostnames: `taurussmp[3-7]`
- Slurm partition: `smp2`
??? hint "Node topology"
![Node topology](misc/smp2.png)
{: align=center}
# HPC Resources # HPC Resources
HPC resources in ZIH systems comprise the *High Performance Computing and Storage Complex* and its The architecture specifically tailored to data-intensive computing, Big Data
extension *High Performance Computing – Data Analytics*. In total it offers scientists analytics, and artificial intelligence methods with extensive capabilities
about 60,000 CPU cores and a peak performance of more than 1.5 quadrillion floating point for performance monitoring provides ideal conditions to achieve the ambitious
operations per second. The architecture specifically tailored to data-intensive computing, Big Data research goals of the users and the ZIH.
analytics, and artificial intelligence methods with extensive capabilities for energy measurement
and performance monitoring provides ideal conditions to achieve the ambitious research goals of the ## Overview
users and the ZIH.
From the users' pespective, there are seperate clusters, all of them with their subdomains:
## Login and Export Nodes
| Name | Description | Year| DNS |
- 4 Login-Nodes `tauruslogin[3-6].hrsk.tu-dresden.de` | --- | --- | --- | --- |
- Each login node is equipped with 2x Intel(R) Xeon(R) CPU E5-2680 v3 with 24 cores in total @ | **Barnard** | CPU cluster |2023| n[1001-1630].barnard.hpc.tu-dresden.de |
2.50 GHz, Multithreading disabled, 64 GB RAM, 128 GB SSD local disk | **Romeo** | CPU cluster |2020|r[001-192].romeo.hpc.tu-dresden.de |
- IPs: 141.30.73.\[102-105\] | **Alpha Centauri** | GPU cluster |2021|a[001-039].alpha.hpc.tu-dresden.de |
- 2 Data-Transfer-Nodes `taurusexport[3-4].hrsk.tu-dresden.de` | **Julia** | single SMP system |2021|julia.hpc.tu-dresden.de |
- DNS Alias `taurusexport.hrsk.tu-dresden.de` | **Power** | IBM Power/GPU system |2018|m[001-032].power.hpc.tu-dresden.de |
- 2 Servers without interactive login, only available via file transfer protocols
(`rsync`, `ftp`) They run with their own Slurm batch system. Job submission is possible only from
- IPs: 141.30.73.\[82,83\] their respective login nodes.
- Further information on the usage is documented on the site
[Export Nodes](../data_transfer/export_nodes.md) All clusters have access to these shared parallel file systems:
| File system | Usable directory | Type | Capacity | Purpose |
| --- | --- | --- | --- | --- |
| `Home` | `/home` | Lustre | 40 TB |
| `Project` | `/projects` | NFS | 40 TB |
| `Scratch for large data / streams` | `/data/horse` | Lustre | 20 PB |
| `Scratch for random access` | `/data/rabbit` | Lustre | 2 PB |
| `Scratch for random access` | `/data/weasel` | WEKA | 232 TB |
| `Scratch for random access` | `/data/weasel` | WEKA | 232 TB |
## Barnard - Intel Sapphire Rapids CPUs
- 630 diskless nodes, each with
- 2 x Intel(R) Xeon(R) CPU E5-2680 v3 (52 cores) @ 2.50 GHz, Multithreading enabled
- 512 GB RAM
- Hostnames: `n1[001-630].barnard.hpc.tu-dresden.de`
- Login nodes: `login[1-4].barnard.hpc.tu-dresden.de`
## AMD Rome CPUs + NVIDIA A100 ## AMD Rome CPUs + NVIDIA A100
...@@ -29,8 +46,8 @@ users and the ZIH. ...@@ -29,8 +46,8 @@ users and the ZIH.
- 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, Multithreading available - 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, Multithreading available
- 1 TB RAM - 1 TB RAM
- 3.5 TB local memory on NVMe device at `/tmp` - 3.5 TB local memory on NVMe device at `/tmp`
- Hostnames: `taurusi[8001-8034]` - Hostnames: `taurusi[8001-8034]` -> `a[1-37].alpha.hpc.tu-dresden.de`
- Slurm partition: `alpha` - Login nodes: `login[1-2].alpha.hpc.tu-dresden.de`
- Further information on the usage is documented on the site [Alpha Centauri Nodes](alpha_centauri.md) - Further information on the usage is documented on the site [Alpha Centauri Nodes](alpha_centauri.md)
## Island 7 - AMD Rome CPUs ## Island 7 - AMD Rome CPUs
...@@ -39,8 +56,8 @@ users and the ZIH. ...@@ -39,8 +56,8 @@ users and the ZIH.
- 2 x AMD EPYC CPU 7702 (64 cores) @ 2.0 GHz, Multithreading available - 2 x AMD EPYC CPU 7702 (64 cores) @ 2.0 GHz, Multithreading available
- 512 GB RAM - 512 GB RAM
- 200 GB local memory on SSD at `/tmp` - 200 GB local memory on SSD at `/tmp`
- Hostnames: `taurusi[7001-7192]` - Hostnames: `taurusi[7001-7192]` -> `r[1-190].romeo.hpc.tu-dresden.de`
- Slurm partition: `romeo` - Login nodes: `login[1-2].romeo.hpc.tu-dresden.de`
- Further information on the usage is documented on the site [AMD Rome Nodes](rome_nodes.md) - Further information on the usage is documented on the site [AMD Rome Nodes](rome_nodes.md)
## Large SMP System HPE Superdome Flex ## Large SMP System HPE Superdome Flex
...@@ -51,8 +68,7 @@ users and the ZIH. ...@@ -51,8 +68,7 @@ users and the ZIH.
- Configured as one single node - Configured as one single node
- 48 TB RAM (usable: 47 TB - one TB is used for cache coherence protocols) - 48 TB RAM (usable: 47 TB - one TB is used for cache coherence protocols)
- 370 TB of fast NVME storage available at `/nvme/<projectname>` - 370 TB of fast NVME storage available at `/nvme/<projectname>`
- Hostname: `taurussmp8` - Hostname: `taurussmp8` -> `julia.hpc.tu-dresden.de`
- Slurm partition: `julia`
- Further information on the usage is documented on the site [HPE Superdome Flex](sd_flex.md) - Further information on the usage is documented on the site [HPE Superdome Flex](sd_flex.md)
## IBM Power9 Nodes for Machine Learning ## IBM Power9 Nodes for Machine Learning
...@@ -64,46 +80,5 @@ For machine learning, we have IBM AC922 nodes installed with this configuration: ...@@ -64,46 +80,5 @@ For machine learning, we have IBM AC922 nodes installed with this configuration:
- 256 GB RAM DDR4 2666 MHz - 256 GB RAM DDR4 2666 MHz
- 6 x NVIDIA VOLTA V100 with 32 GB HBM2 - 6 x NVIDIA VOLTA V100 with 32 GB HBM2
- NVLINK bandwidth 150 GB/s between GPUs and host - NVLINK bandwidth 150 GB/s between GPUs and host
- Hostnames: `taurusml[1-32]` - Hostnames: `taurusml[1-32]` -> `p[1-29].ml.hpc.tu-dresden.de`
- Slurm partition: `ml` - Login nodes: `login[1-2].ml.hpc.tu-dresden.de``
## Island 6 - Intel Haswell CPUs
- 612 nodes, each with
- 2 x Intel(R) Xeon(R) CPU E5-2680 v3 (12 cores) @ 2.50 GHz, Multithreading disabled
- 128 GB local memory on SSD
- Varying amounts of main memory (selected automatically by the batch system for you according to
your job requirements)
* 594 nodes with 2.67 GB RAM per core (64 GB in total): `taurusi[6001-6540,6559-6612]`
- 18 nodes with 10.67 GB RAM per core (256 GB in total): `taurusi[6541-6558]`
- Hostnames: `taurusi[6001-6612]`
- Slurm Partition: `haswell`
??? hint "Node topology"
![Node topology](misc/i4000.png)
{: align=center}
## Island 2 Phase 2 - Intel Haswell CPUs + NVIDIA K80 GPUs
- 64 nodes, each with
- 2 x Intel(R) Xeon(R) CPU E5-E5-2680 v3 (12 cores) @ 2.50 GHz, Multithreading disabled
- 64 GB RAM (2.67 GB per core)
- 128 GB local memory on SSD
- 4 x NVIDIA Tesla K80 (12 GB GDDR RAM) GPUs
- Hostnames: `taurusi[2045-2108]`
- Slurm Partition: `gpu2`
- Node topology, same as [island 4 - 6](#island-6-intel-haswell-cpus)
## SMP Nodes - up to 2 TB RAM
- 5 Nodes, each with
- 4 x Intel(R) Xeon(R) CPU E7-4850 v3 (14 cores) @ 2.20 GHz, Multithreading disabled
- 2 TB RAM
- Hostnames: `taurussmp[3-7]`
- Slurm partition: `smp2`
??? hint "Node topology"
![Node topology](misc/smp2.png)
{: align=center}
# HPC Resources and Jobs # HPC Resources and Jobs
ZIH operates a high performance computing (HPC) system with more than 60.000 cores, 720 GPUs, and a ZIH operates a high performance computing (HPC) system with more than 90.000 cores, 500 GPUs, and
flexible storage hierarchy with about 16 PB total capacity. The HPC system provides an optimal a flexible storage hierarchy with about 20 PB total capacity. The HPC system provides an optimal
research environment especially in the area of data analytics and machine learning as well as for research environment especially in the area of data analytics and machine learning as well as for
processing extremely large data sets. Moreover it is also a perfect platform for highly scalable, processing extremely large data sets. Moreover it is also a perfect platform for highly scalable,
data-intensive and compute-intensive applications. data-intensive and compute-intensive applications.
...@@ -58,12 +58,12 @@ automatically select a suitable partition depending on your memory and GPU requi ...@@ -58,12 +58,12 @@ automatically select a suitable partition depending on your memory and GPU requi
**MPI jobs:** For MPI jobs typically allocates one core per task. Several nodes could be allocated **MPI jobs:** For MPI jobs typically allocates one core per task. Several nodes could be allocated
if it is necessary. The batch system [Slurm](slurm.md) will automatically find suitable hardware. if it is necessary. The batch system [Slurm](slurm.md) will automatically find suitable hardware.
Normal compute nodes are perfect for this task.
**OpenMP jobs:** SMP-parallel applications can only run **within a node**, so it is necessary to **OpenMP jobs:** SMP-parallel applications can only run **within a node**, so it is necessary to
include the [batch system](slurm.md) options `-N 1` and `-n 1`. Using `--cpus-per-task N` Slurm will include the [batch system](slurm.md) options `-N 1` and `-n 1`. Using `--cpus-per-task N` Slurm will
start one task and you will have `N` CPUs. The maximum number of processors for an SMP-parallel start one task and you will have `N` CPUs. The maximum number of processors for an SMP-parallel
program is 896 on partition `julia`, see [partitions](partitions_and_limits.md). program is 896 on partition `julia`, see [partitions](partitions_and_limits.md) (be aware that
the application has to be developed with that large number of threads in mind).
Partitions with GPUs are best suited for **repetitive** and **highly-parallel** computing tasks. If Partitions with GPUs are best suited for **repetitive** and **highly-parallel** computing tasks. If
you have a task with potential [data parallelism](../software/gpu_programming.md) most likely that you have a task with potential [data parallelism](../software/gpu_programming.md) most likely that
...@@ -71,7 +71,9 @@ you need the GPUs. Beyond video rendering, GPUs excel in tasks such as machine ...@@ -71,7 +71,9 @@ you need the GPUs. Beyond video rendering, GPUs excel in tasks such as machine
simulations and risk modeling. Use the partitions `gpu2` and `ml` only if you need GPUs! Otherwise simulations and risk modeling. Use the partitions `gpu2` and `ml` only if you need GPUs! Otherwise
using the x86-based partitions most likely would be more beneficial. using the x86-based partitions most likely would be more beneficial.
**Interactive jobs:** Slurm can forward your X11 credentials to the first node (or even all) for a job **Interactive jobs:** An interactive job is the best choice for testing and development. See
[interactive-jobs](slurm.md).
Slurm can forward your X11 credentials to the first node (or even all) for a job
with the `--x11` option. To use an interactive job you have to specify `-X` flag for the ssh login. with the `--x11` option. To use an interactive job you have to specify `-X` flag for the ssh login.
## Interactive vs. Batch Mode ## Interactive vs. Batch Mode
......
...@@ -99,7 +99,6 @@ nav: ...@@ -99,7 +99,6 @@ nav:
- HPC Resources and Jobs: - HPC Resources and Jobs:
- Overview: jobs_and_resources/overview.md - Overview: jobs_and_resources/overview.md
- HPC Resources: - HPC Resources:
- Overview: jobs_and_resources/hardware_overview.md
- AMD Rome Nodes: jobs_and_resources/rome_nodes.md - AMD Rome Nodes: jobs_and_resources/rome_nodes.md
- NVMe Storage: jobs_and_resources/nvme_storage.md - NVMe Storage: jobs_and_resources/nvme_storage.md
- Alpha Centauri: jobs_and_resources/alpha_centauri.md - Alpha Centauri: jobs_and_resources/alpha_centauri.md
...@@ -111,7 +110,6 @@ nav: ...@@ -111,7 +110,6 @@ nav:
- Partitions and Limits: jobs_and_resources/partitions_and_limits.md - Partitions and Limits: jobs_and_resources/partitions_and_limits.md
- Slurm Job File Generator: jobs_and_resources/slurm_generator.md - Slurm Job File Generator: jobs_and_resources/slurm_generator.md
- Checkpoint/Restart: jobs_and_resources/checkpoint_restart.md - Checkpoint/Restart: jobs_and_resources/checkpoint_restart.md
- Job Profiling: jobs_and_resources/slurm_profiling.md
- Binding and Distribution of Tasks: jobs_and_resources/binding_and_distribution_of_tasks.md - Binding and Distribution of Tasks: jobs_and_resources/binding_and_distribution_of_tasks.md
- User Support: support/support.md - User Support: support/support.md
- Archive: - Archive:
...@@ -124,7 +122,9 @@ nav: ...@@ -124,7 +122,9 @@ nav:
- Platform LSF: archive/platform_lsf.md - Platform LSF: archive/platform_lsf.md
- BeeGFS Filesystem on Demand: archive/beegfs_on_demand.md - BeeGFS Filesystem on Demand: archive/beegfs_on_demand.md
- Jupyter Installation: archive/install_jupyter.md - Jupyter Installation: archive/install_jupyter.md
- Job Profiling: jobs_and_resources/slurm_profiling.md
- Switched-Off Systems: - Switched-Off Systems:
- Overview 2022: archive/hardware_overview_2022.md
- Overview: archive/systems_switched_off.md - Overview: archive/systems_switched_off.md
- Migration From Deimos to Atlas: archive/migrate_to_atlas.md - Migration From Deimos to Atlas: archive/migrate_to_atlas.md
- System Altix: archive/system_altix.md - System Altix: archive/system_altix.md
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment