From ebca616a70510b4f433a0ebdcc50ad98f5e45ab9 Mon Sep 17 00:00:00 2001 From: Ulf Markwardt <ulf.markwardt@tu-dresden.de> Date: Thu, 22 Jun 2023 14:06:24 +0200 Subject: [PATCH] update --- .../jobs_and_resources/hardware_overview.md | 117 +++++++++++------- 1 file changed, 70 insertions(+), 47 deletions(-) diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md index 14a39baba..538296b4e 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md @@ -1,45 +1,26 @@ # HPC Resources -The architecture specifically tailored to data-intensive computing, Big Data -analytics, and artificial intelligence methods with extensive capabilities -for performance monitoring provides ideal conditions to achieve the ambitious -research goals of the users and the ZIH. - -## Overview - -From the users' pespective, there are seperate clusters, all of them with their subdomains: - -| Name | Description | Year| DNS | -| --- | --- | --- | --- | -| **Barnard** | CPU cluster |2023| n[1001-1630].barnard.hpc.tu-dresden.de | -| **Romeo** | CPU cluster |2020|i[8001-8190].romeo.hpc.tu-dresden.de | -| **Alpha Centauri** | GPU cluster |2021|i[8001-8037].alpha.hpc.tu-dresden.de | -| **Julia** | single SMP system |2021|smp8.julia.hpc.tu-dresden.de | -| **Power** | IBM Power/GPU system |2018|ml[1-29].power9.hpc.tu-dresden.de | - -They run with their own Slurm batch system. Job submission is possible only from -their respective login nodes. - -All clusters have access to these shared parallel file systems: - -| File system | Usable directory | Type | Capacity | Purpose | -| --- | --- | --- | --- | --- | -| Home | `/home` | Lustre | quota per user: 20 GB | permanant user data | -| Project | `/projects` | Lustre | quota per project | permanant project data | -| Scratch for large data / streaming | `/data/horse` | Lustre | 20 PB | h -| Scratch for random access | `/data/rabbit` | Lustre | 2 PB | - -These mount points are planned (September 2023): -| Scratch for random access | `/data/weasel` | WEKA | 232 TB | -| Scratch for random access | `/data/squirrel` | BeeGFS | xxx TB | - -## Barnard - Intel Sapphire Rapids CPUs - -- 630 diskless nodes, each with - - 2 x Intel(R) Xeon(R) CPU E5-2680 v3 (52 cores) @ 2.50 GHz, Multithreading enabled - - 512 GB RAM -- Hostnames: `n1[001-630].barnard.hpc.tu-dresden.de` -- Login nodes: `login[1-4].barnard.hpc.tu-dresden.de` +HPC resources in ZIH systems comprise the *High Performance Computing and Storage Complex* and its +extension *High Performance Computing – Data Analytics*. In total it offers scientists +about 60,000 CPU cores and a peak performance of more than 1.5 quadrillion floating point +operations per second. The architecture specifically tailored to data-intensive computing, Big Data +analytics, and artificial intelligence methods with extensive capabilities for energy measurement +and performance monitoring provides ideal conditions to achieve the ambitious research goals of the +users and the ZIH. + +## Login and Export Nodes + +- 4 Login-Nodes `tauruslogin[3-6].hrsk.tu-dresden.de` + - Each login node is equipped with 2x Intel(R) Xeon(R) CPU E5-2680 v3 with 24 cores in total @ + 2.50 GHz, Multithreading disabled, 64 GB RAM, 128 GB SSD local disk + - IPs: 141.30.73.\[102-105\] +- 2 Data-Transfer-Nodes `taurusexport[3-4].hrsk.tu-dresden.de` + - DNS Alias `taurusexport.hrsk.tu-dresden.de` + - 2 Servers without interactive login, only available via file transfer protocols + (`rsync`, `ftp`) + - IPs: 141.30.73.\[82,83\] + - Further information on the usage is documented on the site + [Export Nodes](../data_transfer/export_nodes.md) ## AMD Rome CPUs + NVIDIA A100 @@ -48,8 +29,8 @@ These mount points are planned (September 2023): - 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, Multithreading available - 1 TB RAM - 3.5 TB local memory on NVMe device at `/tmp` -- Hostnames: `taurusi[8001-8034]` -> `i[8001-8037].alpha.hpc.tu-dresden.de` -- Login nodes: `login[1-2].alpha.hpc.tu-dresden.de` +- Hostnames: `taurusi[8001-8034]` +- Slurm partition: `alpha` - Further information on the usage is documented on the site [Alpha Centauri Nodes](alpha_centauri.md) ## Island 7 - AMD Rome CPUs @@ -58,8 +39,8 @@ These mount points are planned (September 2023): - 2 x AMD EPYC CPU 7702 (64 cores) @ 2.0 GHz, Multithreading available - 512 GB RAM - 200 GB local memory on SSD at `/tmp` -- Hostnames: `taurusi[7001-7192]` -> `i[7001-7190].romeo.hpc.tu-dresden.de` -- Login nodes: `login[1-2].romeo.hpc.tu-dresden.de` +- Hostnames: `taurusi[7001-7192]` +- Slurm partition: `romeo` - Further information on the usage is documented on the site [AMD Rome Nodes](rome_nodes.md) ## Large SMP System HPE Superdome Flex @@ -70,7 +51,8 @@ These mount points are planned (September 2023): - Configured as one single node - 48 TB RAM (usable: 47 TB - one TB is used for cache coherence protocols) - 370 TB of fast NVME storage available at `/nvme/<projectname>` -- Hostname: `taurussmp8` -> `smp8.julia.hpc.tu-dresden.de` +- Hostname: `taurussmp8` +- Slurm partition: `julia` - Further information on the usage is documented on the site [HPE Superdome Flex](sd_flex.md) ## IBM Power9 Nodes for Machine Learning @@ -82,5 +64,46 @@ For machine learning, we have IBM AC922 nodes installed with this configuration: - 256 GB RAM DDR4 2666 MHz - 6 x NVIDIA VOLTA V100 with 32 GB HBM2 - NVLINK bandwidth 150 GB/s between GPUs and host -- Hostnames: `taurusml[1-32]` -> `ml[1-29].power9.hpc.tu-dresden.de` -- Login nodes: `login[1-2].power9.hpc.tu-dresden.de`` +- Hostnames: `taurusml[1-32]` +- Slurm partition: `ml` + +## Island 6 - Intel Haswell CPUs + +- 612 nodes, each with + - 2 x Intel(R) Xeon(R) CPU E5-2680 v3 (12 cores) @ 2.50 GHz, Multithreading disabled + - 128 GB local memory on SSD +- Varying amounts of main memory (selected automatically by the batch system for you according to + your job requirements) + * 594 nodes with 2.67 GB RAM per core (64 GB in total): `taurusi[6001-6540,6559-6612]` + - 18 nodes with 10.67 GB RAM per core (256 GB in total): `taurusi[6541-6558]` +- Hostnames: `taurusi[6001-6612]` +- Slurm Partition: `haswell` + +??? hint "Node topology" + +  + {: align=center} + +## Island 2 Phase 2 - Intel Haswell CPUs + NVIDIA K80 GPUs + +- 64 nodes, each with + - 2 x Intel(R) Xeon(R) CPU E5-E5-2680 v3 (12 cores) @ 2.50 GHz, Multithreading disabled + - 64 GB RAM (2.67 GB per core) + - 128 GB local memory on SSD + - 4 x NVIDIA Tesla K80 (12 GB GDDR RAM) GPUs +- Hostnames: `taurusi[2045-2108]` +- Slurm Partition: `gpu2` +- Node topology, same as [island 4 - 6](#island-6-intel-haswell-cpus) + +## SMP Nodes - up to 2 TB RAM + +- 5 Nodes, each with + - 4 x Intel(R) Xeon(R) CPU E7-4850 v3 (14 cores) @ 2.20 GHz, Multithreading disabled + - 2 TB RAM +- Hostnames: `taurussmp[3-7]` +- Slurm partition: `smp2` + +??? hint "Node topology" + +  + {: align=center} -- GitLab