Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
hpc-compendium
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Deploy
Releases
Package registry
Container Registry
Model registry
Operate
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
ZIH
hpcsupport
hpc-compendium
Commits
2698d527
Commit
2698d527
authored
3 years ago
by
Martin Schroschk
Browse files
Options
Downloads
Patches
Plain Diff
Review: make node topo graphics foldable
parent
52c3794d
No related branches found
No related tags found
4 merge requests
!392
Merge preview into contrib guide for browser users
,
!333
Draft: update NGC containers
,
!327
Merge preview into main
,
!317
Jobs and resources
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md
+74
-73
74 additions, 73 deletions
...u-dresden.de/docs/jobs_and_resources/hardware_overview.md
with
74 additions
and
73 deletions
doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md
+
74
−
73
View file @
2698d527
# ZIH Systems
The High Performance Computing and Storage Complex (HRSK-II) and its extension High Performance
Computing – Data Analytics (HPC-DA) offers scientists about 60,000 CPU cores and a peak performance
of more than 1.5 quadrillion floating point operations per second. The architecture specifically
tailored to data-intensive computing, Big Data analytics, and artificial intelligence methods with
extensive capabilities for energy measurement and performance monitoring provides ideal conditions
to achieve the ambitious research goals of the users and the ZIH.
ZIH systems comprises the
*High Performance Computing and Storage Complex*
(HRSK-II) and its
extension
*High Performance Computing – Data Analytics*
(HPC-DA). In totoal it offers scientists
about 60,000 CPU cores and a peak performance of more than 1.5 quadrillion floating point operations
per second. The architecture specifically tailored to data-intensive computing, Big Data analytics,
and artificial intelligence methods with extensive capabilities for energy measurement and
performance monitoring provides ideal conditions to achieve the ambitious research goals of the
users and the ZIH.
## Login Nodes
-
Login-Nodes (
`tauruslogin[3-6].hrsk.tu-dresden.de`
)
-
each with 2x Intel(R) Xeon(R) CPU E5-2680 v3 each with 12 cores
@ 2.50GHz, MultiThreading Disabled, 64 GB RAM, 128 GB SSD local
disk
@ 2.50GHz, MultiThreading Disabled, 64 GB RAM, 128 GB SSD local disk
-
IPs: 141.30.73.
\[
102-105
\]
-
Transfer-Nodes (
`taurusexport3/4.hrsk.tu-dresden.de`
, DNS Alias
`taurusexport.hrsk.tu-dresden.de`
)
-
2 Servers without interactive login, only available via file
transfer protocols (rsync, ftp)
-
2 Servers without interactive login, only available via file transfer protocols (
`rsync`
,
`ftp`
)
-
IPs: 141.30.73.82/83
-
Direct access to these nodes is granted via IP whitelisting (contact
<
hpcsupport@zih.tu-dresden.de
>
) - otherwise use TU Dresden VPN.
hpcsupport@zih.tu-dresden.de) - otherwise use TU Dresden VPN.
## AMD Rome CPUs + NVIDIA A100
-
32 nodes, each with
-
8 x NVIDIA A100-SXM4
-
2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, MultiThreading
disabled
-
1 TB RAM
-
3.5 TB /tmp local NVMe device
-
Hostnames: taurusi
\[
8001-8034
\]
-
SLURM partition
`alpha`
-
dedicated mostly for ScaDS-AI
-
8 x NVIDIA A100-SXM4
-
2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, MultiThreading disabled
-
1 TB RAM
-
3.5 TB local memory at NVMe device at
`/tmp`
-
Hostnames:
`taurusi[8001-8034]`
-
Slurm partition
`alpha`
-
Dedicated mostly for ScaDS-AI
## Island 7 - AMD Rome CPUs
-
192 nodes, each with
-
2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, MultiThreading
enabled,
-
512 GB RAM
-
200 GB /tmp on local SSD local disk
-
Hostnames: taurusi
\[
7001-7192
\]
-
SLURM
partition
`romeo`
-
m
ore information under
[
RomeNodes
](
rome_nodes.md
)
-
192 nodes, each with
-
2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, MultiThreading
enabled,
-
512 GB RAM
-
200 GB /tmp on local SSD local disk
-
Hostnames: taurusi
\[
7001-7192
\]
-
Slurm
partition
`romeo`
-
M
ore information under
[
RomeNodes
](
rome_nodes.md
)
## Large SMP System HPE Superdome Flex
-
32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20GHz (28 cores)
-
47 TB RAM
-
c
urrently configured as one single node
-
Hostname: taurussmp8
-
SLURM
partition
`julia`
-
m
ore information under
[
HPE SD Flex
](
sd_flex.md
)
-
32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20GHz (28 cores)
-
47 TB RAM
-
C
urrently configured as one single node
-
Hostname:
`
taurussmp8
`
-
Slurm
partition
`julia`
-
M
ore information under
[
HPE SD Flex
](
sd_flex.md
)
## IBM Power9 Nodes for Machine Learning
For machine learning, we have 32 IBM AC922 nodes installed with this
configuration:
For machine learning, we have 32 IBM AC922 nodes installed with this configuration:
-
2 x IBM Power9 CPU (2.80 GHz, 3.10 GHz boost, 22 cores)
-
256 GB RAM DDR4 2666MHz
-
6x NVIDIA VOLTA V100 with 32GB HBM2
-
NVLINK bandwidth 150 GB/s between GPUs and host
-
SLURM
partition
`ml`
-
Hostnames: taurusml
\
[
1-32
\
]
-
2 x IBM Power9 CPU (2.80 GHz, 3.10 GHz boost, 22 cores)
-
256 GB RAM DDR4 2666MHz
-
6x NVIDIA VOLTA V100 with 32GB HBM2
-
NVLINK bandwidth 150 GB/s between GPUs and host
-
Slurm
partition
`ml`
-
Hostnames:
`
taurusml[1-32]
`
## Island 4 to 6 - Intel Haswell CPUs
-
1456 nodes, each with 2x Intel(R) Xeon(R) CPU E5-2680 v3 (12 cores)
@ 2.50GHz, MultiThreading disabled, 128 GB SSD local disk
-
Hostname: taurusi4
\[
001-232
\]
, taurusi5
\[
001-612
\]
,
taurusi6
\[
001-612
\]
-
varying amounts of main memory (selected automatically by the batch
system for you according to your job requirements)
-
1328 nodes with 2.67 GB RAM per core (64 GB total):
taurusi
\[
4001-4104,5001-5612,6001-6612
\]
-
84 nodes with 5.34 GB RAM per core (128 GB total):
taurusi
\[
4105-4188
\]
-
44 nodes with 10.67 GB RAM per core (256 GB total):
taurusi
\[
4189-4232
\]
-
SLURM Partition
`haswell`
-
[Node topology]

{: align=center}
-
1456 nodes, each with 2x Intel(R) Xeon(R) CPU E5-2680 v3 (12 cores)
@ 2.50GHz, MultiThreading disabled, 128 GB SSD local disk
-
Hostname:
`taurusi4[001-232]`
,
`taurusi5[001-612]`
,
`taurusi6[001-612]`
-
Varying amounts of main memory (selected automatically by the batch
system for you according to your job requirements)
-
1328 nodes with 2.67 GB RAM per core (64 GB total):
`taurusi[4001-4104,5001-5612,6001-6612]`
-
84 nodes with 5.34 GB RAM per core (128 GB total):
`taurusi[4105-4188]`
-
44 nodes with 10.67 GB RAM per core (256 GB total):
`taurusi[4189-4232]`
-
Slurm Partition
`haswell`
??? hint "Node topology"

{: align=center}
### Extension of Island 4 with Broadwell CPUs
*
32 nodes, eachs witch 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
(
**14 cores**
) , MultiThreading disabled, 64 GB RAM, 256 GB SSD
local disk
(
**14 cores**
), MultiThreading disabled, 64 GB RAM, 256 GB SSD local disk
*
from the users' perspective: Broadwell is like Haswell
*
Hostname:
`taurusi[4233-4264]`
*
Slurm partition
`broadwell`
...
...
@@ -106,21 +103,25 @@ configuration:
## SMP Nodes - up to 2 TB RAM
-
5 Nodes each with 4x Intel(R) Xeon(R) CPU E7-4850 v3 (14 cores) @
2.
20GHz, MultiThreading Disabled, 2 TB RAM
-
Hostname:
`taurussmp[3-7]`
-
SLURM P
artition
`smp2`
-
5 Nodes each with 4x Intel(R) Xeon(R) CPU E7-4850 v3 (14 cores) @
2.
20GHz, MultiThreading Disabled, 2 TB RAM
-
Hostname:
`taurussmp[3-7]`
-
Slurm p
artition
`smp2`

{: align=center}
??? hint "Node topology"

{: align=center}
## Island 2 Phase 1 - Intel Sandybridge CPUs + NVIDIA K20x GPUs
-
44 nodes, each with 2x Intel(R) Xeon(R) CPU E5-2450 (8 cores) @
2.
10GHz, MultiThreading Disabled, 48 GB RAM (3 GB per core), 128 GB
SSD local disk, 2x NVIDIA Tesla K20x (6 GB GDDR RAM) GPUs
-
Hostname:
`taurusi2[001-044]`
-
SLURM Partition
`gpu1`
-
44 nodes, each with 2x Intel(R) Xeon(R) CPU E5-2450 (8 cores) @
2.
10GHz, MultiThreading Disabled, 48 GB RAM (3 GB per core), 128 GB
SSD local disk, 2x NVIDIA Tesla K20x (6 GB GDDR RAM) GPUs
-
Hostname:
`taurusi2[001-044]`
-
Slurm partition
`gpu1`
??? hint "Node topology"

{: align=center}

{: align=center}
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment