Skip to content
Snippets Groups Projects
Commit c2749475 authored by Martin Schroschk's avatar Martin Schroschk
Browse files

Linting

parent 7417037d
No related branches found
No related tags found
2 merge requests!938Automated merge from preview to main,!936Update to Five-Cluster-Operation
# Introduction HPC Resources and Jobs
ZIH operates high performance computing (HPC) systems with more than 90.000 cores, 500 GPUs, and
a flexible storage hierarchy with about 20 PB total capacity. The HPC system provides an optimal
research environment especially in the area of data analytics, artificial intelligence methods and machine learning as well as for
processing extremely large data sets. Moreover it is also a perfect platform for highly scalable,
data-intensive and compute-intensive applications and has extensive capabilities for energy measurement and performance monitoring. Therefore provides ideal conditions to achieve the ambitious research goals of the users and the ZIH.
The HPC system, redesigned in December 2023, consists of five homogeneous clusters with their own [Slurm](../jobs_and_resources/slurm.md) instances and cluster specific [login nodes](hardware_overview.md#login-nodes). The clusters share one
[filesystem](../data_lifecycle/file_systems.md) which enables users to easily switch
between the components.
ZIH operates high performance computing (HPC) systems with more than 90.000 cores, 500 GPUs, and a
flexible storage hierarchy with about 20 PB total capacity. The HPC system provides an optimal
research environment especially in the area of data analytics, artificial intelligence methods and
machine learning as well as for processing extremely large data sets. Moreover it is also a perfect
platform for highly scalable, data-intensive and compute-intensive applications and has extensive
capabilities for energy measurement and performance monitoring. Therefore provides ideal conditions
to achieve the ambitious research goals of the users and the ZIH.
The HPC system, redesigned in December 2023, consists of five homogeneous clusters with their own
[Slurm](../jobs_and_resources/slurm.md) instances and cluster specific
[login nodes](hardware_overview.md#login-nodes). The clusters share one
[filesystem](../data_lifecycle/file_systems.md) which enables users to easily switch between the
components.
## Selection of Suitable Hardware
The five clusters [`barnard`](../jobs_and_resources/barnard.md), [`alpha`](../jobs_and_resources/alpha_centauri.md),[`romeo`](../jobs_and_resources/romeo.md), [`power`](../jobs_and_resources/power9.md) and [`julia`](doc.zih.tu-dresden.de/docs/jobs_and_resources/julia.md) differ, among others, in number of nodes, cores per node, and GPUs and memory. The particular [characteristica](hardware_overview.md) qualify them for different applications.
### Which cluster do I need?
The majority of the basic tasks can be executed on the conventional nodes like on `barnard`. When log in to ZIH systems, you are placed on a login node where you can execute short tests and compile moderate projects. The login nodes cannot be used for real
experiments and computations. Long and extensive computational work and experiments have to be
encapsulated into so called **jobs** and scheduled to the compute nodes.
The five clusters [`barnard`](barnard.md),
[`alpha`](alpha_centauri.md),
[`romeo`](romeo.md),
[`power`](power9.md) and
[`julia`](julia.md)
differ, among others, in number of nodes, cores per node, and GPUs and memory. The particular
[characteristica](hardware_overview.md) qualify them for different applications.
### Which Cluster Do I Need?
The majority of the basic tasks can be executed on the conventional nodes like on `barnard`. When
log in to ZIH systems, you are placed on a login node where you can execute short tests and compile
moderate projects. The login nodes cannot be used for real experiments and computations. Long and
extensive computational work and experiments have to be encapsulated into so called **jobs** and
scheduled to the compute nodes.
There is no such thing as free lunch at ZIH systems. Since compute nodes are operated in multi-user
node by default, jobs of several users can run at the same time at the very same node sharing
......@@ -27,20 +37,18 @@ resources, like memory (but not CPU). On the other hand, a higher throughput can
smaller jobs. Thus, restrictions w.r.t. [memory](#memory-limits) and
[runtime limits](#runtime-limits) have to be respected when submitting jobs.
The following questions may help to decide which cluster to use
- my application
- is [interactive or a batch job](../jobs_and_ressources/slurm.md)?
- requires [parallelism](#parallel-jobs)?
- requires [multithreading (SMT)](#multithreading)?
- Do I need [GPUs](../jobs_and_resources/overview.md#what do I need, a CPU or GPU?)?
- How much [run time](#runtime-limits) do I need?
- How many [cores](#how-many-cores-do-i-need?) do I need?
- How much [memory](#how-much-memory-do-i-need?) do I need?
- is [interactive or a batch job](../jobs_and_ressources/slurm.md)?
- requires [parallelism](#parallel-jobs)?
- requires [multithreading (SMT)](#multithreading)?
- Do I need [GPUs](#what-do-i-need-a-cpu-or-gpu)?
- How much [run time](#runtime-limits) do I need?
- How many [cores](#how-many-cores-do-i-need) do I need?
- How much [memory](#how-much-memory-do-i-need) do I need?
- Which [software](#available-software) is required?
<!-- cluster_overview_table -->
|Name|Description| DNS | Nodes | # Nodes | Cores per Node | Threads per Core | Memory per Node [in MB] | Memory per Core [in MB] | GPUs per Node
|---|---|----|:---|---:|---:|---:|---:|---:|---:|
......@@ -51,7 +59,8 @@ The following questions may help to decide which cluster to use
|**Power**<br>_2018_|IBM Power/GPU system |`ml[node].power9.hpc.tu-dresden.de`|taurusml[3-32] | 30 | 44 | 4 | 254,000 | 1,443 | 6 |
{: summary="cluster overview table" align="bottom"}
### interactive or batch mode
### Interactive or Batch Mode
**Interactive jobs:** An interactive job is the best choice for testing and development. See
[interactive-jobs](slurm.md).
Slurm can forward your X11 credentials to the first node (or even all) for a job
......@@ -62,7 +71,6 @@ Apart from short test runs, it is recommended to encapsulate your experiments an
tasks into batch jobs and submit them to the batch system. For that, you can conveniently put the
parameters directly into the job file which you can submit using `sbatch [options] <job file>`.
### Parallel Jobs
**MPI jobs:** For MPI jobs typically allocates one core per task. Several nodes could be allocated
......@@ -80,7 +88,6 @@ you need the GPUs. Beyond video rendering, GPUs excel in tasks such as machine
simulations and risk modeling. Use the cluster `power` only if you need GPUs! Otherwise
using the x86-based partitions most likely would be more beneficial.
### Multithreading
Some cluster/nodes have Simultaneous Multithreading (SMT) enabled, e.g [`alpha`](slurm.md) You request for this
additional threads using the Slurm option `--hint=multithread` or by setting the environment
......@@ -88,8 +95,6 @@ variable `SLURM_HINT=multithread`. Besides the usage of the threads to speed up
the memory of the other threads is allocated implicitly, too, and you will always get
`Memory per Core`*`number of threads` as memory pledge.
### What do I need, a CPU or GPU?
If an application is designed to run on GPUs this is normally announced unmistakable since the
......@@ -110,8 +115,7 @@ by a significant factor then this might be the obvious choice.
(but the amount of data which
a single GPU's core can handle is small), GPUs are not as versatile as CPUs.
### How much time do I need?
### How much time do I need?
#### Runtime limits
!!! warning "Runtime limits on login nodes"
......@@ -158,12 +162,14 @@ not capable of checkpoint/restart can be adapted. Please refer to the section
### How many cores do I need?
ZIH systems are focused on data-intensive computing. They are meant to be used for highly
parallelized code. Please take that into account when migrating sequential code from a local
machine to our HPC systems. To estimate your execution time when executing your previously
sequential program in parallel, you can use [Amdahl's law](https://en.wikipedia.org/wiki/Amdahl%27s_law).
Think in advance about the parallelization strategy for your project and how to effectively use HPC resources.
parallelized code. Please take that into account when migrating sequential code from a local machine
to our HPC systems. To estimate your execution time when executing your previously sequential
program in parallel, you can use [Amdahl's law](https://en.wikipedia.org/wiki/Amdahl%27s_law).
Think in advance about the parallelization strategy for your project and how to effectively use HPC
resources.
However, this is highly depending on the used software, investigate if your application supports a parallel execution.
However, this is highly depending on the used software, investigate if your application supports a
parallel execution.
### How much memory do I need?
#### Memory Limits
......@@ -186,19 +192,17 @@ be aware of the limits shown in the
Follow the page [Slurm](slurm.md) for comprehensive documentation using the batch system at
ZIH systems. There is also a page with extensive set of [Slurm examples](slurm_examples.md).
### Which software is required?
#### Available software
Pre-installed software on our HPC systems is managed via [modules](../software/modules.md).
You can see the
[list of software that's already installed and accessible via modules](https://gauss-allianz.de/de/application?organizations%5B0%5D=1200).
However, there are many
different variants of these modules available. Each cluster has its own set of installed modules, [depending on their purpose](doc.zih.tu-dresden.de/docs/software/.md)
However, there are many different variants of these modules available. Each cluster has its own set
of installed modules, [depending on their purpose](doc.zih.tu-dresden.de/docs/software/.md)
Specific modules can be found with:
```console
marie@compute$ module spider <software_name>
```
......@@ -209,7 +213,6 @@ ZIH provides a broad variety of compute resources ranging from normal server CPU
manufactures, large shared memory nodes, GPU-assisted nodes up to highly specialized resources for
[Machine Learning](../software/machine_learning.md) and AI.
## Barnard
The cluster **Barnard** is a general purpose cluster by Bull. It is based on Intel Sapphire Rapids
......@@ -278,8 +281,6 @@ The cluster **power** by IBM is based on Power9 CPUs and provides NVIDIA V100 GP
- Login nodes: `login[1-2].power9.hpc.tu-dresden.de`
- Further information on the usage is documented on the site [GPU Cluster Power9](power9.md)
## Processing of Data for Input and Output
Pre-processing and post-processing of the data is a crucial part for the majority of data-dependent
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment