Skip to content
Snippets Groups Projects
Commit f4cdaa9c authored by Danny Marc Rotscher's avatar Danny Marc Rotscher
Browse files

Merge branch 'capella-nutzerbetrieb' into 'preview'

Start regurlar user operation

See merge request !1179
parents 53218f0e ec0aaed4
No related branches found
No related tags found
2 merge requests!1180Automated merge from preview to main,!1179Start regurlar user operation
......@@ -31,12 +31,12 @@ Please also find out the other ways you could contribute in our
## News
* **2024-12-13** Regular user operation of the new
[GPU cluster `Capella`](jobs_and_resources/capella.md) started
* **2024-11-18** The GPU cluster [`Capella`](jobs_and_resources/hardware_overview.md#capella) was
ranked #51 in the [TOP500](https://top500.org/system/180298/), #3 in the German systems and #5 in
the [GREEN500](https://top500.org/lists/green500/list/2024/11/) lists of the world's fastest
computers in November 2024.
* **2024-11-08** Early access phase of the
[new GPU cluster `Capella`](jobs_and_resources/capella.md) started
* **2024-11-04** Slides from the HPC Introduction tutorial in October
2024 are available for [download now](misc/HPC-Introduction.pdf)
......
# GPU Cluster Capella
!!! warning "Acceptance phase"
The cluster `Capella` is currently in the acceptance phase, i.e.,
interruptions and reboots without notice, node failures are possible. Furthermore, the systems
configuration might be adjusted further.
Do not yet move your "production" to `Capella`, but feel free to test it using moderate sized
workloads. Please read this page carefully to understand, what you need to adopt in your
existing workflows w.r.t. [filesystem](#filesystems), [software and
modules](#software-and-modules) and [batch jobs](#batch-system).
We highly appreciate your hints and would be pleased to receive your comments and experiences
regarding its operation via e-mail to
[hpc-support@tu-dresden.de](mailto:hpc-support@tu-dresden.de) using the subject
*Capella: <subject>*.
Please understand that we are current priority is the acceptance, configuration and rollout of
the system. Consequently, we are unable to address any support requests at this time.
## Overview
The multi-GPU cluster `Capella` has been installed for AI-related computations and traditional
The Lenovo multi-GPU cluster `Capella` has been installed by MEGWARE for
AI-related computations and traditional
HPC simulations. Capella is fully integrated into the ZIH HPC infrastructure.
Therefore, the usage should be similar to the other clusters.
Capella was ranked #51 in the [TOP500](https://top500.org/system/180298/), which is #3 of German
In November 2024, Capella was ranked #51 in the [TOP500](https://top500.org/system/180298/),
which is #3 of German
systems, and #5 in the [GREEN500](https://top500.org/lists/green500/list/2024/11/) lists of the
world's fastest computers. Background information on how Capella reached these positions can be
found in this
......@@ -53,44 +36,42 @@ provide further information on these two topics.
As with all other clusters, your `/home` directory is also available on `Capella`.
For reasons of convenience, the filesystems `horse` and `walrus` are also accessible.
Please note, that the filesystem `horse` **is not to be used** as working filesystem at the cluster
`Capella`.
Please note, that the filesystem `horse` **should not be used** as working
filesystem at the cluster `Capella` because we have something better.
### Cluster-Specific Filesystem `cat`
With `Capella` comes the **new filesystem `cat`** designed to meet the high I/O requirements of AI
With `Capella` comes the new filesystem `cat` designed to meet the high I/O requirements of AI
and ML workflows. It is a WEKAio filesystem and mounted under `/data/cat`. It is **only available**
on the cluster `Capella` and the [Datamover nodes](../data_transfer/datamover.md).
!!! hint "Main working filesystem is `cat`"
The filesystem `cat` should be used as the
main working filesystem and has to be used with [workspaces](../data_lifecycle/file_systems.md).
Workspaces on the filesystem `cat` can only be created on the login and compute nodes, not on
the other clusters since `cat` is not available there.
Although all other [filesystems](../data_lifecycle/workspaces.md)
(`/home`, `/software`, `/data/horse`, `/data/walrus`, etc.) are also available.
The filesystem `cat` should be used as the
main working filesystem and has to be used with [workspaces](../data_lifecycle/file_systems.md).
Workspaces on the filesystem `cat` can only be created on the login and compute nodes, not on
the other clusters since `cat` is not available there.
!!! hint "Data transfer to and from `/data/cat`"
`cat` has only limited capacity, hence workspace duration is significantly shorter than
in other filesystems. We recommend that you only store actively used data there.
To transfer input and result data from and to the filesystems `horse` and `walrus`, respectively,
you will need to use the [Datamover nodes](../data_transfer/datamover.md). Regardless of the
direction of transfer, you should pack your data into archives (,e.g., using `dttar` command)
for the transfer.
Please utilize the new filesystem `cat` as the working filesystem on `Capella`. It has limited
capacity, so we advise you to only hold hot data on `cat`.
To transfer input and result data from and to the filesystems `horse` and `walrus`, respectively,
you will need to use the [Datamover nodes](../data_transfer/datamover.md). Regardless of the
direction of transfer, you should pack your data into archives (,e.g., using `dttar` command)
for the transfer.
**Do not** invoke data transfer to the filesystems `horse` and `walrus` from login nodes.
Both login nodes are part of the cluster. Failures, reboots and other work
might affect your data transfer resulting in data corruption.
**Do not** invoke data transfer to the filesystems `horse` and `walrus` from login nodes.
Both login nodes are part of the cluster. Failures, reboots and other work
might affect your data transfer resulting in data corruption.
All other share [filesystems](../data_lifecycle/workspaces.md)
(`/home`, `/software`, `/data/horse`, `/data/walrus`, etc.) are also mounted.
### Software and Modules
## Software and Modules
The most straightforward method for utilizing the software is through the well-known
[module system](../software/modules.md).
All software available from the module system has been **specifically build** for the cluster
`Capella` i.e., with optimization for Zen4 (Genoa) microarchitecture and CUDA-support enabled.
#### Python Virtual Environments
### Python Virtual Environments
[Virtual environments](../software/python_virtual_environments.md) allow you to install
additional Python packages and create an isolated runtime environment. We recommend using
......@@ -100,7 +81,7 @@ additional Python packages and create an isolated runtime environment. We recomm
We recommend to use [workspaces](../data_lifecycle/workspaces.md) for your virtual environments.
### Batch System
## Batch System
The batch system Slurm may be used as usual. Please refer to the page [Batch System Slurm](slurm.md)
for detailed information. In addition, the page [Job Examples with GPU](slurm_examples_with_gpu.md)
......@@ -130,7 +111,7 @@ You need to add `#SBATCH --partition=capella-interactive` to your job file and
to address this partition.
The partition `capella-interactive` is configured to use [MIG](#virtual-gpus-mig) configuration of 1/7.
### Virtual GPUs-MIG
## Virtual GPUs-MIG
Starting with the Capella cluster, we introduce virtual GPUs. They are based on
[Nvidia's MIG technology](https://www.nvidia.com/de-de/technologies/multi-instance-gpu/).
......@@ -148,11 +129,10 @@ Since a GPU in the `Capella` cluster offers 3.2-3.5x more peak performance compa
in the cluster [`Alpha Centauri`](hardware_overview.md#alpha-centauri), a 1/7 shard of a GPU in
Capella is about half the performance of a GPU in `Alpha Centauri`.
At the moment we only have a partitioning of 7 in the `capella-interactive` partition, but we are
evaluating to create more configurations in the future.
At the moment we only have a partitioning of 7 in the `capella-interactive` partition,
but we are free to create more configurations in the future.
For this, users' demands and expected high utilization of the smaller GPUS are essential.
| Configuration Name | Compute Resources | Memory in GiB | Accounted GPU hour |
| --------------------------------------| --------------------| ------------- |---------------------|
| `capella-interactive`, `capella-mig7` | 1 / 7 | 11 | 0.14285714285714285 |
| `capella-mig3` | 3 / 7 | 33 | 0.42857142857142855 |
| `capella-mig2` | 2 / 7 | 22 | 0.28571428571428570 |
| Configuration Name | Compute Resources | Memory in GiB | Accounted GPU hour |
| ------------------------| --------------------| ------------- |---------------------|
| `capella-interactive` | 1 / 7 | 11 | 1/7 |
......@@ -7,6 +7,7 @@ ACLs
Addon
Addons
AFIPS
allocatable
ALLREDUCE
Altix
Altra
......@@ -136,6 +137,7 @@ GitLab's
glibc
Gloo
gnuplot
Golem
gpu
GPU
GPUs
......@@ -208,6 +210,7 @@ LAMMPS
LAPACK
lapply
Leichtbau
Lenovo
LINPACK
linter
Linter
......@@ -287,6 +290,7 @@ Nutzungsbedingungen
nvcc
Nvidia
NVIDIA
Nvidia's
NVLINK
NVMe
nvprof
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment