Merge branch 'capella-nutzerbetrieb' into 'preview'

Start regurlar user operation See merge request !1179

Merge branch 'capella-nutzerbetrieb' into 'preview'
Start regurlar user operation See merge request !1179
f4cdaa9c · Danny Marc Rotscher · 53218f0e · ec0aaed4 · f4cdaa9c · f4cdaa9c
Commit f4cdaa9c authored 3 months ago by Danny Marc Rotscher
--- a/doc.zih.tu-dresden.de/docs/index.md
+++ b/doc.zih.tu-dresden.de/docs/index.md
@@ -31,12 +31,12 @@ Please also find out the other ways you could contribute in our

 ## News

+* **2024-12-13** Regular user operation of the new
+  [GPU cluster `Capella`](jobs_and_resources/capella.md) started
 * **2024-11-18** The GPU cluster [`Capella`](jobs_and_resources/hardware_overview.md#capella) was
  ranked #51 in the [TOP500](https://top500.org/system/180298/), #3 in the German systems and #5 in
  the [GREEN500](https://top500.org/lists/green500/list/2024/11/) lists of the world's fastest
  computers in November 2024.
-* **2024-11-08** Early access phase of the
-  [new GPU cluster `Capella`](jobs_and_resources/capella.md) started
 * **2024-11-04** Slides from the HPC Introduction tutorial in October
 2024 are available for [download now](misc/HPC-Introduction.pdf)


--- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/capella.md
+++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/capella.md
 # GPU Cluster Capella

-!!! warning "Acceptance phase"
-
-    The cluster `Capella` is currently in the acceptance phase, i.e.,
-    interruptions and reboots without notice, node failures are possible. Furthermore, the systems
-    configuration might be adjusted further.
-
-    Do not yet move your "production" to `Capella`, but feel free to test it using moderate sized
-    workloads. Please read this page carefully to understand, what you need to adopt in your
-    existing workflows w.r.t. [filesystem](#filesystems), [software and
-    modules](#software-and-modules) and [batch jobs](#batch-system).
-
-    We highly appreciate your hints and would be pleased to receive your comments and experiences
-    regarding its operation via e-mail to
-    [hpc-support@tu-dresden.de](mailto:hpc-support@tu-dresden.de) using the subject
-    *Capella: <subject>*.
-
-    Please understand that we are current priority is the acceptance, configuration and rollout of
-    the system. Consequently, we are unable to address any support requests at this time.
-
 ## Overview

-The multi-GPU cluster `Capella` has been installed for AI-related computations and traditional
+The Lenovo multi-GPU cluster `Capella` has been installed by MEGWARE for
+AI-related computations and traditional
 HPC simulations. Capella is fully integrated into the ZIH HPC infrastructure.
 Therefore, the usage should be similar to the other clusters.

-Capella was ranked #51 in the [TOP500](https://top500.org/system/180298/), which is #3 of German
+In November 2024, Capella was ranked #51 in the [TOP500](https://top500.org/system/180298/),
+which is #3 of German
 systems, and #5 in the [GREEN500](https://top500.org/lists/green500/list/2024/11/) lists of the
 world's fastest computers. Background information on how Capella reached these positions can be
 found in this
@@ -53,44 +36,42 @@ provide further information on these two topics.

 As with all other clusters, your `/home` directory is also available on `Capella`.
 For reasons of convenience, the filesystems `horse` and `walrus` are also accessible.
-Please note, that the filesystem `horse` **is not to be used** as working filesystem at the cluster
-`Capella`.
+Please note, that the filesystem `horse` **should not be used** as working
+filesystem at the cluster `Capella` because we have something better.
+
+### Cluster-Specific Filesystem `cat`

-With `Capella` comes the **new filesystem `cat`** designed to meet the high I/O requirements of AI
+With `Capella` comes the new filesystem `cat` designed to meet the high I/O requirements of AI
 and ML workflows. It is a WEKAio filesystem and mounted under `/data/cat`. It is **only available**
 on the cluster `Capella` and the [Datamover nodes](../data_transfer/datamover.md).

-!!! hint "Main working filesystem is `cat`"
-
-    The filesystem `cat` should be used as the
-    main working filesystem and has to be used with [workspaces](../data_lifecycle/file_systems.md).
-    Workspaces on the filesystem `cat` can only be created on the login and compute nodes, not on
-    the other clusters since `cat` is not available there.
-
-Although all other [filesystems](../data_lifecycle/workspaces.md)
-(`/home`, `/software`, `/data/horse`, `/data/walrus`, etc.) are also available.
+The filesystem `cat` should be used as the
+main working filesystem and has to be used with [workspaces](../data_lifecycle/file_systems.md).
+Workspaces on the filesystem `cat` can only be created on the login and compute nodes, not on
+the other clusters since `cat` is not available there.

-!!! hint "Data transfer to and from `/data/cat`"
+`cat` has only limited capacity, hence workspace duration is significantly shorter than
+in other filesystems. We recommend that you only store actively used data there.
+To transfer input and result data from and to the filesystems `horse` and `walrus`, respectively,
+you will need to use the [Datamover nodes](../data_transfer/datamover.md). Regardless of the
+direction of transfer, you should pack your data into archives (,e.g., using `dttar` command)
+for the transfer.

-    Please utilize the new filesystem `cat` as the working filesystem on `Capella`. It has limited
-    capacity, so we advise you to only hold hot data on `cat`.
-    To transfer input and result data from and to the filesystems `horse` and `walrus`, respectively,
-    you will need to use the [Datamover nodes](../data_transfer/datamover.md). Regardless of the
-    direction of transfer, you should pack your data into archives (,e.g., using `dttar` command)
-    for the transfer.
+**Do not** invoke data transfer to the filesystems `horse` and `walrus` from login nodes.
+Both login nodes are part of the cluster. Failures, reboots and other work
+might affect your data transfer resulting in data corruption.

-    **Do not** invoke data transfer to the filesystems `horse` and `walrus` from login nodes.
-    Both login nodes are part of the cluster. Failures, reboots and other work
-    might affect your data transfer resulting in data corruption.
+All other share [filesystems](../data_lifecycle/workspaces.md)
+(`/home`, `/software`, `/data/horse`, `/data/walrus`, etc.) are also mounted.

-### Software and Modules
+## Software and Modules

 The most straightforward method for utilizing the software is through the well-known
 [module system](../software/modules.md).
 All software available from the module system has been **specifically build** for the cluster
 `Capella` i.e., with optimization for Zen4 (Genoa) microarchitecture and CUDA-support enabled.

-#### Python Virtual Environments
+### Python Virtual Environments

 [Virtual environments](../software/python_virtual_environments.md) allow you to install
 additional Python packages and create an isolated runtime environment. We recommend using
@@ -100,7 +81,7 @@ additional Python packages and create an isolated runtime environment. We recomm

    We recommend to use [workspaces](../data_lifecycle/workspaces.md) for your virtual environments.

-### Batch System
+## Batch System

 The batch system Slurm may be used as usual. Please refer to the page [Batch System Slurm](slurm.md)
 for detailed information. In addition, the page [Job Examples with GPU](slurm_examples_with_gpu.md)
@@ -130,7 +111,7 @@ You need to add `#SBATCH --partition=capella-interactive` to your job file and
 to address this partition.
 The partition `capella-interactive` is configured to use [MIG](#virtual-gpus-mig) configuration of 1/7.

-### Virtual GPUs-MIG
+## Virtual GPUs-MIG

 Starting with the Capella cluster, we introduce virtual GPUs. They are based on
 [Nvidia's MIG technology](https://www.nvidia.com/de-de/technologies/multi-instance-gpu/).
@@ -148,11 +129,10 @@ Since a GPU in the `Capella` cluster offers 3.2-3.5x more peak performance compa
 in the cluster [`Alpha Centauri`](hardware_overview.md#alpha-centauri), a 1/7 shard of a GPU in
 Capella is about half the performance of a GPU in `Alpha Centauri`.

-At the moment we only have a partitioning of 7 in the `capella-interactive` partition, but we are
-evaluating to create more configurations in the future.
+At the moment we only have a partitioning of 7 in the `capella-interactive` partition,
+but we are free to create more configurations in the future.
+For this, users' demands and expected high utilization of the smaller GPUS are essential.

-| Configuration Name                    | Compute Resources   | Memory in GiB | Accounted GPU hour  |
-| --------------------------------------| --------------------| ------------- |---------------------|
-| `capella-interactive`, `capella-mig7` |  1 / 7              |  11           | 0.14285714285714285 |
-| `capella-mig3`                        |  3 / 7              |  33           | 0.42857142857142855 |
-| `capella-mig2`                        |  2 / 7              |  22           | 0.28571428571428570 |
+| Configuration Name      | Compute Resources   | Memory in GiB | Accounted GPU hour  |
+| ------------------------| --------------------| ------------- |---------------------|
+| `capella-interactive`   |  1 / 7              |  11           | 1/7 |
--- a/doc.zih.tu-dresden.de/wordlist.aspell
+++ b/doc.zih.tu-dresden.de/wordlist.aspell
@@ -7,6 +7,7 @@ ACLs
 Addon
 Addons
 AFIPS
+allocatable
 ALLREDUCE
 Altix
 Altra
@@ -136,6 +137,7 @@ GitLab's
 glibc
 Gloo
 gnuplot
+Golem
 gpu
 GPU
 GPUs
@@ -208,6 +210,7 @@ LAMMPS
 LAPACK
 lapply
 Leichtbau
+Lenovo
 LINPACK
 linter
 Linter
@@ -287,6 +290,7 @@ Nutzungsbedingungen
 nvcc
 Nvidia
 NVIDIA
+Nvidia's
 NVLINK
 NVMe
 nvprof