Skip to content
Snippets Groups Projects
Commit c97a06b8 authored by Jan Frenzel's avatar Jan Frenzel
Browse files

Merge branch 'merge-preview-in-main-' into 'main'

Automated merge from preview to main

See merge request !1173
parents 42c759f5 5cdfb927
No related branches found
No related tags found
2 merge requests!1188Fixes mail address in reservation template,!1173Automated merge from preview to main
......@@ -14,15 +14,14 @@ Please follow this standard Git procedure for working with a local clone:
or request access to the project.
1. Change to a local (unencrypted) filesystem. (We have seen problems running the container on an
ecryptfs filesystem. So you might want to use e.g. `/tmp` as the start directory.)
1. Create a new directory, e.g. with `mkdir hpc-wiki`
1. Change into the new directory, e.g. `cd hpc-wiki`
1. Clone the Git repository:
1. `git clone git@gitlab.hrz.tu-chemnitz.de:zih/hpcsupport/hpc-compendium.git .` (don't forget the
dot)
1. if you forked the repository, use
`git clone git@gitlab.hrz.tu-chemnitz.de:<YOUR_LOGIN>/hpc-compendium.git .` (don't forget the dot).
Add the original repository as a so-called remote:
`git remote add upstream-zih git@gitlab.hrz.tu-chemnitz.de:zih/hpcsupport/hpc-compendium.git`
1. `git clone git@gitlab.hrz.tu-chemnitz.de:zih/hpcsupport/hpc-compendium.git`
1. If you forked the repository, instead use:
- `git clone git@gitlab.hrz.tu-chemnitz.de:<YOUR_LOGIN>/hpc-compendium.git`
1. Change into the new directory:
- `cd hpc-compendium`
1. If you forked the repository, add the original repository as a so-called remote:
- `git remote add upstream-zih git@gitlab.hrz.tu-chemnitz.de:zih/hpcsupport/hpc-compendium.git`
## Working with your Local Clone
......
......@@ -31,6 +31,10 @@ Please also find out the other ways you could contribute in our
## News
* **2024-11-18** The GPU cluster [`Capella`](jobs_and_resources/hardware_overview.md#capella) was
ranked #51 in the [TOP500](https://top500.org/system/180298/), #3 in the German systems and #5 in
the [GREEN500](https://top500.org/lists/green500/list/2024/11/) lists of the world's fastest
computers in November 2024.
* **2024-11-08** Early access phase of the
[new GPU cluster `Capella`](jobs_and_resources/capella.md) started
* **2024-11-04** Slides from the HPC Introduction tutorial in October
......
......@@ -9,7 +9,7 @@
Do not yet move your "production" to `Capella`, but feel free to test it using moderate sized
workloads. Please read this page carefully to understand, what you need to adopt in your
existing workflows w.r.t. [filesystem](#filesystems), [software and
modules](#software-and-modules) and [batch jobs](#batchsystem).
modules](#software-and-modules) and [batch jobs](#batch-system).
We highly appreciate your hints and would be pleased to receive your comments and experiences
regarding its operation via e-mail to
......@@ -25,6 +25,12 @@ The multi-GPU cluster `Capella` has been installed for AI-related computations a
HPC simulations. Capella is fully integrated into the ZIH HPC infrastructure.
Therefore, the usage should be similar to the other clusters.
Capella was ranked #51 in the [TOP500](https://top500.org/system/180298/), which is #3 of German
systems, and #5 in the [GREEN500](https://top500.org/lists/green500/list/2024/11/) lists of the
world's fastest computers. Background information on how Capella reached these positions can be
found in this
[Golem article](https://www.golem.de/news/effiziente-grossrechner-wie-man-einen-supercomputer-in-die-green500-bekommt-2411-190925.html).
## Hardware Specifications
The hardware specification is documented on the page
......@@ -64,7 +70,7 @@ on the cluster `Capella` and the [Datamover nodes](../data_transfer/datamover.md
Although all other [filesystems](../data_lifecycle/workspaces.md)
(`/home`, `/software`, `/data/horse`, `/data/walrus`, etc.) are also available.
!!! hint "Datatransfer to and from `/data/cat`"
!!! hint "Data transfer to and from `/data/cat`"
Please utilize the new filesystem `cat` as the working filesystem on `Capella`. It has limited
capacity, so we advise you to only hold hot data on `cat`.
......@@ -94,10 +100,10 @@ additional Python packages and create an isolated runtime environment. We recomm
We recommend to use [workspaces](../data_lifecycle/workspaces.md) for your virtual environments.
### Batchsystem
### Batch System
The batch system Slurm may be used as usual. Please refer to the page [Batch System Slurm](slurm.md)
for detailed information. In addition, the page [Job Examples](slurm_examples.md#requesting-gpus)
for detailed information. In addition, the page [Job Examples with GPU](slurm_examples_with_gpu.md)
provides examples on GPU allocation with Slurm.
You can find out about upcoming reservations (,e.g., for acceptance benchmarks) via `sinfo -T`.
......@@ -117,7 +123,36 @@ Acceptance has priority, so your reservation requests can currently not be consi
check-point-restart, please use `/data/cat` for temporary data. Remove these data afterwards!
The partition `capella-interactive` can be used for your small tests and compilation of software.
You need to add `#SBATCH --partition=capella-interactive` to your jobfile and
In addition, JupyterHub instances that require low GPU utilization or only use GPUs for a short
period of time in their allocation are intended to use this partition.
You need to add `#SBATCH --partition=capella-interactive` to your job file and
`--partition=capella-interactive` to your `sbatch`, `srun` and `salloc` command line, respectively,
to address this partition. The partitions configuration might be adopted within acceptance phase.
You get the current settings via `scontrol show partitions capella-interactive`.
to address this partition.
The partition `capella-interactive` is configured to use [MIG](#virtual-gpus-mig) configuration of 1/7.
### Virtual GPUs-MIG
Starting with the Capella cluster, we introduce virtual GPUs. They are based on
[Nvidia's MIG technology](https://www.nvidia.com/de-de/technologies/multi-instance-gpu/).
From an application point of view, each virtual GPU looks like a normal physical GPU, but offers
only a fraction of the compute resources and the maximum allocatable memory on the device.
We also only account you a fraction of a full GPU hour.
By using virtual GPUs, we expect to improve overall system utilization for jobs that cannot take
advantage of a full H100 GPU.
In addition, we can provide you with more resources and therefore shorter waiting times.
We intend to use these partitions for all applications that cannot use a full H100 GPU, such as
Jupyter-Notebooks.
Users can check the usage of compute and memory usage of the GPU with the help of
[job monitoring system PIKA](../software/performance_engineering_overview.md#pika).
Since a GPU in the `Capella` cluster offers 3.2-3.5x more peak performance compared to an A100 GPU
in the cluster [`Alpha Centauri`](hardware_overview.md#alpha-centauri), a 1/7 shard of a GPU in
Capella is about half the performance of a GPU in `Alpha Centauri`.
At the moment we only have a partitioning of 7 in the `capella-interactive` partition, but we are
evaluating to create more configurations in the future.
| Configuration Name | Compute Resources | Memory in GiB | Accounted GPU hour |
| --------------------------------------| --------------------| ------------- |---------------------|
| `capella-interactive`, `capella-mig7` | 1 / 7 | 11 | 0.14285714285714285 |
| `capella-mig3` | 3 / 7 | 33 | 0.42857142857142855 |
| `capella-mig2` | 2 / 7 | 22 | 0.28571428571428570 |
......@@ -13,7 +13,7 @@ users and the ZIH.
Over the last decade we have been running our HPC system of high heterogeneity with a single
Slurm batch system. This made things very complicated, especially to inexperienced users. With
the replacement of the Taurus system by the cluster [Barnard](#barnard) in 2023 we have a new
archtictural design comprising **six homogeneous clusters with their own Slurm instances and with
architectural design comprising **six homogeneous clusters with their own Slurm instances and with
cluster specific login nodes** running on the same CPU. Job submission is possible only from
within the corresponding cluster (compute or login node).
......@@ -48,7 +48,7 @@ only from their respective login nodes.
- `dataport[3-4].hpc.tu-dresden.de`
- IPs: 141.30.73.\[4,5\]
- Further information on the usage is documented on the site
[dataport Nodes](../data_transfer/dataport_nodes.md)
[Dataport Nodes](../data_transfer/dataport_nodes.md)
## Barnard
......@@ -91,6 +91,7 @@ and is designed for AI and ML tasks.
- Login nodes: `login[1-2].capella.hpc.tu-dresden.de`
- Hostnames: `c[1-144].capella.hpc.tu-dresden.de`
- Operating system: Alma Linux 9.4
- Offers fractions of full GPUs via [Nvidia's MIG mechanism](capella.md#virtual-gpus-mig)
- Further information on the usage is documented on the site [GPU Cluster Capella](capella.md)
## Romeo
......
......@@ -67,7 +67,7 @@ This page provides a brief overview on
* how to [manage and control your jobs](#manage-and-control-jobs).
If you are are already familiar with Slurm, you might be more interested in our collection of
[job examples](slurm_examples.md).
[job examples](slurm_examples.md) and [job examples for GPU usage](slurm_examples_with_gpu.md).
There is also a ton of external resources regarding Slurm. We recommend these links for detailed
information:
......
......@@ -58,8 +58,7 @@ For MPI-parallel jobs one typically allocates one core per task that has to be s
In this short example, our goal is to run four instances of a program concurrently in a **single**
batch script. Of course, we could also start a batch script four times with `sbatch`, but this is
not what we want to do here. However, you can also find an example about
[how to run GPU programs simultaneously in a single job](#running-multiple-gpu-applications-simultaneously-in-a-batch-job)
below.
[how to run GPU programs simultaneously in a single job](slurm_examples_with_gpu.md#running-multiple-gpu-applications-simultaneously-in-a-batch-job)
!!! example " "
......@@ -105,147 +104,6 @@ But, do you need to request tasks or CPUs from Slurm in order to provide resourc
marie@compute$ make -j 16
```
## Requesting GPUs
Slurm will allocate one or many GPUs for your job if requested.
Please note that GPUs are only available in the GPU clusters, like
[`Alpha`](hardware_overview.md#alpha-centauri), [`Capella`](hardware_overview.md#capella)
and [`Power9`](hardware_overview.md#power9).
The option for `sbatch/srun` in this case is `--gres=gpu:[NUM_PER_NODE]`,
where `NUM_PER_NODE` is the number of GPUs **per node** that will be used for the job.
!!! example "Job file to request a GPU"
```Bash
#!/bin/bash
#SBATCH --nodes=2 # request 2 nodes
#SBATCH --mincpus=1 # allocate one task per node...
#SBATCH --ntasks=2 # ...which means 2 tasks in total (see note below)
#SBATCH --cpus-per-task=6 # use 6 threads per task
#SBATCH --gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task)
#SBATCH --time=01:00:00 # run for 1 hour
#SBATCH --account=p_number_crunch # account CPU time to project p_number_crunch
srun ./your/cuda/application # start you application (probably requires MPI to use both nodes)
```
!!! note
Due to an unresolved issue concerning the Slurm job scheduling behavior, it is currently not
practical to use `--ntasks-per-node` together with GPU jobs. If you want to use multiple nodes,
please use the parameters `--ntasks` and `--mincpus` instead. The values of `mincpus`*`nodes`
has to equal `ntasks` in this case.
### Limitations of GPU Job Allocations
The number of cores per node that are currently allowed to be allocated for GPU jobs is limited
depending on how many GPUs are being requested.
This is because we do not wish that GPUs become unusable due to all cores on a node being used by
a single job which does not, at the same time, request all GPUs.
E.g., if you specify `--gres=gpu:2`, your total number of cores per node (meaning:
`ntasks`*`cpus-per-task`) may not exceed 12 on [`Alpha`](alpha_centauri.md) or on
[`Capella`](capella.md).
Note that this also has implications for the use of the `--exclusive` parameter.
Since this sets the number of allocated cores to the maximum, you also **must** request all GPUs
otherwise your job will not start.
In the case of `--exclusive`, it won't be denied on submission,
because this is evaluated in a later scheduling step.
Jobs that directly request too many cores per GPU will be denied with the error message:
```console
Batch job submission failed: Requested node configuration is not available
```
Similar it is not allowed to start CPU-only jobs on the GPU cluster.
I.e. you must request at least one GPU there, or you will get this error message:
```console
srun: error: QOSMinGRES
srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
```
### Running Multiple GPU Applications Simultaneously in a Batch Job
Our starting point is a (serial) program that needs a single GPU and four CPU cores to perform its
task (e.g. TensorFlow). The following batch script shows how to run such a job on any of
the GPU clusters `Power9`, `Alpha` or `Capella`.
!!! example
```bash
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:1
#SBATCH --gpus-per-task=1
#SBATCH --time=01:00:00
#SBATCH --mem-per-cpu=1443
srun some-gpu-application
```
When `srun` is used within a submission script, it inherits parameters from `sbatch`, including
`--ntasks=1`, `--cpus-per-task=4`, etc. So we actually implicitly run the following
```bash
srun --ntasks=1 --cpus-per-task=4 [...] some-gpu-application
```
Now, our goal is to run four instances of this program concurrently in a single batch script. Of
course we could also start the above script multiple times with `sbatch`, but this is not what we
want to do here.
#### Solution
In order to run multiple programs concurrently in a single batch script/allocation we have to do
three things:
1. Allocate enough resources to accommodate multiple instances of our program. This can be achieved
with an appropriate batch script header (see below).
1. Start job steps with `srun` as background processes. This is achieved by adding an ampersand at
the end of the `srun` command.
1. Make sure that each background process gets its private resources. We need to set the resource
fraction needed for a single run in the corresponding `srun` command. The total aggregated
resources of all job steps must fit in the allocation specified in the batch script header.
Additionally, the option `--exclusive` is needed to make sure that each job step is provided with
its private set of CPU and GPU resources. The following example shows how four independent
instances of the same program can be run concurrently from a single batch script. Each instance
(task) is equipped with 4 CPUs (cores) and one GPU.
!!! example "Job file simultaneously executing four independent instances of the same program"
```Bash
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:4
#SBATCH --gpus-per-task=1
#SBATCH --time=01:00:00
#SBATCH --mem-per-cpu=1443
srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application &
srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application &
srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application &
srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application &
echo "Waiting for all job steps to complete..."
wait
echo "All jobs completed!"
```
In practice, it is possible to leave out resource options in `srun` that do not differ from the ones
inherited from the surrounding `sbatch` context. The following line would be sufficient to do the
job in this example:
```bash
srun --exclusive --gres=gpu:1 --ntasks=1 some-gpu-application &
```
Yet, it adds some extra safety to leave them in, enabling the Slurm batch system to complain if not
enough resources in total were specified in the header of the batch script.
## Exclusive Jobs for Benchmarking
Jobs on ZIH systems run, by default, in shared-mode, meaning that multiple jobs (from different
......@@ -419,3 +277,8 @@ In the following we provide two examples for scripts that submit chain jobs.
Job 3/3: jobfile_c.sh
Dependency: after job 2963709
```
## Requesting GPUs
Examples of jobs that require the use of GPUs can be found in the
[Job Examples with GPU](slurm_examples_with_gpu.md) section.
# Job Examples with GPU
General information on how to request resources via the Slurm batch system can be found in the
[Job Examples](slurm_examples.md) section.
## Requesting GPUs
Slurm will allocate one or many GPUs for your job if requested.
Please note that GPUs are only available in the GPU clusters, like
[`Alpha`](hardware_overview.md#alpha-centauri), [`Capella`](hardware_overview.md#capella)
and [`Power9`](hardware_overview.md#power9).
The option for `sbatch/srun` in this case is `--gres=gpu:[NUM_PER_NODE]`,
where `NUM_PER_NODE` is the number of GPUs **per node** that will be used for the job.
!!! example "Job file to request a GPU"
```Bash
#!/bin/bash
#SBATCH --nodes=2 # request 2 nodes
#SBATCH --mincpus=1 # allocate one task per node...
#SBATCH --ntasks=2 # ...which means 2 tasks in total (see note below)
#SBATCH --cpus-per-task=6 # use 6 threads per task
#SBATCH --gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task)
#SBATCH --time=01:00:00 # run for 1 hour
#SBATCH --account=p_number_crunch # account CPU time to project p_number_crunch
srun ./your/cuda/application # start you application (probably requires MPI to use both nodes)
```
!!! note
Due to an unresolved issue concerning the Slurm job scheduling behavior, it is currently not
practical to use `--ntasks-per-node` together with GPU jobs. If you want to use multiple nodes,
please use the parameters `--ntasks` and `--mincpus` instead. The values of `mincpus`*`nodes`
has to equal `ntasks` in this case.
### Limitations of GPU Job Allocations
The number of cores per node that are currently allowed to be allocated for GPU jobs is limited
depending on how many GPUs are being requested.
This is because we do not wish that GPUs become unusable due to all cores on a node being used by
a single job which does not, at the same time, request all GPUs.
E.g., if you specify `--gres=gpu:2`, your total number of cores per node (meaning:
`ntasks`*`cpus-per-task`) may not exceed 12 on [`Alpha`](alpha_centauri.md) or on
[`Capella`](capella.md).
Note that this also has implications for the use of the `--exclusive` parameter.
Since this sets the number of allocated cores to the maximum, you also **must** request all GPUs
otherwise your job will not start.
In the case of `--exclusive`, it won't be denied on submission,
because this is evaluated in a later scheduling step.
Jobs that directly request too many cores per GPU will be denied with the error message:
```console
Batch job submission failed: Requested node configuration is not available
```
Similar it is not allowed to start CPU-only jobs on the GPU cluster.
I.e. you must request at least one GPU there, or you will get this error message:
```console
srun: error: QOSMinGRES
srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
```
### Running Multiple GPU Applications Simultaneously in a Batch Job
Our starting point is a (serial) program that needs a single GPU and four CPU cores to perform its
task (e.g. TensorFlow). The following batch script shows how to run such a job on any of
the GPU clusters `Power9`, `Alpha` or `Capella`.
!!! example
```bash
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:1
#SBATCH --gpus-per-task=1
#SBATCH --time=01:00:00
#SBATCH --mem-per-cpu=1443
srun some-gpu-application
```
When `srun` is used within a submission script, it inherits parameters from `sbatch`, including
`--ntasks=1`, `--cpus-per-task=4`, etc. So we actually implicitly run the following
```bash
srun --ntasks=1 --cpus-per-task=4 [...] some-gpu-application
```
Now, our goal is to run four instances of this program concurrently in a single batch script. Of
course we could also start the above script multiple times with `sbatch`, but this is not what we
want to do here.
#### Solution
In order to run multiple programs concurrently in a single batch script/allocation we have to do
three things:
1. Allocate enough resources to accommodate multiple instances of our program. This can be achieved
with an appropriate batch script header (see below).
1. Start job steps with `srun` as background processes. This is achieved by adding an ampersand at
the end of the `srun` command.
1. Make sure that each background process gets its private resources. We need to set the resource
fraction needed for a single run in the corresponding `srun` command. The total aggregated
resources of all job steps must fit in the allocation specified in the batch script header.
Additionally, the option `--exclusive` is needed to make sure that each job step is provided with
its private set of CPU and GPU resources. The following example shows how four independent
instances of the same program can be run concurrently from a single batch script. Each instance
(task) is equipped with 4 CPUs (cores) and one GPU.
!!! example "Job file simultaneously executing four independent instances of the same program"
```Bash
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:4
#SBATCH --gpus-per-task=1
#SBATCH --time=01:00:00
#SBATCH --mem-per-cpu=1443
srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application &
srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application &
srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application &
srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application &
echo "Waiting for all job steps to complete..."
wait
echo "All jobs completed!"
```
In practice, it is possible to leave out resource options in `srun` that do not differ from the ones
inherited from the surrounding `sbatch` context. The following line would be sufficient to do the
job in this example:
```bash
srun --exclusive --gres=gpu:1 --ntasks=1 some-gpu-application &
```
Yet, it adds some extra safety to leave them in, enabling the Slurm batch system to complain if not
enough resources in total were specified in the header of the batch script.
......@@ -110,6 +110,7 @@ nav:
- Running Jobs:
- Batch System Slurm: jobs_and_resources/slurm.md
- Job Examples: jobs_and_resources/slurm_examples.md
- Job Examples with GPU: jobs_and_resources/slurm_examples_with_gpu.md
- Slurm Resource Limits: jobs_and_resources/slurm_limits.md
- Slurm Job File Generator: jobs_and_resources/slurm_generator.md
- Checkpoint/Restart: jobs_and_resources/checkpoint_restart.md
......
......@@ -63,6 +63,7 @@ DataFrames
Dataheap
Datamover
DataParallel
Dataport
dataset
Dataset
datasets
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment