Skip to content
Snippets Groups Projects
Commit d725159e authored by Taras Lazariv's avatar Taras Lazariv
Browse files

Merge branch 'fix-checks' into 'preview'

Fix checks

See merge request !342
parents a0bd3124 e413ef32
No related branches found
No related tags found
5 merge requests!392Merge preview into contrib guide for browser users,!356Merge preview in main,!355Merge preview in main,!342Fix checks,!333Draft: update NGC containers
Showing
with 47 additions and 48 deletions
......@@ -61,8 +61,8 @@ Check the status of the job with `squeue -u \<username>`.
## Mount BeeGFS Filesystem
You can mount BeeGFS filesystem on the ML partition (PowerPC architecture) or on the Haswell
[partition](../jobs_and_resources/partitions_and_limits.md) (x86_64 architecture)
You can mount BeeGFS filesystem on the partition ml (PowerPC architecture) or on the
partition haswell (x86_64 architecture), more information about [partitions](../jobs_and_resources/partitions_and_limits.md).
### Mount BeeGFS Filesystem on the Partition `ml`
......
......@@ -131,7 +131,7 @@ c.NotebookApp.allow_remote_access = True
```console
#!/bin/bash -l
#SBATCH --gres=gpu:1 # request GPU
#SBATCH --partition=gpu2 # use GPU partition
#SBATCH --partition=gpu2 # use partition GPU 2
#SBATCH --output=notebook_output.txt
#SBATCH --nodes=1
#SBATCH --ntasks=1
......
......@@ -11,5 +11,5 @@ Most of the UNICORE features are also available using its REST API.
* [https://sourceforge.net/p/unicore/wiki/REST_API/](https://sourceforge.net/p/unicore/wiki/REST_API/)
* Some useful examples of job submission via REST are available at:
* [https://sourceforge.net/p/unicore/wiki/REST_API_Examples/](https://sourceforge.net/p/unicore/wiki/REST_API_Examples/)
* The base address for the Taurus system at the ZIH is:
* *unicore.zih.tu-dresden.de:8080/TAURUS/rest/core*
* The base address for the system at the ZIH is:
* `unicore.zih.tu-dresden.de:8080/TAURUS/rest/core`
......@@ -19,7 +19,7 @@ It has 34 nodes, each with:
### Modules
The easiest way is using the [module system](../software/modules.md).
The software for the `alpha` partition is available in `modenv/hiera` module environment.
The software for the partition alpha is available in `modenv/hiera` module environment.
To check the available modules for `modenv/hiera`, use the command
......
......@@ -56,8 +56,8 @@ checkpoint/restart bits transparently to your batch script. You just have to spe
total runtime of your calculation and the interval in which you wish to do checkpoints. The latter
(plus the time it takes to write the checkpoint) will then be the runtime of the individual jobs.
This should be targeted at below 24 hours in order to be able to run on all
[haswell64 partitions](../jobs_and_resources/partitions_and_limits.md#runtime-limits). For increased
fault-tolerance, it can be chosen even shorter.
[partitions haswell64](../jobs_and_resources/partitions_and_limits.md#runtime-limits). For
increased fault-tolerance, it can be chosen even shorter.
To use it, first add a `dmtcp_launch` before your application call in your batch script. In the case
of MPI applications, you have to add the parameters `--ib --rm` and put it between `srun` and your
......
# ZIH Systems
ZIH systems comprises the *High Performance Computing and Storage Complex* (HRSK-II) and its
extension *High Performance Computing – Data Analytics* (HPC-DA). In totoal it offers scientists
about 60,000 CPU cores and a peak performance of more than 1.5 quadrillion floating point operations
per second. The architecture specifically tailored to data-intensive computing, Big Data analytics,
and artificial intelligence methods with extensive capabilities for energy measurement and
performance monitoring provides ideal conditions to achieve the ambitious research goals of the
ZIH systems comprises the *High Performance Computing and Storage Complex* and its
extension *High Performance Computing – Data Analytics*. In total it offers scientists
about 60,000 CPU cores and a peak performance of more than 1.5 quadrillion floating point
operations per second. The architecture specifically tailored to data-intensive computing, Big Data
analytics, and artificial intelligence methods with extensive capabilities for energy measurement
and performance monitoring provides ideal conditions to achieve the ambitious research goals of the
users and the ZIH.
## Login Nodes
......
......@@ -62,9 +62,9 @@ Normal compute nodes are perfect for this task.
**OpenMP jobs:** SMP-parallel applications can only run **within a node**, so it is necessary to
include the [batch system](slurm.md) options `-N 1` and `-n 1`. Using `--cpus-per-task N` Slurm will
start one task and you will have `N` CPUs. The maximum number of processors for an SMP-parallel
program is 896 on [partition `julia`](partitions_and_limits.md).
program is 896 on partition `julia`, see [partitions](partitions_and_limits.md).
**GPUs** partitions are best suited for **repetitive** and **highly-parallel** computing tasks. If
Partitions with GPUs are best suited for **repetitive** and **highly-parallel** computing tasks. If
you have a task with potential [data parallelism](../software/gpu_programming.md) most likely that
you need the GPUs. Beyond video rendering, GPUs excel in tasks such as machine learning, financial
simulations and risk modeling. Use the partitions `gpu2` and `ml` only if you need GPUs! Otherwise
......
# Large Shared-Memory Node - HPE Superdome Flex
- Hostname: `taurussmp8`
- Access to all shared file systems
- Access to all shared filesystems
- Slurm partition `julia`
- 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20GHz (28 cores)
- 48 TB RAM (usable: 47 TB - one TB is used for cache coherence protocols)
......
......@@ -238,13 +238,13 @@ resources.
Setting `--exclusive` **only** makes sure that there will be **no other jobs running on your nodes**.
It does not, however, mean that you automatically get access to all the resources which the node
might provide without explicitly requesting them, e.g. you still have to request a GPU via the
generic resources parameter (`gres`) to run on the GPU partitions, or you still have to request all
cores of a node if you need them. CPU cores can either to be used for a task (`--ntasks`) or for
multi-threading within the same task (`--cpus-per-task`). Since those two options are semantically
different (e.g., the former will influence how many MPI processes will be spawned by `srun` whereas
the latter does not), Slurm cannot determine automatically which of the two you might want to use.
Since we use cgroups for separation of jobs, your job is not allowed to use more resources than
requested.*
generic resources parameter (`gres`) to run on the partitions with GPU, or you still have to
request all cores of a node if you need them. CPU cores can either to be used for a task
(`--ntasks`) or for multi-threading within the same task (`--cpus-per-task`). Since those two
options are semantically different (e.g., the former will influence how many MPI processes will be
spawned by `srun` whereas the latter does not), Slurm cannot determine automatically which of the
two you might want to use. Since we use cgroups for separation of jobs, your job is not allowed to
use more resources than requested.*
If you just want to use all available cores in a node, you have to specify how Slurm should organize
them, like with `-p haswell -c 24` or `-p haswell --ntasks-per-node=24`.
......
# Job Profiling
Slurm offers the option to gather profiling data from every task/node of the job. Analyzing this
data allows for a better understanding of your jobs in terms of elapsed time, runtime and IO
data allows for a better understanding of your jobs in terms of elapsed time, runtime and I/O
behavior, and many more.
The following data can be gathered:
......
......@@ -6,8 +6,8 @@
[Apache Spark](https://spark.apache.org/), [Apache Flink](https://flink.apache.org/)
and [Apache Hadoop](https://hadoop.apache.org/) are frameworks for processing and integrating
Big Data. These frameworks are also offered as software [modules](modules.md) on both `ml` and
`scs5` partition. You can check module versions and availability with the command
Big Data. These frameworks are also offered as software [modules](modules.md) in both `ml` and
`scs5` software environments. You can check module versions and availability with the command
```console
marie@login$ module avail Spark
......@@ -46,20 +46,20 @@ as via [Jupyter notebook](#jupyter-notebook). All three ways are outlined in the
### Default Configuration
The Spark module is available for both `scs5` and `ml` partitions.
The Spark module is available in both `scs5` and `ml` environments.
Thus, Spark can be executed using different CPU architectures, e.g., Haswell and Power9.
Let us assume that two nodes should be used for the computation. Use a
`srun` command similar to the following to start an interactive session
using the Haswell partition. The following code snippet shows a job submission
to Haswell nodes with an allocation of two nodes with 60 GB main memory
using the partition haswell. The following code snippet shows a job submission
to haswell nodes with an allocation of two nodes with 60 GB main memory
exclusively for one hour:
```console
marie@login$ srun --partition=haswell -N 2 --mem=60g --exclusive --time=01:00:00 --pty bash -l
```
The command for different resource allocation on the `ml` partition is
The command for different resource allocation on the partition `ml` is
similar, e. g. for a job submission to `ml` nodes with an allocation of one
node, one task per node, two CPUs per task, one GPU per node, with 10000 MB for one hour:
......
......@@ -65,9 +65,9 @@ parameters: `--ntasks-per-node` -parameter to the number of GPUs you use
per node. Also, it could be useful to increase `memomy/cpu` parameters
if you run larger models. Memory can be set up to:
`--mem=250000` and `--cpus-per-task=7` for the `ml` partition.
`--mem=250000` and `--cpus-per-task=7` for the partition `ml`.
`--mem=60000` and `--cpus-per-task=6` for the `gpu2` partition.
`--mem=60000` and `--cpus-per-task=6` for the partition `gpu2`.
Keep in mind that only one memory parameter (`--mem-per-cpu` = <MB> or `--mem`=<MB>) can be
specified
......
# Machine Learning
This is an introduction of how to run machine learning applications on ZIH systems.
For machine learning purposes, we recommend to use the partitions [Alpha](#alpha-partition) and/or
[ML](#ml-partition).
For machine learning purposes, we recommend to use the partitions `alpha` and/or `ml`.
## ML Partition
## Partition `ml`
The compute nodes of the partition ML are built on the base of
[Power9 architecture](https://www.ibm.com/it-infrastructure/power/power9) from IBM. The system was created
......@@ -36,7 +35,7 @@ The following have been reloaded with a version change: 1) modenv/scs5 => moden
There are tools provided by IBM, that work on partition ML and are related to AI tasks.
For more information see our [Power AI documentation](power_ai.md).
## Alpha Partition
## Partition: Alpha
Another partition for machine learning tasks is Alpha. It is mainly dedicated to
[ScaDS.AI](https://scads.ai/) topics. Each node on Alpha has 2x AMD EPYC CPUs, 8x NVIDIA A100-SXM4
......@@ -45,7 +44,7 @@ partition in our [Alpha Centauri](../jobs_and_resources/alpha_centauri.md) docum
### Modules
On the partition **Alpha** load the module environment:
On the partition alpha load the module environment:
```console
marie@alpha$ module load modenv/hiera
......
......@@ -5,7 +5,7 @@ the PowerAI Framework for Machine Learning. In the following the links
are valid for PowerAI version 1.5.4.
!!! warning
The information provided here is available from IBM and can be used on `ml` partition only!
The information provided here is available from IBM and can be used on partition ml only!
## General Overview
......@@ -47,7 +47,7 @@ are valid for PowerAI version 1.5.4.
(Open Neural Network Exchange) provides support for moving models
between those frameworks.
- [Distributed Deep Learning](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_ddl.html?view=kc)
Distributed Deep Learning (DDL). Works on up to 4 nodes on `ml` partition.
Distributed Deep Learning (DDL). Works on up to 4 nodes on partition `ml`.
## PowerAI Container
......
......@@ -15,14 +15,14 @@ marie@login$ module spider pytorch
to find out, which PyTorch modules are available on your partition.
We recommend using **Alpha** and/or **ML** partitions when working with machine learning workflows
We recommend using partitions alpha and/or ml when working with machine learning workflows
and the PyTorch library.
You can find detailed hardware specification in our
[hardware documentation](../jobs_and_resources/hardware_overview.md).
## PyTorch Console
On the **Alpha** partition, load the module environment:
On the partition `alpha`, load the module environment:
```console
marie@login$ srun -p alpha --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash #Job submission on alpha nodes with 1 gpu on 1 node with 800 Mb per CPU
......@@ -33,8 +33,8 @@ Die folgenden Module wurden in einer anderen Version erneut geladen:
Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5, PyTorch/1.9.0 and 54 dependencies loaded.
```
??? hint "Torchvision on alpha partition"
On the **Alpha** partition, the module torchvision is not yet available within the module
??? hint "Torchvision on partition `alpha`"
On the partition `alpha`, the module torchvision is not yet available within the module
system. (19.08.2021)
Torchvision can be made available by using a virtual environment:
......@@ -47,7 +47,7 @@ Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5, PyTorch/1.9.0 and 54 dependencies
Using the **--no-deps** option for "pip install" is necessary here as otherwise the PyTorch
version might be replaced and you will run into trouble with the cuda drivers.
On the **ML** partition:
On the partition `ml`:
```console
marie@login$ srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash #Job submission in ml nodes with 1 gpu on 1 node with 800 Mb per CPU
......
......@@ -45,10 +45,10 @@ times till it succeeds.
bash-4.2$ cat /tmp/marie_2759627/activate
#!/bin/bash
if ! grep -q -- "Key for the VM on the ml partition" "/home/rotscher/.ssh/authorized_keys" &gt;& /dev/null; then
if ! grep -q -- "Key for the VM on the partition ml" "/home/rotscher/.ssh/authorized_keys" &gt;& /dev/null; then
cat "/tmp/marie_2759627/kvm.pub" >> "/home/marie/.ssh/authorized_keys"
else
sed -i "s|.*Key for the VM on the ml partition.*|ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC3siZfQ6vQ6PtXPG0RPZwtJXYYFY73TwGYgM6mhKoWHvg+ZzclbBWVU0OoU42B3Ddofld7TFE8sqkHM6M+9jh8u+pYH4rPZte0irw5/27yM73M93q1FyQLQ8Rbi2hurYl5gihCEqomda7NQVQUjdUNVc6fDAvF72giaoOxNYfvqAkw8lFyStpqTHSpcOIL7pm6f76Jx+DJg98sXAXkuf9QK8MurezYVj1qFMho570tY+83ukA04qQSMEY5QeZ+MJDhF0gh8NXjX/6+YQrdh8TklPgOCmcIOI8lwnPTUUieK109ndLsUFB5H0vKL27dA2LZ3ZK+XRCENdUbpdoG2Czz Key for the VM on the ml partition|" "/home/marie/.ssh/authorized_keys"
sed -i "s|.*Key for the VM on the partition ml.*|ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC3siZfQ6vQ6PtXPG0RPZwtJXYYFY73TwGYgM6mhKoWHvg+ZzclbBWVU0OoU42B3Ddofld7TFE8sqkHM6M+9jh8u+pYH4rPZte0irw5/27yM73M93q1FyQLQ8Rbi2hurYl5gihCEqomda7NQVQUjdUNVc6fDAvF72giaoOxNYfvqAkw8lFyStpqTHSpcOIL7pm6f76Jx+DJg98sXAXkuf9QK8MurezYVj1qFMho570tY+83ukA04qQSMEY5QeZ+MJDhF0gh8NXjX/6+YQrdh8TklPgOCmcIOI8lwnPTUUieK109ndLsUFB5H0vKL27dA2LZ3ZK+XRCENdUbpdoG2Czz Key for the VM on the partition ml|" "/home/marie/.ssh/authorized_keys"
fi
ssh -i /tmp/marie_2759627/kvm root@192.168.0.6
......
......@@ -15,7 +15,7 @@ basedir=`dirname "$basedir"`
ruleset="i \<io\> \.io
s \<SLURM\>
i file \+system HDFS
i \<taurus\> taurus\.hrsk /taurus
i \<taurus\> taurus\.hrsk /taurus /TAURUS
i \<hrskii\>
i hpc[ -]\+da\>
i \(alpha\|ml\|haswell\|romeo\|gpu\|smp\|julia\|hpdlf\|scs5\)-\?\(interactive\)\?[^a-z]*partition
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment