Skip to content
Snippets Groups Projects
Commit a9ef5d63 authored by Martin Schroschk's avatar Martin Schroschk
Browse files

Review: Line breaks; wording w.r.t. partitions, fix links

parent 9f21fb23
No related branches found
No related tags found
4 merge requests!333Draft: update NGC containers,!322Merge preview into main,!319Merge preview into main,!258Data Analytics restructuring
# Machine Learning # Machine Learning
This is an introduction of how to run machine learning applications on ZIH systems. This is an introduction of how to run machine learning applications on ZIH systems.
For machine learning purposes, we recommend to use the [Alpha](#alpha-partition) and/or For machine learning purposes, we recommend to use the partitions [Alpha](#alpha-partition) and/or
[ML](#ml-partition) partitions. [ML](#ml-partition).
## ML Partition ## ML Partition
The compute nodes of the ML partition are built on the base of [Power9 architecture](https://www.ibm.com/it-infrastructure/power/power9) The compute nodes of the partition ML are built on the base of
from IBM. The system was created for AI challenges, analytics and working with [Power9 architecture](https://www.ibm.com/it-infrastructure/power/power9) from IBM. The system was created
data-intensive workloads and accelerated databases. for AI challenges, analytics and working with data-intensive workloads and accelerated databases.
The main feature of the nodes is the ability to work with the The main feature of the nodes is the ability to work with the
[NVIDIA Tesla V100](https://www.nvidia.com/en-gb/data-center/tesla-v100/) GPU with **NV-Link** [NVIDIA Tesla V100](https://www.nvidia.com/en-gb/data-center/tesla-v100/) GPU with **NV-Link**
support that allows a total bandwidth with up to 300 gigabytes per second (GB/sec). Each node on the support that allows a total bandwidth with up to 300 GB/s. Each node on the
ML partition has 6x Tesla V-100 GPUs. You can find a detailed specification of the partition in our partition ML has 6x Tesla V-100 GPUs. You can find a detailed specification of the partition in our
[Power9 documentation](../jobs_and_resources/power9.md). [Power9 documentation](../jobs_and_resources/power9.md).
!!! note !!! note
The ML partition is based on the Power9 architecture, which means that the software built
The partition ML is based on the Power9 architecture, which means that the software built
for x86_64 will not work on this partition. Also, users need to use the modules which are for x86_64 will not work on this partition. Also, users need to use the modules which are
specially made for the ml partition (from `modenv/ml`). specially build for this architecture (from `modenv/ml`).
### Modules ### Modules
On the ML partition load the module environment: On the partition ML load the module environment:
```console ```console
marie@ml$ module load modenv/ml marie@ml$ module load modenv/ml
...@@ -32,19 +33,19 @@ The following have been reloaded with a version change: 1) modenv/scs5 => moden ...@@ -32,19 +33,19 @@ The following have been reloaded with a version change: 1) modenv/scs5 => moden
### Power AI ### Power AI
There are tools provided by IBM, that work on `ml` partition and are related to AI tasks. There are tools provided by IBM, that work on partition ML and are related to AI tasks.
For more information see our [Power AI documentation](power_ai.md). For more information see our [Power AI documentation](power_ai.md).
## Alpha partition ## Alpha Partition
Another partition for machine learning tasks is Alpha. It is mainly dedicated to [ScaDS.AI](https://scads.ai/) Another partition for machine learning tasks is Alpha. It is mainly dedicated to
topics. Each node on Alpha has 2x AMD EPYC CPUs, 8x NVIDIA A100-SXM4 GPUs, 1TB RAM and 3.5TB local [ScaDS.AI](https://scads.ai/) topics. Each node on Alpha has 2x AMD EPYC CPUs, 8x NVIDIA A100-SXM4
space (`/tmp`) on an NVMe device. You can find more details of the partition in our [Alpha Centauri](../jobs_and_resources/alpha_centauri.md) GPUs, 1 TB RAM and 3.5 TB local space (`/tmp`) on an NVMe device. You can find more details of the
documentation. partition in our [Alpha Centauri](../jobs_and_resources/alpha_centauri.md) documentation.
### Modules ### Modules
On the **Alpha** partition load the module environment: On the partition **Alpha** load the module environment:
```console ```console
marie@alpha$ module load modenv/hiera marie@alpha$ module load modenv/hiera
...@@ -52,8 +53,9 @@ The following have been reloaded with a version change: 1) modenv/ml => modenv/ ...@@ -52,8 +53,9 @@ The following have been reloaded with a version change: 1) modenv/ml => modenv/
``` ```
!!! note !!! note
On Alpha, the most recent modules are build in hiera. Alternative modules might be build in
scs5. On partition Alpha, the most recent modules are build in `hiera`. Alternative modules might be
build in `scs5`.
## Machine Learning via Console ## Machine Learning via Console
...@@ -71,30 +73,31 @@ R also supports machine learning via console. It does not require a virtual envi ...@@ -71,30 +73,31 @@ R also supports machine learning via console. It does not require a virtual envi
different package management. different package management.
For more details on machine learning or data science with R see For more details on machine learning or data science with R see
[data analytics with R](../data_analytics_with_r/#r-console). [data analytics with R](data_analytics_with_r.md#r-console).
## Machine Learning with Jupyter ## Machine Learning with Jupyter
The [Jupyter Notebook](https://jupyter.org/) is an open-source web application that allows you to The [Jupyter Notebook](https://jupyter.org/) is an open-source web application that allows you to
create documents containing live code, equations, visualizations, and narrative text. [JupyterHub](../access/jupyterhub.md) create documents containing live code, equations, visualizations, and narrative text.
allows to work with machine learning frameworks (e.g. TensorFlow or PyTorch) on ZIH systems and to [JupyterHub](../access/jupyterhub.md) allows to work with machine learning frameworks (e.g.
run your Jupyter notebooks on HPC nodes. TensorFlow or PyTorch) on ZIH systems and to run your Jupyter notebooks on HPC nodes.
After accessing JupyterHub, you can start a new session and configure it. For machine learning After accessing JupyterHub, you can start a new session and configure it. For machine learning
purposes, select either **Alpha** or **ML** partition and the resources, your application requires. purposes, select either partition **Alpha** or **ML** and the resources, your application requires.
In your session you can use [Python](data_analytics_with_python.md/#jupyter-notebooks), [R](data_analytics_with_r.md/#r-in-jupyterhub) In your session you can use [Python](data_analytics_with_python.md#jupyter-notebooks),
or [RStudio](data_analytics_with_rstudio.md) for your machine learning and data science topics. [R](data_analytics_with_r.md#r-in-jupyterhub) or [RStudio](data_analytics_with_rstudio.md) for your
machine learning and data science topics.
## Machine Learning with Containers ## Machine Learning with Containers
Some machine learning tasks require using containers. In the HPC domain, the [Singularity](https://singularity.hpcng.org/) Some machine learning tasks require using containers. In the HPC domain, the
container system is a widely used tool. Docker containers can also be used by Singularity. You can [Singularity](https://singularity.hpcng.org/) container system is a widely used tool. Docker
find further information on working with containers on ZIH systems in our containers can also be used by Singularity. You can find further information on working with
[containers documentation](containers.md). containers on ZIH systems in our [containers documentation](containers.md).
There are two sources for containers for Power9 architecture with There are two sources for containers for Power9 architecture with TensorFlow and PyTorch on the
TensorFlow and PyTorch on the board: board:
* [TensorFlow-ppc64le](https://hub.docker.com/r/ibmcom/tensorflow-ppc64le): * [TensorFlow-ppc64le](https://hub.docker.com/r/ibmcom/tensorflow-ppc64le):
Community-supported `ppc64le` docker container for TensorFlow. Community-supported `ppc64le` docker container for TensorFlow.
...@@ -102,6 +105,7 @@ TensorFlow and PyTorch on the board: ...@@ -102,6 +105,7 @@ TensorFlow and PyTorch on the board:
Official Docker container with TensorFlow, PyTorch and many other packages. Official Docker container with TensorFlow, PyTorch and many other packages.
!!! note !!! note
You could find other versions of software in the container on the "tag" tab on the docker web You could find other versions of software in the container on the "tag" tab on the docker web
page of the container. page of the container.
...@@ -125,6 +129,7 @@ The following NVIDIA libraries are available on all nodes: ...@@ -125,6 +129,7 @@ The following NVIDIA libraries are available on all nodes:
| cuDNN | `/usr/local/cuda/targets/ppc64le-linux` | | cuDNN | `/usr/local/cuda/targets/ppc64le-linux` |
!!! note !!! note
For optimal NCCL performance it is recommended to set the For optimal NCCL performance it is recommended to set the
**NCCL_MIN_NRINGS** environment variable during execution. You can try **NCCL_MIN_NRINGS** environment variable during execution. You can try
different values but 4 should be a pretty good starting point. different values but 4 should be a pretty good starting point.
...@@ -133,7 +138,7 @@ The following NVIDIA libraries are available on all nodes: ...@@ -133,7 +138,7 @@ The following NVIDIA libraries are available on all nodes:
marie@compute$ export NCCL_MIN_NRINGS=4 marie@compute$ export NCCL_MIN_NRINGS=4
``` ```
### HPC related Software ### HPC-Related Software
The following HPC related software is installed on all nodes: The following HPC related software is installed on all nodes:
...@@ -151,15 +156,15 @@ The following HPC related software is installed on all nodes: ...@@ -151,15 +156,15 @@ The following HPC related software is installed on all nodes:
There are many different datasets designed for research purposes. If you would like to download some There are many different datasets designed for research purposes. If you would like to download some
of them, keep in mind that many machine learning libraries have direct access to public datasets of them, keep in mind that many machine learning libraries have direct access to public datasets
without downloading it, e.g. [TensorFlow Datasets](https://www.tensorflow.org/datasets). If you without downloading it, e.g. [TensorFlow Datasets](https://www.tensorflow.org/datasets). If you
still need to download some datasets use [DataMover](../data_transfer/datamover.md). still need to download some datasets use [datamover](../data_transfer/datamover.md) machine.
### The ImageNet dataset ### The ImageNet Dataset
The ImageNet project is a large visual database designed for use in visual object recognition The ImageNet project is a large visual database designed for use in visual object recognition
software research. In order to save space in the filesystem by avoiding to have multiple duplicates software research. In order to save space in the filesystem by avoiding to have multiple duplicates
of this lying around, we have put a copy of the ImageNet database (ILSVRC2012 and ILSVR2017) under of this lying around, we have put a copy of the ImageNet database (ILSVRC2012 and ILSVR2017) under
`/scratch/imagenet` which you can use without having to download it again. For the future, `/scratch/imagenet` which you can use without having to download it again. For the future, the
the ImageNet dataset will be available in ImageNet dataset will be available in
[Warm Archive](../data_lifecycle/workspaces.md#mid-term-storage). ILSVR2017 also includes a dataset [Warm Archive](../data_lifecycle/workspaces.md#mid-term-storage). ILSVR2017 also includes a dataset
for recognition objects from a video. Please respect the corresponding for recognition objects from a video. Please respect the corresponding
[Terms of Use](https://image-net.org/download.php). [Terms of Use](https://image-net.org/download.php).
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment