refactored ML part

still TODO: check examples, outsourcing virtual environments

refactored ML part
still TODO: check examples, outsourcing virtual environments
9417e68a · Elias Werner · 36f0aa87 · 9417e68a
Commit 9417e68a authored 3 years ago by Elias Werner
--- a/doc.zih.tu-dresden.de/docs/software/machine_learning.md
+++ b/doc.zih.tu-dresden.de/docs/software/machine_learning.md
 # Machine Learning
-On the machine learning nodes, you can use the tools from [IBM Power
+This is an introduction of how to run machine learning applications on ZIH systems.
-AI](power_ai.md).
+For machine learning purposes, we recommend to use the **Alpha** and/or **ML** partitions.
-## Get started with HPC-DA
+## ML partition
-HPC-DA (High-Performance Computing and Data Analytics) is a part of TU-Dresden general purpose HPC
+The compute nodes of the ML partition are built on the base of [Power9](https://www.ibm.com/it-infrastructure/power/power9)
-cluster (Taurus). HPC-DA is the best **option** for **Machine learning, Deep learning** applications
+architecture from IBM. The system was created for AI challenges, analytics and working with, 
-and tasks connected with the big data.
+Machine learning, data-intensive workloads, deep-learning frameworks and accelerated databases. 
-**This is an introduction of how to run machine learning applications on the HPC-DA system.**
+The main feature of the nodes is the ability to work with the
+[NVIDIA Tesla V100](https://www.nvidia.com/en-gb/data-center/tesla-v100/) GPU with **NV-Link**
-The main **aim** of this guide is to help users who have started working with Taurus and focused on
+support that allows a total bandwidth with up to 300 gigabytes per second (GB/sec). Each node on the
-working with Machine learning frameworks such as TensorFlow or Pytorch.
+ml partition has 6x Tesla V-100 GPUs. You can find a detailed specification of the partition [here](../jobs_and_resources/power9.md).
-**Prerequisites:** To work with HPC-DA, you need [Login](../access/login.md) for the Taurus system
-and preferably have basic knowledge about High-Performance computers and Python.
-**Disclaimer:** This guide provides the main steps on the way of using Taurus, for details please
-follow links in the text.
-You can also find the information you need on the
-[HPC-Introduction] **todo** %ATTACHURL%/HPC-Introduction.pdf?t=1585216700 and
-[HPC-DA-Introduction] *todo** %ATTACHURL%/HPC-DA-Introduction.pdf?t=1585162693 presentation slides.
-## Why should I use HPC-DA? The architecture and feature of the HPC-DA
-HPC-DA built on the base of [Power9](https://www.ibm.com/it-infrastructure/power/power9)
+**Note:** The ML partition is based on the PowerPC Architecture, which means that the software built
-architecture from IBM. HPC-DA created from
+for x86_64 will not work on this partition. Also, users need to use the modules which are
-[AC922 IBM servers](https://www.ibm.com/ie-en/marketplace/power-systems-ac922), which was created
+specially made for the ml partition (from modenv/ml).
-for AI challenges, analytics and working with, Machine learning, data-intensive workloads,
-deep-learning frameworks and accelerated databases. POWER9 is the processor with state-of-the-art
-I/O subsystem technology, including next-generation NVIDIA NVLink, PCIe Gen4 and OpenCAPI.
-[Here](../jobs_and_resources/power9.md) you could find a detailed specification of the TU Dresden
-HPC-DA system.
-The main feature of the Power9 architecture (ppc64le) is the ability to work the
+### Modules
-[NVIDIA Tesla V100](https://www.nvidia.com/en-gb/data-center/tesla-v100/) GPU with **NV-Link**
-support. NV-Link technology allows increasing a total bandwidth of 300 gigabytes per second (GB/sec)
- 10X the bandwidth of PCIe Gen 3. The bandwidth is a crucial factor for deep learning and machine
+On the **ML** partition load the module environment:
-    learning applications.
-**Note:** The Power9 architecture not so common as an x86 architecture. This means you are not so
+```console
-flexible with choosing applications for your projects. Even so, the main tools and applications are
+marie@login$ srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=8000 bash    #Job submission in ml nodes with 1 gpu on 1 node with 8000 Mb per CPU
-available. See available modules here.
+marie@ml$ module load modenv/ml    #example output: The following have been reloaded with a version change:  1) modenv/scs5 => modenv/ml
+```
-**Please use the ml partition if you need GPUs!** Otherwise using the x86 partitions (e.g Haswell)
+## Alpha partition
-most likely would be more beneficial.
-## Start your application
+- describe alpha partition
-As stated before HPC-DA was created for deep learning, machine learning applications. Machine
+### Modules
-learning frameworks as TensorFlow and PyTorch are industry standards now.
-There are three main options on how to work with Tensorflow and PyTorch:
+On the **Alpha** partition load the module environment:
-1. **Modules**
+```console
-1. **JupyterNotebook**
+marie@login$ srun -p alpha --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=8000 bash   #Job submission on alpha nodes with 1 gpu on 1 node with 8000 Mb per CPU
-1. **Containers**
+marie@romeo$ module load modenv/scs5
+```
-### Modules
-The easiest way is using the [modules system](modules.md) and Python virtual environment. Modules
+## Machine Learning Console and Virtual Environment
-are a way to use frameworks, compilers, loader, libraries, and utilities. The module is a user
-interface that provides utilities for the dynamic modification of a user's environment without
-manual modifications. You could use them for srun , bath jobs (sbatch) and the Jupyterhub.
 A virtual environment is a cooperatively isolated runtime environment that allows Python users and
 applications to install and update Python distribution packages without interfering with the
 behaviour of other Python applications running on the same system. At its core, the main purpose of
-Python virtual environments is to create an isolated environment for Python projects.
+Python virtual environments is to create an isolated environment for Python projects. 
+### Conda virtual environment
-**Vitualenv (venv)** is a standard Python tool to create isolated Python environments. We recommend
-using venv to work with Tensorflow and Pytorch on Taurus. It has been integrated into the standard
-library under the [venv module](https://docs.python.org/3/library/venv.html). However, if you have
-reasons (previously created environments etc) you could easily use conda. The conda is the second
-way to use a virtual environment on the Taurus.
 [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)
 is an open-source package management system and environment management system from the Anaconda.
-As was written in the previous chapter, to start the application (using
+```console
-modules) and to run the job exist two main options:
+marie@login$ srun -p ml -N 1 -n 1 -c 2 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash   #job submission in ml nodes with allocating: 1 node, 1 task per node, 2 CPUs per task, 1 gpu per node, with 8000 mb on 1 hour.
+marie@ml$ module load modenv/ml                    #example output: The following have been reloaded with a version change:  1) modenv/scs5 =&gt; modenv/ml
+marie@ml$ mkdir python-virtual-environments        #create folder for your environments
+marie@ml$ cd python-virtual-environments           #go to folder
+marie@ml$ which python                             #check which python are you using
+marie@ml$ python3 -m venv --system-site-packages env                         #create virtual environment "env" which inheriting with global site packages
+marie@ml$ source env/bin/activate                                            #activate virtual environment "env". Example output: (env) bash-4.2$
+```
- The `srun` command:**
+The inscription (env) at the beginning of each line represents that now you are in the virtual
+environment.
-```Bash
+### Python virtual environment
-srun -p ml -N 1 -n 1 -c 2 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash   #job submission in ml nodes with allocating: 1 node, 1 task per node, 2 CPUs per task, 1 gpu per node, with 8000 mb on 1 hour.
-module load modenv/ml                    #example output: The following have been reloaded with a version change:  1) modenv/scs5 =&gt; modenv/ml
+**Virtualenv (venv)** is a standard Python tool to create isolated Python environments. 
+It has been integrated into the standard library under the [venv module](https://docs.python.org/3/library/venv.html).
-mkdir python-virtual-environments        #create folder for your environments
+```console
-cd python-virtual-environments           #go to folder
+marie@login$ srun -p ml -N 1 -n 1 -c 2 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash   #job submission in ml nodes with allocating: 1 node, 1 task per node, 2 CPUs per task, 1 gpu per node, with 8000 mb on 1 hour.
-module load TensorFlow                   #load TensorFlow module to use python. Example output: Module Module TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4 and 31 dependencies loaded.
+marie@ml$ module load modenv/ml                    #example output: The following have been reloaded with a version change:  1) modenv/scs5 =&gt; modenv/ml
-which python                             #check which python are you using
+marie@ml$ mkdir python-virtual-environments        #create folder for your environments
-python3 -m venv --system-site-packages env                         #create virtual environment "env" which inheriting with global site packages
+marie@ml$ cd python-virtual-environments           #go to folder
-source env/bin/activate                                            #activate virtual environment "env". Example output: (env) bash-4.2$
+marie@ml$ which python                             #check which python are you using
+marie@ml$ python3 -m venv --system-site-packages env                         #create virtual environment "env" which inheriting with global site packages
+marie@ml$ source env/bin/activate                                            #activate virtual environment "env". Example output: (env) bash-4.2$
 ```
 The inscription (env) at the beginning of each line represents that now you are in the virtual
 environment.
-Now you can check the working capacity of the current environment.
+Note: However in case of using [sbatch files](link) to send your job you usually don't need a 
+virtual environment.
-```Bash
+## Machine Learning with Jupyter
-python                                                           # start python
-import tensorflow as tf
-print(tf.__version__)                                            # example output: 2.1.0
-```
-The second and main option is using batch jobs (`sbatch`). It is used to submit a job script for
+The [Jupyter Notebook](https://jupyter.org/) is an open-source web application that allows you to 
-later execution. Consequently, it is **recommended to launch your jobs into the background by using
+create documents containing live code, equations, visualizations, and narrative text. [JupyterHub](../access/jupyterhub.md) 
-batch jobs**. To launch your machine learning application as well to srun job you need to use
+allows to work with machine learning frameworks (e.g. TensorFlow or Pytorch) on Taurus and to run 
-modules. See the previous chapter with the sbatch file example.
+your Jupyter notebooks on HPC nodes.
-Versions: TensorFlow 1.14, 1.15, 2.0, 2.1; PyTorch 1.1, 1.3 are available. (25.02.20)
+After accessing JupyterHub, you can start a new session and configure it. For machine learning 
+purposes, select either **Alpha** or **ML** partition and the resources, your application requires. 
-Note: However in case of using sbatch files to send your job you usually don't need a virtual
-environment.
-### JupyterNotebook
+## Machine Learning with Containers
-The Jupyter Notebook is an open-source web application that allows you to create documents
+Some machine learning tasks require using containers. In the HPC domain, the [Singularity](https://singularity.hpcng.org/)
-containing live code, equations, visualizations, and narrative text. Jupyter notebook allows working
+container system is a widely used tool. Docker containers can also be used by Singularity. You can 
-with TensorFlow on Taurus with GUI (graphic user interface) in a **web browser** and the opportunity
+find further information on working with containers on ZIH systems [here](containers.md)
-to see intermediate results step by step of your work. This can be useful for users who dont have
-huge experience with HPC or Linux.
-There is [JupyterHub](../access/jupyterhub.md) on Taurus, where you can simply run your Jupyter
-notebook on HPC nodes. Also, for more specific cases you can run a manually created remote jupyter
-server. You can find the manual server setup [here](deep_learning.md). However, the simplest option
-for beginners is using JupyterHub.
-JupyterHub is available at
-[taurus.hrsk.tu-dresden.de/jupyter](https://taurus.hrsk.tu-dresden.de/jupyter)
-After logging, you can start a new session and configure it. There are simple and advanced forms to
-set up your session. On the simple form, you have to choose the "IBM Power (ppc64le)" architecture.
-You can select the required number of CPUs and GPUs. For the acquaintance with the system through
-the examples below the recommended amount of CPUs and 1 GPU will be enough.
-With the advanced form, you can use
-the configuration with 1 GPU and 7 CPUs. To access for all your workspaces use " / " in the
-workspace scope. Please check updates and details [here](../access/jupyterhub.md).
-Several Tensorflow and PyTorch examples for the Jupyter notebook have been prepared based on some
-simple tasks and models which will give you an understanding of how to work with ML frameworks and
-JupyterHub. It could be found as the [attachment] **todo** %ATTACHURL%/machine_learning_example.py
-in the bottom of the page. A detailed explanation and examples for TensorFlow can be found
-[here](tensor_flow_on_jupyter_notebook.md). For the Pytorch - [here](py_torch.md).  Usage information
-about the environments for the JupyterHub could be found [here](../access/jupyterhub.md) in the chapter
-*Creating and using your own environment*.
-Versions: TensorFlow 1.14, 1.15, 2.0, 2.1; PyTorch 1.1, 1.3 are
-available. (25.02.20)
-### Containers
-Some machine learning tasks such as benchmarking require using containers. A container is a standard
-unit of software that packages up code and all its dependencies so the application runs quickly and
-reliably from one computing environment to another.  Using containers gives you more flexibility
-working with modules and software but at the same time requires more effort.
-On Taurus [Singularity](https://sylabs.io/) is used as a standard container solution.  Singularity
-enables users to have full control of their environment.  This means that **you dont have to ask an
-HPC support to install anything for you - you can put it in a Singularity container and run!**As
-opposed to Docker (the beat-known container solution), Singularity is much more suited to being used
-in an HPC environment and more efficient in many cases. Docker containers also can easily be used by
-Singularity from the [DockerHub](https://hub.docker.com) for instance. Also, some containers are
-available in [Singularity Hub](https://singularity-hub.org/).
-The simplest option to start working with containers on HPC-DA is importing from Docker or
-SingularityHub container with TensorFlow. It does **not require root privileges** and so works on
-Taurus directly:
-```Bash
-srun -p ml -N 1 --gres=gpu:1 --time=02:00:00 --pty --mem-per-cpu=8000 bash           #allocating resourses from ml nodes to start the job to create a container.<br />singularity build my-ML-container.sif docker://ibmcom/tensorflow-ppc64le             #create a container from the DockerHub with the last TensorFlow version<br />singularity run --nv my-ML-container.sif                                            #run my-ML-container.sif container with support of the Nvidia's GPU. You could also entertain with your container by commands: singularity shell, singularity exec
-```
 There are two sources for containers for Power9 architecture with
 Tensorflow and PyTorch on the board:
@@ -189,43 +116,18 @@ Tensorflow and PyTorch on the board:
 Note: You could find other versions of software in the container on the "tag" tab on the docker web
 page of the container.
-To use not a pure Tensorflow, PyTorch but also with some Python packages
+In the following example, we build a Singularity container with TensorFlow from the DockerHub and
-you have to use the definition file to create the container
+start it:
-(bootstrapping). For details please see the [Container](containers.md) page
-from our wiki. Bootstrapping **has required root privileges** and
-Virtual Machine (VM) should be used! There are two main options on how
-to work with VM on Taurus: [VM tools](vm_tools.md) - automotive algorithms
-for using virtual machines; [Manual method](virtual_machines.md) - it requires more
-operations but gives you more flexibility and reliability.
-## Interactive Session Examples
-### Tensorflow-Test
+```console
+marie@login$ srun -p ml -N 1 --gres=gpu:1 --time=02:00:00 --pty --mem-per-cpu=8000 bash           #allocating resourses from ml nodes to start the job to create a container.
-    tauruslogin6 :~> srun -p ml --gres=gpu:1 -n 1 --pty --mem-per-cpu=10000 bash
+marie@ml$ singularity build my-ML-container.sif docker://ibmcom/tensorflow-ppc64le             #create a container from the DockerHub with the last TensorFlow version
-    srun: job 4374195 queued and waiting for resources
+marie@ml$ singularity run --nv my-ML-container.sif                                            #run my-ML-container.sif container with support of the Nvidia's GPU. You could also entertain with your container by commands: singularity shell, singularity exec
-    srun: job 4374195 has been allocated resources
+```
-    taurusml22 :~> ANACONDA2_INSTALL_PATH='/opt/anaconda2'
-    taurusml22 :~> ANACONDA3_INSTALL_PATH='/opt/anaconda3'
-    taurusml22 :~> export PATH=$ANACONDA3_INSTALL_PATH/bin:$PATH
-    taurusml22 :~> source /opt/DL/tensorflow/bin/tensorflow-activate
-    taurusml22 :~> tensorflow-test
-    Basic test of tensorflow - A Hello World!!!...
-    #or:
-    taurusml22 :~> module load TensorFlow/1.10.0-PythonAnaconda-3.6
-Or to use the whole node: `--gres=gpu:6 --exclusive --pty`
-### In Singularity container:
-    rotscher@tauruslogin6:~&gt; srun -p ml --gres=gpu:6 --pty bash
-    [rotscher@taurusml22 ~]$ singularity shell --nv /scratch/singularity/powerai-1.5.3-all-ubuntu16.04-py3.img
-    Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~&gt; export PATH=/opt/anaconda3/bin:$PATH
-    Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~&gt; . /opt/DL/tensorflow/bin/tensorflow-activate
-    Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~&gt; tensorflow-test
-## Additional libraries
+## Additional Libraries for Machine Learning
 The following NVIDIA libraries are available on all nodes:
@@ -238,9 +140,11 @@ Note: For optimal NCCL performance it is recommended to set the
 **NCCL_MIN_NRINGS** environment variable during execution. You can try
 different values but 4 should be a pretty good starting point.
-    export NCCL_MIN_NRINGS=4
+```console
+marie@compute$ export NCCL_MIN_NRINGS=4
+```
-\<span style="color: #222222; font-size: 1.385em;">HPC\</span>
+### HPC
 The following HPC related software is installed on all nodes: