GetStartedWithHPCDA: Fix checks

84944d39 · Martin Schroschk · dbbce07b · 84944d39
Commit 84944d39 authored 3 years ago by Martin Schroschk
--- a/doc.zih.tu-dresden.de/docs/software/GetStartedWithHPCDA.md
+++ b/doc.zih.tu-dresden.de/docs/software/GetStartedWithHPCDA.md
 # Get started with HPC-DA
+HPC-DA (High-Performance Computing and Data Analytics) is a part of TU-Dresden general purpose HPC
+cluster (Taurus). HPC-DA is the best **option** for **Machine learning, Deep learning** applications
+and tasks connected with the big data.
+**This is an introduction of how to run machine learning applications on the HPC-DA system.**
-\<span style="font-size: 1em;">HPC-DA (High-Performance Computing and
+The main **aim** of this guide is to help users who have started working with Taurus and focused on
-Data Analytics) is a part of TU-Dresden general purpose HPC cluster
+working with Machine learning frameworks such as TensorFlow or Pytorch.
-(Taurus). HPC-DA is the best\</span>** option**\<span style="font-size:
-1em;"> for \</span>**Machine learning, Deep learning**\<span
-style="font-size: 1em;">applications and tasks connected with the big
-data.\</span>
-**This is an introduction of how to run machine learning applications on
+**Prerequisites:** To work with HPC-DA, you need [Login](../access/Login.md) for the Taurus system
-the HPC-DA system.**
+and preferably have basic knowledge about High-Performance computers and Python.
-The main **aim** of this guide is to help users who have started working
+**Disclaimer:** This guide provides the main steps on the way of using Taurus, for details please
-with Taurus and focused on working with Machine learning frameworks such
+follow links in the text.
-as TensorFlow or Pytorch. **Prerequisites:** \<span style="font-size:
-1em;"> To work with HPC-DA, you need \</span>\<a href="Login"
-target="\_blank">access\</a>\<span style="font-size: 1em;"> for the
-Taurus system and preferably have basic knowledge about High-Performance
-computers and Python.\</span>
-\<span style="font-size: 1em;">Disclaimer: This guide provides the main
-steps on the way of using Taurus, for details please follow links in the
-text.\</span>
 You can also find the information you need on the
-[HPC-Introduction](%ATTACHURL%/HPC-Introduction.pdf?t=1585216700) and
+[HPC-Introduction] **todo** %ATTACHURL%/HPC-Introduction.pdf?t=1585216700 and
-[HPC-DA-Introduction](%ATTACHURL%/HPC-DA-Introduction.pdf?t=1585162693)
+[HPC-DA-Introduction] *todo** %ATTACHURL%/HPC-DA-Introduction.pdf?t=1585162693 presentation slides.
-presentation slides.
 ## Why should I use HPC-DA? The architecture and feature of the HPC-DA
-HPC-DA built on the base of
+HPC-DA built on the base of [Power9](https://www.ibm.com/it-infrastructure/power/power9)
-[Power9](https://www.ibm.com/it-infrastructure/power/power9)
+architecture from IBM. HPC-DA created from
-architecture from IBM. HPC-DA created from [AC922 IBM
+[AC922 IBM servers](https://www.ibm.com/ie-en/marketplace/power-systems-ac922), which was created
-servers](https://www.ibm.com/ie-en/marketplace/power-systems-ac922),
+for AI challenges, analytics and working with, Machine learning, data-intensive workloads,
-which was created for AI challenges, analytics and working with, Machine
+deep-learning frameworks and accelerated databases. POWER9 is the processor with state-of-the-art
-learning, data-intensive workloads, deep-learning frameworks and
+I/O subsystem technology, including next-generation NVIDIA NVLink, PCIe Gen4 and OpenCAPI.
-accelerated databases. POWER9 is the processor with state-of-the-art I/O
+[Here](../use_of_hardware/Power9.md) you could find a detailed specification of the TU Dresden
-subsystem technology, including next-generation NVIDIA NVLink, PCIe Gen4
+HPC-DA system.
-and OpenCAPI. [Here](Power9) you could find a detailed specification of
-the TU Dresden HPC-DA system.
+The main feature of the Power9 architecture (ppc64le) is the ability to work the
+[NVIDIA Tesla V100](https://www.nvidia.com/en-gb/data-center/tesla-v100/) GPU with **NV-Link**
-The main feature of the Power9 architecture (ppc64le) is the ability to
+support. NV-Link technology allows increasing a total bandwidth of 300 gigabytes per second (GB/sec)
-work the [ **NVIDIA Tesla V100**
-](https://www.nvidia.com/en-gb/data-center/tesla-v100/)GPU with
+- 10X the bandwidth of PCIe Gen 3. The bandwidth is a crucial factor for deep learning and machine
-**NV-Link** support. NV-Link technology allows increasing a total
+    learning applications.
-bandwidth of 300 gigabytes per second (GB/sec) - 10X the bandwidth of
-PCIe Gen 3. The bandwidth is a crucial factor for deep learning and
+**Note:** The Power9 architecture not so common as an x86 architecture. This means you are not so
-machine learning applications.
+flexible with choosing applications for your projects. Even so, the main tools and applications are
+available. See available modules here.
-Note: the Power9 architecture not so common as an x86 architecture. This
-means you are not so flexible with choosing applications for your
+**Please use the ml partition if you need GPUs!** Otherwise using the x86 partitions (e.g Haswell)
-projects. Even so, the main tools and applications are available. See
+most likely would be more beneficial.
-available modules here. \<br />**Please use the ml partition if you need
-GPUs!** Otherwise using the x86 partitions (e.g Haswell) most likely
-would be more beneficial.
 ## Login
 ### SSH Access
-\<span style="font-size: 1em; color: #444444;">The recommended way to
+The recommended way to connect to the HPC login servers directly via ssh:
-connect to the HPC login servers directly via ssh:\</span>
-    ssh &lt;zih-login&gt;@taurus.hrsk.tu-dresden.de
+```Bash
+ssh <zih-login>@taurus.hrsk.tu-dresden.de
+```
-Please put this command in the terminal and replace \<zih-login> with
+Please put this command in the terminal and replace `<zih-login>` with your login that you received
-your login that you received during the access procedure. Accept the
+during the access procedure. Accept the host verifying and enter your password.
-host verifying and enter your password.
-T\<span style="font-size: 1em;">his method requires two conditions:
+This method requires two conditions:
 Linux OS, workstation within the campus network. For other options and
-details check the \</span>\<a href="Login" target="\_blank">Login
+details check the [login page](../access/Login.md).
-page\</a>\<span style="font-size: 1em;">.\</span>
 ## Data management
 ### Workspaces
-As soon as you have access to HPC-DA you have to manage your data. The
+As soon as you have access to HPC-DA you have to manage your data. The main method of working with
-main method of working with data on Taurus is using Workspaces. \<span
+data on Taurus is using Workspaces.  You could work with simple examples in your home directory
-style="font-size: 1em;">You could work with simple examples in your home
+(where you are loading by default). However, in accordance with the 
-directory (where you are loading by default). However, in accordance
+[storage concept](../data_management/HPCStorageConcept2019.md)
-with the \</span>\<a href="HPCStorageConcept2019"
+**please use** a [workspace](../data_management/Workspaces.md)
-target="\_blank">storage concept\</a>,** please use \<a
+for your study and work projects.
-href="WorkSpaces" target="\_blank">workspaces\</a> for your study and
-work projects.**
 You should create your workspace with a similar command:
-    ws_allocate -F scratch Machine_learning_project 50        #allocating workspase in scratch directory for 50 days
+```Bash
+ws_allocate -F scratch Machine_learning_project 50    #allocating workspase in scratch directory for 50 days
+```
-After the command, you will have an output with the address of the
+After the command, you will have an output with the address of the workspace based on scratch. Use
-workspace based on scratch. Use it to store the main data of your
+it to store the main data of your project.
-project.
-\<span style="font-size: 1em;">For different purposes, you should use
+For different purposes, you should use different storage systems.  To work as efficient as possible,
-different storage systems. \</span>\<span style="font-size: 1em;">To
+consider the following points:
-work as efficient as possible, consider the following points:\</span>
-   Save source code etc. in **`/home`** or **`/projects/...`**
+- Save source code etc. in `/home` or `/projects/...`
-   Store checkpoints and other massive but temporary data with
+- Store checkpoints and other massive but temporary data with
-    workspaces in: **`/scratch/ws/...`**
+  workspaces in: `/scratch/ws/...`
-   For data that seldom changes but consumes a lot of space, use
+- For data that seldom changes but consumes a lot of space, use
-    mid-term storage with workspaces: **`/warm_archive/...`**
+  mid-term storage with workspaces: `/warm_archive/...`
-   For large parallel applications where using the fastest file system
+- For large parallel applications where using the fastest file system
-    is a necessity, use with workspaces: **`/lustre/ssd/...`**
+  is a necessity, use with workspaces: `/lustre/ssd/...`
-   Compilation in **`/dev/shm`** or **`/tmp`**
+- Compilation in `/dev/shm`** or `/tmp`
 ### Data moving
 #### Moving data to/from the HPC machines
-To copy data to/from the HPC machines, the Taurus [export
+To copy data to/from the HPC machines, the Taurus [export nodes](../data_moving/ExportNodes.md)
-nodes](ExportNodes) should be used. They are the preferred way to
+should be used. They are the preferred way to transfer your data. There are three possibilities to
-transfer your data. There are three possibilities to exchanging data
+exchanging data between your local machine (lm) and the HPC machines (hm): **SCP, RSYNC, SFTP**.
-between your local machine (lm) and the HPC machines (hm):\<span> **SCP,
-RSYNC, SFTP**.\</span>
-Type following commands in the local directory of the local machine. For
+Type following commands in the local directory of the local machine. For example, the **`SCP`**
-example, the **`SCP`** command was used.
+command was used.
 #### Copy data from lm to hm
-    scp &lt;file&gt; &lt;zih-user&gt;@taurusexport.hrsk.tu-dresden.de:&lt;target-location&gt;                  #Copy file from your local machine. For example: scp helloworld.txt mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/
+```Bash
+scp &lt;file&gt; &lt;zih-user&gt;@taurusexport.hrsk.tu-dresden.de:&lt;target-location&gt;                  #Copy file from your local machine. For example: scp helloworld.txt mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/
-    scp -r &lt;directory&gt; &lt;zih-user&gt;@taurusexport.hrsk.tu-dresden.de:&lt;target-location&gt;          #Copy directory from your local machine.
+scp -r &lt;directory&gt; &lt;zih-user&gt;@taurusexport.hrsk.tu-dresden.de:&lt;target-location&gt;          #Copy directory from your local machine.
+```
 #### Copy data from hm to lm
-    scp &lt;zih-user&gt;@taurusexport.hrsk.tu-dresden.de:&lt;file&gt; &lt;target-location&gt;                  #Copy file. For example: scp mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/helloworld.txt /home/mustermann/Downloads
+```Bash
+scp &lt;zih-user&gt;@taurusexport.hrsk.tu-dresden.de:&lt;file&gt; &lt;target-location&gt;                  #Copy file. For example: scp mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/helloworld.txt /home/mustermann/Downloads
-    scp -r &lt;zih-user&gt;@taurusexport.hrsk.tu-dresden.de:&lt;directory&gt; &lt;target-location&gt;          #Copy directory
+scp -r &lt;zih-user&gt;@taurusexport.hrsk.tu-dresden.de:&lt;directory&gt; &lt;target-location&gt;          #Copy directory
+```
 #### Moving data inside the HPC machines. Datamover
-The best way to transfer data inside the Taurus is the \<a
+The best way to transfer data inside the Taurus is the [data mover](../data_moving/DataMover.md). It
-href="DataMover" target="\_blank">datamover\</a>. It is the special data
+is the special data transfer machine providing the global file systems of each ZIH HPC system.
-transfer machine providing the global file systems of each ZIH HPC
+Datamover provides the best data speed. To load, move, copy etc.  files from one file system to
-system. Datamover provides the best data speed. To load, move, copy etc.
+another file system, you have to use commands with **dt** prefix, such as:
-files from one file system to another file system, you have to use
-commands with **dt** prefix, such as:
-**`dtcp, dtwget, dtmv, dtrm, dtrsync, dttar, dtls`**
+`dtcp, dtwget, dtmv, dtrm, dtrsync, dttar, dtls`
-These commands submit a job to the data transfer machines that execute
+These commands submit a job to the data transfer machines that execute the selected command. Except
-the selected command. Except for the '\<span>dt'\</span> prefix, their
+for the `dt` prefix, their syntax is the same as the shell command without the `dt`.
-syntax is the same as the shell command without the
-'\<span>dt\</span>'**.**
-    dtcp -r /scratch/ws/&lt;name_of_your_workspace&gt;/results /luste/ssd/ws/&lt;name_of_your_workspace&gt;       #Copy from workspace in scratch to ssd.<br />dtwget https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz                                   #Download archive CIFAR-100.
+```Bash
+dtcp -r /scratch/ws/&lt;name_of_your_workspace&gt;/results /luste/ssd/ws/&lt;name_of_your_workspace&gt;       #Copy from workspace in scratch to ssd.<br />dtwget https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz                                   #Download archive CIFAR-100.
+```
 ## BatchSystems. SLURM
-After logon and preparing your data for further work the next logical
+After logon and preparing your data for further work the next logical step is to start your job. For
-step is to start your job. For these purposes, SLURM is using. Slurm
+these purposes, SLURM is using. Slurm (Simple Linux Utility for Resource Management) is an
-(Simple Linux Utility for Resource Management) is an open-source job
+open-source job scheduler that allocates compute resources on clusters for queued defined jobs.  By
-scheduler that allocates compute resources on clusters for queued
+default, after your logging, you are using the login nodes. The intended purpose of these nodes
-defined jobs. By\<span style="font-size: 1em;"> default, after your
+speaks for oneself.  Applications on an HPC system can not be run there! They have to be submitted
-logging, you are using the login nodes. The intended purpose of these
+to compute nodes (ml nodes for HPC-DA) with dedicated resources for user jobs.
-nodes speaks for oneself. \</span>\<span style="font-size:
-1em;">Applications on an HPC system can not be run there! They have to
+Job submission can be done with the command: `-srun [options] <command>.`
-be submitted to compute nodes (ml nodes for HPC-DA) with dedicated
-resources for user jobs. \</span>
+This is a simple example which you could use for your start. The `srun` command is used to submit a
+job for execution in real-time designed for interactive use, with monitoring the output. For some
-\<span
+details please check [the Slurm page](../jobs/Slurm.md).
-style`"font-size: 1em;">Job submission can be done with the command: =srun [options] <command>.`
-\</span>
+```Bash
+srun -p ml -N 1 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash   #Job submission in ml nodes with allocating: 1 node, 1 gpu per node, with 8000 mb on 1 hour.
-This is a simple example which you could use for your start. The `srun`
+```
-command is used to submit a job for execution in real-time designed for
-interactive use, with monitoring the output. For some details please
+However, using srun directly on the shell will lead to blocking and launch an interactive job. Apart
-check [the page](Slurm).
+from short test runs, it is **recommended to launch your jobs into the background by using batch
+jobs**. For that, you can conveniently put the parameters directly into the job file which you can
-    srun -p ml -N 1 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash   #Job submission in ml nodes with allocating: 1 node, 1 gpu per node, with 8000 mb on 1 hour.
+submit using `sbatch [options] <job file>.`
-However, using srun directly on the shell will lead to blocking and
-launch an interactive job. Apart from short test runs, it is
-**recommended to launch your jobs into the background by using batch
-jobs**. For that, you can conveniently put the parameters directly into
-the job file which you can submit using `sbatch [options] <job file>.`
 This is the example of the sbatch file to run your application:
-    #!/bin/bash
+```Bash
-    #SBATCH --mem=8GB                         # specify the needed memory
+#!/bin/bash
-    #SBATCH -p ml                             # specify ml partition
+#SBATCH --mem=8GB                         # specify the needed memory
-    #SBATCH --gres=gpu:1                      # use 1 GPU per node (i.e. use one GPU per task)
+#SBATCH -p ml                             # specify ml partition
-    #SBATCH --nodes=1                         # request 1 node
+#SBATCH --gres=gpu:1                      # use 1 GPU per node (i.e. use one GPU per task)
-    #SBATCH --time=00:15:00                   # runs for 10 minutes
+#SBATCH --nodes=1                         # request 1 node
-    #SBATCH -c 1                              # how many cores per task allocated
+#SBATCH --time=00:15:00                   # runs for 10 minutes
-    #SBATCH -o HLR_name_your_script.out       # save output message under HLR_${SLURMJOBID}.out
+#SBATCH -c 1                              # how many cores per task allocated
-    #SBATCH -e HLR_name_your_script.err       # save error messages under HLR_${SLURMJOBID}.err
+#SBATCH -o HLR_name_your_script.out       # save output message under HLR_${SLURMJOBID}.out
-    <br />module load modenv/ml
+#SBATCH -e HLR_name_your_script.err       # save error messages under HLR_${SLURMJOBID}.err
-    module load TensorFlow<br /><br />python machine_learning_example.py<br /><br />## when finished writing, submit with:  sbatch &lt;script_name&gt; For example: sbatch machine_learning_script.slurm
+module load modenv/ml
-The `machine_learning_example.py` contains a simple ml application based
+module load TensorFlow
-on the mnist model to test your sbatch file. It could be found as the
-[attachment](%ATTACHURL%/machine_learning_example.py) in the bottom of
+python machine_learning_example.py
-the page.
+## when finished writing, submit with:  sbatch &lt;script_name&gt; For example: sbatch machine_learning_script.slurm
+```
+The `machine_learning_example.py` contains a simple ml application based on the mnist model to test
+your sbatch file. It could be found as the [attachment] **todo**
+%ATTACHURL%/machine_learning_example.py in the bottom of the page.
 ## Start your application
-As stated before HPC-DA was created for deep learning, machine learning
+As stated before HPC-DA was created for deep learning, machine learning applications. Machine
-applications. Machine learning frameworks as TensorFlow and PyTorch are
+learning frameworks as TensorFlow and PyTorch are industry standards now.
-industry standards now.
 There are three main options on how to work with Tensorflow and PyTorch:
-**1.** **Modules,** **2.** **JupyterNotebook, 3. Containers**
+1. **Modules**
-**1.** **Modules**
+1. **JupyterNotebook**
+1. **Containers**
-The easiest way is using the \<a
-href="RuntimeEnvironment#Module_Environments" target="\_blank">Modules
+### Modules
-system\</a> and Python virtual environment. Modules are a way to use
-frameworks, compilers, loader, libraries, and utilities. The module is a
+The easiest way is using the [modules system](modules.md) and Python virtual environment. Modules
-user interface that provides utilities for the dynamic modification of a
+are a way to use frameworks, compilers, loader, libraries, and utilities. The module is a user
-user's environment without manual modifications. You could use them for
+interface that provides utilities for the dynamic modification of a user's environment without
-srun , bath jobs (sbatch) and the Jupyterhub.
+manual modifications. You could use them for srun , bath jobs (sbatch) and the Jupyterhub.
-A virtual environment is a cooperatively isolated runtime environment
+A virtual environment is a cooperatively isolated runtime environment that allows Python users and
-that allows Python users and applications to install and update Python
+applications to install and update Python distribution packages without interfering with the
-distribution packages without interfering with the behaviour of other
+behaviour of other Python applications running on the same system. At its core, the main purpose of
-Python applications running on the same system. At its core, the main
+Python virtual environments is to create an isolated environment for Python projects.
-purpose of Python virtual environments is to create an isolated
-environment for Python projects.
+**Vitualenv (venv)** is a standard Python tool to create isolated Python environments. We recommend
+using venv to work with Tensorflow and Pytorch on Taurus. It has been integrated into the standard
-**Vitualenv (venv)** is a standard Python tool to create isolated Python
+library under the [venv module](https://docs.python.org/3/library/venv.html). However, if you have
-environments. We recommend using venv to work with Tensorflow and
+reasons (previously created environments etc) you could easily use conda. The conda is the second
-Pytorch on Taurus. It has been integrated into the standard library
+way to use a virtual environment on the Taurus.
-under the \<a href="<https://docs.python.org/3/library/venv.html>"
+[Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)
-target="\_blank">venv module\</a>. However, if you have reasons
+is an open-source package management system and environment management system from the Anaconda.
-(previously created environments etc) you could easily use conda. The
-conda is the second way to use a virtual environment on the Taurus. \<a
-href="<https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html>"
-target="\_blank">Conda\</a> is an open-source package management system
-and environment management system from the Anaconda.
 As was written in the previous chapter, to start the application (using
 modules) and to run the job exist two main options:
-   The **\<span class="WYSIWYG_TT">srun\</span> command:**
+- The `srun` command:**
-<!-- -->
-    srun -p ml -N 1 -n 1 -c 2 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash   #job submission in ml nodes with allocating: 1 node, 1 task per node, 2 CPUs per task, 1 gpu per node, with 8000 mb on 1 hour.
+```Bash
+srun -p ml -N 1 -n 1 -c 2 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash   #job submission in ml nodes with allocating: 1 node, 1 task per node, 2 CPUs per task, 1 gpu per node, with 8000 mb on 1 hour.
-    module load modenv/ml                    #example output: The following have been reloaded with a version change:  1) modenv/scs5 =&gt; modenv/ml
+module load modenv/ml                    #example output: The following have been reloaded with a version change:  1) modenv/scs5 =&gt; modenv/ml
-    mkdir python-virtual-environments        #create folder for your environments
+mkdir python-virtual-environments        #create folder for your environments
-    cd python-virtual-environments           #go to folder
+cd python-virtual-environments           #go to folder
-    module load TensorFlow                   #load TensorFlow module to use python. Example output: Module Module TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4 and 31 dependencies loaded.
+module load TensorFlow                   #load TensorFlow module to use python. Example output: Module Module TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4 and 31 dependencies loaded.
-    which python                             #check which python are you using
+which python                             #check which python are you using
-    python3 -m venv --system-site-packages env                         #create virtual environment "env" which inheriting with global site packages
+python3 -m venv --system-site-packages env                         #create virtual environment "env" which inheriting with global site packages
-    source env/bin/activate                                            #activate virtual environment "env". Example output: (env) bash-4.2$
+source env/bin/activate                                            #activate virtual environment "env". Example output: (env) bash-4.2$
+```
-The inscription (env) at the beginning of each line represents that now
+The inscription (env) at the beginning of each line represents that now you are in the virtual
-you are in the virtual environment.
+environment.
 Now you can check the working capacity of the current environment.
-    python                                                           # start python
+```Bash
-    import tensorflow as tf
+python                                                           # start python
-    print(tf.__version__)                                            # example output: 2.1.0
+import tensorflow as tf
+print(tf.__version__)                                            # example output: 2.1.0
-   The second and main option is using batch jobs (**`sbatch`**). It is
+```
-    used to submit a job script for later execution. Consequently, it is
-    **recommended to launch your jobs into the background by using batch
+The second and main option is using batch jobs (`sbatch`). It is used to submit a job script for
-    jobs**. To launch your machine learning application as well to srun
+later execution. Consequently, it is **recommended to launch your jobs into the background by using
-    job you need to use modules. See the previous chapter with the
+batch jobs**. To launch your machine learning application as well to srun job you need to use
-    sbatch file example.
+modules. See the previous chapter with the sbatch file example.
+Versions: TensorFlow 1.14, 1.15, 2.0, 2.1; PyTorch 1.1, 1.3 are available. (25.02.20)
+Note: However in case of using sbatch files to send your job you usually don't need a virtual
+environment.
+### JupyterNotebook
+The Jupyter Notebook is an open-source web application that allows you to create documents
+containing live code, equations, visualizations, and narrative text. Jupyter notebook allows working
+with TensorFlow on Taurus with GUI (graphic user interface) in a **web browser** and the opportunity
+to see intermediate results step by step of your work. This can be useful for users who dont have
+huge experience with HPC or Linux.
+There is [JupyterHub](JupyterHub.md) on Taurus, where you can simply run your Jupyter notebook on
+HPC nodes. Also, for more specific cases you can run a manually created remote jupyter server. You
+can find the manual server setup [here](DeepLearning.md). However, the simplest option for
+beginners is using JupyterHub.
+JupyterHub is available at
+[taurus.hrsk.tu-dresden.de/jupyter](https://taurus.hrsk.tu-dresden.de/jupyter)
+After logging, you can start a new session and configure it. There are simple and advanced forms to
+set up your session. On the simple form, you have to choose the "IBM Power (ppc64le)" architecture.
+You can select the required number of CPUs and GPUs. For the acquaintance with the system through
+the examples below the recommended amount of CPUs and 1 GPU will be enough.
+With the advanced form, you can use
+the configuration with 1 GPU and 7 CPUs. To access for all your workspaces use " / " in the
+workspace scope. Please check updates and details [here](JupyterHub.md).
+Several Tensorflow and PyTorch examples for the Jupyter notebook have been prepared based on some
+simple tasks and models which will give you an understanding of how to work with ML frameworks and
+JupyterHub. It could be found as the [attachment] **todo** %ATTACHURL%/machine_learning_example.py
+in the bottom of the page. A detailed explanation and examples for TensorFlow can be found
+[here](TensorFlowOnJupyterNotebook.md). For the Pytorch - [here](PyTorch.md).  Usage information
+about the environments for the JupyterHub could be found [here](JupyterHub.md) in the chapter
+*Creating and using your own environment*.
 Versions: TensorFlow 1.14, 1.15, 2.0, 2.1; PyTorch 1.1, 1.3 are
 available. (25.02.20)
-Note: However in case of using sbatch files to send your job you usually
+### Containers
-don't need a virtual environment.
-**2. JupyterNotebook**
-The Jupyter Notebook is an open-source web application that allows you
-to create documents containing live code, equations, visualizations, and
-narrative text. Jupyter notebook allows working with TensorFlow on
-Taurus with GUI (graphic user interface) in a **web browser** and the
-opportunity to see intermediate results step by step of your work. This
-can be useful for users who dont have huge experience with HPC or Linux.
-There is \<a href="JupyterHub" target="\_self">jupyterhub\</a> on
-Taurus, where you can simply run your Jupyter notebook on HPC nodes.
-Also, for more specific cases you can run a manually created remote
-jupyter server. You can find the manual server setup \<a
-href="DeepLearning" target="\_blank">here.\</a> However, the simplest
-option for beginners is using JupyterHub.
-JupyterHub is available here: \<a
-href="<https://taurus.hrsk.tu-dresden.de/jupyter>"
-target="\_top"><https://taurus.hrsk.tu-dresden.de/jupyter>\</a>
-After logging, you can start a new session and\<span style="font-size:
-1em;"> configure it. There are simple and advanced forms to set up your
-session. On the simple form, you have to choose the "IBM Power
-(ppc64le)" architecture. You can select the required number of CPUs and
-GPUs. For the acquaintance with the system through the examples below
-the recommended amount of CPUs and 1 GPU will be enough. \</span>\<span
-style="font-size: 1em;">With the advanced form, \</span>\<span
-style="font-size: 1em;">you can use the configuration with 1 GPU and 7
-CPUs. To access for all your workspaces use " / " in the workspace
-scope. Please check\</span>\<span style="font-size: 1em;"> updates and
-details \</span>\<a href="JupyterHub" target="\_blank">here\</a>\<span
-style="font-size: 1em;">.\</span>
-\<span style="font-size: 1em;">Several Tensorflow and PyTorch examples
-for the Jupyter notebook have been prepared based on some simple tasks
-and models which will give you an understanding of how to work with ML
-frameworks and JupyterHub. It could be found as the \</span>
-[attachment](%ATTACHURL%/machine_learning_example.py)\<span
-style="font-size: 1em;"> in the bottom of the page. A detailed
-explanation and examples for TensorFlow can be found \</span>\<a
-href="TensorFlowOnJupyterNotebook" title="EXAMPLES AND RUNNING THE
-MODEL">here\</a>\<span style="font-size: 1em;">. For the Pytorch -
-\</span> [here](PyTorch)\<span style="font-size: 1em;">. \</span>\<span
-style="font-size: 1em;">Usage information about the environments for the
-JupyterHub could be found \</span> [here](JupyterHub)\<span
-style="font-size: 1em;"> in the chapter 'Creating and using your own
-environment'.\</span>
-Versions: TensorFlow 1.14, 1.15, 2.0, 2.1; PyTorch 1.1, 1.3 are
+Some machine learning tasks such as benchmarking require using containers. A container is a standard
-available. (25.02.20)
+unit of software that packages up code and all its dependencies so the application runs quickly and
+reliably from one computing environment to another.  Using containers gives you more flexibility
+working with modules and software but at the same time requires more effort.
+On Taurus [Singularity](https://sylabs.io/) is used as a standard container solution.  Singularity
+enables users to have full control of their environment.  This means that **you dont have to ask an
+HPC support to install anything for you - you can put it in a Singularity container and run!**As
+opposed to Docker (the beat-known container solution), Singularity is much more suited to being used
+in an HPC environment and more efficient in many cases. Docker containers also can easily be used by
+Singularity from the [DockerHub](https://hub.docker.com) for instance. Also, some containers are
+available in [Singularity Hub](https://singularity-hub.org/).
-**3.** **Containers**
+The simplest option to start working with containers on HPC-DA is importing from Docker or
+SingularityHub container with TensorFlow. It does **not require root privileges** and so works on
-Some machine learning tasks such as benchmarking require using
+Taurus directly:
-containers. A container is a standard unit of software that packages up
-code and all its dependencies so the application runs quickly and
+```Bash
-reliably from one computing environment to another. \<span
+srun -p ml -N 1 --gres=gpu:1 --time=02:00:00 --pty --mem-per-cpu=8000 bash           #allocating resourses from ml nodes to start the job to create a container.<br />singularity build my-ML-container.sif docker://ibmcom/tensorflow-ppc64le             #create a container from the DockerHub with the last TensorFlow version<br />singularity run --nv my-ML-container.sif                                            #run my-ML-container.sif container with support of the Nvidia's GPU. You could also entertain with your container by commands: singularity shell, singularity exec
-style="font-size: 1em;">Using containers gives you more flexibility
+```
-working with modules and software but at the same time requires more
-effort.\</span>
-On Taurus \<a href="<https://sylabs.io/>"
-target="\_blank">Singularity\</a> used as a standard container solution.
-Singularity enables users to have full control of their environment.
-This means that **you dont have to ask an HPC support to install
-anything for you - you can put it in a Singularity container and
-run!**As opposed to Docker (the beat-known container solution),
-Singularity is much more suited to being used in an HPC environment and
-more efficient in many cases. Docker containers also can easily be used
-by Singularity from the [DockerHub](https://hub.docker.com) for
-instance. Also, some containers are available in \<a
-href="<https://singularity-hub.org/>"
-target="\_blank">SingularityHub\</a>.
-\<span style="font-size: 1em;">The simplest option to start working with
-containers on HPC-DA is i\</span>\<span style="font-size: 1em;">mporting
-from Docker or SingularityHub container with TensorFlow. It does
-\</span> **not require root privileges** \<span style="font-size: 1em;">
-and so works on Taurus directly\</span>\<span style="font-size: 1em;">:
-\</span>
-    srun -p ml -N 1 --gres=gpu:1 --time=02:00:00 --pty --mem-per-cpu=8000 bash           #allocating resourses from ml nodes to start the job to create a container.<br />singularity build my-ML-container.sif docker://ibmcom/tensorflow-ppc64le             #create a container from the DockerHub with the last TensorFlow version<br />singularity run --nv my-ML-container.sif                                            #run my-ML-container.sif container with support of the Nvidia's GPU. You could also entertain with your container by commands: singularity shell, singularity exec
 There are two sources for containers for Power9 architecture with
-Tensorflow and PyTorch on the board: \<span style="font-size: 1em;">
+Tensorflow and PyTorch on the board:
-[Tensorflow-ppc64le](https://hub.docker.com/r/ibmcom/tensorflow-ppc64le)
- \</span>\<span style="font-size: 1em;">Community-supported ppc64le
+* [Tensorflow-ppc64le](https://hub.docker.com/r/ibmcom/tensorflow-ppc64le):
-docker container for TensorFlow. \</span>\<a
+  Community-supported ppc64le docker container for TensorFlow.
-href="<https://hub.docker.com/r/ibmcom/powerai/>"
+* [PowerAI container](https://hub.docker.com/r/ibmcom/powerai/):
-target="\_blank">PowerAI container\</a> - \<span style="font-size:
+  Official Docker container with Tensorflow, PyTorch and many other packages.
-1em;">Official Docker container with Tensorflow, PyTorch and many other
+  Heavy container. It requires a lot of space. Could be found on Taurus.
-packages. Heavy container. It requires a lot of space. Could be found on
-Taurus.\</span>
+Note: You could find other versions of software in the container on the "tag" tab on the docker web
+page of the container.
-Note: You could find other versions of software in the container on the
-"tag" tab on the docker web page of the container.
 To use not a pure Tensorflow, PyTorch but also with some Python packages
 you have to use the definition file to create the container
-(bootstrapping). For details please see the [Container](Container) page
+(bootstrapping). For details please see the [Container](containers.md) page
 from our wiki. Bootstrapping **has required root privileges** and
 Virtual Machine (VM) should be used! There are two main options on how
-to work with VM on Taurus: [VM tools](VMTools) - automotive algorithms
+to work with VM on Taurus: [VM tools](VMTools.md) - automotive algorithms
-for using virtual machines; [Manual method](Cloud) - it requires more
+for using virtual machines; [Manual method](Cloud.md) - it requires more
 operations but gives you more flexibility and reliability.
-- Main.AndreiPolitov - 2020-02-05
+- [machine_learning_example.py] **todo** %ATTACHURL%/machine_learning_example.py:
+  machine_learning_example.py
-   [machine_learning_example.py](%ATTACHURL%/machine_learning_example.py):
+- [example_TensofFlow_MNIST.zip] **todo** %ATTACHURL%/example_TensofFlow_MNIST.zip:
-    machine_learning_example.py
+  example_TensofFlow_MNIST.zip
-   [example_TensofFlow_MNIST.zip](%ATTACHURL%/example_TensofFlow_MNIST.zip):
+- [example_Pytorch_MNIST.zip] **todo** %ATTACHURL%/example_Pytorch_MNIST.zip:
-    example_TensofFlow_MNIST.zip
+  example_Pytorch_MNIST.zip
-   [example_Pytorch_MNIST.zip](%ATTACHURL%/example_Pytorch_MNIST.zip):
+- [example_Pytorch_image_recognition.zip] **todo** %ATTACHURL%/example_Pytorch_image_recognition.zip:
-    example_Pytorch_MNIST.zip
+  example_Pytorch_image_recognition.zip
-   [example_Pytorch_image_recognition.zip](%ATTACHURL%/example_Pytorch_image_recognition.zip):
+- [example_TensorFlow_Automobileset.zip] **todo** %ATTACHURL%/example_TensorFlow_Automobileset.zip:
-    example_Pytorch_image_recognition.zip
+  example_TensorFlow_Automobileset.zip
-   [example_TensorFlow_Automobileset.zip](%ATTACHURL%/example_TensorFlow_Automobileset.zip):
+- [HPC-Introduction.pdf] **todo** %ATTACHURL%/HPC-Introduction.pdf:
-    example_TensorFlow_Automobileset.zip
+  HPC-Introduction.pdf
-   [HPC-Introduction.pdf](%ATTACHURL%/HPC-Introduction.pdf):
+- [HPC-DA-Introduction.pdf] **todo** %ATTACHURL%/HPC-DA-Introduction.pdf :
-    HPC-Introduction.pdf
+  HPC-DA-Introduction.pdf
-   [HPC-DA-Introduction.pdf](%ATTACHURL%/HPC-DA-Introduction.pdf):
-    HPC-DA-Introduction.pdf
-\<div id="gtx-trans" style="position: absolute; left: -5px; top:
-526.833px;"> \</div>