Skip to content
Snippets Groups Projects
Commit 84944d39 authored by Martin Schroschk's avatar Martin Schroschk
Browse files

GetStartedWithHPCDA: Fix checks

parent dbbce07b
No related branches found
No related tags found
3 merge requests!322Merge preview into main,!319Merge preview into main,!124GetStartedWithHPCDA: Fix checks
# Get started with HPC-DA # Get started with HPC-DA
HPC-DA (High-Performance Computing and Data Analytics) is a part of TU-Dresden general purpose HPC
cluster (Taurus). HPC-DA is the best **option** for **Machine learning, Deep learning** applications
and tasks connected with the big data.
**This is an introduction of how to run machine learning applications on the HPC-DA system.**
\<span style="font-size: 1em;">HPC-DA (High-Performance Computing and The main **aim** of this guide is to help users who have started working with Taurus and focused on
Data Analytics) is a part of TU-Dresden general purpose HPC cluster working with Machine learning frameworks such as TensorFlow or Pytorch.
(Taurus). HPC-DA is the best\</span>** option**\<span style="font-size:
1em;"> for \</span>**Machine learning, Deep learning**\<span
style="font-size: 1em;">applications and tasks connected with the big
data.\</span>
**This is an introduction of how to run machine learning applications on **Prerequisites:** To work with HPC-DA, you need [Login](../access/Login.md) for the Taurus system
the HPC-DA system.** and preferably have basic knowledge about High-Performance computers and Python.
The main **aim** of this guide is to help users who have started working **Disclaimer:** This guide provides the main steps on the way of using Taurus, for details please
with Taurus and focused on working with Machine learning frameworks such follow links in the text.
as TensorFlow or Pytorch. **Prerequisites:** \<span style="font-size:
1em;"> To work with HPC-DA, you need \</span>\<a href="Login"
target="\_blank">access\</a>\<span style="font-size: 1em;"> for the
Taurus system and preferably have basic knowledge about High-Performance
computers and Python.\</span>
\<span style="font-size: 1em;">Disclaimer: This guide provides the main
steps on the way of using Taurus, for details please follow links in the
text.\</span>
You can also find the information you need on the You can also find the information you need on the
[HPC-Introduction](%ATTACHURL%/HPC-Introduction.pdf?t=1585216700) and [HPC-Introduction] **todo** %ATTACHURL%/HPC-Introduction.pdf?t=1585216700 and
[HPC-DA-Introduction](%ATTACHURL%/HPC-DA-Introduction.pdf?t=1585162693) [HPC-DA-Introduction] *todo** %ATTACHURL%/HPC-DA-Introduction.pdf?t=1585162693 presentation slides.
presentation slides.
## Why should I use HPC-DA? The architecture and feature of the HPC-DA ## Why should I use HPC-DA? The architecture and feature of the HPC-DA
HPC-DA built on the base of HPC-DA built on the base of [Power9](https://www.ibm.com/it-infrastructure/power/power9)
[Power9](https://www.ibm.com/it-infrastructure/power/power9) architecture from IBM. HPC-DA created from
architecture from IBM. HPC-DA created from [AC922 IBM [AC922 IBM servers](https://www.ibm.com/ie-en/marketplace/power-systems-ac922), which was created
servers](https://www.ibm.com/ie-en/marketplace/power-systems-ac922), for AI challenges, analytics and working with, Machine learning, data-intensive workloads,
which was created for AI challenges, analytics and working with, Machine deep-learning frameworks and accelerated databases. POWER9 is the processor with state-of-the-art
learning, data-intensive workloads, deep-learning frameworks and I/O subsystem technology, including next-generation NVIDIA NVLink, PCIe Gen4 and OpenCAPI.
accelerated databases. POWER9 is the processor with state-of-the-art I/O [Here](../use_of_hardware/Power9.md) you could find a detailed specification of the TU Dresden
subsystem technology, including next-generation NVIDIA NVLink, PCIe Gen4 HPC-DA system.
and OpenCAPI. [Here](Power9) you could find a detailed specification of
the TU Dresden HPC-DA system. The main feature of the Power9 architecture (ppc64le) is the ability to work the
[NVIDIA Tesla V100](https://www.nvidia.com/en-gb/data-center/tesla-v100/) GPU with **NV-Link**
The main feature of the Power9 architecture (ppc64le) is the ability to support. NV-Link technology allows increasing a total bandwidth of 300 gigabytes per second (GB/sec)
work the [ **NVIDIA Tesla V100**
](https://www.nvidia.com/en-gb/data-center/tesla-v100/)GPU with - 10X the bandwidth of PCIe Gen 3. The bandwidth is a crucial factor for deep learning and machine
**NV-Link** support. NV-Link technology allows increasing a total learning applications.
bandwidth of 300 gigabytes per second (GB/sec) - 10X the bandwidth of
PCIe Gen 3. The bandwidth is a crucial factor for deep learning and **Note:** The Power9 architecture not so common as an x86 architecture. This means you are not so
machine learning applications. flexible with choosing applications for your projects. Even so, the main tools and applications are
available. See available modules here.
Note: the Power9 architecture not so common as an x86 architecture. This
means you are not so flexible with choosing applications for your **Please use the ml partition if you need GPUs!** Otherwise using the x86 partitions (e.g Haswell)
projects. Even so, the main tools and applications are available. See most likely would be more beneficial.
available modules here. \<br />**Please use the ml partition if you need
GPUs!** Otherwise using the x86 partitions (e.g Haswell) most likely
would be more beneficial.
## Login ## Login
### SSH Access ### SSH Access
\<span style="font-size: 1em; color: #444444;">The recommended way to The recommended way to connect to the HPC login servers directly via ssh:
connect to the HPC login servers directly via ssh:\</span>
ssh &lt;zih-login&gt;@taurus.hrsk.tu-dresden.de ```Bash
ssh <zih-login>@taurus.hrsk.tu-dresden.de
```
Please put this command in the terminal and replace \<zih-login> with Please put this command in the terminal and replace `<zih-login>` with your login that you received
your login that you received during the access procedure. Accept the during the access procedure. Accept the host verifying and enter your password.
host verifying and enter your password.
T\<span style="font-size: 1em;">his method requires two conditions: This method requires two conditions:
Linux OS, workstation within the campus network. For other options and Linux OS, workstation within the campus network. For other options and
details check the \</span>\<a href="Login" target="\_blank">Login details check the [login page](../access/Login.md).
page\</a>\<span style="font-size: 1em;">.\</span>
## Data management ## Data management
### Workspaces ### Workspaces
As soon as you have access to HPC-DA you have to manage your data. The As soon as you have access to HPC-DA you have to manage your data. The main method of working with
main method of working with data on Taurus is using Workspaces. \<span data on Taurus is using Workspaces. You could work with simple examples in your home directory
style="font-size: 1em;">You could work with simple examples in your home (where you are loading by default). However, in accordance with the
directory (where you are loading by default). However, in accordance [storage concept](../data_management/HPCStorageConcept2019.md)
with the \</span>\<a href="HPCStorageConcept2019" **please use** a [workspace](../data_management/Workspaces.md)
target="\_blank">storage concept\</a>,** please use \<a for your study and work projects.
href="WorkSpaces" target="\_blank">workspaces\</a> for your study and
work projects.**
You should create your workspace with a similar command: You should create your workspace with a similar command:
ws_allocate -F scratch Machine_learning_project 50 #allocating workspase in scratch directory for 50 days ```Bash
ws_allocate -F scratch Machine_learning_project 50 #allocating workspase in scratch directory for 50 days
```
After the command, you will have an output with the address of the After the command, you will have an output with the address of the workspace based on scratch. Use
workspace based on scratch. Use it to store the main data of your it to store the main data of your project.
project.
\<span style="font-size: 1em;">For different purposes, you should use For different purposes, you should use different storage systems. To work as efficient as possible,
different storage systems. \</span>\<span style="font-size: 1em;">To consider the following points:
work as efficient as possible, consider the following points:\</span>
- Save source code etc. in **`/home`** or **`/projects/...`** - Save source code etc. in `/home` or `/projects/...`
- Store checkpoints and other massive but temporary data with - Store checkpoints and other massive but temporary data with
workspaces in: **`/scratch/ws/...`** workspaces in: `/scratch/ws/...`
- For data that seldom changes but consumes a lot of space, use - For data that seldom changes but consumes a lot of space, use
mid-term storage with workspaces: **`/warm_archive/...`** mid-term storage with workspaces: `/warm_archive/...`
- For large parallel applications where using the fastest file system - For large parallel applications where using the fastest file system
is a necessity, use with workspaces: **`/lustre/ssd/...`** is a necessity, use with workspaces: `/lustre/ssd/...`
- Compilation in **`/dev/shm`** or **`/tmp`** - Compilation in `/dev/shm`** or `/tmp`
### Data moving ### Data moving
#### Moving data to/from the HPC machines #### Moving data to/from the HPC machines
To copy data to/from the HPC machines, the Taurus [export To copy data to/from the HPC machines, the Taurus [export nodes](../data_moving/ExportNodes.md)
nodes](ExportNodes) should be used. They are the preferred way to should be used. They are the preferred way to transfer your data. There are three possibilities to
transfer your data. There are three possibilities to exchanging data exchanging data between your local machine (lm) and the HPC machines (hm): **SCP, RSYNC, SFTP**.
between your local machine (lm) and the HPC machines (hm):\<span> **SCP,
RSYNC, SFTP**.\</span>
Type following commands in the local directory of the local machine. For Type following commands in the local directory of the local machine. For example, the **`SCP`**
example, the **`SCP`** command was used. command was used.
#### Copy data from lm to hm #### Copy data from lm to hm
scp &lt;file&gt; &lt;zih-user&gt;@taurusexport.hrsk.tu-dresden.de:&lt;target-location&gt; #Copy file from your local machine. For example: scp helloworld.txt mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/ ```Bash
scp &lt;file&gt; &lt;zih-user&gt;@taurusexport.hrsk.tu-dresden.de:&lt;target-location&gt; #Copy file from your local machine. For example: scp helloworld.txt mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/
scp -r &lt;directory&gt; &lt;zih-user&gt;@taurusexport.hrsk.tu-dresden.de:&lt;target-location&gt; #Copy directory from your local machine. scp -r &lt;directory&gt; &lt;zih-user&gt;@taurusexport.hrsk.tu-dresden.de:&lt;target-location&gt; #Copy directory from your local machine.
```
#### Copy data from hm to lm #### Copy data from hm to lm
scp &lt;zih-user&gt;@taurusexport.hrsk.tu-dresden.de:&lt;file&gt; &lt;target-location&gt; #Copy file. For example: scp mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/helloworld.txt /home/mustermann/Downloads ```Bash
scp &lt;zih-user&gt;@taurusexport.hrsk.tu-dresden.de:&lt;file&gt; &lt;target-location&gt; #Copy file. For example: scp mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/helloworld.txt /home/mustermann/Downloads
scp -r &lt;zih-user&gt;@taurusexport.hrsk.tu-dresden.de:&lt;directory&gt; &lt;target-location&gt; #Copy directory scp -r &lt;zih-user&gt;@taurusexport.hrsk.tu-dresden.de:&lt;directory&gt; &lt;target-location&gt; #Copy directory
```
#### Moving data inside the HPC machines. Datamover #### Moving data inside the HPC machines. Datamover
The best way to transfer data inside the Taurus is the \<a The best way to transfer data inside the Taurus is the [data mover](../data_moving/DataMover.md). It
href="DataMover" target="\_blank">datamover\</a>. It is the special data is the special data transfer machine providing the global file systems of each ZIH HPC system.
transfer machine providing the global file systems of each ZIH HPC Datamover provides the best data speed. To load, move, copy etc. files from one file system to
system. Datamover provides the best data speed. To load, move, copy etc. another file system, you have to use commands with **dt** prefix, such as:
files from one file system to another file system, you have to use
commands with **dt** prefix, such as:
**`dtcp, dtwget, dtmv, dtrm, dtrsync, dttar, dtls`** `dtcp, dtwget, dtmv, dtrm, dtrsync, dttar, dtls`
These commands submit a job to the data transfer machines that execute These commands submit a job to the data transfer machines that execute the selected command. Except
the selected command. Except for the '\<span>dt'\</span> prefix, their for the `dt` prefix, their syntax is the same as the shell command without the `dt`.
syntax is the same as the shell command without the
'\<span>dt\</span>'**.**
dtcp -r /scratch/ws/&lt;name_of_your_workspace&gt;/results /luste/ssd/ws/&lt;name_of_your_workspace&gt; #Copy from workspace in scratch to ssd.<br />dtwget https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz #Download archive CIFAR-100. ```Bash
dtcp -r /scratch/ws/&lt;name_of_your_workspace&gt;/results /luste/ssd/ws/&lt;name_of_your_workspace&gt; #Copy from workspace in scratch to ssd.<br />dtwget https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz #Download archive CIFAR-100.
```
## BatchSystems. SLURM ## BatchSystems. SLURM
After logon and preparing your data for further work the next logical After logon and preparing your data for further work the next logical step is to start your job. For
step is to start your job. For these purposes, SLURM is using. Slurm these purposes, SLURM is using. Slurm (Simple Linux Utility for Resource Management) is an
(Simple Linux Utility for Resource Management) is an open-source job open-source job scheduler that allocates compute resources on clusters for queued defined jobs. By
scheduler that allocates compute resources on clusters for queued default, after your logging, you are using the login nodes. The intended purpose of these nodes
defined jobs. By\<span style="font-size: 1em;"> default, after your speaks for oneself. Applications on an HPC system can not be run there! They have to be submitted
logging, you are using the login nodes. The intended purpose of these to compute nodes (ml nodes for HPC-DA) with dedicated resources for user jobs.
nodes speaks for oneself. \</span>\<span style="font-size:
1em;">Applications on an HPC system can not be run there! They have to Job submission can be done with the command: `-srun [options] <command>.`
be submitted to compute nodes (ml nodes for HPC-DA) with dedicated
resources for user jobs. \</span> This is a simple example which you could use for your start. The `srun` command is used to submit a
job for execution in real-time designed for interactive use, with monitoring the output. For some
\<span details please check [the Slurm page](../jobs/Slurm.md).
style`"font-size: 1em;">Job submission can be done with the command: =srun [options] <command>.`
\</span> ```Bash
srun -p ml -N 1 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash #Job submission in ml nodes with allocating: 1 node, 1 gpu per node, with 8000 mb on 1 hour.
This is a simple example which you could use for your start. The `srun` ```
command is used to submit a job for execution in real-time designed for
interactive use, with monitoring the output. For some details please However, using srun directly on the shell will lead to blocking and launch an interactive job. Apart
check [the page](Slurm). from short test runs, it is **recommended to launch your jobs into the background by using batch
jobs**. For that, you can conveniently put the parameters directly into the job file which you can
srun -p ml -N 1 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash #Job submission in ml nodes with allocating: 1 node, 1 gpu per node, with 8000 mb on 1 hour. submit using `sbatch [options] <job file>.`
However, using srun directly on the shell will lead to blocking and
launch an interactive job. Apart from short test runs, it is
**recommended to launch your jobs into the background by using batch
jobs**. For that, you can conveniently put the parameters directly into
the job file which you can submit using `sbatch [options] <job file>.`
This is the example of the sbatch file to run your application: This is the example of the sbatch file to run your application:
#!/bin/bash ```Bash
#SBATCH --mem=8GB # specify the needed memory #!/bin/bash
#SBATCH -p ml # specify ml partition #SBATCH --mem=8GB # specify the needed memory
#SBATCH --gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task) #SBATCH -p ml # specify ml partition
#SBATCH --nodes=1 # request 1 node #SBATCH --gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task)
#SBATCH --time=00:15:00 # runs for 10 minutes #SBATCH --nodes=1 # request 1 node
#SBATCH -c 1 # how many cores per task allocated #SBATCH --time=00:15:00 # runs for 10 minutes
#SBATCH -o HLR_name_your_script.out # save output message under HLR_${SLURMJOBID}.out #SBATCH -c 1 # how many cores per task allocated
#SBATCH -e HLR_name_your_script.err # save error messages under HLR_${SLURMJOBID}.err #SBATCH -o HLR_name_your_script.out # save output message under HLR_${SLURMJOBID}.out
<br />module load modenv/ml #SBATCH -e HLR_name_your_script.err # save error messages under HLR_${SLURMJOBID}.err
module load TensorFlow<br /><br />python machine_learning_example.py<br /><br />## when finished writing, submit with: sbatch &lt;script_name&gt; For example: sbatch machine_learning_script.slurm
module load modenv/ml
The `machine_learning_example.py` contains a simple ml application based module load TensorFlow
on the mnist model to test your sbatch file. It could be found as the
[attachment](%ATTACHURL%/machine_learning_example.py) in the bottom of python machine_learning_example.py
the page.
## when finished writing, submit with: sbatch &lt;script_name&gt; For example: sbatch machine_learning_script.slurm
```
The `machine_learning_example.py` contains a simple ml application based on the mnist model to test
your sbatch file. It could be found as the [attachment] **todo**
%ATTACHURL%/machine_learning_example.py in the bottom of the page.
## Start your application ## Start your application
As stated before HPC-DA was created for deep learning, machine learning As stated before HPC-DA was created for deep learning, machine learning applications. Machine
applications. Machine learning frameworks as TensorFlow and PyTorch are learning frameworks as TensorFlow and PyTorch are industry standards now.
industry standards now.
There are three main options on how to work with Tensorflow and PyTorch: There are three main options on how to work with Tensorflow and PyTorch:
**1.** **Modules,** **2.** **JupyterNotebook, 3. Containers**
1. **Modules**
**1.** **Modules** 1. **JupyterNotebook**
1. **Containers**
The easiest way is using the \<a
href="RuntimeEnvironment#Module_Environments" target="\_blank">Modules ### Modules
system\</a> and Python virtual environment. Modules are a way to use
frameworks, compilers, loader, libraries, and utilities. The module is a The easiest way is using the [modules system](modules.md) and Python virtual environment. Modules
user interface that provides utilities for the dynamic modification of a are a way to use frameworks, compilers, loader, libraries, and utilities. The module is a user
user's environment without manual modifications. You could use them for interface that provides utilities for the dynamic modification of a user's environment without
srun , bath jobs (sbatch) and the Jupyterhub. manual modifications. You could use them for srun , bath jobs (sbatch) and the Jupyterhub.
A virtual environment is a cooperatively isolated runtime environment A virtual environment is a cooperatively isolated runtime environment that allows Python users and
that allows Python users and applications to install and update Python applications to install and update Python distribution packages without interfering with the
distribution packages without interfering with the behaviour of other behaviour of other Python applications running on the same system. At its core, the main purpose of
Python applications running on the same system. At its core, the main Python virtual environments is to create an isolated environment for Python projects.
purpose of Python virtual environments is to create an isolated
environment for Python projects. **Vitualenv (venv)** is a standard Python tool to create isolated Python environments. We recommend
using venv to work with Tensorflow and Pytorch on Taurus. It has been integrated into the standard
**Vitualenv (venv)** is a standard Python tool to create isolated Python library under the [venv module](https://docs.python.org/3/library/venv.html). However, if you have
environments. We recommend using venv to work with Tensorflow and reasons (previously created environments etc) you could easily use conda. The conda is the second
Pytorch on Taurus. It has been integrated into the standard library way to use a virtual environment on the Taurus.
under the \<a href="<https://docs.python.org/3/library/venv.html>" [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)
target="\_blank">venv module\</a>. However, if you have reasons is an open-source package management system and environment management system from the Anaconda.
(previously created environments etc) you could easily use conda. The
conda is the second way to use a virtual environment on the Taurus. \<a
href="<https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html>"
target="\_blank">Conda\</a> is an open-source package management system
and environment management system from the Anaconda.
As was written in the previous chapter, to start the application (using As was written in the previous chapter, to start the application (using
modules) and to run the job exist two main options: modules) and to run the job exist two main options:
- The **\<span class="WYSIWYG_TT">srun\</span> command:** - The `srun` command:**
<!-- -->
srun -p ml -N 1 -n 1 -c 2 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash #job submission in ml nodes with allocating: 1 node, 1 task per node, 2 CPUs per task, 1 gpu per node, with 8000 mb on 1 hour. ```Bash
srun -p ml -N 1 -n 1 -c 2 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash #job submission in ml nodes with allocating: 1 node, 1 task per node, 2 CPUs per task, 1 gpu per node, with 8000 mb on 1 hour.
module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 =&gt; modenv/ml module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 =&gt; modenv/ml
mkdir python-virtual-environments #create folder for your environments mkdir python-virtual-environments #create folder for your environments
cd python-virtual-environments #go to folder cd python-virtual-environments #go to folder
module load TensorFlow #load TensorFlow module to use python. Example output: Module Module TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4 and 31 dependencies loaded. module load TensorFlow #load TensorFlow module to use python. Example output: Module Module TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4 and 31 dependencies loaded.
which python #check which python are you using which python #check which python are you using
python3 -m venv --system-site-packages env #create virtual environment "env" which inheriting with global site packages python3 -m venv --system-site-packages env #create virtual environment "env" which inheriting with global site packages
source env/bin/activate #activate virtual environment "env". Example output: (env) bash-4.2$ source env/bin/activate #activate virtual environment "env". Example output: (env) bash-4.2$
```
The inscription (env) at the beginning of each line represents that now The inscription (env) at the beginning of each line represents that now you are in the virtual
you are in the virtual environment. environment.
Now you can check the working capacity of the current environment. Now you can check the working capacity of the current environment.
python # start python ```Bash
import tensorflow as tf python # start python
print(tf.__version__) # example output: 2.1.0 import tensorflow as tf
print(tf.__version__) # example output: 2.1.0
- The second and main option is using batch jobs (**`sbatch`**). It is ```
used to submit a job script for later execution. Consequently, it is
**recommended to launch your jobs into the background by using batch The second and main option is using batch jobs (`sbatch`). It is used to submit a job script for
jobs**. To launch your machine learning application as well to srun later execution. Consequently, it is **recommended to launch your jobs into the background by using
job you need to use modules. See the previous chapter with the batch jobs**. To launch your machine learning application as well to srun job you need to use
sbatch file example. modules. See the previous chapter with the sbatch file example.
Versions: TensorFlow 1.14, 1.15, 2.0, 2.1; PyTorch 1.1, 1.3 are available. (25.02.20)
Note: However in case of using sbatch files to send your job you usually don't need a virtual
environment.
### JupyterNotebook
The Jupyter Notebook is an open-source web application that allows you to create documents
containing live code, equations, visualizations, and narrative text. Jupyter notebook allows working
with TensorFlow on Taurus with GUI (graphic user interface) in a **web browser** and the opportunity
to see intermediate results step by step of your work. This can be useful for users who dont have
huge experience with HPC or Linux.
There is [JupyterHub](JupyterHub.md) on Taurus, where you can simply run your Jupyter notebook on
HPC nodes. Also, for more specific cases you can run a manually created remote jupyter server. You
can find the manual server setup [here](DeepLearning.md). However, the simplest option for
beginners is using JupyterHub.
JupyterHub is available at
[taurus.hrsk.tu-dresden.de/jupyter](https://taurus.hrsk.tu-dresden.de/jupyter)
After logging, you can start a new session and configure it. There are simple and advanced forms to
set up your session. On the simple form, you have to choose the "IBM Power (ppc64le)" architecture.
You can select the required number of CPUs and GPUs. For the acquaintance with the system through
the examples below the recommended amount of CPUs and 1 GPU will be enough.
With the advanced form, you can use
the configuration with 1 GPU and 7 CPUs. To access for all your workspaces use " / " in the
workspace scope. Please check updates and details [here](JupyterHub.md).
Several Tensorflow and PyTorch examples for the Jupyter notebook have been prepared based on some
simple tasks and models which will give you an understanding of how to work with ML frameworks and
JupyterHub. It could be found as the [attachment] **todo** %ATTACHURL%/machine_learning_example.py
in the bottom of the page. A detailed explanation and examples for TensorFlow can be found
[here](TensorFlowOnJupyterNotebook.md). For the Pytorch - [here](PyTorch.md). Usage information
about the environments for the JupyterHub could be found [here](JupyterHub.md) in the chapter
*Creating and using your own environment*.
Versions: TensorFlow 1.14, 1.15, 2.0, 2.1; PyTorch 1.1, 1.3 are Versions: TensorFlow 1.14, 1.15, 2.0, 2.1; PyTorch 1.1, 1.3 are
available. (25.02.20) available. (25.02.20)
Note: However in case of using sbatch files to send your job you usually ### Containers
don't need a virtual environment.
**2. JupyterNotebook**
The Jupyter Notebook is an open-source web application that allows you
to create documents containing live code, equations, visualizations, and
narrative text. Jupyter notebook allows working with TensorFlow on
Taurus with GUI (graphic user interface) in a **web browser** and the
opportunity to see intermediate results step by step of your work. This
can be useful for users who dont have huge experience with HPC or Linux.
There is \<a href="JupyterHub" target="\_self">jupyterhub\</a> on
Taurus, where you can simply run your Jupyter notebook on HPC nodes.
Also, for more specific cases you can run a manually created remote
jupyter server. You can find the manual server setup \<a
href="DeepLearning" target="\_blank">here.\</a> However, the simplest
option for beginners is using JupyterHub.
JupyterHub is available here: \<a
href="<https://taurus.hrsk.tu-dresden.de/jupyter>"
target="\_top"><https://taurus.hrsk.tu-dresden.de/jupyter>\</a>
After logging, you can start a new session and\<span style="font-size:
1em;"> configure it. There are simple and advanced forms to set up your
session. On the simple form, you have to choose the "IBM Power
(ppc64le)" architecture. You can select the required number of CPUs and
GPUs. For the acquaintance with the system through the examples below
the recommended amount of CPUs and 1 GPU will be enough. \</span>\<span
style="font-size: 1em;">With the advanced form, \</span>\<span
style="font-size: 1em;">you can use the configuration with 1 GPU and 7
CPUs. To access for all your workspaces use " / " in the workspace
scope. Please check\</span>\<span style="font-size: 1em;"> updates and
details \</span>\<a href="JupyterHub" target="\_blank">here\</a>\<span
style="font-size: 1em;">.\</span>
\<span style="font-size: 1em;">Several Tensorflow and PyTorch examples
for the Jupyter notebook have been prepared based on some simple tasks
and models which will give you an understanding of how to work with ML
frameworks and JupyterHub. It could be found as the \</span>
[attachment](%ATTACHURL%/machine_learning_example.py)\<span
style="font-size: 1em;"> in the bottom of the page. A detailed
explanation and examples for TensorFlow can be found \</span>\<a
href="TensorFlowOnJupyterNotebook" title="EXAMPLES AND RUNNING THE
MODEL">here\</a>\<span style="font-size: 1em;">. For the Pytorch -
\</span> [here](PyTorch)\<span style="font-size: 1em;">. \</span>\<span
style="font-size: 1em;">Usage information about the environments for the
JupyterHub could be found \</span> [here](JupyterHub)\<span
style="font-size: 1em;"> in the chapter 'Creating and using your own
environment'.\</span>
Versions: TensorFlow 1.14, 1.15, 2.0, 2.1; PyTorch 1.1, 1.3 are Some machine learning tasks such as benchmarking require using containers. A container is a standard
available. (25.02.20) unit of software that packages up code and all its dependencies so the application runs quickly and
reliably from one computing environment to another. Using containers gives you more flexibility
working with modules and software but at the same time requires more effort.
On Taurus [Singularity](https://sylabs.io/) is used as a standard container solution. Singularity
enables users to have full control of their environment. This means that **you dont have to ask an
HPC support to install anything for you - you can put it in a Singularity container and run!**As
opposed to Docker (the beat-known container solution), Singularity is much more suited to being used
in an HPC environment and more efficient in many cases. Docker containers also can easily be used by
Singularity from the [DockerHub](https://hub.docker.com) for instance. Also, some containers are
available in [Singularity Hub](https://singularity-hub.org/).
**3.** **Containers** The simplest option to start working with containers on HPC-DA is importing from Docker or
SingularityHub container with TensorFlow. It does **not require root privileges** and so works on
Some machine learning tasks such as benchmarking require using Taurus directly:
containers. A container is a standard unit of software that packages up
code and all its dependencies so the application runs quickly and ```Bash
reliably from one computing environment to another. \<span srun -p ml -N 1 --gres=gpu:1 --time=02:00:00 --pty --mem-per-cpu=8000 bash #allocating resourses from ml nodes to start the job to create a container.<br />singularity build my-ML-container.sif docker://ibmcom/tensorflow-ppc64le #create a container from the DockerHub with the last TensorFlow version<br />singularity run --nv my-ML-container.sif #run my-ML-container.sif container with support of the Nvidia's GPU. You could also entertain with your container by commands: singularity shell, singularity exec
style="font-size: 1em;">Using containers gives you more flexibility ```
working with modules and software but at the same time requires more
effort.\</span>
On Taurus \<a href="<https://sylabs.io/>"
target="\_blank">Singularity\</a> used as a standard container solution.
Singularity enables users to have full control of their environment.
This means that **you dont have to ask an HPC support to install
anything for you - you can put it in a Singularity container and
run!**As opposed to Docker (the beat-known container solution),
Singularity is much more suited to being used in an HPC environment and
more efficient in many cases. Docker containers also can easily be used
by Singularity from the [DockerHub](https://hub.docker.com) for
instance. Also, some containers are available in \<a
href="<https://singularity-hub.org/>"
target="\_blank">SingularityHub\</a>.
\<span style="font-size: 1em;">The simplest option to start working with
containers on HPC-DA is i\</span>\<span style="font-size: 1em;">mporting
from Docker or SingularityHub container with TensorFlow. It does
\</span> **not require root privileges** \<span style="font-size: 1em;">
and so works on Taurus directly\</span>\<span style="font-size: 1em;">:
\</span>
srun -p ml -N 1 --gres=gpu:1 --time=02:00:00 --pty --mem-per-cpu=8000 bash #allocating resourses from ml nodes to start the job to create a container.<br />singularity build my-ML-container.sif docker://ibmcom/tensorflow-ppc64le #create a container from the DockerHub with the last TensorFlow version<br />singularity run --nv my-ML-container.sif #run my-ML-container.sif container with support of the Nvidia's GPU. You could also entertain with your container by commands: singularity shell, singularity exec
There are two sources for containers for Power9 architecture with There are two sources for containers for Power9 architecture with
Tensorflow and PyTorch on the board: \<span style="font-size: 1em;"> Tensorflow and PyTorch on the board:
[Tensorflow-ppc64le](https://hub.docker.com/r/ibmcom/tensorflow-ppc64le)
- \</span>\<span style="font-size: 1em;">Community-supported ppc64le * [Tensorflow-ppc64le](https://hub.docker.com/r/ibmcom/tensorflow-ppc64le):
docker container for TensorFlow. \</span>\<a Community-supported ppc64le docker container for TensorFlow.
href="<https://hub.docker.com/r/ibmcom/powerai/>" * [PowerAI container](https://hub.docker.com/r/ibmcom/powerai/):
target="\_blank">PowerAI container\</a> - \<span style="font-size: Official Docker container with Tensorflow, PyTorch and many other packages.
1em;">Official Docker container with Tensorflow, PyTorch and many other Heavy container. It requires a lot of space. Could be found on Taurus.
packages. Heavy container. It requires a lot of space. Could be found on
Taurus.\</span> Note: You could find other versions of software in the container on the "tag" tab on the docker web
page of the container.
Note: You could find other versions of software in the container on the
"tag" tab on the docker web page of the container.
To use not a pure Tensorflow, PyTorch but also with some Python packages To use not a pure Tensorflow, PyTorch but also with some Python packages
you have to use the definition file to create the container you have to use the definition file to create the container
(bootstrapping). For details please see the [Container](Container) page (bootstrapping). For details please see the [Container](containers.md) page
from our wiki. Bootstrapping **has required root privileges** and from our wiki. Bootstrapping **has required root privileges** and
Virtual Machine (VM) should be used! There are two main options on how Virtual Machine (VM) should be used! There are two main options on how
to work with VM on Taurus: [VM tools](VMTools) - automotive algorithms to work with VM on Taurus: [VM tools](VMTools.md) - automotive algorithms
for using virtual machines; [Manual method](Cloud) - it requires more for using virtual machines; [Manual method](Cloud.md) - it requires more
operations but gives you more flexibility and reliability. operations but gives you more flexibility and reliability.
-- Main.AndreiPolitov - 2020-02-05 - [machine_learning_example.py] **todo** %ATTACHURL%/machine_learning_example.py:
machine_learning_example.py
- [machine_learning_example.py](%ATTACHURL%/machine_learning_example.py): - [example_TensofFlow_MNIST.zip] **todo** %ATTACHURL%/example_TensofFlow_MNIST.zip:
machine_learning_example.py example_TensofFlow_MNIST.zip
- [example_TensofFlow_MNIST.zip](%ATTACHURL%/example_TensofFlow_MNIST.zip): - [example_Pytorch_MNIST.zip] **todo** %ATTACHURL%/example_Pytorch_MNIST.zip:
example_TensofFlow_MNIST.zip example_Pytorch_MNIST.zip
- [example_Pytorch_MNIST.zip](%ATTACHURL%/example_Pytorch_MNIST.zip): - [example_Pytorch_image_recognition.zip] **todo** %ATTACHURL%/example_Pytorch_image_recognition.zip:
example_Pytorch_MNIST.zip example_Pytorch_image_recognition.zip
- [example_Pytorch_image_recognition.zip](%ATTACHURL%/example_Pytorch_image_recognition.zip): - [example_TensorFlow_Automobileset.zip] **todo** %ATTACHURL%/example_TensorFlow_Automobileset.zip:
example_Pytorch_image_recognition.zip example_TensorFlow_Automobileset.zip
- [example_TensorFlow_Automobileset.zip](%ATTACHURL%/example_TensorFlow_Automobileset.zip): - [HPC-Introduction.pdf] **todo** %ATTACHURL%/HPC-Introduction.pdf:
example_TensorFlow_Automobileset.zip HPC-Introduction.pdf
- [HPC-Introduction.pdf](%ATTACHURL%/HPC-Introduction.pdf): - [HPC-DA-Introduction.pdf] **todo** %ATTACHURL%/HPC-DA-Introduction.pdf :
HPC-Introduction.pdf HPC-DA-Introduction.pdf
- [HPC-DA-Introduction.pdf](%ATTACHURL%/HPC-DA-Introduction.pdf):
HPC-DA-Introduction.pdf
\<div id="gtx-trans" style="position: absolute; left: -5px; top:
526.833px;"> \</div>
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment