**please use** a [workspace](../data_management/Workspaces.md)
target="\_blank">storage concept\</a>,** please use \<a
for your study and work projects.
href="WorkSpaces" target="\_blank">workspaces\</a> for your study and
work projects.**
You should create your workspace with a similar command:
You should create your workspace with a similar command:
ws_allocate -F scratch Machine_learning_project 50 #allocating workspase in scratch directory for 50 days
```Bash
ws_allocate -F scratch Machine_learning_project 50 #allocating workspase in scratch directory for 50 days
```
After the command, you will have an output with the address of the
After the command, you will have an output with the address of the workspace based on scratch. Use
workspace based on scratch. Use it to store the main data of your
it to store the main data of your project.
project.
\<span style="font-size: 1em;">For different purposes, you should use
For different purposes, you should use different storage systems. To work as efficient as possible,
different storage systems. \</span>\<span style="font-size: 1em;">To
consider the following points:
work as efficient as possible, consider the following points:\</span>
-Save source code etc. in **`/home`** or **`/projects/...`**
- Save source code etc. in `/home` or `/projects/...`
-Store checkpoints and other massive but temporary data with
- Store checkpoints and other massive but temporary data with
workspaces in: **`/scratch/ws/...`**
workspaces in: `/scratch/ws/...`
-For data that seldom changes but consumes a lot of space, use
- For data that seldom changes but consumes a lot of space, use
mid-term storage with workspaces: **`/warm_archive/...`**
mid-term storage with workspaces: `/warm_archive/...`
-For large parallel applications where using the fastest file system
- For large parallel applications where using the fastest file system
is a necessity, use with workspaces: **`/lustre/ssd/...`**
is a necessity, use with workspaces: `/lustre/ssd/...`
-Compilation in **`/dev/shm`** or **`/tmp`**
- Compilation in `/dev/shm`** or `/tmp`
### Data moving
### Data moving
#### Moving data to/from the HPC machines
#### Moving data to/from the HPC machines
To copy data to/from the HPC machines, the Taurus [export
To copy data to/from the HPC machines, the Taurus [export nodes](../data_moving/ExportNodes.md)
nodes](ExportNodes) should be used. They are the preferred way to
should be used. They are the preferred way to transfer your data. There are three possibilities to
transfer your data. There are three possibilities to exchanging data
exchanging data between your local machine (lm) and the HPC machines (hm): **SCP, RSYNC, SFTP**.
between your local machine (lm) and the HPC machines (hm):\<span> **SCP,
RSYNC, SFTP**.\</span>
Type following commands in the local directory of the local machine. For
Type following commands in the local directory of the local machine. For example, the **`SCP`**
example, the **`SCP`**command was used.
command was used.
#### Copy data from lm to hm
#### Copy data from lm to hm
scp <file> <zih-user>@taurusexport.hrsk.tu-dresden.de:<target-location> #Copy file from your local machine. For example: scp helloworld.txt mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/
```Bash
scp <file> <zih-user>@taurusexport.hrsk.tu-dresden.de:<target-location> #Copy file from your local machine. For example: scp helloworld.txt mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/
scp -r <directory> <zih-user>@taurusexport.hrsk.tu-dresden.de:<target-location> #Copy directory from your local machine.
scp -r <directory> <zih-user>@taurusexport.hrsk.tu-dresden.de:<target-location> #Copy directory from your local machine.
```
#### Copy data from hm to lm
#### Copy data from hm to lm
scp <zih-user>@taurusexport.hrsk.tu-dresden.de:<file> <target-location> #Copy file. For example: scp mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/helloworld.txt /home/mustermann/Downloads
```Bash
scp <zih-user>@taurusexport.hrsk.tu-dresden.de:<file> <target-location> #Copy file. For example: scp mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/helloworld.txt /home/mustermann/Downloads
style`"font-size: 1em;">Job submission can be done with the command: =srun [options] <command>.`
\</span>
```Bash
srun -p ml -N 1 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash #Job submission in ml nodes with allocating: 1 node, 1 gpu per node, with 8000 mb on 1 hour.
This is a simple example which you could use for your start. The `srun`
```
command is used to submit a job for execution in real-time designed for
interactive use, with monitoring the output. For some details please
However, using srun directly on the shell will lead to blocking and launch an interactive job. Apart
check [the page](Slurm).
from short test runs, it is **recommended to launch your jobs into the background by using batch
jobs**. For that, you can conveniently put the parameters directly into the job file which you can
srun -p ml -N 1 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash #Job submission in ml nodes with allocating: 1 node, 1 gpu per node, with 8000 mb on 1 hour.
submit using `sbatch [options] <job file>.`
However, using srun directly on the shell will lead to blocking and
launch an interactive job. Apart from short test runs, it is
**recommended to launch your jobs into the background by using batch
jobs**. For that, you can conveniently put the parameters directly into
the job file which you can submit using `sbatch [options] <job file>.`
This is the example of the sbatch file to run your application:
This is the example of the sbatch file to run your application:
#!/bin/bash
```Bash
#SBATCH --mem=8GB # specify the needed memory
#!/bin/bash
#SBATCH -p ml # specify ml partition
#SBATCH --mem=8GB # specify the needed memory
#SBATCH --gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task)
#SBATCH -p ml # specify ml partition
#SBATCH --nodes=1 # request 1 node
#SBATCH --gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task)
#SBATCH --time=00:15:00 # runs for 10 minutes
#SBATCH --nodes=1 # request 1 node
#SBATCH -c 1 # how many cores per task allocated
#SBATCH --time=00:15:00 # runs for 10 minutes
#SBATCH -o HLR_name_your_script.out # save output message under HLR_${SLURMJOBID}.out
#SBATCH -c 1 # how many cores per task allocated
#SBATCH -e HLR_name_your_script.err # save error messages under HLR_${SLURMJOBID}.err
#SBATCH -o HLR_name_your_script.out # save output message under HLR_${SLURMJOBID}.out
<br />module load modenv/ml
#SBATCH -e HLR_name_your_script.err # save error messages under HLR_${SLURMJOBID}.err
module load TensorFlow<br /><br />python machine_learning_example.py<br /><br />## when finished writing, submit with: sbatch <script_name> For example: sbatch machine_learning_script.slurm
module load modenv/ml
The `machine_learning_example.py` contains a simple ml application based
module load TensorFlow
on the mnist model to test your sbatch file. It could be found as the
[attachment](%ATTACHURL%/machine_learning_example.py) in the bottom of
python machine_learning_example.py
the page.
## when finished writing, submit with: sbatch <script_name> For example: sbatch machine_learning_script.slurm
```
The `machine_learning_example.py` contains a simple ml application based on the mnist model to test
your sbatch file. It could be found as the [attachment] **todo**
%ATTACHURL%/machine_learning_example.py in the bottom of the page.
## Start your application
## Start your application
As stated before HPC-DA was created for deep learning, machine learning
As stated before HPC-DA was created for deep learning, machine learning applications. Machine
applications. Machine learning frameworks as TensorFlow and PyTorch are
learning frameworks as TensorFlow and PyTorch are industry standards now.
industry standards now.
There are three main options on how to work with Tensorflow and PyTorch:
There are three main options on how to work with Tensorflow and PyTorch:
target="\_blank">Conda\</a> is an open-source package management system
and environment management system from the Anaconda.
As was written in the previous chapter, to start the application (using
As was written in the previous chapter, to start the application (using
modules) and to run the job exist two main options:
modules) and to run the job exist two main options:
- The **\<span class="WYSIWYG_TT">srun\</span> command:**
- The `srun` command:**
<!-- -->
srun -p ml -N 1 -n 1 -c 2 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash #job submission in ml nodes with allocating: 1 node, 1 task per node, 2 CPUs per task, 1 gpu per node, with 8000 mb on 1 hour.
```Bash
srun -p ml -N 1 -n 1 -c 2 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash #job submission in ml nodes with allocating: 1 node, 1 task per node, 2 CPUs per task, 1 gpu per node, with 8000 mb on 1 hour.
module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml
module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml
mkdir python-virtual-environments #create folder for your environments
mkdir python-virtual-environments #create folder for your environments
cd python-virtual-environments #go to folder
cd python-virtual-environments #go to folder
module load TensorFlow #load TensorFlow module to use python. Example output: Module Module TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4 and 31 dependencies loaded.
module load TensorFlow #load TensorFlow module to use python. Example output: Module Module TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4 and 31 dependencies loaded.
which python #check which python are you using
which python #check which python are you using
python3 -m venv --system-site-packages env #create virtual environment "env" which inheriting with global site packages
python3 -m venv --system-site-packages env #create virtual environment "env" which inheriting with global site packages
source env/bin/activate #activate virtual environment "env". Example output: (env) bash-4.2$
source env/bin/activate #activate virtual environment "env". Example output: (env) bash-4.2$
```
The inscription (env) at the beginning of each line represents that now
The inscription (env) at the beginning of each line represents that now you are in the virtual
you are in the virtual environment.
environment.
Now you can check the working capacity of the current environment.
Now you can check the working capacity of the current environment.
python # start python
```Bash
import tensorflow as tf
python # start python
print(tf.__version__) # example output: 2.1.0
import tensorflow as tf
print(tf.__version__) # example output: 2.1.0
- The second and main option is using batch jobs (**`sbatch`**). It is
```
used to submit a job script for later execution. Consequently, it is
**recommended to launch your jobs into the background by using batch
The second and main option is using batch jobs (`sbatch`). It is used to submit a job script for
jobs**. To launch your machine learning application as well to srun
later execution. Consequently, it is **recommended to launch your jobs into the background by using
job you need to use modules. See the previous chapter with the
batch jobs**. To launch your machine learning application as well to srun job you need to use
sbatch file example.
modules. See the previous chapter with the sbatch file example.
style="font-size: 1em;">Usage information about the environments for the
JupyterHub could be found \</span> [here](JupyterHub)\<span
style="font-size: 1em;"> in the chapter 'Creating and using your own
environment'.\</span>
Versions: TensorFlow 1.14, 1.15, 2.0, 2.1; PyTorch 1.1, 1.3 are
Some machine learning tasks such as benchmarking require using containers. A container is a standard
available. (25.02.20)
unit of software that packages up code and all its dependencies so the application runs quickly and
reliably from one computing environment to another. Using containers gives you more flexibility
working with modules and software but at the same time requires more effort.
On Taurus [Singularity](https://sylabs.io/) is used as a standard container solution. Singularity
enables users to have full control of their environment. This means that **you dont have to ask an
HPC support to install anything for you - you can put it in a Singularity container and run!**As
opposed to Docker (the beat-known container solution), Singularity is much more suited to being used
in an HPC environment and more efficient in many cases. Docker containers also can easily be used by
Singularity from the [DockerHub](https://hub.docker.com) for instance. Also, some containers are
available in [Singularity Hub](https://singularity-hub.org/).
**3.****Containers**
The simplest option to start working with containers on HPC-DA is importing from Docker or
SingularityHub container with TensorFlow. It does **not require root privileges** and so works on
Some machine learning tasks such as benchmarking require using
Taurus directly:
containers. A container is a standard unit of software that packages up
code and all its dependencies so the application runs quickly and
```Bash
reliably from one computing environment to another. \<span
srun -p ml -N 1 --gres=gpu:1 --time=02:00:00 --pty --mem-per-cpu=8000 bash #allocating resourses from ml nodes to start the job to create a container.<br />singularity build my-ML-container.sif docker://ibmcom/tensorflow-ppc64le #create a container from the DockerHub with the last TensorFlow version<br />singularity run --nv my-ML-container.sif #run my-ML-container.sif container with support of the Nvidia's GPU. You could also entertain with your container by commands: singularity shell, singularity exec
style="font-size: 1em;">Using containers gives you more flexibility
```
working with modules and software but at the same time requires more
effort.\</span>
On Taurus \<a href="<https://sylabs.io/>"
target="\_blank">Singularity\</a> used as a standard container solution.
Singularity enables users to have full control of their environment.
This means that **you dont have to ask an HPC support to install
anything for you - you can put it in a Singularity container and
run!**As opposed to Docker (the beat-known container solution),
Singularity is much more suited to being used in an HPC environment and
more efficient in many cases. Docker containers also can easily be used
by Singularity from the [DockerHub](https://hub.docker.com) for
instance. Also, some containers are available in \<a
href="<https://singularity-hub.org/>"
target="\_blank">SingularityHub\</a>.
\<span style="font-size: 1em;">The simplest option to start working with
containers on HPC-DA is i\</span>\<span style="font-size: 1em;">mporting
from Docker or SingularityHub container with TensorFlow. It does
and so works on Taurus directly\</span>\<span style="font-size: 1em;">:
\</span>
srun -p ml -N 1 --gres=gpu:1 --time=02:00:00 --pty --mem-per-cpu=8000 bash #allocating resourses from ml nodes to start the job to create a container.<br />singularity build my-ML-container.sif docker://ibmcom/tensorflow-ppc64le #create a container from the DockerHub with the last TensorFlow version<br />singularity run --nv my-ML-container.sif #run my-ML-container.sif container with support of the Nvidia's GPU. You could also entertain with your container by commands: singularity shell, singularity exec
There are two sources for containers for Power9 architecture with
There are two sources for containers for Power9 architecture with
Tensorflow and PyTorch on the board: \<span style="font-size: 1em;">