Skip to content
Snippets Groups Projects
Commit 3cc360da authored by Martin Schroschk's avatar Martin Schroschk
Browse files

AlphaCentauri:Fix checks

parent 17bd8e82
No related branches found
No related tags found
3 merge requests!322Merge preview into main,!319Merge preview into main,!122AlphaCentauri:Fix checks
# Alpha Centauri - Multi-GPU cluster with NVIDIA A100
The sub-cluster "AlphaCentauri" had been installed for AI-related
computations (ScaDS.AI).
The sub-cluster "AlphaCentauri" had been installed for AI-related computations (ScaDS.AI).
## Hardware
- 34 nodes, each with
- 8 x NVIDIA A100-SXM4 (40 GB RAM)
- 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, MultiThreading
enabled
- 1 TB RAM
- 3.5 TB /tmp local NVMe device
- Hostnames: taurusi\[8001-8034\]
- SLURM partition **`alpha`**
- 34 nodes, each with 8 x NVIDIA A100-SXM4 (40 GB RAM) 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz,
- MultiThreading enabled
- 1 TB RAM 3.5 TB /tmp local NVMe device Hostnames: taurusi\[8001-8034\] Slurm partition
- **`alpha`**
## Hints for the usage
These nodes of the cluster can be used like other "normal" GPU nodes
(ml, gpu2).
These nodes of the cluster can be used like other "normal" GPU nodes (ml, gpu2).
<span class="twiki-macro RED"></span> **Attention:** <span
class="twiki-macro ENDCOLOR"></span> These GPUs may only be used with
**CUDA 11** or later. Earlier versions do not recognize the new hardware
properly or cannot fully utilize it. Make sure the software you are
using is built against this library.
**Attention:** These GPUs may only be used with **CUDA 11** or later. Earlier versions do not
recognize the new hardware properly or cannot fully utilize it. Make sure the software you are using
is built against this library.
## Typical tasks
\<span style="font-size: 1em;">Machine learning frameworks as TensorFlow
and PyTorch are industry standards now. The example of work with PyTorch
on the new AlphaCentauri sub-cluster is illustrated below in brief
examples.\</span>
Machine learning frameworks as TensorFlow and PyTorch are industry
standards now. The example of work with PyTorch on the new AlphaCentauri sub-cluster is illustrated
below in brief examples.
There are three main options on how to work with Tensorflow and PyTorch on the Alpha Centauri
cluster:
1. **Modules**
1 **Virtual Environments (manual software installation)**
1. [JupyterHub](https://taurus.hrsk.tu-dresden.de/)
1. [Containers](../software/containers.md)
There are three main options on how to work with Tensorflow and PyTorch
on the Alpha Centauri cluster: **1.** **Modules,** **2.** **Virtual**
**Environments (manual software installation)**, **3. \<a
href="<https://taurus.hrsk.tu-dresden.de/>"
target="\_blank">Jyupterhub\</a> 4. \<a href="Container"
target="\_blank">Containers\</a>.** \<br />\<br />
### Modules
### 1. Modules
The easiest way is using the [module system](../software/modules.md) and Python virtual environment.
Modules are a way to use frameworks, compilers, loader, libraries, and utilities. The software
environment for the **alpha** partition is available under the name **hiera**:
\<span
style`"font-size: 1em;">The easiest way is using the </span><a href="RuntimeEnvironment#Module_Environments" target="_blank">Modules system</a><span style="font-size: 1em;"> and Python virtual environment. Modules are a way to use frameworks, compilers, loader, libraries, and utilities. The software environment for the </span> =alpha`
\<span style="font-size: 1em;"> partition is available under the name
**hiera** \</span> :
```Bash
module load modenv/hiera
```
module load modenv/hiera
Machine learning frameworks **PyTorch** and **TensorFlow**available for **alpha** partition as
modules with CUDA11, GCC 10 and OpenMPI4:
Machine learning frameworks **PyTorch** and **TensorFlow**available for
**alpha** partition as modules with CUDA11, GCC 10 and OpenMPI4:
```Bash
module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch/1.7.1 module load
modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 TensorFlow/2.4.1
```
module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch/1.7.1
module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 TensorFlow/2.4.1
**Hint**: To check the available modules for the **hiera** software environment, use the command:
%RED%Hint<span class="twiki-macro ENDCOLOR"></span>: To check the
available modules for the **hiera** software environment, use the
command:
```Bash
module available
```
module available
To show all the dependencies you need to load for the core module, use the command:
To show all the dependencies you need to load for the core module, use
the command:
```Bash
module spider <name_of_the_module>
```
module spider <name_of_the_module>
### Virtual Environments
### 2. Virtual environments
It is necessary to use virtual environments for your work with Python. A virtual environment is a
cooperatively isolated runtime environment. There are two main options to use virtual environments:
venv (standard Python tool) and
It is necessary to use virtual environments for your work with Python. A
virtual environment is a cooperatively isolated runtime environment.
There are two main options to use virtual environments: venv (standard
Python tool) and
1. **Vitualenv** is a standard Python tool to create isolated Python environments. It is the
**preferred** interface for managing installations and virtual environments on Taurus and part of
the Python modules.
1.** Vitualenv** is a standard Python tool to create isolated Python
environments. It is the %RED%preferred<span
class="twiki-macro ENDCOLOR"></span> interface for managing
installations and virtual environments on Taurus and part of the Python
modules.
1. [Conda](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#activating-an-environment)
is an alternative method for managing installations and virtual environments on Taurus. Conda is an
open-source package management system and environment management system from Anaconda. The conda
manager is included in all versions of Anaconda and Miniconda.
2\. **\<a
href="<https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#activating-an-environment>"
target="\_blank">Conda\</a>** is an alternative method for managing
installations and virtual environments on Taurus. Conda is an
open-source package management system and environment management system
from Anaconda. The conda manager is included in all versions of Anaconda
and Miniconda.
**Note**: There are two sub-partitions of the alpha partition: alpha and
alpha-interactive. Please use alpha-interactive for the interactive jobs and alpha for the batch
jobs.
**%RED%Note%ENDCOLOR%**: There are two sub-partitions of the alpha
partition: alpha and alpha-interactive. Please use alpha-interactive for
the interactive jobs and alpha for the batch jobs.
Examples with conda and venv will be presented below. Also, there is an example of an interactive
job for the AlphaCentauri sub-cluster using the `alpha-interactive` partition:
Examples with conda and venv will be presented below. Also, there is an
example of an interactive job for the AlphaCentauri sub-cluster using
the `alpha-interactive` partition:
```Bash
srun -p alpha-interactive -N 1 -n 1 --gres=gpu:1 --time=01:00:00 --pty bash # Job submission in
alpha nodes with 1 gpu on 1 node.
srun -p alpha-interactive -N 1 -n 1 --gres=gpu:1 --time=01:00:00 --pty bash # Job submission in alpha nodes with 1 gpu on 1 node.<br /><br />mkdir conda-virtual-environments #create a folder, please use Workspaces! <br />cd conda-virtual-environments #go to folder<br />which python #check which python are you using<br />ml modenv/hiera<br />ml Miniconda3<br />which python #check which python are you using now<br />conda create -n conda-testenv python=3.8 #create virtual environment with the name conda-testenv and Python version 3.8 <br />conda activate conda-testenv #activate conda-testenv virtual environment <br />conda deactivate #Leave the virtual environment
mkdir conda-virtual-environments #create a folder, please use Workspaces!
cd conda-virtual-environments #go to folder
which python #check which python are you using ml modenv/hiera
ml Miniconda3
which python #check which python are you using now
conda create -n conda-testenv python=3.8 #create virtual environment with the name conda-testenv and Python version 3.8
conda activate
conda-testenv #activate conda-testenv virtual environment
conda deactivate #Leave the virtual environment
```
New software for data analytics is emerging faster than we can install
it. If you urgently need a certain version we advise you to manually
install it (the machine learning frameworks and required packages) in
your virtual environment (or use a \<a
href="<https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/Container>"
target="\_blank">container\</a>).
New software for data analytics is emerging faster than we can install it. If you urgently need a
certain version we advise you to manually install it (the machine learning frameworks and required
packages) in your virtual environment (or use a [container](../software/containers.md).
The **Virtualenv** example:
srun -p alpha-interactive -N 1 -n 1 --gres=gpu:1 --time=01:00:00 --pty bash #Job submission in alpha nodes with 1 gpu on 1 node.
```Bash
srun -p alpha-interactive -N 1 -n 1 --gres=gpu:1 --time=01:00:00 --pty bash
#Job submission in alpha nodes with 1 gpu on 1 node.
mkdir python-environments && cd "$_" # Optional: Create folder. Please use Workspaces!<br /><br />module load modenv/hiera modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 Python/3.8.6 #Changing the environment and load necessary modules
which python #Check which python are you using
virtualenv --system-site-packages python-environments/envtest #Create virtual environment
source python-environments/envtest/bin/activate #Activate virtual environment. Example output: (envtest) bash-4.2$
mkdir python-environments && cd "$_" # Optional: Create folder. Please use Workspaces!
Example of using **Conda** with a Pytorch and P\<span style="font-size:
1em;">illow installation: \</span>
module load modenv/hiera modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 Python/3.8.6 #Changing the environment and load necessary modules
which python #Check which python are you using
virtualenv --system-site-packages python-environments/envtest #Create virtual environment
source python-environments/envtest/bin/activate #Activate virtual environment. Example output: (envtest) bash-4.2$
```
conda activate conda-testenv<br />conda install pytorch torchvision cudatoolkit=11.1 -c pytorch -c nvidia<br />conda install -c anaconda pillow
Example of using **Conda** with a Pytorch and Pillow installation:
```Bash
conda activate conda-testenv
conda install pytorch torchvision cudatoolkit=11.1 -c pytorch -c nvidia
conda install -c anaconda pillow
```
Verify installation for the **Venv** example:
python #Start python
from time import gmtime, strftime
print(strftime("%Y-%m-%d %H:%M:%S", gmtime())) #Example output: 2019-11-18 13:54:16<br /><br />deactivate #Leave the virtual environment
```Bash
python #Start python from time import
gmtime, strftime print(strftime("%Y-%m-%d %H:%M:%S", gmtime())) #Example output: 2019-11-18 13:54:16
deactivate #Leave the virtual environment
```
Verify installation for the **Conda** example:
python #Start python
import torch
torch.version.__version__ #Example output: 1.8.1
There is an example of the batch script for the typical usage of the
Alpha Centauri cluster:
#!/bin/bash
#SBATCH --mem=40GB # specify the needed memory. Same amount memory as on the GPU
#SBATCH -p alpha # specify Alpha-Centauri partition
#SBATCH --gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task)
#SBATCH --nodes=1 # request 1 node
#SBATCH --time=00:15:00 # runs for 15 minutes
#SBATCH -c 2 # how many cores per task allocated
#SBATCH -o HLR_name_your_script.out # save output message under HLR_${SLURMJOBID}.out
#SBATCH -e HLR_name_your_script.err # save error messages under HLR_${SLURMJOBID}.err
module load modenv/hiera
eval "$(conda shell.bash hook)"
conda activate conda-testenv && python machine_learning_example.py
## when finished writing, submit with: sbatch <script_name> For example: sbatch machine_learning_script.sh
The Alpha Centauri sub-cluster has the NVIDIA A100-SXM4 with 40 GB RAM
each. Thus It is prudent to have the same memory on the host (cpu). The
number of cores is free for the users to define, at the moment.
### 3. JupyterHub
There is \<a href="JupyterHub" target="\_self">jupyterhub\</a> on
Taurus, where you can simply run your Jupyter notebook on Alpha-Centauri
sub-cluster. Also, for more specific cases you can run a manually
created remote jupyter server. You can find the manual server setup \<a
href="DeepLearning" target="\_blank">here.\</a> However, the simplest
option for beginners is using JupyterHub.
JupyterHub is available here: \<a
href="<https://taurus.hrsk.tu-dresden.de/jupyter>"
target="\_top"><https://taurus.hrsk.tu-dresden.de/jupyter>\</a>
\<a
href`"https://taurus.hrsk.tu-dresden.de/jupyter" target="_top"></a>After logging, you can start a new session and configure it. There are simple and advanced forms to set up your session. The =alpha`
partition is available in advanced form. You have to choose the
\<span>alpha\</span> partition in the partition field. The resource
recommendations to allocate are the same as described above for the
batch script example (not confuse `--mem-per-cpu` with `--mem`
```Bash
python #Start python import torch torch.version.__version__ #Example output: 1.8.1
```
There is an example of the batch script for the typical usage of the Alpha Centauri cluster:
```Bash
#!/bin/bash #SBATCH --mem=40GB # specify the needed memory. Same amount memory as
on the GPU #SBATCH -p alpha # specify Alpha-Centauri partition #SBATCH
--gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task) #SBATCH --nodes=1
# request 1 node #SBATCH --time=00:15:00 # runs for 15 minutes #SBATCH -c 2
# how many cores per task allocated #SBATCH -o HLR_name_your_script.out # save output
message under HLR_${SLURMJOBID}.out #SBATCH -e HLR_name_your_script.err # save error
messages under HLR_${SLURMJOBID}.err
module load modenv/hiera eval "$(conda shell.bash hook)" conda activate conda-testenv && python
machine_learning_example.py
## when finished writing, submit with: sbatch <script_name> For example: sbatch
machine_learning_script.sh
```
The Alpha Centauri sub-cluster has the NVIDIA A100-SXM4 with 40 GB RAM each. Thus It is prudent to
have the same memory on the host (cpu). The number of cores is free for the users to define, at the
moment.
### JupyterHub
There is [JupyterHub](../software/JupyterHub.md) on Taurus, where you can simply run
your Jupyter notebook on Alpha-Centauri sub-cluster. Also, for more specific cases you can run a
manually created remote jupyter server. You can find the manual server setup
[here](../software/DeepLearning.md). However, the simplest option for beginners is using
JupyterHub.
JupyterHub is available at
[taurus.hrsk.tu-dresden.de/jupyter](https://taurus.hrsk.tu-dresden.de/jupyter).
After logging, you can start a new session and configure it. There are simple and advanced forms to
set up your session. The `alpha` partition is available in advanced form. You have to choose the
alpha partition in the partition field. The resource recommendations to allocate are
the same as described above for the batch script example (not confuse `--mem-per-cpu` with `--mem`
parameter).
### 4. Containers
### Containers
On Taurus \<a
href`"https://sylabs.io/" target="_blank">Singularity</a> used as a standard container solution. It can be run on the =alpha`
partition as well. Singularity enables users to have full control of
their environment. Detailed information about containers can be found
[here](Container).
On Taurus [Singularity](https://sylabs.io/) is used as a standard container
solution. It can be run on the `alpha` partition as well. Singularity enables users to have full
control of their environment. Detailed information about containers can be found
[here](../software/containers.md).
Nvidia
[NGC](https://developer.nvidia.com/blog/how-to-run-ngc-deep-learning-containers-with-singularity/)
containers can be used as an effective solution for machine learning
related tasks. (Downloading containers requires registration).
Nvidia-prepared containers with software solutions for specific
scientific problems can simplify the deployment of deep learning
workloads on HPC. NGC containers have shown consistent performance
compared to directly run code.
## \<span style="font-size: 1em;">Examples\</span>
containers can be used as an effective solution for machine learning related tasks. (Downloading
containers requires registration). Nvidia-prepared containers with software solutions for specific
scientific problems can simplify the deployment of deep learning workloads on HPC. NGC containers
have shown consistent performance compared to directly run code.
There is a test example of a deep learning task that could be used for
the test. For the correct work, Pytorch and Pillow package should be
installed in your virtual environment (how it was shown above in the
interactive job example)
## Examples
- [example_pytorch_image_recognition.zip](%ATTACHURL%/example_pytorch_image_recognition.zip):
example_pytorch_image_recognition.zip
There is a test example of a deep learning task that could be used for the test. For the correct
work, Pytorch and Pillow package should be installed in your virtual environment (how it was shown
above in the interactive job example)
\<div id="gtx-trans" style="position: absolute; left: 8px; top:
1248.47px;"> \</div>
- [example_pytorch_image_recognition.zip]**todo attachment**
<!--%ATTACHURL%/example_pytorch_image_recognition.zip:-->
<!--example_pytorch_image_recognition.zip-->
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment