Skip to content
Snippets Groups Projects
Commit b2ceed14 authored by Elias Werner's avatar Elias Werner
Browse files

remove backup for tf section

tidy up
ran tests
parent 5710df35
No related branches found
No related tags found
5 merge requests!333Draft: update NGC containers,!322Merge preview into main,!319Merge preview into main,!279Draft: Machine Learning restructuring,!258Data Analytics restructuring
doc.zih.tu-dresden.de/docs/software/misc/tensorflow_jupyter_module.png

72.5 KiB | W: | H:

doc.zih.tu-dresden.de/docs/software/misc/tensorflow_jupyter_module.png

79.7 KiB | W: | H:

doc.zih.tu-dresden.de/docs/software/misc/tensorflow_jupyter_module.png
doc.zih.tu-dresden.de/docs/software/misc/tensorflow_jupyter_module.png
doc.zih.tu-dresden.de/docs/software/misc/tensorflow_jupyter_module.png
doc.zih.tu-dresden.de/docs/software/misc/tensorflow_jupyter_module.png
  • 2-up
  • Swipe
  • Onion skin
# TensorFlow
## Introduction
TensorFlow is a free end-to-end open-source software library for dataflow and differentiable
programming across many tasks. It is a symbolic math library, used primarily for machine learning
applications. It has a comprehensive, flexible ecosystem of tools, libraries and community
resources.
This is an introduction of how to start working with TensorFlow and run
machine learning applications on the [HPC-DA](../jobs_and_resources/hpcda.md) system of Taurus.
\<span style="font-size: 1em;">On the machine learning nodes (machine
learning partition), you can use the tools from [IBM PowerAI](power_ai.md) or the other
modules. PowerAI is an enterprise software distribution that combines popular open-source
deep learning frameworks, efficient AI development tools (Tensorflow, Caffe, etc). For
this page and examples was used [PowerAI version 1.5.4](https://www.ibm.com/support/knowledgecenter/en/SS5SF7_1.5.4/navigation/pai_software_pkgs.html)
Please check the software modules list via
[TensorFlow](https://www.tensorflow.org/guide/) is a free end-to-end open-source
software library for dataflow and differentiable programming across many
tasks. It is a symbolic math library, used primarily for machine
learning applications. It has a comprehensive, flexible ecosystem of tools, libraries and
community resources. It is available on taurus along with other common machine
learning packages like Pillow, SciPY, Numpy.
module spider TensorFlow
to find out, which TensorFlow modules are available on your partition.
On ZIH systems, TensorFlow 2 is the default module version. For compatibility hints between TF2 and
TF1, see the corresponding [section](#compatibility-tf2-and-tf1) below.
**Prerequisites:** To work with Tensorflow on Taurus, you obviously need
[access](../access/ssh_login.md) for the Taurus system and basic knowledge about Python, SLURM system.
We recommend using **Alpha** and/or **ML** partitions when working with machine learning workflows
and the TensorFlow library. You can find detailed hardware specification
[here](../jobs_and_resources/hardware_taurus.md).
**Aim** of this page is to introduce users on how to start working with
TensorFlow on the \<a href="HPCDA" target="\_self">HPC-DA\</a> system -
part of the TU Dresden HPC system.
## TensorFlow Console
There are three main options on how to work with Tensorflow on the
HPC-DA: **1.** **Modules,** **2.** **JupyterNotebook, 3. Containers**. The best option is
to use [module system](../software/runtime_environment.md#Module_Environments) and
Python virtual environment. Please see the next chapters and the [Python page](python.md) for the
HPC-DA system.
On the **Alpha** partition load the module environment:
The information about the Jupyter notebook and the **JupyterHub** could
be found [here](../access/jupyterhub.md). The use of
Containers is described [here](tensorflow_container_on_hpcda.md).
```Bash
tauruslogin:~> srun -p alpha --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=8000 bash #Job submission on alpha nodes with 1 gpu on 1 node with 8000 mb.
taurus-rome:~> module load modenv/scs5
```
On Taurus, there exist different module environments, each containing a set
of software modules. The default is *modenv/scs5* which is already loaded,
however for the HPC-DA system using the "ml" partition you need to use *modenv/ml*.
To find out which partition are you using use: `ml list`.
You can change the module environment with the command:
On the **ML** partition load the module environment:
module load modenv/ml
```Bash
tauruslogin:~> srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=8000 bash #Job submission in ml nodes with 1 gpu on 1 node with 8000 mb.
taurus-ml:~> module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml
```
The machine learning partition is based on the PowerPC Architecture (ppc64le)
(Power9 processors), which means that the software built for x86_64 will not
work on this partition, so you most likely can't use your already locally
installed packages on Taurus. Also, users need to use the modules which are
specially made for the ml partition (from modenv/ml) and not for the rest
of Taurus (e.g. from modenv/scs5).
This example shows how to install and start working with TensorFlow (with using modules system)
Each node on the ml partition has 6x Tesla V-100 GPUs, with 176 parallel threads
on 44 cores per node (Simultaneous multithreading (SMT) enabled) and 256GB RAM.
The specification could be found [here](../jobs_and_resources/power9.md).
```Bash
taurus-ml:~> module load TensorFlow #load TensorFlow module. example output: Module TensorFlow/1.10.0-PythonAnaconda-3.6 and 1 dependency loaded.
```
%RED%Note:<span class="twiki-macro ENDCOLOR"></span> Users should not
reserve more than 28 threads per each GPU device so that other users on
the same node still have enough CPUs for their computations left.
Now we check that we can access TensorFlow. One example is tensorflow-test:
## Get started with Tensorflow
```Bash
taurus-ml:~> tensorflow-test #example output: Basic test of tensorflow - A Hello World!!!...
```
This example shows how to install and start working with TensorFlow
(with using modules system) and the python virtual environment. Please,
check the next chapter for the details about the virtual environment.
As another example we use a python virtual environment and import TensorFlow.
srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=8000 bash #Job submission in ml nodes with 1 gpu on 1 node with 8000 mb.
```Bash
taurus-ml:~> mkdir python-environments #create folder
taurus-ml:~> which python #check which python are you using
taurus-ml:~> virtualenvv --system-site-packages python-environments/env #create virtual environment "env" which inheriting with global site packages
taurus-ml:~> source python-environments/env/bin/activate #activate virtual environment "env". Example output: (env) bash-4.2$
taurus-ml:~> python #start python
>>> import tensorflow as tf
>>> print(tf.VERSION) #example output: 1.10.0
```
module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml
## TensorFlow in JupyterHub
mkdir python-environments #create folder
module load TensorFlow #load TensorFlow module. Example output: Module TensorFlow/1.10.0-PythonAnaconda-3.6 and 1 dependency loaded.
which python #check which python are you using
virtualenvv --system-site-packages python-environments/env #create virtual environment "env" which inheriting with global site packages
source python-environments/env/bin/activate #Activate virtual environment "env". Example output: (env) bash-4.2$
python #start python
import tensorflow as tf
print(tf.VERSION) #example output: 1.10.0
In addition to using interactive and batch jobs, it is possible to work with TensorFlow using
JupyterHub. The production and test environments of JupyterHub contain Python and R kernels, that
both come with a TensorFlow support.
On the machine learning nodes, you can use the tools from [IBM Power
AI](power_ai.md).
![TensorFlow module in JupyterHub](misc/tensorflow_jupyter_module.png)
{: align="center"}
## Interactive Session Examples
## TensorFlow in Containers
### Tensorflow-Test
Another option to use TensorFlow are containers. In the HPC domain, the
[Singularity](https://singularity.hpcng.org/) container system is a widely used tool. In the
following example, we use the tensorflow-test in a Singularity container:
tauruslogin6 :~> srun -p ml --gres=gpu:1 -n 1 --pty --mem-per-cpu=10000 bash
srun: job 4374195 queued and waiting for resources
srun: job 4374195 has been allocated resources
taurusml22 :~> ANACONDA2_INSTALL_PATH='/opt/anaconda2'
taurusml22 :~> ANACONDA3_INSTALL_PATH='/opt/anaconda3'
taurusml22 :~> export PATH=$ANACONDA3_INSTALL_PATH/bin:$PATH
taurusml22 :~> source /opt/DL/tensorflow/bin/tensorflow-activate
taurusml22 :~> tensorflow-test
Basic test of tensorflow - A Hello World!!!...
```Bash
tauruslogin:~> srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=8000 bash #Job submission in ml nodes with 1 gpu on 1 node with 8000 mb.
taurus-ml:~> singularity shell --nv /scratch/singularity/powerai-1.5.3-all-ubuntu16.04-py3.img
taurus-ml:~> export PATH=/opt/anaconda3/bin:$PATH
taurus-ml:~> source activate /opt/anaconda3 #activate conda environment
taurus-ml:~> . /opt/DL/tensorflow/bin/tensorflow-activate
taurus-ml:~> tensorflow-test #example output: Basic test of tensorflow - A Hello World!!!...
```
#or:
taurusml22 :~> module load TensorFlow/1.10.0-PythonAnaconda-3.6
## TensorFlow with Python or R
Or to use the whole node: `--gres=gpu:6 --exclusive --pty`
For further information on TensorFlow in combination with Python see
[here](data_analytics_with_python.md), for R see [here](data_analytics_with_r.md).
### In Singularity container:
## Compatibility TF2 and TF1
rotscher@tauruslogin6:~&gt; srun -p ml --gres=gpu:6 --pty bash
[rotscher@taurusml22 ~]$ singularity shell --nv /scratch/singularity/powerai-1.5.3-all-ubuntu16.04-py3.img
Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~&gt; export PATH=/opt/anaconda3/bin:$PATH
Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~&gt; . /opt/DL/tensorflow/bin/tensorflow-activate
Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~&gt; tensorflow-test
TensorFlow 2.0 includes many API changes, such as reordering arguments, renaming symbols, and
changing default values for parameters. Thus in some cases, it makes code written for the TensorFlow
1 not compatible with TensorFlow 2. However, If you are using the high-level APIs (tf.keras) there
may be little or no action you need to take to make your code fully [TensorFlow
2.0](https://www.tensorflow.org/guide/migrate) compatible. It is still possible to run 1.X code,
unmodified (except for contrib), in TensorFlow 2.0:
```Python
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior() #instead of "import tensorflow as tf"
```
To make the transition to TF 2.0 as seamless as possible, the TensorFlow team has created the
tf_upgrade_v2 utility to help transition legacy code to the new API.
## Additional libraries
......@@ -120,70 +117,6 @@ Note: For optimal NCCL performance it is recommended to set the
**NCCL_MIN_NRINGS** environment variable during execution. You can try
different values but 4 should be a pretty good starting point.
export NCCL_MIN_NRINGS=4
\<span style="color: #222222; font-size: 1.385em;">HPC\</span>
The following HPC related software is installed on all nodes:
| | |
|------------------|------------------------|
| IBM Spectrum MPI | /opt/ibm/spectrum_mpi/ |
| PGI compiler | /opt/pgi/ |
| IBM XLC Compiler | /opt/ibm/xlC/ |
| IBM XLF Compiler | /opt/ibm/xlf/ |
| IBM ESSL | /opt/ibmmath/essl/ |
| IBM PESSL | /opt/ibmmath/pessl/ |
## TensorFlow 2
[TensorFlow
2.0](https://blog.tensorflow.org/2019/09/tensorflow-20-is-now-available.html)
is a significant milestone for TensorFlow and the community. There are
multiple important changes for users. TensorFlow 2.0 removes redundant
APIs, makes APIs more consistent (Unified RNNs, Unified Optimizers), and
better integrates with the Python runtime with Eager execution. Also,
TensorFlow 2.0 offers many performance improvements on GPUs.
There are a number of TensorFlow 2 modules for both ml and scs5 modenvs
on Taurus. Please check\<a href="SoftwareModulesList" target="\_blank">
the software modules list\</a> for the information about available
modules or use
module spider TensorFlow
%RED%Note:<span class="twiki-macro ENDCOLOR"></span> Tensorflow 2 will
be loaded by default when loading the Tensorflow module without
specifying the version.
\<span style="font-size: 1em;">TensorFlow 2.0 includes many API changes,
such as reordering arguments, renaming symbols, and changing default
values for parameters. Thus in some cases, it makes code written for the
TensorFlow 1 not compatible with TensorFlow 2. However, If you are using
the high-level APIs (tf.keras) there may be little or no action you need
to take to make your code fully TensorFlow 2.0 \<a
href="<https://www.tensorflow.org/guide/migrate>"
target="\_blank">compatible\</a>. It is still possible to run 1.X code,
unmodified ( [except for
contrib](https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md)),
in TensorFlow 2.0:\</span>
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior() #instead of "import tensorflow as tf"
To make the transition to TF 2.0 as seamless as possible, the TensorFlow
team has created the
[`tf_upgrade_v2`](https://www.tensorflow.org/guide/upgrade) utility to
help transition legacy code to the new API.
## FAQ:
Q: Which module environment should I use? modenv/ml, modenv/scs5,
modenv/hiera
A: On the ml partition use modenv/ml, on rome and gpu3 use modenv/hiera,
else stay with the default of modenv/scs5.
Q: How to change the module environment and know more about modules?
A: [Modules](../software/runtime_environment.md#Modules)
```Bash
export NCCL_MIN_NRINGS=4
```
# Tensorflow
TensorFlow is a free end-to-end open-source software library for dataflow and differentiable programming across many tasks. It is a symbolic math library, used primarily for machine learning applications. It has a comprehensive, flexible ecosystem of tools, libraries and community resources.
On taurus, Tensorflow 2 is the default module version. Please check \<a href="SoftwareModulesList" target="\_blank">
the software modules list\</a> for information about available modules or use
module spider TensorFlow
For compatibility hints between TF2 and TF1, see here.
We recommend using **Alpha** and/or **ML** partitions when working with machine learning workflows and the Tensorflow library. For more details see here. You can find detailed hardware specification here.
## Tensorflow Console
On the **ML** partition load the module environment:
module load modenv/ml
On the **Alpha** partition load the module environment:
module load modenv/scs5
This example shows how to install and start working with TensorFlow (with using modules system)
srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=8000 bash #Job submission in ml nodes with 1 gpu on 1 node with 8000 mb.
module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml
module load TensorFlow #load TensorFlow module. Example output: Module TensorFlow/1.10.0-PythonAnaconda-3.6 and 1 dependency loaded.
Now we check that we can access Tensorflow. One example is tensorflow-test:
tauruslogin6 :~> srun -p ml --gres=gpu:1 -n 1 --pty --mem-per-cpu=10000 bash
srun: job 4374195 queued and waiting for resources
srun: job 4374195 has been allocated resources
taurusml22 :~> module load TensorFlow/1.10.0-PythonAnaconda-3.6
taurusml22 :~> tensorflow-test
Basic test of tensorflow - A Hello World!!!...
As another example we use a python virtual environment and import Tensorflow.
mkdir python-environments #create folder
which python #check which python are you using
virtualenvv --system-site-packages python-environments/env #create virtual environment "env" which inheriting with global site packages
source python-environments/env/bin/activate #Activate virtual environment "env". Example output: (env) bash-4.2$
python #start python
import tensorflow as tf
print(tf.VERSION) #example output: 1.10.0
## Tensorflow in JupyterHub
In addition to using interactive and batch jobs, it is possible to work with Tensorflow using JupyterHub. The production and test environments of JupyterHub contain Python and R kernels, that both come with a Tensorflow support.
![Tensorflow module in JupyterHub](misc/tensorflow_jupyter_module.png)
{: align="center"}
## Tensorflow in Containers
Another option to use Tensorflow are containers. In the HPC domain, the [Singularity](https://singularity.hpcng.org/) container system is a widely used tool. In the following example, we use the tesnroflow-test in a Singularity container:
rotscher@tauruslogin6:~&gt; srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=8000 bash
[rotscher@taurusml22 ~]$ singularity shell --nv /scratch/singularity/powerai-1.5.3-all-ubuntu16.04-py3.img
Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~&gt; export PATH=/opt/anaconda3/bin:$PATH
Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~&gt; . /opt/DL/tensorflow/bin/tensorflow-activate
Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~&gt; tensorflow-test
## Tensorflow with Python or R
For further information on Tensorflow in combination with Python see [here](data_analytics_with_python.md), for R see [here](data_analytics_with_r.md).
## Compatibility TF2 and TF1
TensorFlow 2.0 includes many API changes, such as reordering arguments, renaming symbols, and changing default values for parameters. Thus in some cases, it makes code written for the TensorFlow 1 not compatible with TensorFlow 2. However, If you are using the high-level APIs (tf.keras) there may be little or no action you need to take to make your code fully [TensorFlow 2.0](https://www.tensorflow.org/guide/migrate) compatible. It is still possible to run 1.X code, unmodified (except for contrib), in TensorFlow 2.0:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior() #instead of "import tensorflow as tf"
To make the transition to TF 2.0 as seamless as possible, the TensorFlow team has created the tf_upgrade_v2 utility to help transition legacy code to the new API.
## Additional libraires
The following NVIDIA libraries are available on all nodes:
| | |
|-------|---------------------------------------|
| NCCL | /usr/local/cuda/targets/ppc64le-linux |
| cuDNN | /usr/local/cuda/targets/ppc64le-linux |
Note: For optimal NCCL performance it is recommended to set the
**NCCL_MIN_NRINGS** environment variable during execution. You can try
different values but 4 should be a pretty good starting point.
export NCCL_MIN_NRINGS=4
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment