diff --git a/doc.zih.tu-dresden.de/docs/software/misc/tensorflow_jupyter_module.png b/doc.zih.tu-dresden.de/docs/software/misc/tensorflow_jupyter_module.png index dc6fa60eebfacfb66a6735e826e2a5ed1cf60e42..1327ee6304faf4b293c385981a750f362063ecbf 100644 Binary files a/doc.zih.tu-dresden.de/docs/software/misc/tensorflow_jupyter_module.png and b/doc.zih.tu-dresden.de/docs/software/misc/tensorflow_jupyter_module.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/tensorflow.md b/doc.zih.tu-dresden.de/docs/software/tensorflow.md index 0d5ef7503d3283376d0eaeb400fde1529fa95f08..d9b488007026fc6450c42220caa477006e1670e1 100644 --- a/doc.zih.tu-dresden.de/docs/software/tensorflow.md +++ b/doc.zih.tu-dresden.de/docs/software/tensorflow.md @@ -1,111 +1,108 @@ # TensorFlow -## Introduction +TensorFlow is a free end-to-end open-source software library for dataflow and differentiable +programming across many tasks. It is a symbolic math library, used primarily for machine learning +applications. It has a comprehensive, flexible ecosystem of tools, libraries and community +resources. -This is an introduction of how to start working with TensorFlow and run -machine learning applications on the [HPC-DA](../jobs_and_resources/hpcda.md) system of Taurus. - -\<span style="font-size: 1em;">On the machine learning nodes (machine -learning partition), you can use the tools from [IBM PowerAI](power_ai.md) or the other -modules. PowerAI is an enterprise software distribution that combines popular open-source -deep learning frameworks, efficient AI development tools (Tensorflow, Caffe, etc). For -this page and examples was used [PowerAI version 1.5.4](https://www.ibm.com/support/knowledgecenter/en/SS5SF7_1.5.4/navigation/pai_software_pkgs.html) +Please check the software modules list via -[TensorFlow](https://www.tensorflow.org/guide/) is a free end-to-end open-source -software library for dataflow and differentiable programming across many -tasks. It is a symbolic math library, used primarily for machine -learning applications. It has a comprehensive, flexible ecosystem of tools, libraries and -community resources. It is available on taurus along with other common machine -learning packages like Pillow, SciPY, Numpy. + module spider TensorFlow + +to find out, which TensorFlow modules are available on your partition. + +On ZIH systems, TensorFlow 2 is the default module version. For compatibility hints between TF2 and +TF1, see the corresponding [section](#compatibility-tf2-and-tf1) below. -**Prerequisites:** To work with Tensorflow on Taurus, you obviously need -[access](../access/ssh_login.md) for the Taurus system and basic knowledge about Python, SLURM system. +We recommend using **Alpha** and/or **ML** partitions when working with machine learning workflows +and the TensorFlow library. You can find detailed hardware specification +[here](../jobs_and_resources/hardware_taurus.md). -**Aim** of this page is to introduce users on how to start working with -TensorFlow on the \<a href="HPCDA" target="\_self">HPC-DA\</a> system - -part of the TU Dresden HPC system. +## TensorFlow Console -There are three main options on how to work with Tensorflow on the -HPC-DA: **1.** **Modules,** **2.** **JupyterNotebook, 3. Containers**. The best option is -to use [module system](../software/runtime_environment.md#Module_Environments) and -Python virtual environment. Please see the next chapters and the [Python page](python.md) for the -HPC-DA system. +On the **Alpha** partition load the module environment: -The information about the Jupyter notebook and the **JupyterHub** could -be found [here](../access/jupyterhub.md). The use of -Containers is described [here](tensorflow_container_on_hpcda.md). +```Bash +tauruslogin:~> srun -p alpha --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=8000 bash #Job submission on alpha nodes with 1 gpu on 1 node with 8000 mb. +taurus-rome:~> module load modenv/scs5 +``` -On Taurus, there exist different module environments, each containing a set -of software modules. The default is *modenv/scs5* which is already loaded, -however for the HPC-DA system using the "ml" partition you need to use *modenv/ml*. -To find out which partition are you using use: `ml list`. -You can change the module environment with the command: +On the **ML** partition load the module environment: - module load modenv/ml +```Bash +tauruslogin:~> srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=8000 bash #Job submission in ml nodes with 1 gpu on 1 node with 8000 mb. +taurus-ml:~> module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml +``` -The machine learning partition is based on the PowerPC Architecture (ppc64le) -(Power9 processors), which means that the software built for x86_64 will not -work on this partition, so you most likely can't use your already locally -installed packages on Taurus. Also, users need to use the modules which are -specially made for the ml partition (from modenv/ml) and not for the rest -of Taurus (e.g. from modenv/scs5). +This example shows how to install and start working with TensorFlow (with using modules system) -Each node on the ml partition has 6x Tesla V-100 GPUs, with 176 parallel threads -on 44 cores per node (Simultaneous multithreading (SMT) enabled) and 256GB RAM. -The specification could be found [here](../jobs_and_resources/power9.md). +```Bash +taurus-ml:~> module load TensorFlow #load TensorFlow module. example output: Module TensorFlow/1.10.0-PythonAnaconda-3.6 and 1 dependency loaded. +``` -%RED%Note:<span class="twiki-macro ENDCOLOR"></span> Users should not -reserve more than 28 threads per each GPU device so that other users on -the same node still have enough CPUs for their computations left. +Now we check that we can access TensorFlow. One example is tensorflow-test: -## Get started with Tensorflow +```Bash +taurus-ml:~> tensorflow-test #example output: Basic test of tensorflow - A Hello World!!!... +``` -This example shows how to install and start working with TensorFlow -(with using modules system) and the python virtual environment. Please, -check the next chapter for the details about the virtual environment. +As another example we use a python virtual environment and import TensorFlow. - srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=8000 bash #Job submission in ml nodes with 1 gpu on 1 node with 8000 mb. +```Bash +taurus-ml:~> mkdir python-environments #create folder +taurus-ml:~> which python #check which python are you using +taurus-ml:~> virtualenvv --system-site-packages python-environments/env #create virtual environment "env" which inheriting with global site packages +taurus-ml:~> source python-environments/env/bin/activate #activate virtual environment "env". Example output: (env) bash-4.2$ +taurus-ml:~> python #start python +>>> import tensorflow as tf +>>> print(tf.VERSION) #example output: 1.10.0 +``` - module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml +## TensorFlow in JupyterHub - mkdir python-environments #create folder - module load TensorFlow #load TensorFlow module. Example output: Module TensorFlow/1.10.0-PythonAnaconda-3.6 and 1 dependency loaded. - which python #check which python are you using - virtualenvv --system-site-packages python-environments/env #create virtual environment "env" which inheriting with global site packages - source python-environments/env/bin/activate #Activate virtual environment "env". Example output: (env) bash-4.2$ - python #start python - import tensorflow as tf - print(tf.VERSION) #example output: 1.10.0 +In addition to using interactive and batch jobs, it is possible to work with TensorFlow using +JupyterHub. The production and test environments of JupyterHub contain Python and R kernels, that +both come with a TensorFlow support. -On the machine learning nodes, you can use the tools from [IBM Power -AI](power_ai.md). + +{: align="center"} -## Interactive Session Examples +## TensorFlow in Containers -### Tensorflow-Test +Another option to use TensorFlow are containers. In the HPC domain, the +[Singularity](https://singularity.hpcng.org/) container system is a widely used tool. In the +following example, we use the tensorflow-test in a Singularity container: - tauruslogin6 :~> srun -p ml --gres=gpu:1 -n 1 --pty --mem-per-cpu=10000 bash - srun: job 4374195 queued and waiting for resources - srun: job 4374195 has been allocated resources - taurusml22 :~> ANACONDA2_INSTALL_PATH='/opt/anaconda2' - taurusml22 :~> ANACONDA3_INSTALL_PATH='/opt/anaconda3' - taurusml22 :~> export PATH=$ANACONDA3_INSTALL_PATH/bin:$PATH - taurusml22 :~> source /opt/DL/tensorflow/bin/tensorflow-activate - taurusml22 :~> tensorflow-test - Basic test of tensorflow - A Hello World!!!... +```Bash +tauruslogin:~> srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=8000 bash #Job submission in ml nodes with 1 gpu on 1 node with 8000 mb. +taurus-ml:~> singularity shell --nv /scratch/singularity/powerai-1.5.3-all-ubuntu16.04-py3.img +taurus-ml:~> export PATH=/opt/anaconda3/bin:$PATH +taurus-ml:~> source activate /opt/anaconda3 #activate conda environment +taurus-ml:~> . /opt/DL/tensorflow/bin/tensorflow-activate +taurus-ml:~> tensorflow-test #example output: Basic test of tensorflow - A Hello World!!!... +``` - #or: - taurusml22 :~> module load TensorFlow/1.10.0-PythonAnaconda-3.6 +## TensorFlow with Python or R -Or to use the whole node: `--gres=gpu:6 --exclusive --pty` +For further information on TensorFlow in combination with Python see +[here](data_analytics_with_python.md), for R see [here](data_analytics_with_r.md). -### In Singularity container: +## Compatibility TF2 and TF1 - rotscher@tauruslogin6:~> srun -p ml --gres=gpu:6 --pty bash - [rotscher@taurusml22 ~]$ singularity shell --nv /scratch/singularity/powerai-1.5.3-all-ubuntu16.04-py3.img - Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~> export PATH=/opt/anaconda3/bin:$PATH - Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~> . /opt/DL/tensorflow/bin/tensorflow-activate - Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~> tensorflow-test +TensorFlow 2.0 includes many API changes, such as reordering arguments, renaming symbols, and +changing default values for parameters. Thus in some cases, it makes code written for the TensorFlow +1 not compatible with TensorFlow 2. However, If you are using the high-level APIs (tf.keras) there +may be little or no action you need to take to make your code fully [TensorFlow +2.0](https://www.tensorflow.org/guide/migrate) compatible. It is still possible to run 1.X code, +unmodified (except for contrib), in TensorFlow 2.0: + +```Python +import tensorflow.compat.v1 as tf +tf.disable_v2_behavior() #instead of "import tensorflow as tf" +``` + +To make the transition to TF 2.0 as seamless as possible, the TensorFlow team has created the +tf_upgrade_v2 utility to help transition legacy code to the new API. ## Additional libraries @@ -120,70 +117,6 @@ Note: For optimal NCCL performance it is recommended to set the **NCCL_MIN_NRINGS** environment variable during execution. You can try different values but 4 should be a pretty good starting point. - export NCCL_MIN_NRINGS=4 - -\<span style="color: #222222; font-size: 1.385em;">HPC\</span> - -The following HPC related software is installed on all nodes: - -| | | -|------------------|------------------------| -| IBM Spectrum MPI | /opt/ibm/spectrum_mpi/ | -| PGI compiler | /opt/pgi/ | -| IBM XLC Compiler | /opt/ibm/xlC/ | -| IBM XLF Compiler | /opt/ibm/xlf/ | -| IBM ESSL | /opt/ibmmath/essl/ | -| IBM PESSL | /opt/ibmmath/pessl/ | - -## TensorFlow 2 - -[TensorFlow -2.0](https://blog.tensorflow.org/2019/09/tensorflow-20-is-now-available.html) -is a significant milestone for TensorFlow and the community. There are -multiple important changes for users. TensorFlow 2.0 removes redundant -APIs, makes APIs more consistent (Unified RNNs, Unified Optimizers), and -better integrates with the Python runtime with Eager execution. Also, -TensorFlow 2.0 offers many performance improvements on GPUs. - -There are a number of TensorFlow 2 modules for both ml and scs5 modenvs -on Taurus. Please check\<a href="SoftwareModulesList" target="\_blank"> -the software modules list\</a> for the information about available -modules or use - - module spider TensorFlow - -%RED%Note:<span class="twiki-macro ENDCOLOR"></span> Tensorflow 2 will -be loaded by default when loading the Tensorflow module without -specifying the version. - -\<span style="font-size: 1em;">TensorFlow 2.0 includes many API changes, -such as reordering arguments, renaming symbols, and changing default -values for parameters. Thus in some cases, it makes code written for the -TensorFlow 1 not compatible with TensorFlow 2. However, If you are using -the high-level APIs (tf.keras) there may be little or no action you need -to take to make your code fully TensorFlow 2.0 \<a -href="<https://www.tensorflow.org/guide/migrate>" -target="\_blank">compatible\</a>. It is still possible to run 1.X code, -unmodified ( [except for -contrib](https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md)), -in TensorFlow 2.0:\</span> - - import tensorflow.compat.v1 as tf - tf.disable_v2_behavior() #instead of "import tensorflow as tf" - -To make the transition to TF 2.0 as seamless as possible, the TensorFlow -team has created the -[`tf_upgrade_v2`](https://www.tensorflow.org/guide/upgrade) utility to -help transition legacy code to the new API. - -## FAQ: - -Q: Which module environment should I use? modenv/ml, modenv/scs5, -modenv/hiera - -A: On the ml partition use modenv/ml, on rome and gpu3 use modenv/hiera, -else stay with the default of modenv/scs5. - -Q: How to change the module environment and know more about modules? - -A: [Modules](../software/runtime_environment.md#Modules) +```Bash +export NCCL_MIN_NRINGS=4 +``` diff --git a/doc.zih.tu-dresden.de/docs/software/tensorflow_new.md b/doc.zih.tu-dresden.de/docs/software/tensorflow_new.md deleted file mode 100644 index 3d623f1461c1f55cab180a1836338fe487aa6128..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/tensorflow_new.md +++ /dev/null @@ -1,91 +0,0 @@ -# Tensorflow - -TensorFlow is a free end-to-end open-source software library for dataflow and differentiable programming across many tasks. It is a symbolic math library, used primarily for machine learning applications. It has a comprehensive, flexible ecosystem of tools, libraries and community resources. - -On taurus, Tensorflow 2 is the default module version. Please check \<a href="SoftwareModulesList" target="\_blank"> -the software modules list\</a> for information about available modules or use - - module spider TensorFlow - -For compatibility hints between TF2 and TF1, see here. - -We recommend using **Alpha** and/or **ML** partitions when working with machine learning workflows and the Tensorflow library. For more details see here. You can find detailed hardware specification here. - -## Tensorflow Console - -On the **ML** partition load the module environment: - - module load modenv/ml - -On the **Alpha** partition load the module environment: - - module load modenv/scs5 - -This example shows how to install and start working with TensorFlow (with using modules system) - - srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=8000 bash #Job submission in ml nodes with 1 gpu on 1 node with 8000 mb. - module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml - module load TensorFlow #load TensorFlow module. Example output: Module TensorFlow/1.10.0-PythonAnaconda-3.6 and 1 dependency loaded. - -Now we check that we can access Tensorflow. One example is tensorflow-test: - - tauruslogin6 :~> srun -p ml --gres=gpu:1 -n 1 --pty --mem-per-cpu=10000 bash - srun: job 4374195 queued and waiting for resources - srun: job 4374195 has been allocated resources - taurusml22 :~> module load TensorFlow/1.10.0-PythonAnaconda-3.6 - taurusml22 :~> tensorflow-test - Basic test of tensorflow - A Hello World!!!... - -As another example we use a python virtual environment and import Tensorflow. - - mkdir python-environments #create folder - which python #check which python are you using - virtualenvv --system-site-packages python-environments/env #create virtual environment "env" which inheriting with global site packages - source python-environments/env/bin/activate #Activate virtual environment "env". Example output: (env) bash-4.2$ - python #start python - import tensorflow as tf - print(tf.VERSION) #example output: 1.10.0 - -## Tensorflow in JupyterHub -In addition to using interactive and batch jobs, it is possible to work with Tensorflow using JupyterHub. The production and test environments of JupyterHub contain Python and R kernels, that both come with a Tensorflow support. - - -{: align="center"} - -## Tensorflow in Containers -Another option to use Tensorflow are containers. In the HPC domain, the [Singularity](https://singularity.hpcng.org/) container system is a widely used tool. In the following example, we use the tesnroflow-test in a Singularity container: - - rotscher@tauruslogin6:~> srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=8000 bash - [rotscher@taurusml22 ~]$ singularity shell --nv /scratch/singularity/powerai-1.5.3-all-ubuntu16.04-py3.img - Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~> export PATH=/opt/anaconda3/bin:$PATH - Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~> . /opt/DL/tensorflow/bin/tensorflow-activate - Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~> tensorflow-test - - -## Tensorflow with Python or R -For further information on Tensorflow in combination with Python see [here](data_analytics_with_python.md), for R see [here](data_analytics_with_r.md). - -## Compatibility TF2 and TF1 -TensorFlow 2.0 includes many API changes, such as reordering arguments, renaming symbols, and changing default values for parameters. Thus in some cases, it makes code written for the TensorFlow 1 not compatible with TensorFlow 2. However, If you are using the high-level APIs (tf.keras) there may be little or no action you need to take to make your code fully [TensorFlow 2.0](https://www.tensorflow.org/guide/migrate) compatible. It is still possible to run 1.X code, unmodified (except for contrib), in TensorFlow 2.0: - - import tensorflow.compat.v1 as tf - tf.disable_v2_behavior() #instead of "import tensorflow as tf" - -To make the transition to TF 2.0 as seamless as possible, the TensorFlow team has created the tf_upgrade_v2 utility to help transition legacy code to the new API. - - -## Additional libraires - -The following NVIDIA libraries are available on all nodes: - -| | | -|-------|---------------------------------------| -| NCCL | /usr/local/cuda/targets/ppc64le-linux | -| cuDNN | /usr/local/cuda/targets/ppc64le-linux | - -Note: For optimal NCCL performance it is recommended to set the -**NCCL_MIN_NRINGS** environment variable during execution. You can try -different values but 4 should be a pretty good starting point. - - export NCCL_MIN_NRINGS=4 -