From ba36cb23b1a661cbf6dea2399cd97e8eb547e5f6 Mon Sep 17 00:00:00 2001 From: Christoph Lehmann <christoph.lehmann@tu-dresden.de> Date: Thu, 5 Aug 2021 17:27:43 +0200 Subject: [PATCH] basic stucturing ML section finished --- .../{software => archive}/deep_learning.md | 0 .../docs/software/get_started_with_hpcda.md | 350 ------------------ .../docs/software/machine_learning.md | 196 ++++++++++ .../docs/software/tensorflow.md | 187 +++------- doc.zih.tu-dresden.de/mkdocs.yml | 10 +- 5 files changed, 254 insertions(+), 489 deletions(-) rename doc.zih.tu-dresden.de/docs/{software => archive}/deep_learning.md (100%) delete mode 100644 doc.zih.tu-dresden.de/docs/software/get_started_with_hpcda.md diff --git a/doc.zih.tu-dresden.de/docs/software/deep_learning.md b/doc.zih.tu-dresden.de/docs/archive/deep_learning.md similarity index 100% rename from doc.zih.tu-dresden.de/docs/software/deep_learning.md rename to doc.zih.tu-dresden.de/docs/archive/deep_learning.md diff --git a/doc.zih.tu-dresden.de/docs/software/get_started_with_hpcda.md b/doc.zih.tu-dresden.de/docs/software/get_started_with_hpcda.md deleted file mode 100644 index 8a15ce2c7..000000000 --- a/doc.zih.tu-dresden.de/docs/software/get_started_with_hpcda.md +++ /dev/null @@ -1,350 +0,0 @@ -# Get started with HPC-DA - -HPC-DA (High-Performance Computing and Data Analytics) is a part of TU-Dresden general purpose HPC -cluster (Taurus). HPC-DA is the best **option** for **Machine learning, Deep learning** applications -and tasks connected with the big data. - -**This is an introduction of how to run machine learning applications on the HPC-DA system.** - -The main **aim** of this guide is to help users who have started working with Taurus and focused on -working with Machine learning frameworks such as TensorFlow or Pytorch. - -**Prerequisites:** To work with HPC-DA, you need [Login](../access/ssh_login.md) for the Taurus system -and preferably have basic knowledge about High-Performance computers and Python. - -**Disclaimer:** This guide provides the main steps on the way of using Taurus, for details please -follow links in the text. - -You can also find the information you need on the -[HPC-Introduction] **todo** %ATTACHURL%/HPC-Introduction.pdf?t=1585216700 and -[HPC-DA-Introduction] *todo** %ATTACHURL%/HPC-DA-Introduction.pdf?t=1585162693 presentation slides. - -## Why should I use HPC-DA? The architecture and feature of the HPC-DA - -HPC-DA built on the base of [Power9](https://www.ibm.com/it-infrastructure/power/power9) -architecture from IBM. HPC-DA created from -[AC922 IBM servers](https://www.ibm.com/ie-en/marketplace/power-systems-ac922), which was created -for AI challenges, analytics and working with, Machine learning, data-intensive workloads, -deep-learning frameworks and accelerated databases. POWER9 is the processor with state-of-the-art -I/O subsystem technology, including next-generation NVIDIA NVLink, PCIe Gen4 and OpenCAPI. -[Here](../jobs_and_resources/power9.md) you could find a detailed specification of the TU Dresden -HPC-DA system. - -The main feature of the Power9 architecture (ppc64le) is the ability to work the -[NVIDIA Tesla V100](https://www.nvidia.com/en-gb/data-center/tesla-v100/) GPU with **NV-Link** -support. NV-Link technology allows increasing a total bandwidth of 300 gigabytes per second (GB/sec) - -- 10X the bandwidth of PCIe Gen 3. The bandwidth is a crucial factor for deep learning and machine - learning applications. - -**Note:** The Power9 architecture not so common as an x86 architecture. This means you are not so -flexible with choosing applications for your projects. Even so, the main tools and applications are -available. See available modules here. - -**Please use the ml partition if you need GPUs!** Otherwise using the x86 partitions (e.g Haswell) -most likely would be more beneficial. - -## Login - -### SSH Access - -The recommended way to connect to the HPC login servers directly via ssh: - -```Bash -ssh <zih-login>@taurus.hrsk.tu-dresden.de -``` - -Please put this command in the terminal and replace `<zih-login>` with your login that you received -during the access procedure. Accept the host verifying and enter your password. - -This method requires two conditions: -Linux OS, workstation within the campus network. For other options and -details check the [login page](../access/ssh_login.md). - -## Data management - -### Workspaces - -As soon as you have access to HPC-DA you have to manage your data. The main method of working with -data on Taurus is using Workspaces. You could work with simple examples in your home directory -(where you are loading by default). However, in accordance with the -[storage concept](../data_lifecycle/hpc_storage_concept2019.md) -**please use** a [workspace](../data_lifecycle/workspaces.md) -for your study and work projects. - -You should create your workspace with a similar command: - -```Bash -ws_allocate -F scratch Machine_learning_project 50 #allocating workspase in scratch directory for 50 days -``` - -After the command, you will have an output with the address of the workspace based on scratch. Use -it to store the main data of your project. - -For different purposes, you should use different storage systems. To work as efficient as possible, -consider the following points: - -- Save source code etc. in `/home` or `/projects/...` -- Store checkpoints and other massive but temporary data with - workspaces in: `/scratch/ws/...` -- For data that seldom changes but consumes a lot of space, use - mid-term storage with workspaces: `/warm_archive/...` -- For large parallel applications where using the fastest file system - is a necessity, use with workspaces: `/lustre/ssd/...` -- Compilation in `/dev/shm`** or `/tmp` - -### Data moving - -#### Moving data to/from the HPC machines - -To copy data to/from the HPC machines, the Taurus [export nodes](../data_transfer/export_nodes.md) -should be used. They are the preferred way to transfer your data. There are three possibilities to -exchanging data between your local machine (lm) and the HPC machines (hm): **SCP, RSYNC, SFTP**. - -Type following commands in the local directory of the local machine. For example, the **`SCP`** -command was used. - -#### Copy data from lm to hm - -```Bash -scp <file> <zih-user>@taurusexport.hrsk.tu-dresden.de:<target-location> #Copy file from your local machine. For example: scp helloworld.txt mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/ - -scp -r <directory> <zih-user>@taurusexport.hrsk.tu-dresden.de:<target-location> #Copy directory from your local machine. -``` - -#### Copy data from hm to lm - -```Bash -scp <zih-user>@taurusexport.hrsk.tu-dresden.de:<file> <target-location> #Copy file. For example: scp mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/helloworld.txt /home/mustermann/Downloads - -scp -r <zih-user>@taurusexport.hrsk.tu-dresden.de:<directory> <target-location> #Copy directory -``` - -#### Moving data inside the HPC machines. Datamover - -The best way to transfer data inside the Taurus is the [data mover](../data_transfer/data_mover.md). -It is the special data transfer machine providing the global file systems of each ZIH HPC system. -Datamover provides the best data speed. To load, move, copy etc. files from one file system to -another file system, you have to use commands with **dt** prefix, such as: - -`dtcp, dtwget, dtmv, dtrm, dtrsync, dttar, dtls` - -These commands submit a job to the data transfer machines that execute the selected command. Except -for the `dt` prefix, their syntax is the same as the shell command without the `dt`. - -```Bash -dtcp -r /scratch/ws/<name_of_your_workspace>/results /luste/ssd/ws/<name_of_your_workspace> #Copy from workspace in scratch to ssd.<br />dtwget https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz #Download archive CIFAR-100. -``` - -## BatchSystems. SLURM - -After logon and preparing your data for further work the next logical step is to start your job. For -these purposes, SLURM is using. Slurm (Simple Linux Utility for Resource Management) is an -open-source job scheduler that allocates compute resources on clusters for queued defined jobs. By -default, after your logging, you are using the login nodes. The intended purpose of these nodes -speaks for oneself. Applications on an HPC system can not be run there! They have to be submitted -to compute nodes (ml nodes for HPC-DA) with dedicated resources for user jobs. - -Job submission can be done with the command: `-srun [options] <command>.` - -This is a simple example which you could use for your start. The `srun` command is used to submit a -job for execution in real-time designed for interactive use, with monitoring the output. For some -details please check [the Slurm page](../jobs_and_resources/slurm.md). - -```Bash -srun -p ml -N 1 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash #Job submission in ml nodes with allocating: 1 node, 1 gpu per node, with 8000 mb on 1 hour. -``` - -However, using srun directly on the shell will lead to blocking and launch an interactive job. Apart -from short test runs, it is **recommended to launch your jobs into the background by using batch -jobs**. For that, you can conveniently put the parameters directly into the job file which you can -submit using `sbatch [options] <job file>.` - -This is the example of the sbatch file to run your application: - -```Bash -#!/bin/bash -#SBATCH --mem=8GB # specify the needed memory -#SBATCH -p ml # specify ml partition -#SBATCH --gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task) -#SBATCH --nodes=1 # request 1 node -#SBATCH --time=00:15:00 # runs for 10 minutes -#SBATCH -c 1 # how many cores per task allocated -#SBATCH -o HLR_name_your_script.out # save output message under HLR_${SLURMJOBID}.out -#SBATCH -e HLR_name_your_script.err # save error messages under HLR_${SLURMJOBID}.err - -module load modenv/ml -module load TensorFlow - -python machine_learning_example.py - -## when finished writing, submit with: sbatch <script_name> For example: sbatch machine_learning_script.slurm -``` - -The `machine_learning_example.py` contains a simple ml application based on the mnist model to test -your sbatch file. It could be found as the [attachment] **todo** -%ATTACHURL%/machine_learning_example.py in the bottom of the page. - -## Start your application - -As stated before HPC-DA was created for deep learning, machine learning applications. Machine -learning frameworks as TensorFlow and PyTorch are industry standards now. - -There are three main options on how to work with Tensorflow and PyTorch: - -1. **Modules** -1. **JupyterNotebook** -1. **Containers** - -### Modules - -The easiest way is using the [modules system](modules.md) and Python virtual environment. Modules -are a way to use frameworks, compilers, loader, libraries, and utilities. The module is a user -interface that provides utilities for the dynamic modification of a user's environment without -manual modifications. You could use them for srun , bath jobs (sbatch) and the Jupyterhub. - -A virtual environment is a cooperatively isolated runtime environment that allows Python users and -applications to install and update Python distribution packages without interfering with the -behaviour of other Python applications running on the same system. At its core, the main purpose of -Python virtual environments is to create an isolated environment for Python projects. - -**Vitualenv (venv)** is a standard Python tool to create isolated Python environments. We recommend -using venv to work with Tensorflow and Pytorch on Taurus. It has been integrated into the standard -library under the [venv module](https://docs.python.org/3/library/venv.html). However, if you have -reasons (previously created environments etc) you could easily use conda. The conda is the second -way to use a virtual environment on the Taurus. -[Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) -is an open-source package management system and environment management system from the Anaconda. - -As was written in the previous chapter, to start the application (using -modules) and to run the job exist two main options: - -- The `srun` command:** - -```Bash -srun -p ml -N 1 -n 1 -c 2 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash #job submission in ml nodes with allocating: 1 node, 1 task per node, 2 CPUs per task, 1 gpu per node, with 8000 mb on 1 hour. - -module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml - -mkdir python-virtual-environments #create folder for your environments -cd python-virtual-environments #go to folder -module load TensorFlow #load TensorFlow module to use python. Example output: Module Module TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4 and 31 dependencies loaded. -which python #check which python are you using -python3 -m venv --system-site-packages env #create virtual environment "env" which inheriting with global site packages -source env/bin/activate #activate virtual environment "env". Example output: (env) bash-4.2$ -``` - -The inscription (env) at the beginning of each line represents that now you are in the virtual -environment. - -Now you can check the working capacity of the current environment. - -```Bash -python # start python -import tensorflow as tf -print(tf.__version__) # example output: 2.1.0 -``` - -The second and main option is using batch jobs (`sbatch`). It is used to submit a job script for -later execution. Consequently, it is **recommended to launch your jobs into the background by using -batch jobs**. To launch your machine learning application as well to srun job you need to use -modules. See the previous chapter with the sbatch file example. - -Versions: TensorFlow 1.14, 1.15, 2.0, 2.1; PyTorch 1.1, 1.3 are available. (25.02.20) - -Note: However in case of using sbatch files to send your job you usually don't need a virtual -environment. - -### JupyterNotebook - -The Jupyter Notebook is an open-source web application that allows you to create documents -containing live code, equations, visualizations, and narrative text. Jupyter notebook allows working -with TensorFlow on Taurus with GUI (graphic user interface) in a **web browser** and the opportunity -to see intermediate results step by step of your work. This can be useful for users who dont have -huge experience with HPC or Linux. - -There is [JupyterHub](../access/jupyterhub.md) on Taurus, where you can simply run your Jupyter -notebook on HPC nodes. Also, for more specific cases you can run a manually created remote jupyter -server. You can find the manual server setup [here](deep_learning.md). However, the simplest option -for beginners is using JupyterHub. - -JupyterHub is available at -[taurus.hrsk.tu-dresden.de/jupyter](https://taurus.hrsk.tu-dresden.de/jupyter) - -After logging, you can start a new session and configure it. There are simple and advanced forms to -set up your session. On the simple form, you have to choose the "IBM Power (ppc64le)" architecture. -You can select the required number of CPUs and GPUs. For the acquaintance with the system through -the examples below the recommended amount of CPUs and 1 GPU will be enough. -With the advanced form, you can use -the configuration with 1 GPU and 7 CPUs. To access for all your workspaces use " / " in the -workspace scope. Please check updates and details [here](../access/jupyterhub.md). - -Several Tensorflow and PyTorch examples for the Jupyter notebook have been prepared based on some -simple tasks and models which will give you an understanding of how to work with ML frameworks and -JupyterHub. It could be found as the [attachment] **todo** %ATTACHURL%/machine_learning_example.py -in the bottom of the page. A detailed explanation and examples for TensorFlow can be found -[here](tensorflow_on_jupyter_notebook.md). For the Pytorch - [here](pytorch.md). Usage information -about the environments for the JupyterHub could be found [here](../access/jupyterhub.md) in the chapter -*Creating and using your own environment*. - -Versions: TensorFlow 1.14, 1.15, 2.0, 2.1; PyTorch 1.1, 1.3 are -available. (25.02.20) - -### Containers - -Some machine learning tasks such as benchmarking require using containers. A container is a standard -unit of software that packages up code and all its dependencies so the application runs quickly and -reliably from one computing environment to another. Using containers gives you more flexibility -working with modules and software but at the same time requires more effort. - -On Taurus [Singularity](https://sylabs.io/) is used as a standard container solution. Singularity -enables users to have full control of their environment. This means that **you dont have to ask an -HPC support to install anything for you - you can put it in a Singularity container and run!**As -opposed to Docker (the beat-known container solution), Singularity is much more suited to being used -in an HPC environment and more efficient in many cases. Docker containers also can easily be used by -Singularity from the [DockerHub](https://hub.docker.com) for instance. Also, some containers are -available in [Singularity Hub](https://singularity-hub.org/). - -The simplest option to start working with containers on HPC-DA is importing from Docker or -SingularityHub container with TensorFlow. It does **not require root privileges** and so works on -Taurus directly: - -```Bash -srun -p ml -N 1 --gres=gpu:1 --time=02:00:00 --pty --mem-per-cpu=8000 bash #allocating resourses from ml nodes to start the job to create a container.<br />singularity build my-ML-container.sif docker://ibmcom/tensorflow-ppc64le #create a container from the DockerHub with the last TensorFlow version<br />singularity run --nv my-ML-container.sif #run my-ML-container.sif container with support of the Nvidia's GPU. You could also entertain with your container by commands: singularity shell, singularity exec -``` - -There are two sources for containers for Power9 architecture with -Tensorflow and PyTorch on the board: - -* [Tensorflow-ppc64le](https://hub.docker.com/r/ibmcom/tensorflow-ppc64le): - Community-supported ppc64le docker container for TensorFlow. -* [PowerAI container](https://hub.docker.com/r/ibmcom/powerai/): - Official Docker container with Tensorflow, PyTorch and many other packages. - Heavy container. It requires a lot of space. Could be found on Taurus. - -Note: You could find other versions of software in the container on the "tag" tab on the docker web -page of the container. - -To use not a pure Tensorflow, PyTorch but also with some Python packages -you have to use the definition file to create the container -(bootstrapping). For details please see the [Container](containers.md) page -from our wiki. Bootstrapping **has required root privileges** and -Virtual Machine (VM) should be used! There are two main options on how -to work with VM on Taurus: [VM tools](vm_tools.md) - automotive algorithms -for using virtual machines; [Manual method](virtual_machines.md) - it requires more -operations but gives you more flexibility and reliability. - -- [machine_learning_example.py] **todo** %ATTACHURL%/machine_learning_example.py: - machine_learning_example.py -- [example_TensofFlow_MNIST.zip] **todo** %ATTACHURL%/example_TensofFlow_MNIST.zip: - example_TensofFlow_MNIST.zip -- [example_Pytorch_MNIST.zip] **todo** %ATTACHURL%/example_Pytorch_MNIST.zip: - example_Pytorch_MNIST.zip -- [example_Pytorch_image_recognition.zip] **todo** %ATTACHURL%/example_Pytorch_image_recognition.zip: - example_Pytorch_image_recognition.zip -- [example_TensorFlow_Automobileset.zip] **todo** %ATTACHURL%/example_TensorFlow_Automobileset.zip: - example_TensorFlow_Automobileset.zip -- [HPC-Introduction.pdf] **todo** %ATTACHURL%/HPC-Introduction.pdf: - HPC-Introduction.pdf -- [HPC-DA-Introduction.pdf] **todo** %ATTACHURL%/HPC-DA-Introduction.pdf : - HPC-DA-Introduction.pdf diff --git a/doc.zih.tu-dresden.de/docs/software/machine_learning.md b/doc.zih.tu-dresden.de/docs/software/machine_learning.md index e80e6c346..debb3f494 100644 --- a/doc.zih.tu-dresden.de/docs/software/machine_learning.md +++ b/doc.zih.tu-dresden.de/docs/software/machine_learning.md @@ -3,6 +3,202 @@ On the machine learning nodes, you can use the tools from [IBM Power AI](power_ai.md). +# Get started with HPC-DA + +HPC-DA (High-Performance Computing and Data Analytics) is a part of TU-Dresden general purpose HPC +cluster (Taurus). HPC-DA is the best **option** for **Machine learning, Deep learning** applications +and tasks connected with the big data. + +**This is an introduction of how to run machine learning applications on the HPC-DA system.** + +The main **aim** of this guide is to help users who have started working with Taurus and focused on +working with Machine learning frameworks such as TensorFlow or Pytorch. + +**Prerequisites:** To work with HPC-DA, you need [Login](../access/login.md) for the Taurus system +and preferably have basic knowledge about High-Performance computers and Python. + +**Disclaimer:** This guide provides the main steps on the way of using Taurus, for details please +follow links in the text. + +You can also find the information you need on the +[HPC-Introduction] **todo** %ATTACHURL%/HPC-Introduction.pdf?t=1585216700 and +[HPC-DA-Introduction] *todo** %ATTACHURL%/HPC-DA-Introduction.pdf?t=1585162693 presentation slides. + +## Why should I use HPC-DA? The architecture and feature of the HPC-DA + +HPC-DA built on the base of [Power9](https://www.ibm.com/it-infrastructure/power/power9) +architecture from IBM. HPC-DA created from +[AC922 IBM servers](https://www.ibm.com/ie-en/marketplace/power-systems-ac922), which was created +for AI challenges, analytics and working with, Machine learning, data-intensive workloads, +deep-learning frameworks and accelerated databases. POWER9 is the processor with state-of-the-art +I/O subsystem technology, including next-generation NVIDIA NVLink, PCIe Gen4 and OpenCAPI. +[Here](../jobs_and_resources/power9.md) you could find a detailed specification of the TU Dresden +HPC-DA system. + +The main feature of the Power9 architecture (ppc64le) is the ability to work the +[NVIDIA Tesla V100](https://www.nvidia.com/en-gb/data-center/tesla-v100/) GPU with **NV-Link** +support. NV-Link technology allows increasing a total bandwidth of 300 gigabytes per second (GB/sec) + +- 10X the bandwidth of PCIe Gen 3. The bandwidth is a crucial factor for deep learning and machine + learning applications. + +**Note:** The Power9 architecture not so common as an x86 architecture. This means you are not so +flexible with choosing applications for your projects. Even so, the main tools and applications are +available. See available modules here. + +**Please use the ml partition if you need GPUs!** Otherwise using the x86 partitions (e.g Haswell) +most likely would be more beneficial. + +## Start your application + +As stated before HPC-DA was created for deep learning, machine learning applications. Machine +learning frameworks as TensorFlow and PyTorch are industry standards now. + +There are three main options on how to work with Tensorflow and PyTorch: + +1. **Modules** +1. **JupyterNotebook** +1. **Containers** + +### Modules + +The easiest way is using the [modules system](modules.md) and Python virtual environment. Modules +are a way to use frameworks, compilers, loader, libraries, and utilities. The module is a user +interface that provides utilities for the dynamic modification of a user's environment without +manual modifications. You could use them for srun , bath jobs (sbatch) and the Jupyterhub. + +A virtual environment is a cooperatively isolated runtime environment that allows Python users and +applications to install and update Python distribution packages without interfering with the +behaviour of other Python applications running on the same system. At its core, the main purpose of +Python virtual environments is to create an isolated environment for Python projects. + +**Vitualenv (venv)** is a standard Python tool to create isolated Python environments. We recommend +using venv to work with Tensorflow and Pytorch on Taurus. It has been integrated into the standard +library under the [venv module](https://docs.python.org/3/library/venv.html). However, if you have +reasons (previously created environments etc) you could easily use conda. The conda is the second +way to use a virtual environment on the Taurus. +[Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) +is an open-source package management system and environment management system from the Anaconda. + +As was written in the previous chapter, to start the application (using +modules) and to run the job exist two main options: + +- The `srun` command:** + +```Bash +srun -p ml -N 1 -n 1 -c 2 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash #job submission in ml nodes with allocating: 1 node, 1 task per node, 2 CPUs per task, 1 gpu per node, with 8000 mb on 1 hour. + +module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml + +mkdir python-virtual-environments #create folder for your environments +cd python-virtual-environments #go to folder +module load TensorFlow #load TensorFlow module to use python. Example output: Module Module TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4 and 31 dependencies loaded. +which python #check which python are you using +python3 -m venv --system-site-packages env #create virtual environment "env" which inheriting with global site packages +source env/bin/activate #activate virtual environment "env". Example output: (env) bash-4.2$ +``` + +The inscription (env) at the beginning of each line represents that now you are in the virtual +environment. + +Now you can check the working capacity of the current environment. + +```Bash +python # start python +import tensorflow as tf +print(tf.__version__) # example output: 2.1.0 +``` + +The second and main option is using batch jobs (`sbatch`). It is used to submit a job script for +later execution. Consequently, it is **recommended to launch your jobs into the background by using +batch jobs**. To launch your machine learning application as well to srun job you need to use +modules. See the previous chapter with the sbatch file example. + +Versions: TensorFlow 1.14, 1.15, 2.0, 2.1; PyTorch 1.1, 1.3 are available. (25.02.20) + +Note: However in case of using sbatch files to send your job you usually don't need a virtual +environment. + +### JupyterNotebook + +The Jupyter Notebook is an open-source web application that allows you to create documents +containing live code, equations, visualizations, and narrative text. Jupyter notebook allows working +with TensorFlow on Taurus with GUI (graphic user interface) in a **web browser** and the opportunity +to see intermediate results step by step of your work. This can be useful for users who dont have +huge experience with HPC or Linux. + +There is [JupyterHub](../access/jupyterhub.md) on Taurus, where you can simply run your Jupyter +notebook on HPC nodes. Also, for more specific cases you can run a manually created remote jupyter +server. You can find the manual server setup [here](deep_learning.md). However, the simplest option +for beginners is using JupyterHub. + +JupyterHub is available at +[taurus.hrsk.tu-dresden.de/jupyter](https://taurus.hrsk.tu-dresden.de/jupyter) + +After logging, you can start a new session and configure it. There are simple and advanced forms to +set up your session. On the simple form, you have to choose the "IBM Power (ppc64le)" architecture. +You can select the required number of CPUs and GPUs. For the acquaintance with the system through +the examples below the recommended amount of CPUs and 1 GPU will be enough. +With the advanced form, you can use +the configuration with 1 GPU and 7 CPUs. To access for all your workspaces use " / " in the +workspace scope. Please check updates and details [here](../access/jupyterhub.md). + +Several Tensorflow and PyTorch examples for the Jupyter notebook have been prepared based on some +simple tasks and models which will give you an understanding of how to work with ML frameworks and +JupyterHub. It could be found as the [attachment] **todo** %ATTACHURL%/machine_learning_example.py +in the bottom of the page. A detailed explanation and examples for TensorFlow can be found +[here](tensor_flow_on_jupyter_notebook.md). For the Pytorch - [here](py_torch.md). Usage information +about the environments for the JupyterHub could be found [here](../access/jupyterhub.md) in the chapter +*Creating and using your own environment*. + +Versions: TensorFlow 1.14, 1.15, 2.0, 2.1; PyTorch 1.1, 1.3 are +available. (25.02.20) + +### Containers + +Some machine learning tasks such as benchmarking require using containers. A container is a standard +unit of software that packages up code and all its dependencies so the application runs quickly and +reliably from one computing environment to another. Using containers gives you more flexibility +working with modules and software but at the same time requires more effort. + +On Taurus [Singularity](https://sylabs.io/) is used as a standard container solution. Singularity +enables users to have full control of their environment. This means that **you dont have to ask an +HPC support to install anything for you - you can put it in a Singularity container and run!**As +opposed to Docker (the beat-known container solution), Singularity is much more suited to being used +in an HPC environment and more efficient in many cases. Docker containers also can easily be used by +Singularity from the [DockerHub](https://hub.docker.com) for instance. Also, some containers are +available in [Singularity Hub](https://singularity-hub.org/). + +The simplest option to start working with containers on HPC-DA is importing from Docker or +SingularityHub container with TensorFlow. It does **not require root privileges** and so works on +Taurus directly: + +```Bash +srun -p ml -N 1 --gres=gpu:1 --time=02:00:00 --pty --mem-per-cpu=8000 bash #allocating resourses from ml nodes to start the job to create a container.<br />singularity build my-ML-container.sif docker://ibmcom/tensorflow-ppc64le #create a container from the DockerHub with the last TensorFlow version<br />singularity run --nv my-ML-container.sif #run my-ML-container.sif container with support of the Nvidia's GPU. You could also entertain with your container by commands: singularity shell, singularity exec +``` + +There are two sources for containers for Power9 architecture with +Tensorflow and PyTorch on the board: + +* [Tensorflow-ppc64le](https://hub.docker.com/r/ibmcom/tensorflow-ppc64le): + Community-supported ppc64le docker container for TensorFlow. +* [PowerAI container](https://hub.docker.com/r/ibmcom/powerai/): + Official Docker container with Tensorflow, PyTorch and many other packages. + Heavy container. It requires a lot of space. Could be found on Taurus. + +Note: You could find other versions of software in the container on the "tag" tab on the docker web +page of the container. + +To use not a pure Tensorflow, PyTorch but also with some Python packages +you have to use the definition file to create the container +(bootstrapping). For details please see the [Container](containers.md) page +from our wiki. Bootstrapping **has required root privileges** and +Virtual Machine (VM) should be used! There are two main options on how +to work with VM on Taurus: [VM tools](vm_tools.md) - automotive algorithms +for using virtual machines; [Manual method](virtual_machines.md) - it requires more +operations but gives you more flexibility and reliability. + + ## Interactive Session Examples ### Tensorflow-Test diff --git a/doc.zih.tu-dresden.de/docs/software/tensorflow.md b/doc.zih.tu-dresden.de/docs/software/tensorflow.md index 346eb9a1d..0d5ef7503 100644 --- a/doc.zih.tu-dresden.de/docs/software/tensorflow.md +++ b/doc.zih.tu-dresden.de/docs/software/tensorflow.md @@ -77,138 +77,63 @@ check the next chapter for the details about the virtual environment. import tensorflow as tf print(tf.VERSION) #example output: 1.10.0 -Keep in mind that using **srun** directly on the shell will be blocking -and launch an interactive job. Apart from short test runs, it is -recommended to launch your jobs into the background by using batch -jobs:\<span> **sbatch \[options\] \<job file>** \</span>. The example -will be presented later on the page. - -As a Tensorflow example, we will use a \<a -href="<https://www.tensorflow.org/tutorials>" target="\_blank">simple -mnist model\</a>. Even though this example is in Python, the information -here will still apply to other tools. - -The ml partition has very efficacious GPUs to offer. Do not assume that -more power means automatically faster computational speed. The GPU is -only one part of a typical machine learning application. Do not forget -that first the input data needs to be loaded and in most cases even -rescaled or augmented. If you do not specify that you want to use more -than the default one worker (=one CPU thread), then it is very likely -that your GPU computes faster, than it receives the input data. It is, -therefore, possible, that you will not be any faster, than on other GPU -partitions. \<span style="font-size: 1em;">You can solve this by using -multithreading when loading your input data. The \</span>\<a -href="<https://keras.io/models/sequential/#fit_generator>" -target="\_blank">fit_generator\</a>\<span style="font-size: 1em;"> -method supports multiprocessing, just set \`use_multiprocessing\` to -\`True\`, \</span>\<a href="Slurm#Job_Submission" -target="\_blank">request more Threads\</a>\<span style="font-size: -1em;"> from SLURM and set the \`Workers\` amount accordingly.\</span> - -The example below with a \<a -href="<https://www.tensorflow.org/tutorials>" target="\_blank">simple -mnist model\</a> of the python script illustrates using TF-Keras API -from TensorFlow. \<a href="<https://www.tensorflow.org/guide/keras>" -target="\_top">Keras\</a> is TensorFlows high-level API. - -**You can read in detail how to work with Keras on Taurus \<a -href="Keras" target="\_blank">here\</a>.** +On the machine learning nodes, you can use the tools from [IBM Power +AI](power_ai.md). - import tensorflow as tf - # Load and prepare the MNIST dataset. Convert the samples from integers to floating-point numbers: - mnist = tf.keras.datasets.mnist - - (x_train, y_train),(x_test, y_test) = mnist.load_data() - x_train, x_test = x_train / 255.0, x_test / 255.0 - - # Build the tf.keras model by stacking layers. Select an optimizer and loss function used for training - model = tf.keras.models.Sequential([ - tf.keras.layers.Flatten(input_shape=(28, 28)), - tf.keras.layers.Dense(512, activation=tf.nn.relu), - tf.keras.layers.Dropout(0.2), - tf.keras.layers.Dense(10, activation=tf.nn.softmax) - ]) - model.compile(optimizer='adam', - loss='sparse_categorical_crossentropy', - metrics=['accuracy']) - - # Train and evaluate model - model.fit(x_train, y_train, epochs=5) - model.evaluate(x_test, y_test) - -The example can train an image classifier with \~98% accuracy based on -this dataset. - -## Python virtual environment - -A virtual environment is a cooperatively isolated runtime environment -that allows Python users and applications to install and update Python -distribution packages without interfering with the behaviour of other -Python applications running on the same system. At its core, the main -purpose of Python virtual environments is to create an isolated -environment for Python projects. - -**Vitualenv**is a standard Python tool to create isolated Python -environments and part of the Python installation/module. We recommend -using virtualenv to work with Tensorflow and Pytorch on Taurus.\<br -/>However, if you have reasons (previously created environments etc) you -can also use conda which is the second way to use a virtual environment -on the Taurus. \<a -href="<https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html>" -target="\_blank">Conda\</a> is an open-source package management system -and environment management system. Note that using conda means that -working with other modules from taurus will be harder or impossible. -Hence it is highly recommended to use virtualenv. - -## Running the sbatch script on ML modules (modenv/ml) and SCS5 modules (modenv/scs5) - -Generally, for machine learning purposes the ml partition is used but -for some special issues, the other partitions can be useful also. The -following sbatch script can execute the above Python script both on ml -partition or gpu2 partition.\<br /> When not using the -TensorFlow-Anaconda modules you may need some additional modules that -are not included (e.g. when using the TensorFlow module from modenv/scs5 -on gpu2).\<br />If you have a question about the sbatch script see the -article about \<a href="Slurm" target="\_blank">SLURM\</a>. Keep in mind -that you need to put the executable file (machine_learning_example.py) -with python code to the same folder as the bash script file -\<script_name>.sh (see below) or specify the path. - - #!/bin/bash - #SBATCH --mem=8GB # specify the needed memory - #SBATCH -p ml # specify ml partition or gpu2 partition - #SBATCH --gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task) - #SBATCH --nodes=1 # request 1 node - #SBATCH --time=00:10:00 # runs for 10 minutes - #SBATCH -c 7 # how many cores per task allocated - #SBATCH -o HLR_<name_your_script>.out # save output message under HLR_${SLURMJOBID}.out - #SBATCH -e HLR_<name_your_script>.err # save error messages under HLR_${SLURMJOBID}.err - - if [ "$SLURM_JOB_PARTITION" == "ml" ]; then - module load modenv/ml - module load TensorFlow/2.0.0-PythonAnaconda-3.7 - else - module load modenv/scs5 - module load TensorFlow/2.0.0-fosscuda-2019b-Python-3.7.4 - module load Pillow/6.2.1-GCCcore-8.3.0 # Optional - module load h5py/2.10.0-fosscuda-2019b-Python-3.7.4 # Optional - fi - - python machine_learning_example.py - - ## when finished writing, submit with: sbatch <script_name> - -Output results and errors file can be seen in the same folder in the -corresponding files after the end of the job. Part of the example -output: - - 1600/10000 [===>..........................] - ETA: 0s - 3168/10000 [========>.....................] - ETA: 0s - 4736/10000 [=============>................] - ETA: 0s - 6304/10000 [=================>............] - ETA: 0s - 7872/10000 [======================>.......] - ETA: 0s - 9440/10000 [===========================>..] - ETA: 0s - 10000/10000 [==============================] - 0s 38us/step +## Interactive Session Examples + +### Tensorflow-Test + + tauruslogin6 :~> srun -p ml --gres=gpu:1 -n 1 --pty --mem-per-cpu=10000 bash + srun: job 4374195 queued and waiting for resources + srun: job 4374195 has been allocated resources + taurusml22 :~> ANACONDA2_INSTALL_PATH='/opt/anaconda2' + taurusml22 :~> ANACONDA3_INSTALL_PATH='/opt/anaconda3' + taurusml22 :~> export PATH=$ANACONDA3_INSTALL_PATH/bin:$PATH + taurusml22 :~> source /opt/DL/tensorflow/bin/tensorflow-activate + taurusml22 :~> tensorflow-test + Basic test of tensorflow - A Hello World!!!... + + #or: + taurusml22 :~> module load TensorFlow/1.10.0-PythonAnaconda-3.6 + +Or to use the whole node: `--gres=gpu:6 --exclusive --pty` + +### In Singularity container: + + rotscher@tauruslogin6:~> srun -p ml --gres=gpu:6 --pty bash + [rotscher@taurusml22 ~]$ singularity shell --nv /scratch/singularity/powerai-1.5.3-all-ubuntu16.04-py3.img + Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~> export PATH=/opt/anaconda3/bin:$PATH + Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~> . /opt/DL/tensorflow/bin/tensorflow-activate + Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~> tensorflow-test + +## Additional libraries + +The following NVIDIA libraries are available on all nodes: + +| | | +|-------|---------------------------------------| +| NCCL | /usr/local/cuda/targets/ppc64le-linux | +| cuDNN | /usr/local/cuda/targets/ppc64le-linux | + +Note: For optimal NCCL performance it is recommended to set the +**NCCL_MIN_NRINGS** environment variable during execution. You can try +different values but 4 should be a pretty good starting point. + + export NCCL_MIN_NRINGS=4 + +\<span style="color: #222222; font-size: 1.385em;">HPC\</span> + +The following HPC related software is installed on all nodes: + +| | | +|------------------|------------------------| +| IBM Spectrum MPI | /opt/ibm/spectrum_mpi/ | +| PGI compiler | /opt/pgi/ | +| IBM XLC Compiler | /opt/ibm/xlC/ | +| IBM XLF Compiler | /opt/ibm/xlf/ | +| IBM ESSL | /opt/ibmmath/essl/ | +| IBM PESSL | /opt/ibmmath/pessl/ | ## TensorFlow 2 diff --git a/doc.zih.tu-dresden.de/mkdocs.yml b/doc.zih.tu-dresden.de/mkdocs.yml index fbd8fdc2e..41e80c674 100644 --- a/doc.zih.tu-dresden.de/mkdocs.yml +++ b/doc.zih.tu-dresden.de/mkdocs.yml @@ -41,16 +41,10 @@ nav: - FEM Software: software/fem_software.md - Visualization: software/visualization.md - Machine Learning: - - Get started with HPC-DA: software/get_started_with_hpcda.md - Overview: software/machine_learning.md - - Deep Learning: software/deep_learning.md - - TensorFlow: - - TensorFlow Overview: software/tensorflow.md - - TensorFlow in Container: software/tensorflow_container_on_hpcda.md - - TensorFlow in JupyterHub: software/tensorflow_on_jupyter_notebook.md - - Tensorboard: software/tensorboard.md - - Keras: software/keras.md + - TensorFlow: software/tensorflow.md - PyTorch: software/pytorch.md + - Tensorboard: software/tensorboard.md - Distributed Training: software/distributed_training.md - Hyperparameter Optimization (OmniOpt): software/hyperparameter_optimization.md - Data Analytics: -- GitLab