diff --git a/doc.zih.tu-dresden.de/README.md b/doc.zih.tu-dresden.de/README.md index cbb777f1179990d32dbc50adeb8d1290bc6c5ad2..fe6487b3f1a181e1b9a2dcf4217c496a7bda2491 100644 --- a/doc.zih.tu-dresden.de/README.md +++ b/doc.zih.tu-dresden.de/README.md @@ -141,7 +141,7 @@ documentation. If you want to check whether the markdown files are formatted properly, use the following command: ```Bash -docker run --name=hpc-compendium --rm -it -w /docs --mount src="$(pwd)"/doc.zih.tu-dresden.de,target=/docs,type=bind hpc-compendium markdownlint docs +docker run --name=hpc-compendium --rm -it -w /docs/doc.zih.tu-dresden.de --mount src="$(pwd)",target=/docs,type=bind hpc-compendium markdownlint docs ``` To check whether there are links that point to a wrong target, use @@ -160,13 +160,13 @@ docker run --name=hpc-compendium --rm -it -w /docs --mount src="$(pwd)"/doc.zih. For spell-checking a single file, use: ```Bash -docker run --name=hpc-compendium --rm -it -w /docs/doc.zih.tu-dresden.de --mount src="$(pwd)",target=/docs,type=bind hpc-compendium ./util/check-spelling.sh <file> +docker run --name=hpc-compendium --rm -it -w /docs --mount src="$(pwd)",target=/docs,type=bind hpc-compendium ./doc.zih.tu-dresden.de/util/check-spelling.sh <file> ``` For spell-checking all files, use: ```Bash -docker run --name=hpc-compendium --rm -it -w /docs/doc.zih.tu-dresden.de --mount src="$(pwd)",target=/docs,type=bind hpc-compendium ./util/check-spelling.sh +docker run --name=hpc-compendium --rm -it -w /docs --mount src="$(pwd)",target=/docs,type=bind hpc-compendium ./doc.zih.tu-dresden.de/util/check-spelling.sh ``` This outputs all words of all files that are unknown to the spell checker. diff --git a/doc.zih.tu-dresden.de/docs/archive/install_jupyter.md b/doc.zih.tu-dresden.de/docs/archive/install_jupyter.md new file mode 100644 index 0000000000000000000000000000000000000000..c6924fc4f716a0f7bdab1ab5b66bdcfe71019151 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/archive/install_jupyter.md @@ -0,0 +1,200 @@ +# Jupyter Installation + +Jupyter notebooks allow to analyze data interactively using your web browser. One advantage of +Jupyter is, that code, documentation and visualization can be included in a single notebook, so that +it forms a unit. Jupyter notebooks can be used for many tasks, such as data cleaning and +transformation, numerical simulation, statistical modeling, data visualization and also machine +learning. + +There are two general options on how to work with Jupyter notebooks on ZIH systems: remote Jupyter +server and JupyterHub. + +These sections show how to set up and run a remote Jupyter server with GPUs within a Slurm job. +Furthermore, the following sections explain which modules and packages you need for that. + +!!! note + On ZIH systems, there is a [JupyterHub](../access/jupyterhub.md), where you do not need the + manual server setup described below and can simply run your Jupyter notebook on HPC nodes. Keep + in mind, that, with JupyterHub, you can't work with some special instruments. However, general + data analytics tools are available. + +The remote Jupyter server is able to offer more freedom with settings and approaches. + +## Preparation phase (optional) + +On ZIH system, start an interactive session for setting up the environment: + +```console +marie@login$ srun --pty -n 1 --cpus-per-task=2 --time=2:00:00 --mem-per-cpu=2500 --x11=first bash -l -i +``` + +Create a new directory in your home, e.g. Jupyter + +```console +marie@compute$ mkdir Jupyter +marie@compute$ cd Jupyter +``` + +There are two ways how to run Anaconda. The easiest way is to load the Anaconda module. The second +one is to download Anaconda in your home directory. + +1. Load Anaconda module (recommended): + +```console +marie@compute module load modenv/scs5 +marie@compute module load Anaconda3 +``` + +1. Download latest Anaconda release (see example below) and change the rights to make it an +executable script and run the installation script: + +```console +marie@compute wget https://repo.continuum.io/archive/Anaconda3-2019.03-Linux-x86_64.sh +marie@compute chmod u+x Anaconda3-2019.03-Linux-x86_64.sh +marie@compute ./Anaconda3-2019.03-Linux-x86_64.sh +``` + +(during installation you have to confirm the license agreement) + +Next step will install the anaconda environment into the home +directory (`/home/userxx/anaconda3`). Create a new anaconda environment with the name `jnb`. + +```console +marie@compute conda create --name jnb +``` + +## Set environmental variables + +In the shell, activate previously created python environment (you can +deactivate it also manually) and install Jupyter packages for this python environment: + +```console +marie@compute source activate jnb +marie@compute conda install jupyter +``` + +If you need to adjust the configuration, you should create the template. Generate configuration +files for Jupyter notebook server: + +```console +marie@compute jupyter notebook --generate-config +``` + +Find a path of the configuration file, usually in the home under `.jupyter` directory, e.g. +`/home//.jupyter/jupyter_notebook_config.py` + +Set a password (choose easy one for testing), which is needed later on to log into the server +in browser session: + +```console +marie@compute jupyter notebook password Enter password: Verify password: +``` + +You get a message like that: + +```console +[NotebookPasswordApp] Wrote *hashed password* to +/home/<zih_user>/.jupyter/jupyter_notebook_config.json +``` + +I order to create a certificate for secure connections, you can create a self-signed +certificate: + +```console +marie@compute openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mykey.key -out mycert.pem +``` + +Fill in the form with decent values. + +Possible entries for your Jupyter configuration (`.jupyter/jupyter_notebook*config.py*`). + +```console +c.NotebookApp.certfile = u'<path-to-cert>/mycert.pem' +c.NotebookApp.keyfile = u'<path-to-cert>/mykey.key' + +# set ip to '*' otherwise server is bound to localhost only +c.NotebookApp.ip = '*' +c.NotebookApp.open_browser = False + +# copy hashed password from the jupyter_notebook_config.json +c.NotebookApp.password = u'<your hashed password here>' +c.NotebookApp.port = 9999 +c.NotebookApp.allow_remote_access = True +``` + +!!! note + `<path-to-cert>` - path to key and certificate files, for example: + (`/home/<zih_user>/mycert.pem`) + +## Slurm job file to run the Jupyter server on ZIH system with GPU (1x K80) (also works on K20) + +```console +#!/bin/bash -l +#SBATCH --gres=gpu:1 # request GPU +#SBATCH --partition=gpu2 # use GPU partition +#SBATCH --output=notebook_output.txt +#SBATCH --nodes=1 +#SBATCH --ntasks=1 +#SBATCH --time=02:30:00 +#SBATCH --mem=4000M +#SBATCH -J "jupyter-notebook" # job-name +#SBATCH -A <name_of_your_project> + +unset XDG_RUNTIME_DIR # might be required when interactive instead of sbatch to avoid 'Permission denied error' +srun jupyter notebook +``` + +Start the script above (e.g. with the name `jnotebook`) with sbatch command: + +```console +sbatch jnotebook.slurm +``` + +If you have a question about sbatch script see the article about [Slurm](../jobs_and_resources/slurm.md). + +Check by the command: `tail notebook_output.txt` the status and the **token** of the server. It +should look like this: + +```console +https://(taurusi2092.taurus.hrsk.tu-dresden.de or 127.0.0.1):9999/ +``` + +You can see the **server node's hostname** by the command: `squeue -u <username>`. + +### Remote connect to the server + +There are two options on how to connect to the server: + +1. You can create an ssh tunnel if you have problems with the +solution above. Open the other terminal and configure ssh +tunnel: (look up connection values in the output file of Slurm job, e.g.) (recommended): + +```console +node=taurusi2092 #see the name of the node with squeue -u <your_login> +localport=8887 #local port on your computer +remoteport=9999 #pay attention on the value. It should be the same value as value in the notebook_output.txt +ssh -fNL ${localport}:${node}:${remoteport} <zih_user>@taurus.hrsk.tu-dresden.de #configure the ssh tunnel for connection to your remote server +pgrep -f "ssh -fNL ${localport}" #verify that tunnel is alive +``` + +2. On your client (local machine) you now can connect to the server. You need to know the **node's + hostname**, the **port** of the server and the **token** to login (see paragraph above). + +You can connect directly if you know the IP address (just ping the node's hostname while logged on +ZIH system). + +```console +#comand on remote terminal +taurusi2092$> host taurusi2092 +# copy IP address from output +# paste IP to your browser or call on local terminal e.g.: +local$> firefox https://<IP>:<PORT> # https important to use SSL cert +``` + +To login into the Jupyter notebook site, you have to enter the **token**. +(`https://localhost:8887`). Now you can create and execute notebooks on ZIH system with GPU support. + +!!! important + If you would like to use [JupyterHub](../access/jupyterhub.md) after using a remote manually + configured Jupyter server (example above) you need to change the name of the configuration file + (`/home//.jupyter/jupyter_notebook_config.py`) to any other. diff --git a/doc.zih.tu-dresden.de/docs/data_lifecycle/overview.md b/doc.zih.tu-dresden.de/docs/data_lifecycle/overview.md index e20e2ace134dad1c4fbbb94b2fc3d0a0f1401df1..bdbaa5a1523ec2fc06150195e18764cf14b618ef 100644 --- a/doc.zih.tu-dresden.de/docs/data_lifecycle/overview.md +++ b/doc.zih.tu-dresden.de/docs/data_lifecycle/overview.md @@ -10,7 +10,7 @@ uniformity of the project can be achieved by taking into account and setting up The used set of software within an HPC project can be management with environments on different levels either defined by [modules](../software/modules.md), [containers](../software/containers.md) -or by [Python virtual environments](../software/python.md). +or by [Python virtual environments](../software/python_virtual_environments.md). In the following, a brief overview on relevant topics w.r.t. data life cycle management is provided. ## Data Storage and Management @@ -21,7 +21,7 @@ properly: * use a `/home` directory for the limited amount of personal data, simple examples and the results of calculations. The home directory is not a working directory! However, `/home` filesystem is [backed up](#backup) using snapshots; - * use `workspaces` as a place for working data (i.e. datasets); Recommendations of choosing the + * use `workspaces` as a place for working data (i.e. data sets); Recommendations of choosing the correct storage system for workspace presented below. ### Taxonomy of Filesystems @@ -30,15 +30,17 @@ It is important to design your data workflow according to characteristics, like (bandwidth/IOPS) of the application, size of the data, (number of files,) and duration of the storage to efficiently use the provided storage and filesystems. The page [filesystems](file_systems.md) holds a comprehensive documentation on the different -filesystems. <!--In general, the mechanisms of so-called--> <!--[Workspaces](workspaces.md) are -compulsory for all HPC users to store data for a defined duration ---> <!--depending on the -requirements and the storage system this time span might range from days to a few--> <!--years.--> -<!--- [HPC filesystems](file_systems.md)--> <!--- [Intermediate -Archive](intermediate_archive.md)--> <!--- [Special data containers] **todo** Special data -containers (was no valid link in old compendium)--> <!--- [Move data between filesystems] -(../data_transfer/data_mover.md)--> <!--- [Move data to/from ZIH's filesystems] -(../data_transfer/export_nodes.md)--> <!--- [Longterm Preservation for -ResearchData](preservation_research_data.md)--> +filesystems. +<!--In general, the mechanisms of +so-called--> <!--[Workspaces](workspaces.md) are compulsory for all HPC users to store data for a +defined duration ---> <!--depending on the requirements and the storage system this time span might +range from days to a few--> <!--years.--> +<!--- [HPC filesystems](file_systems.md)--> +<!--- [Intermediate Archive](intermediate_archive.md)--> +<!--- [Special data containers] **todo** Special data containers (was no valid link in old compendium)--> +<!--- [Move data between filesystems](../data_transfer/data_mover.md)--> +<!--- [Move data to/from ZIH's filesystems](../data_transfer/export_nodes.md)--> +<!--- [Longterm Preservation for ResearchData](preservation_research_data.md)--> !!! hint "Recommendations to choose of storage system" @@ -68,7 +70,7 @@ files can be restored directly by the users. Details can be found ### Folder Structure and Organizing Data -Organizing of living data using the filesystem helps for consistency and structuredness of the +Organizing of living data using the filesystem helps for consistency of the project. We recommend following the rules for your work regarding: * Organizing the data: Never change the original data; Automatize the organizing the data; Clearly @@ -79,7 +81,7 @@ project. We recommend following the rules for your work regarding: don’t replace documentation and metadata; Use standards of your discipline; Make rules for your project, document and keep them (See the [README recommendations]**todo link** below) -This is the example of an organisation (hierarchical) for the folder structure. Use it as a visual +This is the example of an organization (hierarchical) for the folder structure. Use it as a visual illustration of the above:  @@ -142,7 +144,7 @@ you don’t need throughout its life cycle. <!--### Python Virtual Environment--> -<!--If you are working with the Python then it is crucial to use the virtual environment on ZIH Systems. The--> +<!--If you are working with the Python then it is crucial to use the virtual environment on ZIH systems. The--> <!--main purpose of Python virtual environments (don't mess with the software environment for modules)--> <!--is to create an isolated environment for Python projects (self-contained directory tree that--> <!--contains a Python installation for a particular version of Python, plus a number of additional--> @@ -170,6 +172,5 @@ changing permission command (i.e `chmod`) valid for ZIH systems as well. The **g contains members of your project group. Be careful with 'write' permission and never allow to change the original data. -Useful links: [Data Management]**todo link**, [Filesystems]**todo link**, [Get Started with -HPC Data Analytics]**todo link**, [Project Management]**todo link**, [Preservation research -data[**todo link** +Useful links: [Data Management]**todo link**, [Filesystems]**todo link**, +[Project Management]**todo link**, [Preservation research data[**todo link** diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hpcda.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hpcda.md deleted file mode 100644 index d7bdec9afe83de27488e712b07e5fd5bdbcfcd17..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hpcda.md +++ /dev/null @@ -1,67 +0,0 @@ -# HPC for Data Analytics - -With the HPC-DA system, the TU Dresden provides infrastructure for High-Performance Computing and -Data Analytics (HPC-DA) for German researchers for computing projects with focus in one of the -following areas: - -- machine learning scenarios for large systems -- evaluation of various hardware settings for large machine learning - problems, including accelerator and compute node configuration and - memory technologies -- processing of large amounts of data on highly parallel machine - learning infrastructure. - -Currently we offer 25 Mio core hours compute time per year for external computing projects. -Computing projects have a duration of up to one year with the possibility of extensions, thus -enabling projects to continue seamlessly. Applications for regular projects on HPC-DA can be -submitted at any time via the -[online web-based submission](https://tu-dresden.de/zih/hochleistungsrechnen/zugang/hpc-da) -and review system. The reviews of the applications are carried out by experts in their respective -scientific fields. Applications are evaluated only according to their scientific excellence. - -ZIH provides a portfolio of preinstalled applications and offers support for software -installation/configuration of project-specific applications. In particular, we provide consulting -services for all our users, and advise researchers on using the resources in an efficient way. - -\<img align="right" alt="HPC-DA Overview" -src="%ATTACHURL%/bandwidth.png" title="bandwidth.png" width="250" /> - -## Access - -- Application for access using this - [Online Web Form](https://tu-dresden.de/zih/hochleistungsrechnen/zugang/hpc-da) - -## Hardware Overview - -- [Nodes for machine learning (Power9)](../jobs_and_resources/power9.md) -- [NVMe Storage](../jobs_and_resources/nvme_storage.md) (2 PB) -- [Warm archive](../data_lifecycle/file_systems.md#warm-archive) (10 PB) -- HPC nodes (x86) for DA (island 6) -- Compute nodes with high memory bandwidth: - [AMD Rome Nodes](../jobs_and_resources/rome_nodes.md) (island 7) - -Additional hardware: - -- [Multi-GPU-Cluster](../jobs_and_resources/alpha_centauri.md) for projects of SCADS.AI - -## File Systems and Object Storage - -- Lustre -- BeeGFS -- Quobyte -- S3 - -## HOWTOS - -- [Get started with HPC-DA](../software/get_started_with_hpcda.md) -- [IBM Power AI](../software/power_ai.md) -- [Work with Singularity Containers on Power9]**todo** Cloud -- [TensorFlow on HPC-DA (native)](../software/tensorflow.md) -- [Tensorflow on Jupyter notebook](../software/tensorflow_on_jupyter_notebook.md) -- Create and run your own TensorFlow container for HPC-DA (Power9) (todo: no link at all in old compendium) -- [TensorFlow on x86](../software/deep_learning.md) -- [PyTorch on HPC-DA (Power9)](../software/pytorch.md) -- [Python on HPC-DA (Power9)](../software/python.md) -- [JupyterHub](../access/jupyterhub.md) -- [R on HPC-DA (Power9)](../software/data_analytics_with_r.md) -- [Big Data frameworks: Apache Spark, Apache Flink, Apache Hadoop](../software/big_data_frameworks.md) diff --git a/doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md b/doc.zih.tu-dresden.de/docs/software/big_data_frameworks_spark.md similarity index 84% rename from doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md rename to doc.zih.tu-dresden.de/docs/software/big_data_frameworks_spark.md index 59aa75e842e3875f99d458caec785c6bf9645a81..7eb432bd9f963ed2ecc0133db2ad5f04c9b67b8c 100644 --- a/doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md +++ b/doc.zih.tu-dresden.de/docs/software/big_data_frameworks_spark.md @@ -1,4 +1,4 @@ -# Big Data Frameworks: Apache Spark, Apache Flink, Apache Hadoop +# Big Data Frameworks: Apache Spark !!! note @@ -10,11 +10,11 @@ Big Data. These frameworks are also offered as software [modules](modules.md) on `scs5` partition. You can check module versions and availability with the command ```console -marie@login$ module av Spark +marie@login$ module avail Spark ``` The **aim** of this page is to introduce users on how to start working with -these frameworks on ZIH systems, e. g. on the [HPC-DA](../jobs_and_resources/hpcda.md) system. +these frameworks on ZIH systems. **Prerequisites:** To work with the frameworks, you need [access](../access/ssh_login.md) to ZIH systems and basic knowledge about data analysis and the batch system @@ -94,7 +94,7 @@ The Spark processes should now be set up and you can start your application, e. g.: ```console -marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.4.jar 1000 +marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000 ``` !!! warning @@ -156,43 +156,34 @@ Please use a [batch job](../jobs_and_resources/slurm.md) similar to There are two general options on how to work with Jupyter notebooks: There is [JupyterHub](../access/jupyterhub.md), where you can simply -run your Jupyter notebook on HPC nodes (the preferable way). Also, you -can run a remote Jupyter server manually within a GPU job using -the modules and packages you need. You can find the manual server -setup [here](deep_learning.md). +run your Jupyter notebook on HPC nodes (the preferable way). ### Preparation If you want to run Spark in Jupyter notebooks, you have to prepare it first. This is comparable -to the [description for custom environments](../access/jupyterhub.md#conda-environment). +to [normal Python virtual environments](../software/python_virtual_environments.md#python-virtual-environment). You start with an allocation: ```console marie@login$ srun --pty -n 1 -c 2 --mem-per-cpu=2500 -t 01:00:00 bash -l ``` -When a node is allocated, install the required package with Anaconda: +When a node is allocated, install he required packages: ```console -marie@compute$ module load Anaconda3 marie@compute$ cd -marie@compute$ mkdir user-kernel -marie@compute$ conda create --prefix $HOME/user-kernel/haswell-py3.6-spark python=3.6 -Collecting package metadata: done -Solving environment: done [...] - -marie@compute$ conda activate $HOME/user-kernel/haswell-py3.6-spark -marie@compute$ conda install ipykernel -Collecting package metadata: done -Solving environment: done [...] - -marie@compute$ python -m ipykernel install --user --name haswell-py3.6-spark --display-name="haswell-py3.6-spark" -Installed kernelspec haswell-py3.6-spark in [...] - -marie@compute$ conda install -c conda-forge findspark -marie@compute$ conda install pyspark - -marie@compute$ conda deactivate +marie@compute$ mkdir jupyter-kernel +marie@compute$ virtualenv --system-site-packages jupyter-kernel/env #Create virtual environment +[...] +marie@compute$ source jupyter-kernel/env/bin/activate #Activate virtual environment. +marie@compute$ pip install ipykernel +[...] +marie@compute$ python -m ipykernel install --user --name haswell-py3.7-spark --display-name="haswell-py3.7-spark" +Installed kernelspec haswell-py3.7-spark in [...] + +marie@compute$ pip install findspark + +marie@compute$ deactivate ``` You are now ready to spawn a notebook with Spark. @@ -206,7 +197,7 @@ to the field "Preload modules" and select one of the Spark modules. When your Jupyter instance is started, check whether the kernel that you created in the preparation phase (see above) is shown in the top right corner of the notebook. If it is not already selected, select the -kernel `haswell-py3.6-spark`. Then, you can set up Spark. Since the setup +kernel `haswell-py3.7-spark`. Then, you can set up Spark. Since the setup in the notebook requires more steps than in an interactive session, we have created an example notebook that you can use as a starting point for convenience: [SparkExample.ipynb](misc/SparkExample.ipynb) diff --git a/doc.zih.tu-dresden.de/docs/software/dask.md b/doc.zih.tu-dresden.de/docs/software/dask.md deleted file mode 100644 index 316aefe2395e077bec611fdbd0c080cce2af1940..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/dask.md +++ /dev/null @@ -1,136 +0,0 @@ -# Dask - -**Dask** is an open-source library for parallel computing. Dask is a flexible library for parallel -computing in Python. - -Dask natively scales Python. It provides advanced parallelism for analytics, enabling performance at -scale for some of the popular tools. For instance: Dask arrays scale Numpy workflows, Dask -dataframes scale Pandas workflows, Dask-ML scales machine learning APIs like Scikit-Learn and -XGBoost. - -Dask is composed of two parts: - -- Dynamic task scheduling optimized for computation and interactive - computational workloads. -- Big Data collections like parallel arrays, data frames, and lists - that extend common interfaces like NumPy, Pandas, or Python - iterators to larger-than-memory or distributed environments. These - parallel collections run on top of dynamic task schedulers. - -Dask supports several user interfaces: - -High-Level: - -- Arrays: Parallel NumPy -- Bags: Parallel lists -- DataFrames: Parallel Pandas -- Machine Learning : Parallel Scikit-Learn -- Others from external projects, like XArray - -Low-Level: - -- Delayed: Parallel function evaluation -- Futures: Real-time parallel function evaluation - -## Installation - -### Installation Using Conda - -Dask is installed by default in [Anaconda](https://www.anaconda.com/download/). To install/update -Dask on a Taurus with using the [conda](https://www.anaconda.com/download/) follow the example: - -```Bash -# Job submission in ml nodes with allocating: 1 node, 1 gpu per node, 4 hours -srun -p ml -N 1 -n 1 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash -``` - -Create a conda virtual environment. We would recommend using a workspace. See the example (use -`--prefix` flag to specify the directory). - -**Note:** You could work with simple examples in your home directory (where you are loading by -default). However, in accordance with the -[HPC storage concept](../data_lifecycle/overview.md) please use a -[workspaces](../data_lifecycle/workspaces.md) for your study and work projects. - -```Bash -conda create --prefix /scratch/ws/0/aabc1234-Workproject/conda-virtual-environment/dask-test python=3.6 -``` - -By default, conda will locate the environment in your home directory: - -```Bash -conda create -n dask-test python=3.6 -``` - -Activate the virtual environment, install Dask and verify the installation: - -```Bash -ml modenv/ml -ml PythonAnaconda/3.6 -conda activate /scratch/ws/0/aabc1234-Workproject/conda-virtual-environment/dask-test python=3.6 -which python -which conda -conda install dask -python - -from dask.distributed import Client, progress -client = Client(n_workers=4, threads_per_worker=1) -client -``` - -### Installation Using Pip - -You can install everything required for most common uses of Dask (arrays, dataframes, etc) - -```Bash -srun -p ml -N 1 -n 1 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash - -cd /scratch/ws/0/aabc1234-Workproject/python-virtual-environment/dask-test - -ml modenv/ml -module load PythonAnaconda/3.6 -which python - -python3 -m venv --system-site-packages dask-test -source dask-test/bin/activate -python -m pip install "dask[complete]" - -python -from dask.distributed import Client, progress -client = Client(n_workers=4, threads_per_worker=1) -client -``` - -Distributed scheduler - -? - -## Run Dask on Taurus - -The preferred and simplest way to run Dask on HPC systems today both for new, experienced users or -administrator is to use [dask-jobqueue](https://jobqueue.dask.org/). - -You can install dask-jobqueue with `pip` or `conda` - -Installation with Pip - -```Bash -srun -p haswell -N 1 -n 1 -c 4 --mem-per-cpu=2583 --time=01:00:00 --pty bash -cd -/scratch/ws/0/aabc1234-Workproject/python-virtual-environment/dask-test -ml modenv/ml module load PythonAnaconda/3.6 which python - -source dask-test/bin/activate pip -install dask-jobqueue --upgrade # Install everything from last released version -``` - -Installation with Conda - -```Bash -srun -p haswell -N 1 -n 1 -c 4 --mem-per-cpu=2583 --time=01:00:00 --pty bash - -ml modenv/ml module load PythonAnaconda/3.6 source -dask-test/bin/activate - -conda install dask-jobqueue -c conda-forge\</verbatim> -``` diff --git a/doc.zih.tu-dresden.de/docs/software/data_analytics.md b/doc.zih.tu-dresden.de/docs/software/data_analytics.md new file mode 100644 index 0000000000000000000000000000000000000000..245bd5ae1a8ea0f246bd578d4365b3d23aaaba64 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/data_analytics.md @@ -0,0 +1,35 @@ +# Data Analytics + +On ZIH systems, there are many possibilities for working with tools from the field of data +analytics. The boundaries between data analytics and machine learning are fluid. +Therefore, it may be worthwhile to search for a specific issue within the data analytics and +machine learning sections. + +The following tools are available on ZIH systems, among others: + +* [Python](data_analytics_with_python.md) +* [R](data_analytics_with_r.md) +* [RStudio](data_analytics_with_rstudio.md) +* [Big Data framework Spark](big_data_frameworks_spark.md) +* [MATLAB and Mathematica](mathematics.md) + +Detailed information about frameworks for machine learning, such as [TensorFlow](tensorflow.md) +and [PyTorch](pytorch.md), can be found in the [machine learning](machine_learning.md) subsection. + +Other software, not listed here, can be searched with + +```console +marie@compute$ module spider <software_name> +``` + +Refer to the section covering [modules](modules.md) for further information on the modules system. +Additional software or special versions of [individual modules](custom_easy_build_environment.md) +can be installed individually by each user. If possible, the use of virtual environments is +recommended (e.g. for Python). Likewise, software can be used within [containers](containers.md). + +For the transfer of larger amounts of data into and within the system, the +[export nodes and datamover](../data_transfer/overview.md) should be used. +Data is stored in the [workspaces](../data_lifecycle/workspaces.md). +Software modules or virtual environments can also be installed in workspaces to enable +collaborative work even within larger groups. General recommendations for setting up workflows +can be found in the [experiments](../data_lifecycle/experiments.md) section. diff --git a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md new file mode 100644 index 0000000000000000000000000000000000000000..a1974c5d288b275a33f621044209ec0e90ce201d --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md @@ -0,0 +1,205 @@ +# Python for Data Analytics + +Python is a high-level interpreted language widely used in research and science. Using ZIH system +allows you to work with Python quicker and more effective. Here, a general introduction to working +with Python on ZIH systems is given. Further documentation is available for specific +[machine learning frameworks](machine_learning.md). + +## Python Console and Virtual Environments + +Often, it is useful to create an isolated development environment, which can be shared among +a research group and/or teaching class. For this purpose, +[Python virtual environments](python_virtual_environments.md) can be used. + +The interactive Python interpreter can also be used on ZIH systems via an interactive job: + +```console +marie@login$ srun --partition=haswell --gres=gpu:1 --ntasks=1 --cpus-per-task=7 --pty --mem-per-cpu=8000 bash +marie@haswell$ module load Python +marie@haswell$ python +Python 3.8.6 (default, Feb 17 2021, 11:48:51) +[GCC 10.2.0] on linux +Type "help", "copyright", "credits" or "license" for more information. +>>> +``` + +## Jupyter Notebooks + +Jupyter notebooks allow to analyze data interactively using your web browser. One advantage of +Jupyter is, that code, documentation and visualization can be included in a single notebook, so that +it forms a unit. Jupyter notebooks can be used for many tasks, such as data cleaning and +transformation, numerical simulation, statistical modeling, data visualization and also machine +learning. + +On ZIH systems, a [JupyterHub](../access/jupyterhub.md) is available, which can be used to run a +Jupyter notebook on a node, using a GPU when needed. + +## Parallel Computing with Python + +### Pandas with Pandarallel + +[Pandas](https://pandas.pydata.org/){:target="_blank"} is a widely used library for data +analytics in Python. +In many cases, an existing source code using Pandas can be easily modified for parallel execution by +using the [pandarallel](https://github.com/nalepae/pandarallel/tree/v1.5.2) module. The number of +threads that can be used in parallel depends on the number of cores (parameter `--cpus-per-task`) +within the Slurm request, e.g. + +```console +marie@login$ srun --partition=haswell --cpus-per-task=4 --mem=2G --hint=nomultithread --pty --time=8:00:00 bash +``` + +The above request allows to use 4 parallel threads. + +The following example shows how to parallelize the apply method for pandas dataframes with the +pandarallel module. If the pandarallel module is not installed already, use a +[virtual environment](python_virtual_environments.md) to install the module. + +??? example + + ```python + import pandas as pd + import numpy as np + from pandarallel import pandarallel + + pandarallel.initialize() + # unfortunately the initialize method gets the total number of physical cores without + # taking into account allocated cores by Slurm, but the choice of the -c parameter is of relevance here + + N_rows = 10**5 + N_cols = 5 + df = pd.DataFrame(np.random.randn(N_rows, N_cols)) + + # here some function that needs to be executed in parallel + def transform(x): + return(np.mean(x)) + + print('calculate with normal apply...') + df.apply(func=transform, axis=1) + + print('calculate with pandarallel...') + df.parallel_apply(func=transform, axis=1) + ``` +For more examples of using pandarallel check out +[https://github.com/nalepae/pandarallel/blob/master/docs/examples.ipynb](https://github.com/nalepae/pandarallel/blob/master/docs/examples.ipynb). + +### Dask + +[Dask](https://dask.org/) is a flexible and open-source library for parallel computing in Python. +It replaces some Python data structures with parallel versions in order to provide advanced +parallelism for analytics, enabling performance at scale for some of the popular tools. For +instance: Dask arrays replace NumPy arrays, Dask dataframes replace Pandas dataframes. +Furthermore, Dask-ML scales machine learning APIs like Scikit-Learn and XGBoost. + +Dask is composed of two parts: + +- Dynamic task scheduling optimized for computation and interactive computational workloads. +- Big Data collections like parallel arrays, data frames, and lists that extend common interfaces + like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. + These parallel collections run on top of dynamic task schedulers. + +Dask supports several user interfaces: + +- High-Level + - Arrays: Parallel NumPy + - Bags: Parallel lists + - DataFrames: Parallel Pandas + - Machine Learning: Parallel Scikit-Learn + - Others from external projects, like XArray +- Low-Level + - Delayed: Parallel function evaluation + - Futures: Real-time parallel function evaluation + +#### Dask Usage + +On ZIH systems, Dask is available as a module. Check available versions and load your preferred one: + +```console +marie@compute$ module spider dask +------------------------------------------------------------------------------------------ + dask: +---------------------------------------------------------------------------------------------- + Versions: + dask/2.8.0-fosscuda-2019b-Python-3.7.4 + dask/2.8.0-Python-3.7.4 + dask/2.8.0 (E) +[...] +marie@compute$ module load dask/2.8.0-fosscuda-2019b-Python-3.7.4 +marie@compute$ python -c "import dask; print(dask.__version__)" +2021.08.1 +``` + +The preferred and simplest way to run Dask on ZIH system is using +[dask-jobqueue](https://jobqueue.dask.org/). + +**TODO** create better example with jobqueue + +```python +from dask.distributed import Client, progress +client = Client(n_workers=4, threads_per_worker=1) +client +``` + +### mpi4py - MPI for Python + +Message Passing Interface (MPI) is a standardized and portable message-passing standard, designed to +function on a wide variety of parallel computing architectures. The Message Passing Interface (MPI) +is a library specification that allows HPC to pass information between its various nodes and +clusters. MPI is designed to provide access to advanced parallel hardware for end-users, library +writers and tool developers. + +mpi4py (MPI for Python) provides bindings of the MPI standard for the Python programming +language, allowing any Python program to exploit multiple processors. + +mpi4py is based on MPI-2 C++ bindings. It supports almost all MPI calls. This implementation is +popular on Linux clusters and in the SciPy community. Operations are primarily methods of +communicator objects. It supports communication of pickle-able Python objects. mpi4py provides +optimized communication of NumPy arrays. + +mpi4py is included in the SciPy-bundle modules on the ZIH system. + +```console +marie@compute$ module load SciPy-bundle/2020.11-foss-2020b +Module SciPy-bundle/2020.11-foss-2020b and 28 dependencies loaded. +marie@compute$ pip list +Package Version +----------------------------- ---------- +[...] +mpi4py 3.0.3 +[...] +``` + +Other versions of the package can be found with + +```console +marie@compute$ module spider mpi4py +----------------------------------------------------------------------------------------------------------------------------------------- + mpi4py: +----------------------------------------------------------------------------------------------------------------------------------------- + Versions: + mpi4py/1.3.1 + mpi4py/2.0.0-impi + mpi4py/3.0.0 (E) + mpi4py/3.0.2 (E) + mpi4py/3.0.3 (E) + +Names marked by a trailing (E) are extensions provided by another module. + +----------------------------------------------------------------------------------------------------------------------------------------- + For detailed information about a specific "mpi4py" package (including how to load the modules) use the module's full name. + Note that names that have a trailing (E) are extensions provided by other modules. + For example: + + $ module spider mpi4py/3.0.3 +----------------------------------------------------------------------------------------------------------------------------------------- +``` + +Check if mpi4py is running correctly + +```python +from mpi4py import MPI +comm = MPI.COMM_WORLD +print("%d of %d" % (comm.Get_rank(), comm.Get_size())) +``` + +**TODO** verify mpi4py installation diff --git a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md index 9c1e092a72d6294a9c5b91f0cd3459bc8e215ebb..21966e1f3f03416e1a080a391894f370f9f1a5a8 100644 --- a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md +++ b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md @@ -1,53 +1,41 @@ # R for Data Analytics [R](https://www.r-project.org/about.html) is a programming language and environment for statistical -computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling, -classical statistical tests, time-series analysis, classification, etc) and graphical techniques. R -is an integrated suite of software facilities for data manipulation, calculation and -graphing. +computing and graphics. It provides a wide variety of statistical (linear and nonlinear modeling, +classical statistical tests, time-series analysis, classification, etc.), machine learning +algorithms and graphical techniques. R is an integrated suite of software facilities for data +manipulation, calculation and graphing. -R possesses an extensive catalogue of statistical and graphical methods. It includes machine -learning algorithms, linear regression, time series, statistical inference. - -We recommend using **Haswell** and/or **Romeo** partitions to work with R. For more details -see [here](../jobs_and_resources/hardware_taurus.md). +We recommend using the partitions Haswell and/or Romeo to work with R. For more details +see our [hardware documentation](../jobs_and_resources/hardware_taurus.md). ## R Console -This is a quickstart example. The `srun` command is used to submit a real-time execution job -designed for interactive use with monitoring the output. Please check -[the Slurm page](../jobs_and_resources/slurm.md) for details. - -```Bash -# job submission on haswell nodes with allocating: 1 task, 1 node, 4 CPUs per task with 2541 mb per CPU(core) for 1 hour -tauruslogin$ srun --partition=haswell --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2541 --time=01:00:00 --pty bash - -# Ensure that you are using the scs5 environment -module load modenv/scs5 -# Check all available modules for R with version 3.6 -module available R/3.6 -# Load default R module -module load R -# Checking the current R version -which R -# Start R console -R +In the following example, the `srun` command is used to start an interactive job, so that the output +is visible to the user. Please check the [Slurm page](../jobs_and_resources/slurm.md) for details. + +```console +marie@login$ srun --partition=haswell --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2541 --time=01:00:00 --pty bash +marie@haswell$ module load modenv/scs5 +marie@haswell$ module load R/3.6 +[...] +Module R/3.6.0-foss-2019a and 56 dependencies loaded. +marie@haswell$ which R +marie@haswell$ /sw/installed/R/3.6.0-foss-2019a/bin/R ``` -Using `srun` is recommended only for short test runs, while for larger runs batch jobs should be -used. The examples can be found [here](get_started_with_hpcda.md) or -[here](../jobs_and_resources/slurm.md). +Using interactive sessions is recommended only for short test runs, while for larger runs batch jobs +should be used. Examples can be found on the [Slurm page](../jobs_and_resources/slurm.md). It is also possible to run `Rscript` command directly (after loading the module): -```Bash -# Run Rscript directly. For instance: Rscript /scratch/ws/0/marie-study_project/my_r_script.R -Rscript /path/to/script/your_script.R param1 param2 +```console +marie@haswell$ Rscript </path/to/script/your_script.R> <param1> <param2> ``` ## R in JupyterHub -In addition to using interactive and batch jobs, it is possible to work with **R** using +In addition to using interactive and batch jobs, it is possible to work with R using [JupyterHub](../access/jupyterhub.md). The production and test [environments](../access/jupyterhub.md#standard-environments) of @@ -55,66 +43,49 @@ JupyterHub contain R kernel. It can be started either in the notebook or in the ## RStudio -[RStudio](<https://rstudio.com/) is an integrated development environment (IDE) for R. It includes -a console, syntax-highlighting editor that supports direct code execution, as well as tools for -plotting, history, debugging and workspace management. RStudio is also available on Taurus. - -The easiest option is to run RStudio in JupyterHub directly in the browser. It can be started -similarly to a new kernel from [JupyterLab](../access/jupyterhub.md#jupyterlab) launcher. - - -{: align="center"} - -Please keep in mind that it is currently not recommended to use the interactive x11 job with the -desktop version of RStudio, as described, for example, in introduction HPC-DA slides. +For using R with RStudio please refer to the documentation on +[Data Analytics with RStudio](data_analytics_with_rstudio.md). ## Install Packages in R -By default, user-installed packages are saved in the users home in a subfolder depending on -the architecture (x86 or PowerPC). Therefore the packages should be installed using interactive +By default, user-installed packages are saved in the users home in a folder depending on +the architecture (`x86` or `PowerPC`). Therefore the packages should be installed using interactive jobs on the compute node: -```Bash -srun -p haswell --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2541 --time=01:00:00 --pty bash - -module purge -module load modenv/scs5 -module load R -R -e 'install.packages("package_name")' #For instance: 'install.packages("ggplot2")' +```console +marie@compute$ module load R +[...] +Module R/3.6.0-foss-2019a and 56 dependencies loaded. +marie@compute$ R -e 'install.packages("ggplot2")' +[...] ``` ## Deep Learning with R The deep learning frameworks perform extremely fast when run on accelerators such as GPU. -Therefore, using nodes with built-in GPUs ([ml](../jobs_and_resources/power9.md) or -[alpha](../jobs_and_resources/alpha_centauri.md) partitions) is beneficial for the examples here. +Therefore, using nodes with built-in GPUs, e.g., partitions [ml](../jobs_and_resources/power9.md) +and [alpha](../jobs_and_resources/alpha_centauri.md), is beneficial for the examples here. ### R Interface to TensorFlow The ["TensorFlow" R package](https://tensorflow.rstudio.com/) provides R users access to the -Tensorflow toolset. [TensorFlow](https://www.tensorflow.org/) is an open-source software library +TensorFlow framework. [TensorFlow](https://www.tensorflow.org/) is an open-source software library for numerical computation using data flow graphs. -```Bash -srun --partition=ml --ntasks=1 --nodes=1 --cpus-per-task=7 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash +The respective modules can be loaded with the following -module purge -ml modenv/ml -ml TensorFlow -ml R - -which python -mkdir python-virtual-environments # Create a folder for virtual environments -cd python-virtual-environments -python3 -m venv --system-site-packages R-TensorFlow #create python virtual environment -source R-TensorFlow/bin/activate #activate environment -module list -which R +```console +marie@compute$ module load R/3.6.2-fosscuda-2019b +[...] +Module R/3.6.2-fosscuda-2019b and 63 dependencies loaded. +marie@compute$ module load TensorFlow/2.3.1-fosscuda-2019b-Python-3.7.4 +Module TensorFlow/2.3.1-fosscuda-2019b-Python-3.7.4 and 15 dependencies loaded. ``` -Please allocate the job with respect to -[hardware specification](../jobs_and_resources/hardware_taurus.md)! Note that the nodes on `ml` -partition have 4way-SMT, so for every physical core allocated, you will always get 4\*1443Mb=5772mb. +!!! warning + + Be aware that for compatibility reasons it is important to choose [modules](modules.md) with + the same toolchain version (in this case `fosscuda/2019b`). In order to interact with Python-based frameworks (like TensorFlow) `reticulate` R library is used. To configure it to point to the correct Python executable in your virtual environment, create @@ -122,23 +93,40 @@ a file named `.Rprofile` in your project directory (e.g. R-TensorFlow) with the contents: ```R -Sys.setenv(RETICULATE_PYTHON = "/sw/installed/Anaconda3/2019.03/bin/python") #assign the output of the 'which python' from above to RETICULATE_PYTHON +Sys.setenv(RETICULATE_PYTHON = "/sw/installed/Python/3.7.4-GCCcore-8.3.0/bin/python") #assign RETICULATE_PYTHON to the python executable ``` Let's start R, install some libraries and evaluate the result: -```R -install.packages("reticulate") -library(reticulate) -reticulate::py_config() -install.packages("tensorflow") -library(tensorflow) -tf$constant("Hello Tensorflow") #In the output 'Tesla V100-SXM2-32GB' should be mentioned +```rconsole +> install.packages(c("reticulate", "tensorflow")) +Installing packages into ‘~/R/x86_64-pc-linux-gnu-library/3.6’ +(as ‘lib’ is unspecified) +> reticulate::py_config() +python: /software/rome/Python/3.7.4-GCCcore-8.3.0/bin/python +libpython: /sw/installed/Python/3.7.4-GCCcore-8.3.0/lib/libpython3.7m.so +pythonhome: /software/rome/Python/3.7.4-GCCcore-8.3.0:/software/rome/Python/3.7.4-GCCcore-8.3.0 +version: 3.7.4 (default, Mar 25 2020, 13:46:43) [GCC 8.3.0] +numpy: /software/rome/SciPy-bundle/2019.10-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/numpy +numpy_version: 1.17.3 + +NOTE: Python version was forced by RETICULATE_PYTHON + +> library(tensorflow) +2021-08-26 16:11:47.110548: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 +> tf$constant("Hello TensorFlow") +2021-08-26 16:14:00.269248: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1 +2021-08-26 16:14:00.674878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: +pciBusID: 0000:0b:00.0 name: A100-SXM4-40GB computeCapability: 8.0 +coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s +[...] +tf.Tensor(b'Hello TensorFlow', shape=(), dtype=string) ``` ??? example + The example shows the use of the TensorFlow package with the R for the classification problem - related to the MNIST dataset. + related to the MNIST data set. ```R library(tensorflow) library(keras) @@ -214,20 +202,16 @@ tf$constant("Hello Tensorflow") #In the output 'Tesla V100-SXM2-32GB' sh ## Parallel Computing with R Generally, the R code is serial. However, many computations in R can be made faster by the use of -parallel computations. Taurus allows a vast number of options for parallel computations. Large -amounts of data and/or use of complex models are indications to use parallelization. - -### General Information about the R Parallelism - -There are various techniques and packages in R that allow parallelization. This section -concentrates on most general methods and examples. The Information here is Taurus-specific. +parallel computations. This section concentrates on most general methods and examples. The [parallel](https://www.rdocumentation.org/packages/parallel/versions/3.6.2) library will be used below. -**Warning:** Please do not install or update R packages related to parallelism as it could lead to -conflicts with other pre-installed packages. +!!! warning -### Basic Lapply-Based Parallelism + Please do not install or update R packages related to parallelism as it could lead to + conflicts with other preinstalled packages. + +### Basic lapply-Based Parallelism `lapply()` function is a part of base R. lapply is useful for performing operations on list-objects. Roughly speaking, lapply is a vectorization of the source code and it is the first step before @@ -243,6 +227,7 @@ This is a simple option for parallelization. It doesn't require much effort to r code to use `mclapply` function. Check out an example below. ??? example + ```R library(parallel) @@ -269,9 +254,9 @@ code to use `mclapply` function. Check out an example below. list_of_averages <- mclapply(X=sample_sizes, FUN=average, mc.cores=threads) # apply function "average" 100 times ``` -The disadvantages of using shared-memory parallelism approach are, that the number of parallel -tasks is limited to the number of cores on a single node. The maximum number of cores on a single -node can be found [here](../jobs_and_resources/hardware_taurus.md). +The disadvantages of using shared-memory parallelism approach are, that the number of parallel tasks +is limited to the number of cores on a single node. The maximum number of cores on a single node can +be found in our [hardware documentation](../jobs_and_resources/hardware_taurus.md). Submitting a multicore R job to Slurm is very similar to submitting an [OpenMP Job](../jobs_and_resources/slurm.md#binding-and-distribution-of-tasks), @@ -305,9 +290,10 @@ running in parallel. The desired type of the cluster can be specified with a par This way of the R parallelism uses the [Rmpi](http://cran.r-project.org/web/packages/Rmpi/index.html) package and the [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) (Message Passing Interface) as a -"backend" for its parallel operations. The MPI-based job in R is very similar to submitting an +"back-end" for its parallel operations. The MPI-based job in R is very similar to submitting an [MPI Job](../jobs_and_resources/slurm.md#binding-and-distribution-of-tasks) since both are running -multicore jobs on multiple nodes. Below is an example of running R script with the Rmpi on Taurus: +multicore jobs on multiple nodes. Below is an example of running R script with the Rmpi on the ZIH +system: ```Bash #!/bin/bash @@ -315,8 +301,8 @@ multicore jobs on multiple nodes. Below is an example of running R script with t #SBATCH --ntasks=32 # this parameter determines how many processes will be spawned, please use >=8 #SBATCH --cpus-per-task=1 #SBATCH --time=01:00:00 -#SBATCH -o test_Rmpi.out -#SBATCH -e test_Rmpi.err +#SBATCH --output=test_Rmpi.out +#SBATCH --error=test_Rmpi.err module purge module load modenv/scs5 @@ -333,10 +319,10 @@ However, in some specific cases, you can specify the number of nodes and the num tasks per node explicitly: ```Bash -#!/bin/bash #SBATCH --nodes=2 #SBATCH --tasks-per-node=16 #SBATCH --cpus-per-task=1 + module purge module load modenv/scs5 module load R @@ -348,6 +334,7 @@ Use an example below, where 32 global ranks are distributed over 2 nodes with 16 Each MPI rank has 1 core assigned to it. ??? example + ```R library(Rmpi) @@ -371,6 +358,7 @@ Each MPI rank has 1 core assigned to it. Another example: ??? example + ```R library(Rmpi) library(parallel) @@ -405,7 +393,7 @@ Another example: #snow::stopCluster(cl) # usually it hangs over here with OpenMPI > 2.0. In this case this command may be avoided, Slurm will clean up after the job finishes ``` -To use Rmpi and MPI please use one of these partitions: **haswell**, **broadwell** or **rome**. +To use Rmpi and MPI please use one of these partitions: `haswell`, `broadwell` or `rome`. Use `mpirun` command to start the R script. It is a wrapper that enables the communication between processes running on different nodes. It is important to use `-np 1` (the number of spawned @@ -422,6 +410,7 @@ parallel workers, you have to manually specify the number of nodes according to hardware specification and parameters of your job. ??? example + ```R library(parallel) @@ -456,7 +445,7 @@ hardware specification and parameters of your job. print(paste("Program finished")) ``` -#### FORK cluster +#### FORK Cluster The `type="FORK"` method behaves exactly like the `mclapply` function discussed in the previous section. Like `mclapply`, it can only use the cores available on a single node. However this method @@ -464,7 +453,7 @@ requires exporting the workspace data to other processes. The FORK method in a c `parLapply` function might be used in situations, where different source code should run on each parallel process. -### Other parallel options +### Other Parallel Options - [foreach](https://cran.r-project.org/web/packages/foreach/index.html) library. It is functionally equivalent to the @@ -476,7 +465,8 @@ parallel process. expression via futures - [Poor-man's parallelism](https://www.glennklockwood.com/data-intensive/r/alternative-parallelism.html#6-1-poor-man-s-parallelism) (simple data parallelism). It is the simplest, but not an elegant way to parallelize R code. - It runs several copies of the same R script where's each read different sectors of the input data + It runs several copies of the same R script where each copy reads a different part of the input + data. - [Hands-off (OpenMP)](https://www.glennklockwood.com/data-intensive/r/alternative-parallelism.html#6-2-hands-off-parallelism) method. R has [OpenMP](https://www.openmp.org/resources/) support. Thus using OpenMP is a simple method where you don't need to know much about the parallelism options in your code. Please be diff --git a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_rstudio.md b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_rstudio.md new file mode 100644 index 0000000000000000000000000000000000000000..51d1068e3d1c32796859037e51a37e71810259b6 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_rstudio.md @@ -0,0 +1,14 @@ +# Data Analytics with RStudio + +[RStudio](https://rstudio.com/) is an integrated development environment (IDE) for R. It includes +a console, syntax-highlighting editor that supports direct code execution, as well as tools for +plotting, history, debugging and workspace management. RStudio is also available on ZIH systems. + +The easiest option is to run RStudio in JupyterHub directly in the browser. It can be started +similarly to a new kernel from [JupyterLab](../access/jupyterhub.md#jupyterlab) launcher. + + +{: style="width:90%" } + +!!! tip + If an error "could not start RStudio in time" occurs, try reloading the web page with `F5`. diff --git a/doc.zih.tu-dresden.de/docs/software/deep_learning.md b/doc.zih.tu-dresden.de/docs/software/deep_learning.md deleted file mode 100644 index 00cbdad4bd40ee8d1fcf30aea6b0c2e215aaa467..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/deep_learning.md +++ /dev/null @@ -1,333 +0,0 @@ -# Deep learning - -**Prerequisites**: To work with Deep Learning tools you obviously need [Login](../access/ssh_login.md) -for the Taurus system and basic knowledge about Python, Slurm manager. - -**Aim** of this page is to introduce users on how to start working with Deep learning software on -both the ml environment and the scs5 environment of the Taurus system. - -## Deep Learning Software - -### TensorFlow - -[TensorFlow](https://www.tensorflow.org/guide/) is a free end-to-end open-source software library -for dataflow and differentiable programming across a range of tasks. - -TensorFlow is available in both main partitions -[ml environment and scs5 environment](modules.md#module-environments) -under the module name "TensorFlow". However, for purposes of machine learning and deep learning, we -recommend using Ml partition [HPC-DA](../jobs_and_resources/hpcda.md). For example: - -```Bash -module load TensorFlow -``` - -There are numerous different possibilities on how to work with [TensorFlow](tensorflow.md) on -Taurus. On this page, for all examples default, scs5 partition is used. Generally, the easiest way -is using the [modules system](modules.md) -and Python virtual environment (test case). However, in some cases, you may need directly installed -TensorFlow stable or night releases. For this purpose use the -[EasyBuild](custom_easy_build_environment.md), [Containers](tensorflow_container_on_hpcda.md) and see -[the example](https://www.tensorflow.org/install/pip). For examples of using TensorFlow for ml partition -with module system see [TensorFlow page for HPC-DA](tensorflow.md). - -Note: If you are going used manually installed TensorFlow release we recommend use only stable -versions. - -## Keras - -[Keras](https://keras.io/) is a high-level neural network API, written in Python and capable of -running on top of [TensorFlow](https://github.com/tensorflow/tensorflow) Keras is available in both -environments [ml environment and scs5 environment](modules.md#module-environments) under the module -name "Keras". - -On this page for all examples default scs5 partition used. There are numerous different -possibilities on how to work with [TensorFlow](tensorflow.md) and Keras -on Taurus. Generally, the easiest way is using the [module system](modules.md) and Python -virtual environment (test case) to see TensorFlow part above. -For examples of using Keras for ml partition with the module system see the -[Keras page for HPC-DA](keras.md). - -It can either use TensorFlow as its backend. As mentioned in Keras documentation Keras capable of -running on Theano backend. However, due to the fact that Theano has been abandoned by the -developers, we don't recommend use Theano anymore. If you wish to use Theano backend you need to -install it manually. To use the TensorFlow backend, please don't forget to load the corresponding -TensorFlow module. TensorFlow should be loaded automatically as a dependency. - -Test case: Keras with TensorFlow on MNIST data - -Go to a directory on Taurus, get Keras for the examples and go to the examples: - -```Bash -git clone https://github.com/fchollet/keras.git'>https://github.com/fchollet/keras.git -cd keras/examples/ -``` - -If you do not specify Keras backend, then TensorFlow is used as a default - -Job-file (schedule job with sbatch, check the status with 'squeue -u \<Username>'): - -```Bash -#!/bin/bash -#SBATCH --gres=gpu:1 # 1 - using one gpu, 2 - for using 2 gpus -#SBATCH --mem=8000 -#SBATCH -p gpu2 # select the type of nodes (options: haswell, smp, sandy, west, gpu, ml) K80 GPUs on Haswell node -#SBATCH --time=00:30:00 -#SBATCH -o HLR_<name_of_your_script>.out # save output under HLR_${SLURMJOBID}.out -#SBATCH -e HLR_<name_of_your_script>.err # save error messages under HLR_${SLURMJOBID}.err - -module purge # purge if you already have modules loaded -module load modenv/scs5 # load scs5 environment -module load Keras # load Keras module -module load TensorFlow # load TensorFlow module - -# if you see 'broken pipe error's (might happen in interactive session after the second srun -command) uncomment line below -# module load h5py - -python mnist_cnn.py -``` - -Keep in mind that you need to put the bash script to the same folder as an executable file or -specify the path. - -Example output: - -```Bash -x_train shape: (60000, 28, 28, 1) 60000 train samples 10000 test samples Train on 60000 samples, -validate on 10000 samples Epoch 1/12 - -128/60000 [..............................] - ETA: 12:08 - loss: 2.3064 - acc: 0.0781 256/60000 -[..............................] - ETA: 7:04 - loss: 2.2613 - acc: 0.1523 384/60000 -[..............................] - ETA: 5:22 - loss: 2.2195 - acc: 0.2005 - -... - -60000/60000 [==============================] - 128s 2ms/step - loss: 0.0296 - acc: 0.9905 - -val_loss: 0.0268 - val_acc: 0.9911 Test loss: 0.02677746053306255 Test accuracy: 0.9911 -``` - -## Datasets - -There are many different datasets designed for research purposes. If you would like to download some -of them, first of all, keep in mind that many machine learning libraries have direct access to -public datasets without downloading it (for example -[TensorFlow Datasets](https://www.tensorflow.org/datasets). - -If you still need to download some datasets, first of all, be careful with the size of the datasets -which you would like to download (some of them have a size of few Terabytes). Don't download what -you really not need to use! Use login nodes only for downloading small files (hundreds of the -megabytes). For downloading huge files use [DataMover](../data_transfer/datamover.md). -For example, you can use command `dtwget` (it is an analogue of the general wget -command). This command submits a job to the data transfer machines. If you need to download or -allocate massive files (more than one terabyte) please contact the support before. - -### The ImageNet dataset - -The [ImageNet](http://www.image-net.org/) project is a large visual database designed for use in -visual object recognition software research. In order to save space in the file system by avoiding -to have multiple duplicates of this lying around, we have put a copy of the ImageNet database -(ILSVRC2012 and ILSVR2017) under `/scratch/imagenet` which you can use without having to download it -again. For the future, the ImageNet dataset will be available in `/warm_archive`. ILSVR2017 also -includes a dataset for recognition objects from a video. Please respect the corresponding -[Terms of Use](https://image-net.org/download.php). - -## Jupyter Notebook - -Jupyter notebooks are a great way for interactive computing in your web browser. Jupyter allows -working with data cleaning and transformation, numerical simulation, statistical modelling, data -visualization and of course with machine learning. - -There are two general options on how to work Jupyter notebooks using HPC: remote Jupyter server and -JupyterHub. - -These sections show how to run and set up a remote Jupyter server within a sbatch GPU job and which -modules and packages you need for that. - -**Note:** On Taurus, there is a [JupyterHub](../access/jupyterhub.md), where you do not need the -manual server setup described below and can simply run your Jupyter notebook on HPC nodes. Keep in -mind, that, with JupyterHub, you can't work with some special instruments. However, general data -analytics tools are available. - -The remote Jupyter server is able to offer more freedom with settings and approaches. - -### Preparation phase (optional) - -On Taurus, start an interactive session for setting up the -environment: - -```Bash -srun --pty -n 1 --cpus-per-task=2 --time=2:00:00 --mem-per-cpu=2500 --x11=first bash -l -i -``` - -Create a new subdirectory in your home, e.g. Jupyter - -```Bash -mkdir Jupyter cd Jupyter -``` - -There are two ways how to run Anaconda. The easiest way is to load the Anaconda module. The second -one is to download Anaconda in your home directory. - -1. Load Anaconda module (recommended): - -```Bash -module load modenv/scs5 module load Anaconda3 -``` - -1. Download latest Anaconda release (see example below) and change the rights to make it an -executable script and run the installation script: - -```Bash -wget https://repo.continuum.io/archive/Anaconda3-2019.03-Linux-x86_64.sh chmod 744 -Anaconda3-2019.03-Linux-x86_64.sh ./Anaconda3-2019.03-Linux-x86_64.sh - -(during installation you have to confirm the license agreement) -``` - -Next step will install the anaconda environment into the home -directory (/home/userxx/anaconda3). Create a new anaconda environment with the name "jnb". - -```Bash -conda create --name jnb -``` - -### Set environmental variables on Taurus - -In shell activate previously created python environment (you can -deactivate it also manually) and install Jupyter packages for this python environment: - -```Bash -source activate jnb conda install jupyter -``` - -If you need to adjust the configuration, you should create the template. Generate config files for -Jupyter notebook server: - -```Bash -jupyter notebook --generate-config -``` - -Find a path of the configuration file, usually in the home under `.jupyter` directory, e.g. -`/home//.jupyter/jupyter_notebook_config.py` - -Set a password (choose easy one for testing), which is needed later on to log into the server -in browser session: - -```Bash -jupyter notebook password Enter password: Verify password: -``` - -You get a message like that: - -```Bash -[NotebookPasswordApp] Wrote *hashed password* to -/home/<zih_user>/.jupyter/jupyter_notebook_config.json -``` - -I order to create an SSL certificate for https connections, you can create a self-signed -certificate: - -```Bash -openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mykey.key -out mycert.pem -``` - -Fill in the form with decent values. - -Possible entries for your Jupyter config (`.jupyter/jupyter_notebook*config.py*`). Uncomment below -lines: - -```Bash -c.NotebookApp.certfile = u'<path-to-cert>/mycert.pem' c.NotebookApp.keyfile = -u'<path-to-cert>/mykey.key' - -# set ip to '*' otherwise server is bound to localhost only c.NotebookApp.ip = '*' -c.NotebookApp.open_browser = False - -# copy hashed password from the jupyter_notebook_config.json c.NotebookApp.password = u'<your -hashed password here>' c.NotebookApp.port = 9999 c.NotebookApp.allow_remote_access = True -``` - -Note: `<path-to-cert>` - path to key and certificate files, for example: -(`/home/\<username>/mycert.pem`) - -### Slurm job file to run the Jupyter server on Taurus with GPU (1x K80) (also works on K20) - -```Bash -#!/bin/bash -l #SBATCH --gres=gpu:1 # request GPU #SBATCH --partition=gpu2 # use GPU partition -SBATCH --output=notebook_output.txt #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --time=02:30:00 -SBATCH --mem=4000M #SBATCH -J "jupyter-notebook" # job-name #SBATCH -A <name_of_your_project> - -unset XDG_RUNTIME_DIR # might be required when interactive instead of sbatch to avoid -'Permission denied error' srun jupyter notebook -``` - -Start the script above (e.g. with the name jnotebook) with sbatch command: - -```Bash -sbatch jnotebook.slurm -``` - -If you have a question about sbatch script see the article about [Slurm](../jobs_and_resources/slurm.md). - -Check by the command: `tail notebook_output.txt` the status and the **token** of the server. It -should look like this: - -```Bash -https://(taurusi2092.taurus.hrsk.tu-dresden.de or 127.0.0.1):9999/ -``` - -You can see the **server node's hostname** by the command: `squeue -u <username>`. - -Remote connect to the server - -There are two options on how to connect to the server: - -1. You can create an ssh tunnel if you have problems with the -solution above. Open the other terminal and configure ssh -tunnel: (look up connection values in the output file of Slurm job, e.g.) (recommended): - -```Bash -node=taurusi2092 #see the name of the node with squeue -u <your_login> -localport=8887 #local port on your computer remoteport=9999 -#pay attention on the value. It should be the same value as value in the notebook_output.txt ssh --fNL ${localport}:${node}:${remoteport} <zih_user>@taurus.hrsk.tu-dresden.de #configure -of the ssh tunnel for connection to your remote server pgrep -f "ssh -fNL ${localport}" -#verify that tunnel is alive -``` - -2. On your client (local machine) you now can connect to the server. You need to know the **node's - hostname**, the **port** of the server and the **token** to login (see paragraph above). - -You can connect directly if you know the IP address (just ping the node's hostname while logged on -Taurus). - -```Bash -#comand on remote terminal taurusi2092$> host taurusi2092 # copy IP address from output # paste -IP to your browser or call on local terminal e.g. local$> firefox https://<IP>:<PORT> # https -important to use SSL cert -``` - -To login into the Jupyter notebook site, you have to enter the **token**. -(`https://localhost:8887`). Now you can create and execute notebooks on Taurus with GPU support. - -If you would like to use [JupyterHub](../access/jupyterhub.md) after using a remote manually configured -Jupyter server (example above) you need to change the name of the configuration file -(`/home//.jupyter/jupyter_notebook_config.py`) to any other. - -### F.A.Q - -**Q:** - I have an error to connect to the Jupyter server (e.g. "open failed: administratively -prohibited: open failed") - -**A:** - Check the settings of your Jupyter config file. Is it all necessary lines uncommented, the -right path to cert and key files, right hashed password from .json file? Check is the used local -port [available](https://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers) -Check local settings e.g. (`/etc/ssh/sshd_config`, `/etc/hosts`). - -**Q:** I have an error during the start of the interactive session (e.g. PMI2_Init failed to -initialize. Return code: 1) - -**A:** Probably you need to provide `--mpi=none` to avoid ompi errors (). -`srun --mpi=none --reservation \<...> -A \<...> -t 90 --mem=4000 --gres=gpu:1 ---partition=gpu2-interactive --pty bash -l` diff --git a/doc.zih.tu-dresden.de/docs/software/distributed_training.md b/doc.zih.tu-dresden.de/docs/software/distributed_training.md new file mode 100644 index 0000000000000000000000000000000000000000..1548afa1aef1dd3377490b4f6b757194f320bdea --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/distributed_training.md @@ -0,0 +1,184 @@ +# Distributed Training + +## Internal Distribution + +### Distributed TensorFlow + +TODO + +### Distributed PyTorch + +Hint: just copied some old content as starting point + +#### Using Multiple GPUs with PyTorch + +Effective use of GPUs is essential, and it implies using parallelism in +your code and model. Data Parallelism and model parallelism are effective instruments +to improve the performance of your code in case of GPU using. + +The data parallelism is a widely-used technique. It replicates the same model to all GPUs, +where each GPU consumes a different partition of the input data. You could see this method [here](https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html). + +The example below shows how to solve that problem by using model +parallel, which, in contrast to data parallelism, splits a single model +onto different GPUs, rather than replicating the entire model on each +GPU. The high-level idea of model parallel is to place different sub-networks of a model onto different +devices. As the only part of a model operates on any individual device, a set of devices can +collectively serve a larger model. + +It is recommended to use [DistributedDataParallel] +(https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html), +instead of this class, to do multi-GPU training, even if there is only a single node. +See: Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel. +Check the [page](https://pytorch.org/docs/stable/notes/cuda.html#cuda-nn-ddp-instead) and +[Distributed Data Parallel](https://pytorch.org/docs/stable/notes/ddp.html#ddp). + +Examples: + +1\. The parallel model. The main aim of this model to show the way how +to effectively implement your neural network on several GPUs. It +includes a comparison of different kinds of models and tips to improve +the performance of your model. **Necessary** parameters for running this +model are **2 GPU** and 14 cores (56 thread). + +(example_PyTorch_parallel.zip) + +Remember that for using [JupyterHub service](../access/jupyterhub.md) +for PyTorch you need to create and activate +a virtual environment (kernel) with loaded essential modules. + +Run the example in the same way as the previous examples. + +#### Distributed data-parallel + +[DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel) +(DDP) implements data parallelism at the module level which can run across multiple machines. +Applications using DDP should spawn multiple processes and create a single DDP instance per process. +DDP uses collective communications in the [torch.distributed] +(https://pytorch.org/tutorials/intermediate/dist_tuto.html) +package to synchronize gradients and buffers. + +The tutorial could be found [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). + +To use distributed data parallelization on ZIH system please use following +parameters: `--ntasks-per-node` -parameter to the number of GPUs you use +per node. Also, it could be useful to increase `memomy/cpu` parameters +if you run larger models. Memory can be set up to: + +`--mem=250000` and `--cpus-per-task=7` for the `ml` partition. + +`--mem=60000` and `--cpus-per-task=6` for the `gpu2` partition. + +Keep in mind that only one memory parameter (`--mem-per-cpu` = <MB> or `--mem`=<MB>) can be +specified + +## External Distribution + +### Horovod + +[Horovod](https://github.com/horovod/horovod) is the open source distributed training +framework for TensorFlow, Keras, PyTorch. It is supposed to make it easy +to develop distributed deep learning projects and speed them up with +TensorFlow. + +#### Why use Horovod? + +Horovod allows you to easily take a single-GPU TensorFlow and PyTorch +program and successfully train it on many GPUs! In +some cases, the MPI model is much more straightforward and requires far +less code changes than the distributed code from TensorFlow for +instance, with parameter servers. Horovod uses MPI and NCCL which gives +in some cases better results than pure TensorFlow and PyTorch. + +#### Horovod as a module + +Horovod is available as a module with **TensorFlow** or **PyTorch**for **all** module environments. +Please check the [software module list](modules.md) for the current version of the software. +Horovod can be loaded like other software on ZIH system: + +```Bash +ml av Horovod #Check available modules with Python +module load Horovod #Loading of the module +``` + +#### Horovod installation + +However, if it is necessary to use Horovod with **PyTorch** or use +another version of Horovod it is possible to install it manually. To +install Horovod you need to create a virtual environment and load the +dependencies (e.g. MPI). Installing PyTorch can take a few hours and is +not recommended + +**Note:** You could work with simple examples in your home directory but **please use workspaces +for your study and work projects** (see the Storage concept). + +Setup: + +```Bash +srun -N 1 --ntasks-per-node=6 -p ml --time=08:00:00 --pty bash #allocate a Slurm job allocation, which is a set of resources (nodes) +module load modenv/ml #Load dependencies by using modules +module load OpenMPI/3.1.4-gcccuda-2018b +module load Python/3.6.6-fosscuda-2018b +module load cuDNN/7.1.4.18-fosscuda-2018b +module load CMake/3.11.4-GCCcore-7.3.0 +virtualenv --system-site-packages <location_for_your_environment> #create virtual environment +source <location_for_your_environment>/bin/activate #activate virtual environment +``` + +Or when you need to use conda: + +```Bash +srun -N 1 --ntasks-per-node=6 -p ml --time=08:00:00 --pty bash #allocate a Slurm job allocation, which is a set of resources (nodes) +module load modenv/ml #Load dependencies by using modules +module load OpenMPI/3.1.4-gcccuda-2018b +module load PythonAnaconda/3.6 +module load cuDNN/7.1.4.18-fosscuda-2018b +module load CMake/3.11.4-GCCcore-7.3.0 + +conda create --prefix=<location_for_your_environment> python=3.6 anaconda #create virtual environment + +conda activate <location_for_your_environment> #activate virtual environment +``` + +Install PyTorch (not recommended) + +```Bash +cd /tmp +git clone https://github.com/pytorch/pytorch #clone PyTorch from the source +cd pytorch #go to folder +git checkout v1.7.1 #Checkout version (example: 1.7.1) +git submodule update --init #Update dependencies +python setup.py install #install it with python +``` + +##### Install Horovod for PyTorch with python and pip + +In the example presented installation for the PyTorch without +TensorFlow. Adapt as required and refer to the Horovod documentation for +details. + +```Bash +HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir horovod +``` + +##### Verify that Horovod works + +```Bash +python #start python +import torch #import pytorch +import horovod.torch as hvd #import horovod +hvd.init() #initialize horovod +hvd.size() +hvd.rank() +print('Hello from:', hvd.rank()) +``` + +##### Horovod with NCCL + +If you want to use NCCL instead of MPI you can specify that in the +install command after loading the NCCL module: + +```Bash +module load NCCL/2.3.7-fosscuda-2018b +HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir horovod +``` diff --git a/doc.zih.tu-dresden.de/docs/software/get_started_with_hpcda.md b/doc.zih.tu-dresden.de/docs/software/get_started_with_hpcda.md deleted file mode 100644 index 8bf5d12783a30d92ded0e2a9f01b06a246066020..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/get_started_with_hpcda.md +++ /dev/null @@ -1,353 +0,0 @@ -# Get started with HPC-DA - -HPC-DA (High-Performance Computing and Data Analytics) is a part of TU-Dresden general purpose HPC -cluster (Taurus). HPC-DA is the best **option** for **Machine learning, Deep learning** applications -and tasks connected with the big data. - -**This is an introduction of how to run machine learning applications on the HPC-DA system.** - -The main **aim** of this guide is to help users who have started working with Taurus and focused on -working with Machine learning frameworks such as TensorFlow or Pytorch. - -**Prerequisites:** To work with HPC-DA, you need [Login](../access/ssh_login.md) for the Taurus system -and preferably have basic knowledge about High-Performance computers and Python. - -**Disclaimer:** This guide provides the main steps on the way of using Taurus, for details please -follow links in the text. - -You can also find the information you need on the -[HPC-Introduction] **todo** %ATTACHURL%/HPC-Introduction.pdf?t=1585216700 and -[HPC-DA-Introduction] *todo** %ATTACHURL%/HPC-DA-Introduction.pdf?t=1585162693 presentation slides. - -## Why should I use HPC-DA? The architecture and feature of the HPC-DA - -HPC-DA built on the base of [Power9](https://www.ibm.com/it-infrastructure/power/power9) -architecture from IBM. HPC-DA created from -[AC922 IBM servers](https://www.ibm.com/ie-en/marketplace/power-systems-ac922), which was created -for AI challenges, analytics and working with, Machine learning, data-intensive workloads, -deep-learning frameworks and accelerated databases. POWER9 is the processor with state-of-the-art -I/O subsystem technology, including next-generation NVIDIA NVLink, PCIe Gen4 and OpenCAPI. -[Here](../jobs_and_resources/power9.md) you could find a detailed specification of the TU Dresden -HPC-DA system. - -The main feature of the Power9 architecture (ppc64le) is the ability to work the -[NVIDIA Tesla V100](https://www.nvidia.com/en-gb/data-center/tesla-v100/) GPU with **NV-Link** -support. NV-Link technology allows increasing a total bandwidth of 300 gigabytes per second (GB/sec) - -- 10X the bandwidth of PCIe Gen 3. The bandwidth is a crucial factor for deep learning and machine - learning applications. - -**Note:** The Power9 architecture not so common as an x86 architecture. This means you are not so -flexible with choosing applications for your projects. Even so, the main tools and applications are -available. See available modules here. - -**Please use the ml partition if you need GPUs!** Otherwise using the x86 partitions (e.g Haswell) -most likely would be more beneficial. - -## Login - -### SSH Access - -The recommended way to connect to the HPC login servers directly via ssh: - -```Bash -ssh <zih-login>@taurus.hrsk.tu-dresden.de -``` - -Please put this command in the terminal and replace `<zih-login>` with your login that you received -during the access procedure. Accept the host verifying and enter your password. - -This method requires two conditions: -Linux OS, workstation within the campus network. For other options and -details check the [login page](../access/ssh_login.md). - -## Data management - -### Workspaces - -As soon as you have access to HPC-DA you have to manage your data. The main method of working with -data on Taurus is using Workspaces. You could work with simple examples in your home directory -(where you are loading by default). However, in accordance with the -[storage concept](../data_lifecycle/overview.md) -**please use** a [workspace](../data_lifecycle/workspaces.md) -for your study and work projects. - -You should create your workspace with a similar command: - -```Bash -ws_allocate -F scratch Machine_learning_project 50 #allocating workspase in scratch directory for 50 days -``` - -After the command, you will have an output with the address of the workspace based on scratch. Use -it to store the main data of your project. - -For different purposes, you should use different storage systems. To work as efficient as possible, -consider the following points: - -- Save source code etc. in `/home` or `/projects/...` -- Store checkpoints and other massive but temporary data with - workspaces in: `/scratch/ws/...` -- For data that seldom changes but consumes a lot of space, use - mid-term storage with workspaces: `/warm_archive/...` -- For large parallel applications where using the fastest file system - is a necessity, use with workspaces: `/lustre/ssd/...` -- Compilation in `/dev/shm`** or `/tmp` - -### Data moving - -#### Moving data to/from the HPC machines - -To copy data to/from the HPC machines, the Taurus [export nodes](../data_transfer/export_nodes.md) -should be used. They are the preferred way to transfer your data. There are three possibilities to -exchanging data between your local machine (lm) and the HPC machines (hm): **SCP, RSYNC, SFTP**. - -Type following commands in the local directory of the local machine. For example, the **`SCP`** -command was used. - -#### Copy data from lm to hm - -```Bash -scp <file> <zih-user>@taurusexport.hrsk.tu-dresden.de:<target-location> #Copy file from your local machine. For example: scp helloworld.txt mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/ - -scp -r <directory> <zih-user>@taurusexport.hrsk.tu-dresden.de:<target-location> #Copy directory from your local machine. -``` - -#### Copy data from hm to lm - -```Bash -scp <zih-user>@taurusexport.hrsk.tu-dresden.de:<file> <target-location> #Copy file. For example: scp mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/helloworld.txt /home/mustermann/Downloads - -scp -r <zih-user>@taurusexport.hrsk.tu-dresden.de:<directory> <target-location> #Copy directory -``` - -#### Moving data inside the HPC machines. Datamover - -The best way to transfer data inside the Taurus is the [data mover](../data_transfer/datamover.md). -It is the special data transfer machine providing the global file systems of each ZIH HPC system. -Datamover provides the best data speed. To load, move, copy etc. files from one file system to -another file system, you have to use commands with **dt** prefix, such as: - -`dtcp, dtwget, dtmv, dtrm, dtrsync, dttar, dtls` - -These commands submit a job to the data transfer machines that execute the selected command. Except -for the `dt` prefix, their syntax is the same as the shell command without the `dt`. - -```Bash -dtcp -r /scratch/ws/<name_of_your_workspace>/results /lustre/ssd/ws/<name_of_your_workspace>; #Copy from workspace in scratch to ssd. -dtwget https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz #Download archive CIFAR-100. -``` - -## BatchSystems. SLURM - -After logon and preparing your data for further work the next logical step is to start your job. For -these purposes, SLURM is using. Slurm (Simple Linux Utility for Resource Management) is an -open-source job scheduler that allocates compute resources on clusters for queued defined jobs. By -default, after your logging, you are using the login nodes. The intended purpose of these nodes -speaks for oneself. Applications on an HPC system can not be run there! They have to be submitted -to compute nodes (ml nodes for HPC-DA) with dedicated resources for user jobs. - -Job submission can be done with the command: `-srun [options] <command>.` - -This is a simple example which you could use for your start. The `srun` command is used to submit a -job for execution in real-time designed for interactive use, with monitoring the output. For some -details please check [the Slurm page](../jobs_and_resources/slurm.md). - -```Bash -srun -p ml -N 1 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash #Job submission in ml nodes with allocating: 1 node, 1 gpu per node, with 8000 mb on 1 hour. -``` - -However, using srun directly on the shell will lead to blocking and launch an interactive job. Apart -from short test runs, it is **recommended to launch your jobs into the background by using batch -jobs**. For that, you can conveniently put the parameters directly into the job file which you can -submit using `sbatch [options] <job file>.` - -This is the example of the sbatch file to run your application: - -```Bash -#!/bin/bash -#SBATCH --mem=8GB # specify the needed memory -#SBATCH -p ml # specify ml partition -#SBATCH --gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task) -#SBATCH --nodes=1 # request 1 node -#SBATCH --time=00:15:00 # runs for 10 minutes -#SBATCH -c 1 # how many cores per task allocated -#SBATCH -o HLR_name_your_script.out # save output message under HLR_${SLURMJOBID}.out -#SBATCH -e HLR_name_your_script.err # save error messages under HLR_${SLURMJOBID}.err - -module load modenv/ml -module load TensorFlow - -python machine_learning_example.py - -## when finished writing, submit with: sbatch <script_name> For example: sbatch machine_learning_script.slurm -``` - -The `machine_learning_example.py` contains a simple ml application based on the mnist model to test -your sbatch file. It could be found as the [attachment] **todo** -%ATTACHURL%/machine_learning_example.py in the bottom of the page. - -## Start your application - -As stated before HPC-DA was created for deep learning, machine learning applications. Machine -learning frameworks as TensorFlow and PyTorch are industry standards now. - -There are three main options on how to work with Tensorflow and PyTorch: - -1. **Modules** -1. **JupyterNotebook** -1. **Containers** - -### Modules - -The easiest way is using the [modules system](modules.md) and Python virtual environment. Modules -are a way to use frameworks, compilers, loader, libraries, and utilities. The module is a user -interface that provides utilities for the dynamic modification of a user's environment without -manual modifications. You could use them for srun , bath jobs (sbatch) and the Jupyterhub. - -A virtual environment is a cooperatively isolated runtime environment that allows Python users and -applications to install and update Python distribution packages without interfering with the -behaviour of other Python applications running on the same system. At its core, the main purpose of -Python virtual environments is to create an isolated environment for Python projects. - -**Vitualenv (venv)** is a standard Python tool to create isolated Python environments. We recommend -using venv to work with Tensorflow and Pytorch on Taurus. It has been integrated into the standard -library under the [venv module](https://docs.python.org/3/library/venv.html). However, if you have -reasons (previously created environments etc) you could easily use conda. The conda is the second -way to use a virtual environment on the Taurus. -[Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) -is an open-source package management system and environment management system from the Anaconda. - -As was written in the previous chapter, to start the application (using -modules) and to run the job exist two main options: - -- The `srun` command:** - -```Bash -srun -p ml -N 1 -n 1 -c 2 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash #job submission in ml nodes with allocating: 1 node, 1 task per node, 2 CPUs per task, 1 gpu per node, with 8000 mb on 1 hour. - -module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml - -mkdir python-virtual-environments #create folder for your environments -cd python-virtual-environments #go to folder -module load TensorFlow #load TensorFlow module to use python. Example output: Module Module TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4 and 31 dependencies loaded. -which python #check which python are you using -python3 -m venv --system-site-packages env #create virtual environment "env" which inheriting with global site packages -source env/bin/activate #activate virtual environment "env". Example output: (env) bash-4.2$ -``` - -The inscription (env) at the beginning of each line represents that now you are in the virtual -environment. - -Now you can check the working capacity of the current environment. - -```Bash -python # start python -import tensorflow as tf -print(tf.__version__) # example output: 2.1.0 -``` - -The second and main option is using batch jobs (`sbatch`). It is used to submit a job script for -later execution. Consequently, it is **recommended to launch your jobs into the background by using -batch jobs**. To launch your machine learning application as well to srun job you need to use -modules. See the previous chapter with the sbatch file example. - -Versions: TensorFlow 1.14, 1.15, 2.0, 2.1; PyTorch 1.1, 1.3 are available. (25.02.20) - -Note: However in case of using sbatch files to send your job you usually don't need a virtual -environment. - -### JupyterNotebook - -The Jupyter Notebook is an open-source web application that allows you to create documents -containing live code, equations, visualizations, and narrative text. Jupyter notebook allows working -with TensorFlow on Taurus with GUI (graphic user interface) in a **web browser** and the opportunity -to see intermediate results step by step of your work. This can be useful for users who dont have -huge experience with HPC or Linux. - -There is [JupyterHub](../access/jupyterhub.md) on Taurus, where you can simply run your Jupyter -notebook on HPC nodes. Also, for more specific cases you can run a manually created remote jupyter -server. You can find the manual server setup [here](deep_learning.md). However, the simplest option -for beginners is using JupyterHub. - -JupyterHub is available at -[taurus.hrsk.tu-dresden.de/jupyter](https://taurus.hrsk.tu-dresden.de/jupyter) - -After logging, you can start a new session and configure it. There are simple and advanced forms to -set up your session. On the simple form, you have to choose the "IBM Power (ppc64le)" architecture. -You can select the required number of CPUs and GPUs. For the acquaintance with the system through -the examples below the recommended amount of CPUs and 1 GPU will be enough. -With the advanced form, you can use -the configuration with 1 GPU and 7 CPUs. To access for all your workspaces use " / " in the -workspace scope. Please check updates and details [here](../access/jupyterhub.md). - -Several Tensorflow and PyTorch examples for the Jupyter notebook have been prepared based on some -simple tasks and models which will give you an understanding of how to work with ML frameworks and -JupyterHub. It could be found as the [attachment] **todo** %ATTACHURL%/machine_learning_example.py -in the bottom of the page. A detailed explanation and examples for TensorFlow can be found -[here](tensorflow_on_jupyter_notebook.md). For the Pytorch - [here](pytorch.md). Usage information -about the environments for the JupyterHub could be found [here](../access/jupyterhub.md) in the chapter -*Creating and using your own environment*. - -Versions: TensorFlow 1.14, 1.15, 2.0, 2.1; PyTorch 1.1, 1.3 are -available. (25.02.20) - -### Containers - -Some machine learning tasks such as benchmarking require using containers. A container is a standard -unit of software that packages up code and all its dependencies so the application runs quickly and -reliably from one computing environment to another. Using containers gives you more flexibility -working with modules and software but at the same time requires more effort. - -On Taurus [Singularity](https://sylabs.io/) is used as a standard container solution. Singularity -enables users to have full control of their environment. This means that **you dont have to ask an -HPC support to install anything for you - you can put it in a Singularity container and run!**As -opposed to Docker (the beat-known container solution), Singularity is much more suited to being used -in an HPC environment and more efficient in many cases. Docker containers also can easily be used by -Singularity from the [DockerHub](https://hub.docker.com) for instance. Also, some containers are -available in [Singularity Hub](https://singularity-hub.org/). - -The simplest option to start working with containers on HPC-DA is importing from Docker or -SingularityHub container with TensorFlow. It does **not require root privileges** and so works on -Taurus directly: - -```Bash -srun -p ml -N 1 --gres=gpu:1 --time=02:00:00 --pty --mem-per-cpu=8000 bash #allocating resourses from ml nodes to start the job to create a container. -singularity build my-ML-container.sif docker://ibmcom/tensorflow-ppc64le #create a container from the DockerHub with the last TensorFlow version -singularity run --nv my-ML-container.sif #run my-ML-container.sif container with support of the Nvidia's GPU. You could also entertain with your container by commands: singularity shell, singularity exec -``` - -There are two sources for containers for Power9 architecture with -Tensorflow and PyTorch on the board: - -* [Tensorflow-ppc64le](https://hub.docker.com/r/ibmcom/tensorflow-ppc64le): - Community-supported ppc64le docker container for TensorFlow. -* [PowerAI container](https://hub.docker.com/r/ibmcom/powerai/): - Official Docker container with Tensorflow, PyTorch and many other packages. - Heavy container. It requires a lot of space. Could be found on Taurus. - -Note: You could find other versions of software in the container on the "tag" tab on the docker web -page of the container. - -To use not a pure Tensorflow, PyTorch but also with some Python packages -you have to use the definition file to create the container -(bootstrapping). For details please see the [Container](containers.md) page -from our wiki. Bootstrapping **has required root privileges** and -Virtual Machine (VM) should be used! There are two main options on how -to work with VM on Taurus: [VM tools](vm_tools.md) - automotive algorithms -for using virtual machines; [Manual method](virtual_machines.md) - it requires more -operations but gives you more flexibility and reliability. - -- [machine_learning_example.py] **todo** %ATTACHURL%/machine_learning_example.py: - machine_learning_example.py -- [example_TensofFlow_MNIST.zip] **todo** %ATTACHURL%/example_TensofFlow_MNIST.zip: - example_TensofFlow_MNIST.zip -- [example_Pytorch_MNIST.zip] **todo** %ATTACHURL%/example_Pytorch_MNIST.zip: - example_Pytorch_MNIST.zip -- [example_Pytorch_image_recognition.zip] **todo** %ATTACHURL%/example_Pytorch_image_recognition.zip: - example_Pytorch_image_recognition.zip -- [example_TensorFlow_Automobileset.zip] **todo** %ATTACHURL%/example_TensorFlow_Automobileset.zip: - example_TensorFlow_Automobileset.zip -- [HPC-Introduction.pdf] **todo** %ATTACHURL%/HPC-Introduction.pdf: - HPC-Introduction.pdf -- [HPC-DA-Introduction.pdf] **todo** %ATTACHURL%/HPC-DA-Introduction.pdf : - HPC-DA-Introduction.pdf diff --git a/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md b/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md new file mode 100644 index 0000000000000000000000000000000000000000..92786013f0382c841eed253c71e4a39cbc1a9b62 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md @@ -0,0 +1,365 @@ +# Hyperparameter Optimization (OmniOpt) + +Classical simulation methods as well as machine learning methods (e.g. neural networks) have a large +number of hyperparameters that significantly determine the accuracy, efficiency, and transferability +of the method. In classical simulations, the hyperparameters are usually determined by adaptation to +measured values. Esp. in neural networks, the hyperparameters determine the network architecture: +number and type of layers, number of neurons, activation functions, measures against overfitting +etc. The most common methods to determine hyperparameters are intuitive testing, grid search or +random search. + +The tool OmniOpt performs hyperparameter optimization within a broad range of applications as +classical simulations or machine learning algorithms. OmniOpt is robust and it checks and installs +all dependencies automatically and fixes many problems in the background. While OmniOpt optimizes, +no further intervention is required. You can follow the ongoing output live in the console. +Overhead of OmniOpt is minimal and virtually imperceptible. + +## Quick start with OmniOpt + +The following instructions demonstrate the basic usage of OmniOpt on the ZIH system, based on the +hyperparameter optimization for a neural network. + +The typical OmniOpt workflow comprises at least the following steps: + +1. [Prepare application script and software environment](#prepare-application-script-and-software-environment) +1. [Configure and run OmniOpt](#configure-and-run-omniopt) +1. [Check and evaluate OmniOpt results](#check-and-evaluate-omniopt-results) + +### Prepare Application Script and Software Environment + +The following example application script was created from +[https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html) +as a starting point. +Therein, a neural network is trained on the MNIST Fashion data set. + +There are the following script preparation steps for OmniOpt: + +1. Changing hard-coded hyperparameters (chosen here: batch size, epochs, size of layer 1 and 2) into + command line parameters. Esp. for this example, the Python module `argparse` (see the docs at + [https://docs.python.org/3/library/argparse.html](https://docs.python.org/3/library/argparse.html) + is used. + + ??? note "Parsing arguments in Python" + There are many ways for parsing arguments into Python scripts. The easiest approach is + the `sys` module (see + [www.geeksforgeeks.org/how-to-use-sys-argv-in-python](https://www.geeksforgeeks.org/how-to-use-sys-argv-in-python)), + which would be fully sufficient for usage with OmniOpt. Nevertheless, this basic approach + has no consistency checks or error handling etc. + +1. Mark the output of the optimization target (chosen here: average loss) by prefixing it with the + RESULT string. OmniOpt takes the **last appearing value** prefixed with the RESULT string. In + the example, different epochs are performed and the average from the last epoch is caught by + OmniOpt. Additionally, the `RESULT` output has to be a **single line**. After all these changes, + the final script is as follows (with the lines containing relevant changes highlighted). + + ??? example "Final modified Python script: MNIST Fashion " + + ```python linenums="1" hl_lines="18-33 52-53 66-68 72 74 76 85 125-126" + #!/usr/bin/env python + # coding: utf-8 + + # # Example for using OmniOpt + # + # source code taken from: https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html + # parameters under consideration:# + # 1. batch size + # 2. epochs + # 3. size output layer 1 + # 4. size output layer 2 + + import torch + from torch import nn + from torch.utils.data import DataLoader + from torchvision import datasets + from torchvision.transforms import ToTensor, Lambda, Compose + import argparse + + # parsing hpyerparameters as arguments + parser = argparse.ArgumentParser(description="Demo application for OmniOpt for hyperparameter optimization, example: neural network on MNIST fashion data.") + + parser.add_argument("--out-layer1", type=int, help="the number of outputs of layer 1", default = 512) + parser.add_argument("--out-layer2", type=int, help="the number of outputs of layer 2", default = 512) + parser.add_argument("--batchsize", type=int, help="batchsize for training", default = 64) + parser.add_argument("--epochs", type=int, help="number of epochs", default = 5) + + args = parser.parse_args() + + batch_size = args.batchsize + epochs = args.epochs + num_nodes_out1 = args.out_layer1 + num_nodes_out2 = args.out_layer2 + + # Download training data from open data sets. + training_data = datasets.FashionMNIST( + root="data", + train=True, + download=True, + transform=ToTensor(), + ) + + # Download test data from open data sets. + test_data = datasets.FashionMNIST( + root="data", + train=False, + download=True, + transform=ToTensor(), + ) + + # Create data loaders. + train_dataloader = DataLoader(training_data, batch_size=batch_size) + test_dataloader = DataLoader(test_data, batch_size=batch_size) + + for X, y in test_dataloader: + print("Shape of X [N, C, H, W]: ", X.shape) + print("Shape of y: ", y.shape, y.dtype) + break + + # Get cpu or gpu device for training. + device = "cuda" if torch.cuda.is_available() else "cpu" + print("Using {} device".format(device)) + + # Define model + class NeuralNetwork(nn.Module): + def __init__(self, out1, out2): + self.o1 = out1 + self.o2 = out2 + super(NeuralNetwork, self).__init__() + self.flatten = nn.Flatten() + self.linear_relu_stack = nn.Sequential( + nn.Linear(28*28, out1), + nn.ReLU(), + nn.Linear(out1, out2), + nn.ReLU(), + nn.Linear(out2, 10), + nn.ReLU() + ) + + def forward(self, x): + x = self.flatten(x) + logits = self.linear_relu_stack(x) + return logits + + model = NeuralNetwork(out1=num_nodes_out1, out2=num_nodes_out2).to(device) + print(model) + + loss_fn = nn.CrossEntropyLoss() + optimizer = torch.optim.SGD(model.parameters(), lr=1e-3) + + def train(dataloader, model, loss_fn, optimizer): + size = len(dataloader.dataset) + for batch, (X, y) in enumerate(dataloader): + X, y = X.to(device), y.to(device) + + # Compute prediction error + pred = model(X) + loss = loss_fn(pred, y) + + # Backpropagation + optimizer.zero_grad() + loss.backward() + optimizer.step() + + if batch % 200 == 0: + loss, current = loss.item(), batch * len(X) + print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]") + + def test(dataloader, model, loss_fn): + size = len(dataloader.dataset) + num_batches = len(dataloader) + model.eval() + test_loss, correct = 0, 0 + with torch.no_grad(): + for X, y in dataloader: + X, y = X.to(device), y.to(device) + pred = model(X) + test_loss += loss_fn(pred, y).item() + correct += (pred.argmax(1) == y).type(torch.float).sum().item() + test_loss /= num_batches + correct /= size + print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n") + + + #print statement esp. for OmniOpt (single line!!) + print(f"RESULT: {test_loss:>8f} \n") + + for t in range(epochs): + print(f"Epoch {t+1}\n-------------------------------") + train(train_dataloader, model, loss_fn, optimizer) + test(test_dataloader, model, loss_fn) + print("Done!") + ``` + +1. Testing script functionality and determine software requirements for the chosen + [partition](../jobs_and_resources/system_taurus.md#partitions). In the following, the alpha + partition is used. Please note the parameters `--out-layer1`, `--batchsize`, `--epochs` when + calling the Python script. Additionally, note the `RESULT` string with the output for OmniOpt. + + ??? hint "Hint for installing Python modules" + + Note that for this example the module `torchvision` is not available on the partition `alpha` + and it is installed by creating a [virtual environment](python_virtual_environments.md). It is + recommended to install such a virtual environment into a + [workspace](../data_lifecycle/workspaces.md). + + ```console + marie@login$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch/1.9.0 + marie@login$ mkdir </path/to/workspace/python-environments> #create folder + marie@login$ virtualenv --system-site-packages </path/to/workspace/python-environments/torchvision_env> + marie@login$ source </path/to/workspace/python-environments/torchvision_env>/bin/activate #activate virtual environment + marie@login$ pip install torchvision #install torchvision module + ``` + + ```console + # Job submission on alpha nodes with 1 GPU on 1 node with 800 MB per CPU + marie@login$ srun -p alpha --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash + marie@alpha$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch/1.9.0 + # Activate virtual environment + marie@alpha$ source </path/to/workspace/python-environments/torchvision_env>/bin/activate + The following have been reloaded with a version change: + 1) modenv/scs5 => modenv/hiera + + Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5, PyTorch/1.9.0 and 54 dependencies loaded. + marie@alpha$ python </path/to/your/script/mnistFashion.py> --out-layer1=200 --batchsize=10 --epochs=3 + [...] + Epoch 3 + ------------------------------- + loss: 1.422406 [ 0/60000] + loss: 0.852647 [10000/60000] + loss: 1.139685 [20000/60000] + loss: 0.572221 [30000/60000] + loss: 1.516888 [40000/60000] + loss: 0.445737 [50000/60000] + Test Error: + Accuracy: 69.5%, Avg loss: 0.878329 + + RESULT: 0.878329 + + Done! + ``` + +Using the modified script within OmniOpt requires configuring and loading of the software +environment. The recommended way is to wrap the necessary calls in a shell script. + +??? example "Example for wrapping with shell script" + + ```bash + #!/bin/bash -l + # ^ Shebang-Line, so that it is known that this is a bash file + # -l means 'load this as login shell', so that /etc/profile gets loaded and you can use 'module load' or 'ml' as usual + + # If you don't use this script via `./run.sh' or just `srun run.sh', but like `srun bash run.sh', please add the '-l' there too. + # Like this: + # srun bash -l run.sh + + # Load modules your program needs, always specify versions! + module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch/1.7.1 + source </path/to/workspace/python-environments/torchvision_env>/bin/activate #activate virtual environment + + # Load your script. $@ is all the parameters that are given to this shell file. + python </path/to/your/script/mnistFashion.py> $@ + ``` + +When the wrapped shell script is running properly, the preparations are finished and the next step +is configuring OmniOpt. + +### Configure and Run OmniOpt + +Configuring OmniOpt is done via the GUI at +[https://imageseg.scads.ai/omnioptgui/](https://imageseg.scads.ai/omnioptgui/). +This GUI guides through the configuration process and as result a configuration file is created +automatically according to the GUI input. If you are more familiar with using OmniOpt later on, +this configuration file can be modified directly without using the GUI. + +A screenshot of the GUI, including a properly configuration for the MNIST fashion example is shown +below. The GUI, in which the below displayed values are already entered, can be reached +[here](https://imageseg.scads.ai/omnioptgui/?maxevalserror=5&mem_per_worker=1000&number_of_parameters=3¶m_0_values=10%2C50%2C100¶m_1_values=8%2C16%2C32¶m_2_values=10%2C15%2C30¶m_0_name=out-layer1¶m_1_name=batchsize¶m_2_name=batchsize&account=&projectname=mnist_fashion_optimization_set_1&partition=alpha&searchtype=tpe.suggest¶m_0_type=hp.choice¶m_1_type=hp.choice¶m_2_type=hp.choice&max_evals=1000&objective_program=bash%20%3C%2Fpath%2Fto%2Fwrapper-script%2Frun-mnist-fashion.sh%3E%20--out-layer1%3D%28%24x_0%29%20--batchsize%3D%28%24x_1%29%20--epochs%3D%28%24x_2%29&workdir=%3C%2Fscratch%2Fws%2Fomniopt-workdir%2F%3E). + +Please modify the paths for `objective program` and `workdir` according to your needs. + + +{: align="center"} + +Using OmniOpt for a first trial example, it is often sufficient to concentrate on the following +configuration parameters: + +1. **Optimization run name:** A name for an OmniOpt run given a belonging configuration. +1. **Partition:** Choose the partition on the ZIH system that fits the programs' needs. +1. **Enable GPU:** Decide whether a program could benefit from GPU usage or not. +1. **Workdir:** The directory where OmniOpt is saving its necessary files and all results. Derived + from the optimization run name, each configuration creates a single directory. + Make sure that this working directory is writable from the compute nodes. It is recommended to + use a [workspace](../data_lifecycle/workspaces.md). +1. **Objective program:** Provide all information for program execution. Typically, this will + contain the command for executing a wrapper script. +1. **Parameters:** The hyperparameters to be optimized with the names OmniOpt should use. For the + example here, the variable names are identical to the input parameters of the Python script. + However, these names can be chosen differently, since the connection to OmniOpt is realized via + the variables (`$x_0`), (`$x_1`), etc. from the GUI section "Objective program". Please note that + it is not necessary to name the parameters explicitly in your script but only within the OmniOpt + configuration. + +After all parameters are entered into the GUI, the call for OmniOpt is generated automatically and +displayed on the right. This command contains all necessary instructions (including requesting +resources with Slurm). **Thus, this command can be executed directly on a login node on the ZIH +system.** + + +{: align="center"} + +After executing this command OmniOpt is doing all the magic in the background and there are no +further actions necessary. + +??? hint "Hints on the working directory" + + 1. Starting OmniOpt without providing a working directory will store OmniOpt into the present directory. + 1. Within the given working directory, a new folder named "omniopt" as default, is created. + 1. Within one OmniOpt working directory, there can be multiple optimization projects. + 1. It is possible to have as many working directories as you want (with multiple optimization runs). + 1. It is recommended to use a [workspace](../data_lifecycle/workspaces.md) as working directory, but not the home directory. + +### Check and Evaluate OmniOpt Results + +For getting informed about the current status of OmniOpt or for looking into results, the evaluation +tool of OmniOpt is used. Switch to the OmniOpt folder and run `evaluate-run.sh`. + +``` console +marie@login$ bash </scratch/ws/omniopt-workdir/>evaluate-run.sh +``` + +After initializing and checking for updates in the background, OmniOpt is asking to select the +optimization run of interest. After selecting the optimization run, there will be a menu with the +items as shown below. If OmniOpt has still running jobs there appear some menu items that refer to +these running jobs (image shown below to the right). + +evaluation options (all jobs finished) | evaluation options (still running jobs) +:--------------------------------------------------------------:|:-------------------------: + |  + +For now, we assume that OmniOpt has finished already. +In order to look into the results, there are the following basic approaches. + +1. **Graphical approach:** + There are basically two graphical approaches: two dimensional scatter plots and parallel plots. + + Below there is shown a parallel plot from the MNIST fashion example. + {: align="center"} + + ??? hint "Hints on parallel plots" + + Parallel plots are suitable especially for dealing with multiple dimensions. The parallel + plot created by OmniOpt is an interactive `html` file that is stored in the OminOpt working + directory under `projects/<name_of_optimization_run>/parallel-plot`. The interactivity + of this plot is intended to make optimal combinations of the hyperparameters visible more + easily. Get more information about this interactivity by clicking the "Help" button at the + top of the graphic (see red arrow on the image above). + + After creating a 2D scatter plot or a parallel plot, OmniOpt will try to display the + corresponding file (`html`, `png`) directly on the ZIH system. Therefore, it is necessary to + login via ssh with the option `-X` (X11 forwarding), e.g., `ssh -X taurus.hrsk.tu-dresden.de`. + Nevertheless, because of latency using x11 forwarding, it is recommended to download the created + files and explore them on the local machine (esp. for the parallel plot). The created files are + saved at `projects/<name_of_optimization_run>/{2d-scatterplots,parallel-plot}`. + +1. **Getting the raw data:** + As a second approach, the raw data of the optimization process can be exported as a CSV file. + The created output files are stored in the folder `projects/<name_of_optimization_run>/csv`. diff --git a/doc.zih.tu-dresden.de/docs/software/keras.md b/doc.zih.tu-dresden.de/docs/software/keras.md deleted file mode 100644 index 356e5b17e0ed1a3224ef815629e456391192b5ba..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/keras.md +++ /dev/null @@ -1,237 +0,0 @@ -# Keras - -This is an introduction on how to run a -Keras machine learning application on the new machine learning partition -of Taurus. - -Keras is a high-level neural network API, -written in Python and capable of running on top of -[TensorFlow](https://github.com/tensorflow/tensorflow). -In this page, [Keras](https://www.tensorflow.org/guide/keras) will be -considered as a TensorFlow's high-level API for building and training -deep learning models. Keras includes support for TensorFlow-specific -functionality, such as [eager execution](https://www.tensorflow.org/guide/keras#eager_execution) -, [tf.data](https://www.tensorflow.org/api_docs/python/tf/data) pipelines -and [estimators](https://www.tensorflow.org/guide/estimator). - -On the machine learning nodes (machine learning partition), you can use -the tools from [IBM Power AI](./power_ai.md). PowerAI is an enterprise -software distribution that combines popular open-source deep learning -frameworks, efficient AI development tools (Tensorflow, Caffe, etc). - -In machine learning partition (modenv/ml) Keras is available as part of -the Tensorflow library at Taurus and also as a separate module named -"Keras". For using Keras in machine learning partition you have two -options: - -- use Keras as part of the TensorFlow module; -- use Keras separately and use Tensorflow as an interface between - Keras and GPUs. - -**Prerequisites**: To work with Keras you, first of all, need -[access](../access/ssh_login.md) for the Taurus system, loaded -Tensorflow module on ml partition, activated Python virtual environment. -Basic knowledge about Python, SLURM system also required. - -**Aim** of this page is to introduce users on how to start working with -Keras and TensorFlow on the [HPC-DA](../jobs_and_resources/hpcda.md) -system - part of the TU Dresden HPC system. - -There are three main options on how to work with Keras and Tensorflow on -the HPC-DA: 1. Modules; 2. JupyterNotebook; 3. Containers. One of the -main ways is using the **TODO LINK MISSING** (Modules -system)(RuntimeEnvironment#Module_Environments) and Python virtual -environment. Please see the -[Python page](./python.md) for the HPC-DA -system. - -The information about the Jupyter notebook and the **JupyterHub** could -be found [here](../access/jupyterhub.md). The use of -Containers is described [here](tensorflow_container_on_hpcda.md). - -Keras contains numerous implementations of commonly used neural-network -building blocks such as layers, -[objectives](https://en.wikipedia.org/wiki/Objective_function), -[activation functions](https://en.wikipedia.org/wiki/Activation_function) -[optimizers](https://en.wikipedia.org/wiki/Mathematical_optimization), -and a host of tools -to make working with image and text data easier. Keras, for example, has -a library for preprocessing the image data. - -The core data structure of Keras is a -**model**, a way to organize layers. The Keras functional API is the way -to go for defining as simple (sequential) as complex models, such as -multi-output models, directed acyclic graphs, or models with shared -layers. - -## Getting started with Keras - -This example shows how to install and start working with TensorFlow and -Keras (using the module system). To get started, import [tf.keras](https://www.tensorflow.org/api_docs/python/tf/keras) -as part of your TensorFlow program setup. -tf.keras is TensorFlow's implementation of the [Keras API -specification](https://keras.io/). This is a modified example that we -used for the [Tensorflow page](./tensorflow.md). - -```bash -srun -p ml --gres=gpu:1 -n 1 --pty --mem-per-cpu=8000 bash - -module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml - -mkdir python-virtual-environments -cd python-virtual-environments -module load TensorFlow #example output: Module TensorFlow/1.10.0-PythonAnaconda-3.6 and 1 dependency loaded. -which python -python3 -m venv --system-site-packages env #create virtual environment "env" which inheriting with global site packages -source env/bin/activate #example output: (env) bash-4.2$ -module load TensorFlow -python -import tensorflow as tf -from tensorflow.keras import layers - -print(tf.VERSION) #example output: 1.10.0 -print(tf.keras.__version__) #example output: 2.1.6-tf -``` - -As was said the core data structure of Keras is a **model**, a way to -organize layers. In Keras, you assemble *layers* to build *models*. A -model is (usually) a graph of layers. For our example we use the most -common type of model is a stack of layers. The below [example](https://www.tensorflow.org/guide/keras#model_subclassing) -of using the advanced model with model -subclassing and custom layers illustrate using TF-Keras API. - -```python -import tensorflow as tf -from tensorflow.keras import layers -import numpy as np - -# Numpy arrays to train and evaluate a model -data = np.random.random((50000, 32)) -labels = np.random.random((50000, 10)) - -# Create a custom layer by subclassing -class MyLayer(layers.Layer): - - def __init__(self, output_dim, **kwargs): - self.output_dim = output_dim - super(MyLayer, self).__init__(**kwargs) - -# Create the weights of the layer - def build(self, input_shape): - shape = tf.TensorShape((input_shape[1], self.output_dim)) -# Create a trainable weight variable for this layer - self.kernel = self.add_weight(name='kernel', - shape=shape, - initializer='uniform', - trainable=True) - super(MyLayer, self).build(input_shape) -# Define the forward pass - def call(self, inputs): - return tf.matmul(inputs, self.kernel) - -# Specify how to compute the output shape of the layer given the input shape. - def compute_output_shape(self, input_shape): - shape = tf.TensorShape(input_shape).as_list() - shape[-1] = self.output_dim - return tf.TensorShape(shape) - -# Serializing the layer - def get_config(self): - base_config = super(MyLayer, self).get_config() - base_config['output_dim'] = self.output_dim - return base_config - - @classmethod - def from_config(cls, config): - return cls(**config) -# Create a model using your custom layer -model = tf.keras.Sequential([ - MyLayer(10), - layers.Activation('softmax')]) - -# The compile step specifies the training configuration -model.compile(optimizer=tf.compat.v1.train.RMSPropOptimizer(0.001), - loss='categorical_crossentropy', - metrics=['accuracy']) - -# Trains for 10 epochs(steps). -model.fit(data, labels, batch_size=32, epochs=10) -``` - -## Running the sbatch script on ML modules (modenv/ml) - -Generally, for machine learning purposes ml partition is used but for -some special issues, SCS5 partition can be useful. The following sbatch -script will automatically execute the above Python script on ml -partition. If you have a question about the sbatch script see the -article about [SLURM](./../jobs_and_resources/binding_and_distribution_of_tasks.md). -Keep in mind that you need to put the executable file (Keras_example) with -python code to the same folder as bash script or specify the path. - -```bash -#!/bin/bash -#SBATCH --mem=4GB # specify the needed memory -#SBATCH -p ml # specify ml partition -#SBATCH --gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task) -#SBATCH --nodes=1 # request 1 node -#SBATCH --time=00:05:00 # runs for 5 minutes -#SBATCH -c 16 # how many cores per task allocated -#SBATCH -o HLR_Keras_example.out # save output message under HLR_${SLURMJOBID}.out -#SBATCH -e HLR_Keras_example.err # save error messages under HLR_${SLURMJOBID}.err - -module load modenv/ml -module load TensorFlow - -python Keras_example.py - -## when finished writing, submit with: sbatch <script_name> -``` - -Output results and errors file you can see in the same folder in the -corresponding files after the end of the job. Part of the example -output: - -``` -...... -Epoch 9/10 -50000/50000 [==============================] - 2s 37us/sample - loss: 11.5159 - acc: 0.1000 -Epoch 10/10 -50000/50000 [==============================] - 2s 37us/sample - loss: 11.5159 - acc: 0.1020 -``` - -## Tensorflow 2 - -[TensorFlow 2.0](https://blog.tensorflow.org/2019/09/tensorflow-20-is-now-available.html) -is a significant milestone for the -TensorFlow and the community. There are multiple important changes for -users. - -Tere are a number of TensorFlow 2 modules for both ml and scs5 -partitions in Taurus (2.0 (anaconda), 2.0 (python), 2.1 (python)) -(11.04.20). Please check **TODO MISSING DOC**(the software modules list)(./SoftwareModulesList.md -for the information about available -modules. - -<span style="color:red">**NOTE**</span>: Tensorflow 2 of the -current version is loading by default as a Tensorflow module. - -TensorFlow 2.0 includes many API changes, such as reordering arguments, -renaming symbols, and changing default values for parameters. Thus in -some cases, it makes code written for the TensorFlow 1 not compatible -with TensorFlow 2. However, If you are using the high-level APIs -**(tf.keras)** there may be little or no action you need to take to make -your code fully TensorFlow 2.0 [compatible](https://www.tensorflow.org/guide/migrate). -It is still possible to run 1.X code, -unmodified ([except for contrib](https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md) -), in TensorFlow 2.0: - -```python -import tensorflow.compat.v1 as tf -tf.disable_v2_behavior() #instead of "import tensorflow as tf" -``` - -To make the transition to TF 2.0 as seamless as possible, the TensorFlow -team has created the [tf_upgrade_v2](https://www.tensorflow.org/guide/upgrade) -utility to help transition legacy code to the new API. - -## F.A.Q: diff --git a/doc.zih.tu-dresden.de/docs/software/machine_learning.md b/doc.zih.tu-dresden.de/docs/software/machine_learning.md index e80e6c346dfbeff977fdf74fc251507cc171bbcb..ecbb9e146276aff67d6079579f2163fa6d7dbf74 100644 --- a/doc.zih.tu-dresden.de/docs/software/machine_learning.md +++ b/doc.zih.tu-dresden.de/docs/software/machine_learning.md @@ -1,59 +1,170 @@ # Machine Learning -On the machine learning nodes, you can use the tools from [IBM Power -AI](power_ai.md). +This is an introduction of how to run machine learning applications on ZIH systems. +For machine learning purposes, we recommend to use the partitions [Alpha](#alpha-partition) and/or +[ML](#ml-partition). -## Interactive Session Examples +## ML Partition -### Tensorflow-Test +The compute nodes of the partition ML are built on the base of +[Power9 architecture](https://www.ibm.com/it-infrastructure/power/power9) from IBM. The system was created +for AI challenges, analytics and working with data-intensive workloads and accelerated databases. - tauruslogin6 :~> srun -p ml --gres=gpu:1 -n 1 --pty --mem-per-cpu=10000 bash - srun: job 4374195 queued and waiting for resources - srun: job 4374195 has been allocated resources - taurusml22 :~> ANACONDA2_INSTALL_PATH='/opt/anaconda2' - taurusml22 :~> ANACONDA3_INSTALL_PATH='/opt/anaconda3' - taurusml22 :~> export PATH=$ANACONDA3_INSTALL_PATH/bin:$PATH - taurusml22 :~> source /opt/DL/tensorflow/bin/tensorflow-activate - taurusml22 :~> tensorflow-test - Basic test of tensorflow - A Hello World!!!... +The main feature of the nodes is the ability to work with the +[NVIDIA Tesla V100](https://www.nvidia.com/en-gb/data-center/tesla-v100/) GPU with **NV-Link** +support that allows a total bandwidth with up to 300 GB/s. Each node on the +partition ML has 6x Tesla V-100 GPUs. You can find a detailed specification of the partition in our +[Power9 documentation](../jobs_and_resources/power9.md). - #or: - taurusml22 :~> module load TensorFlow/1.10.0-PythonAnaconda-3.6 +!!! note -Or to use the whole node: `--gres=gpu:6 --exclusive --pty` + The partition ML is based on the Power9 architecture, which means that the software built + for x86_64 will not work on this partition. Also, users need to use the modules which are + specially build for this architecture (from `modenv/ml`). -### In Singularity container: +### Modules - rotscher@tauruslogin6:~> srun -p ml --gres=gpu:6 --pty bash - [rotscher@taurusml22 ~]$ singularity shell --nv /scratch/singularity/powerai-1.5.3-all-ubuntu16.04-py3.img - Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~> export PATH=/opt/anaconda3/bin:$PATH - Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~> . /opt/DL/tensorflow/bin/tensorflow-activate - Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~> tensorflow-test +On the partition ML load the module environment: -## Additional libraries +```console +marie@ml$ module load modenv/ml +The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml +``` + +### Power AI + +There are tools provided by IBM, that work on partition ML and are related to AI tasks. +For more information see our [Power AI documentation](power_ai.md). + +## Alpha Partition + +Another partition for machine learning tasks is Alpha. It is mainly dedicated to +[ScaDS.AI](https://scads.ai/) topics. Each node on Alpha has 2x AMD EPYC CPUs, 8x NVIDIA A100-SXM4 +GPUs, 1 TB RAM and 3.5 TB local space (`/tmp`) on an NVMe device. You can find more details of the +partition in our [Alpha Centauri](../jobs_and_resources/alpha_centauri.md) documentation. + +### Modules + +On the partition **Alpha** load the module environment: + +```console +marie@alpha$ module load modenv/hiera +The following have been reloaded with a version change: 1) modenv/ml => modenv/hiera +``` + +!!! note + + On partition Alpha, the most recent modules are build in `hiera`. Alternative modules might be + build in `scs5`. + +## Machine Learning via Console + +### Python and Virtual Environments + +Python users should use a [virtual environment](python_virtual_environments.md) when conducting +machine learning tasks via console. + +For more details on machine learning or data science with Python see +[data analytics with Python](data_analytics_with_python.md). + +### R + +R also supports machine learning via console. It does not require a virtual environment due to a +different package management. + +For more details on machine learning or data science with R see +[data analytics with R](data_analytics_with_r.md#r-console). + +## Machine Learning with Jupyter + +The [Jupyter Notebook](https://jupyter.org/) is an open-source web application that allows you to +create documents containing live code, equations, visualizations, and narrative text. +[JupyterHub](../access/jupyterhub.md) allows to work with machine learning frameworks (e.g. +TensorFlow or PyTorch) on ZIH systems and to run your Jupyter notebooks on HPC nodes. + +After accessing JupyterHub, you can start a new session and configure it. For machine learning +purposes, select either partition **Alpha** or **ML** and the resources, your application requires. + +In your session you can use [Python](data_analytics_with_python.md#jupyter-notebooks), +[R](data_analytics_with_r.md#r-in-jupyterhub) or [RStudio](data_analytics_with_rstudio.md) for your +machine learning and data science topics. + +## Machine Learning with Containers + +Some machine learning tasks require using containers. In the HPC domain, the +[Singularity](https://singularity.hpcng.org/) container system is a widely used tool. Docker +containers can also be used by Singularity. You can find further information on working with +containers on ZIH systems in our [containers documentation](containers.md). + +There are two sources for containers for Power9 architecture with TensorFlow and PyTorch on the +board: + +* [TensorFlow-ppc64le](https://hub.docker.com/r/ibmcom/tensorflow-ppc64le): + Community-supported `ppc64le` docker container for TensorFlow. +* [PowerAI container](https://hub.docker.com/r/ibmcom/powerai/): + Official Docker container with TensorFlow, PyTorch and many other packages. + +!!! note + + You could find other versions of software in the container on the "tag" tab on the docker web + page of the container. + +In the following example, we build a Singularity container with TensorFlow from the DockerHub and +start it: + +```console +marie@ml$ singularity build my-ML-container.sif docker://ibmcom/tensorflow-ppc64le #create a container from the DockerHub with the last TensorFlow version +[...] +marie@ml$ singularity run --nv my-ML-container.sif #run my-ML-container.sif container supporting the Nvidia's GPU. You can also work with your container by: singularity shell, singularity exec +[...] +``` + +## Additional Libraries for Machine Learning The following NVIDIA libraries are available on all nodes: -| | | -|-------|---------------------------------------| -| NCCL | /usr/local/cuda/targets/ppc64le-linux | -| cuDNN | /usr/local/cuda/targets/ppc64le-linux | +| Name | Path | +|-------|-----------------------------------------| +| NCCL | `/usr/local/cuda/targets/ppc64le-linux` | +| cuDNN | `/usr/local/cuda/targets/ppc64le-linux` | + +!!! note -Note: For optimal NCCL performance it is recommended to set the -**NCCL_MIN_NRINGS** environment variable during execution. You can try -different values but 4 should be a pretty good starting point. + For optimal NCCL performance it is recommended to set the + **NCCL_MIN_NRINGS** environment variable during execution. You can try + different values but 4 should be a pretty good starting point. - export NCCL_MIN_NRINGS=4 +```console +marie@compute$ export NCCL_MIN_NRINGS=4 +``` -\<span style="color: #222222; font-size: 1.385em;">HPC\</span> +### HPC-Related Software The following HPC related software is installed on all nodes: -| | | -|------------------|------------------------| -| IBM Spectrum MPI | /opt/ibm/spectrum_mpi/ | -| PGI compiler | /opt/pgi/ | -| IBM XLC Compiler | /opt/ibm/xlC/ | -| IBM XLF Compiler | /opt/ibm/xlf/ | -| IBM ESSL | /opt/ibmmath/essl/ | -| IBM PESSL | /opt/ibmmath/pessl/ | +| Name | Path | +|------------------|--------------------------| +| IBM Spectrum MPI | `/opt/ibm/spectrum_mpi/` | +| PGI compiler | `/opt/pgi/` | +| IBM XLC Compiler | `/opt/ibm/xlC/` | +| IBM XLF Compiler | `/opt/ibm/xlf/` | +| IBM ESSL | `/opt/ibmmath/essl/` | +| IBM PESSL | `/opt/ibmmath/pessl/` | + +## Datasets for Machine Learning + +There are many different datasets designed for research purposes. If you would like to download some +of them, keep in mind that many machine learning libraries have direct access to public datasets +without downloading it, e.g. [TensorFlow Datasets](https://www.tensorflow.org/datasets). If you +still need to download some datasets use [datamover](../data_transfer/datamover.md) machine. + +### The ImageNet Dataset + +The ImageNet project is a large visual database designed for use in visual object recognition +software research. In order to save space in the filesystem by avoiding to have multiple duplicates +of this lying around, we have put a copy of the ImageNet database (ILSVRC2012 and ILSVR2017) under +`/scratch/imagenet` which you can use without having to download it again. For the future, the +ImageNet dataset will be available in +[Warm Archive](../data_lifecycle/workspaces.md#mid-term-storage). ILSVR2017 also includes a dataset +for recognition objects from a video. Please respect the corresponding +[Terms of Use](https://image-net.org/download.php). diff --git a/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-evaluate-menu.png b/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-evaluate-menu.png new file mode 100644 index 0000000000000000000000000000000000000000..6d425818925017b52e455ddfb92b00904a0f302d Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-evaluate-menu.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-graph-result.png b/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-graph-result.png new file mode 100644 index 0000000000000000000000000000000000000000..8dbbec668465134bbd35a78d63052b7c7d253d0e Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-graph-result.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-parallel-plot.png b/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-parallel-plot.png new file mode 100644 index 0000000000000000000000000000000000000000..3702d69383fe4cb248456102f97e8a7fc8127ca0 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-parallel-plot.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-still-running-jobs.png b/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-still-running-jobs.png new file mode 100644 index 0000000000000000000000000000000000000000..d4cd05138805d13e6eedd61b3ad8b0c5c9416afe Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-still-running-jobs.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/Pytorch_jupyter_module.png b/doc.zih.tu-dresden.de/docs/software/misc/Pytorch_jupyter_module.png new file mode 100644 index 0000000000000000000000000000000000000000..5f3e324da2114dc24382f57dfeb14c10554d60f5 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/Pytorch_jupyter_module.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/data_analytics_with_r_RStudio_launcher.png b/doc.zih.tu-dresden.de/docs/software/misc/data_analytics_with_r_RStudio_launcher.png deleted file mode 100644 index fd50be1824655ef7e39c2adf74287fa14a716148..0000000000000000000000000000000000000000 Binary files a/doc.zih.tu-dresden.de/docs/software/misc/data_analytics_with_r_RStudio_launcher.png and /dev/null differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/data_analytics_with_rstudio_launcher.jpg b/doc.zih.tu-dresden.de/docs/software/misc/data_analytics_with_rstudio_launcher.jpg new file mode 100644 index 0000000000000000000000000000000000000000..8f12eb7e8afc8c1c12c1d772ccb391791ec3b550 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/data_analytics_with_rstudio_launcher.jpg differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/example-spark.sbatch b/doc.zih.tu-dresden.de/docs/software/misc/example-spark.sbatch index 5a418a9c5e98f70b027618a4da1158010619556b..2fcf3aa39b8e66b004fa0fed621475e3200f9d76 100644 --- a/doc.zih.tu-dresden.de/docs/software/misc/example-spark.sbatch +++ b/doc.zih.tu-dresden.de/docs/software/misc/example-spark.sbatch @@ -3,10 +3,10 @@ #SBATCH --partition=haswell #SBATCH --nodes=1 #SBATCH --exclusive -#SBATCH --mem=60G +#SBATCH --mem=50G #SBATCH -J "example-spark" -ml Spark +ml Spark/3.0.1-Hadoop-2.7-Java-1.8-Python-3.7.4-GCCcore-8.3.0 function myExitHandler () { stop-all.sh @@ -20,7 +20,7 @@ trap myExitHandler EXIT start-all.sh -spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.4.jar 1000 +spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000 stop-all.sh diff --git a/doc.zih.tu-dresden.de/docs/software/misc/hyperparameter_optimization-OmniOpt-GUI.png b/doc.zih.tu-dresden.de/docs/software/misc/hyperparameter_optimization-OmniOpt-GUI.png new file mode 100644 index 0000000000000000000000000000000000000000..c292e7cefb46224585894acc8623e1bfa9878052 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/hyperparameter_optimization-OmniOpt-GUI.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/hyperparameter_optimization-OmniOpt-final-command.png b/doc.zih.tu-dresden.de/docs/software/misc/hyperparameter_optimization-OmniOpt-final-command.png new file mode 100644 index 0000000000000000000000000000000000000000..b0b714462939f9acbd2e25e0d0eb39b431dba5de Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/hyperparameter_optimization-OmniOpt-final-command.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/tensorflow_jupyter_module.png b/doc.zih.tu-dresden.de/docs/software/misc/tensorflow_jupyter_module.png new file mode 100644 index 0000000000000000000000000000000000000000..1327ee6304faf4b293c385981a750f362063ecbf Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/tensorflow_jupyter_module.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/overview.md b/doc.zih.tu-dresden.de/docs/software/overview.md index 835d22204fcda298899f49d4b2a95b092b7e3da1..f8f4bf32b66c73234ad6db3cb728662e0d33dd7e 100644 --- a/doc.zih.tu-dresden.de/docs/software/overview.md +++ b/doc.zih.tu-dresden.de/docs/software/overview.md @@ -29,11 +29,11 @@ list]**todo link**. <!--After logging in, you are on one of the login nodes. They are not meant for work, but only for the--> <!--login process and short tests. Allocating resources will be done by batch system--> -<!--[SLURM](../jobs_and_resources/slurm.md).--> +<!--[Slurm](../jobs_and_resources/slurm.md).--> ## Modules -Usage of software on HPC systems, e.g., frameworks, compilers, loader and libraries, is +Usage of software on ZIH systems, e.g., frameworks, compilers, loader and libraries, is almost always managed by a **modules system**. Thus, it is crucial to be familiar with the [modules concept and its commands](modules.md). A module is a user interface that provides utilities for the dynamic modification of a user's environment without manual modifications. @@ -47,8 +47,8 @@ The [Jupyter Notebook](https://jupyter.org/) is an open-source web application t documents containing live code, equations, visualizations, and narrative text. There is a [JupyterHub](../access/jupyterhub.md) service on ZIH systems, where you can simply run your Jupyter notebook on compute nodes using [modules](#modules), preloaded or custom virtual environments. -Moreover, you can run a [manually created remote jupyter server](deep_learning.md) for more specific -cases. +Moreover, you can run a [manually created remote jupyter server](../archive/install_jupyter.md) +for more specific cases. ## Containers diff --git a/doc.zih.tu-dresden.de/docs/software/power_ai.md b/doc.zih.tu-dresden.de/docs/software/power_ai.md index dc0fa59b3fc53e180bd620dde71df5597c33298f..37de0d0a05ecf8113b86ca9a550285184cf202a7 100644 --- a/doc.zih.tu-dresden.de/docs/software/power_ai.md +++ b/doc.zih.tu-dresden.de/docs/software/power_ai.md @@ -2,81 +2,56 @@ There are different documentation sources for users to learn more about the PowerAI Framework for Machine Learning. In the following the links -are valid for PowerAI version 1.5.4 +are valid for PowerAI version 1.5.4. -## General Overview: +!!! warning + The information provided here is available from IBM and can be used on `ml` partition only! -- \<a - href="<https://www.ibm.com/support/knowledgecenter/en/SS5SF7_1.5.3/welcome/welcome.htm>" - target="\_blank" title="Landing Page">Landing Page\</a> (note that - you can select different PowerAI versions with the drop down menu - "Change Product or version") -- \<a - href="<https://developer.ibm.com/linuxonpower/deep-learning-powerai/>" - target="\_blank" title="PowerAI Developer Portal">PowerAI Developer - Portal \</a>(Some Use Cases and examples) -- \<a - href="<https://www.ibm.com/support/knowledgecenter/en/SS5SF7_1.5.4/navigation/pai_software_pkgs.html>" - target="\_blank" title="Included Software Packages">Included - Software Packages\</a> (note that you can select different PowerAI - versions with the drop down menu "Change Product or version") - -## Specific User Howtos. Getting started with...: +## General Overview -- \<a - href="<https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted.htm>" - target="\_blank" title="Getting Started with PowerAI">PowerAI\</a> -- \<a - href="<https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_caffe.html>" - target="\_blank" title="Caffe">Caffe\</a> -- \<a - href="<https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_tensorflow.html?view=kc>" - target="\_blank" title="Tensorflow">TensorFlow\</a> -- \<a - href="<https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_tensorflow_prob.html?view=kc>" - target="\_blank" title="Tensorflow Probability">TensorFlow - Probability\</a>\<br />This release of PowerAI includes TensorFlow - Probability. TensorFlow Probability is a library for probabilistic - reasoning and statistical analysis in TensorFlow. -- \<a - href="<https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_tensorboard.html?view=kc>" - target="\_blank" title="Tensorboard">TensorBoard\</a> -- \<a - href="<https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_snapml.html>" - target="\_blank">Snap ML\</a>\<br />This release of PowerAI includes - Snap Machine Learning (Snap ML). Snap ML is a library for training - generalized linear models. It is being developed at IBM with the - vision to remove training time as a bottleneck for machine learning - applications. Snap ML supports many classical machine learning - models and scales gracefully to data sets with billions of examples - or features. It also offers distributed training, GPU acceleration, - and supports sparse data structures. -- \<a - href="<https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_pytorch.html>" - target="\_blank">PyTorch\</a>\<br />This release of PowerAI includes - the community development preview of PyTorch 1.0 (rc1). PowerAI's - PyTorch includes support for IBM's Distributed Deep Learning (DDL) - and Large Model Support (LMS). -- \<a - href="<https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_caffe2ONNX.html>" - target="\_blank">Caffe2 and ONNX\</a>\<br />This release of PowerAI - includes a Technology Preview of Caffe2 and ONNX. Caffe2 is a - companion to PyTorch. PyTorch is great for experimentation and rapid - development, while Caffe2 is aimed at production environments. ONNX - (Open Neural Network Exchange) provides support for moving models - between those frameworks. -- \<a - href="<https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_ddl.html?view=kc>" - target="\_blank" title="Distributed Deep Learning">Distributed Deep - Learning\</a> (DDL). \<br />Works up to 4 TaurusML worker nodes. - (Larger models with more nodes are possible with PowerAI Enterprise) +- [PowerAI Introduction](https://www.ibm.com/support/knowledgecenter/en/SS5SF7_1.5.3/welcome/welcome.htm) + (note that you can select different PowerAI versions with the drop down menu + "Change Product or version") +- [PowerAI Developer Portal](https://developer.ibm.com/linuxonpower/deep-learning-powerai/) + (Some Use Cases and examples) +- [Included Software Packages](https://www.ibm.com/support/knowledgecenter/en/SS5SF7_1.5.4/navigation/pai_software_pkgs.html) + (note that you can select different PowerAI versions with the drop down menu "Change Product + or version") + +## Specific User Guides + +- [Getting Started with PowerAI](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted.htm) +- [Caffe](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_caffe.html) +- [TensorFlow](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_tensorflow.html?view=kc) +- [TensorFlow Probability](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_tensorflow_prob.html?view=kc) + This release of PowerAI includes TensorFlow Probability. TensorFlow Probability is a library + for probabilistic reasoning and statistical analysis in TensorFlow. +- [TensorBoard](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_tensorboard.html?view=kc) +- [Snap ML](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_snapml.html) + This release of PowerAI includes Snap Machine Learning (Snap ML). Snap ML is a library for + training generalized linear models. It is being developed at IBM with the + vision to remove training time as a bottleneck for machine learning + applications. Snap ML supports many classical machine learning + models and scales gracefully to data sets with billions of examples + or features. It also offers distributed training, GPU acceleration, + and supports sparse data structures. +- [PyTorch](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_pytorch.html) + This release of PowerAI includes + the community development preview of PyTorch 1.0 (rc1). PowerAI's + PyTorch includes support for IBM's Distributed Deep Learning (DDL) + and Large Model Support (LMS). +- [Caffe2 and ONNX](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_caffe2ONNX.html) + This release of PowerAI includes a Technology Preview of Caffe2 and ONNX. Caffe2 is a + companion to PyTorch. PyTorch is great for experimentation and rapid + development, while Caffe2 is aimed at production environments. ONNX + (Open Neural Network Exchange) provides support for moving models + between those frameworks. +- [Distributed Deep Learning](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_ddl.html?view=kc) + Distributed Deep Learning (DDL). Works on up to 4 nodes on `ml` partition. ## PowerAI Container We have converted the official Docker container to Singularity. Here is a documentation about the Docker base container, including a table with the individual software versions of the packages installed within the -container: - -- \<a href="<https://hub.docker.com/r/ibmcom/powerai/>" - target="\_blank">PowerAI Docker Container Docu\</a> +container: [PowerAI Docker Container](https://hub.docker.com/r/ibmcom/powerai/). diff --git a/doc.zih.tu-dresden.de/docs/software/python.md b/doc.zih.tu-dresden.de/docs/software/python.md deleted file mode 100644 index b9bde2e2324d2d413c65f1cb4a6b34d45f5225bf..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/python.md +++ /dev/null @@ -1,298 +0,0 @@ -# Python for Data Analytics - -Python is a high-level interpreted language widely used in research and -science. Using HPC allows you to work with python quicker and more -effective. Taurus allows working with a lot of available packages and -libraries which give more useful functionalities and allow use all -features of Python and to avoid minuses. - -**Prerequisites:** To work with PyTorch you obviously need [access](../access/ssh_login.md) for the -Taurus system and basic knowledge about Python, Numpy and SLURM system. - -**Aim** of this page is to introduce users on how to start working with Python on the -[HPC-DA](../jobs_and_resources/power9.md) system - part of the TU Dresden HPC system. - -There are three main options on how to work with Keras and Tensorflow on the HPC-DA: 1. Modules; 2. -[JupyterNotebook](../access/jupyterhub.md); 3.[Containers](containers.md). The main way is using -the [Modules system](modules.md) and Python virtual environment. - -Note: You could work with simple examples in your home directory but according to -[HPCStorageConcept2019](../data_lifecycle/overview.md) please use **workspaces** -for your study and work projects. - -## Virtual environment - -There are two methods of how to work with virtual environments on -Taurus: - -1. **Vitualenv** is a standard Python tool to create isolated Python environments. - It is the preferred interface for - managing installations and virtual environments on Taurus and part of the Python modules. - -2. **Conda** is an alternative method for managing installations and -virtual environments on Taurus. Conda is an open-source package -management system and environment management system from Anaconda. The -conda manager is included in all versions of Anaconda and Miniconda. - -**Note:** Keep in mind that you **cannot** use virtualenv for working -with the virtual environments previously created with conda tool and -vice versa! Prefer virtualenv whenever possible. - -This example shows how to start working -with **Virtualenv** and Python virtual environment (using the module system) - -```Bash -srun -p ml -N 1 -n 1 -c 7 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash #Job submission in ml nodes with 1 gpu on 1 node. - -mkdir python-environments # Optional: Create folder. Please use Workspaces! - -module load modenv/ml # Changing the environment. Example output: The following have been reloaded with a version change: 1 modenv/scs5 => modenv/ml -ml av Python #Check the available modules with Python -module load Python #Load default Python. Example output: Module Python/3.7 4-GCCcore-8.3.0 with 7 dependencies loaded -which python #Check which python are you using -virtualenv --system-site-packages python-environments/envtest #Create virtual environment -source python-environments/envtest/bin/activate #Activate virtual environment. Example output: (envtest) bash-4.2$ -python #Start python - -from time import gmtime, strftime -print(strftime("%Y-%m-%d %H:%M:%S", gmtime())) #Example output: 2019-11-18 13:54:16 -deactivate #Leave the virtual environment -``` - -The [virtualenv](https://virtualenv.pypa.io/en/latest/) Python module (Python 3) provides support -for creating virtual environments with their own sitedirectories, optionally isolated from system -site directories. Each virtual environment has its own Python binary (which matches the version of -the binary that was used to create this environment) and can have its own independent set of -installed Python packages in its site directories. This allows you to manage separate package -installations for different projects. It essentially allows us to create a virtual isolated Python -installation and install packages into that virtual installation. When you switch projects, you can -simply create a new virtual environment and not have to worry about breaking the packages installed -in other environments. - -In your virtual environment, you can use packages from the (Complete List of -Modules)(SoftwareModulesList) or if you didn't find what you need you can install required packages -with the command: `pip install`. With the command `pip freeze`, you can see a list of all installed -packages and their versions. - -This example shows how to start working with **Conda** and virtual -environment (with using module system) - -```Bash -srun -p ml -N 1 -n 1 -c 7 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash # Job submission in ml nodes with 1 gpu on 1 node. - -module load modenv/ml -mkdir conda-virtual-environments #create a folder -cd conda-virtual-environments #go to folder -which python #check which python are you using -module load PythonAnaconda/3.6 #load Anaconda module -which python #check which python are you using now - -conda create -n conda-testenv python=3.6 #create virtual environment with the name conda-testenv and Python version 3.6 -conda activate conda-testenv #activate conda-testenv virtual environment - -conda deactivate #Leave the virtual environment -``` - -You can control where a conda environment -lives by providing a path to a target directory when creating the -environment. For example, the following command will create a new -environment in a workspace located in `scratch` - -```Bash -conda create --prefix /scratch/ws/<name_of_your_workspace>/conda-virtual-environment/<name_of_your_environment> -``` - -Please pay attention, -using srun directly on the shell will lead to blocking and launch an -interactive job. Apart from short test runs, it is **recommended to -launch your jobs into the background by using Slurm**. For that, you can conveniently put -the parameters directly into the job file which you can submit using -`sbatch [options] <job file>.` - -## Jupyter Notebooks - -Jupyter notebooks are a great way for interactive computing in your web -browser. Jupyter allows working with data cleaning and transformation, -numerical simulation, statistical modelling, data visualization and of -course with machine learning. - -There are two general options on how to work Jupyter notebooks using -HPC. - -On Taurus, there is [JupyterHub](../access/jupyterhub.md) where you can simply run your Jupyter -notebook on HPC nodes. Also, you can run a remote jupyter server within a sbatch GPU job and with -the modules and packages you need. The manual server setup you can find [here](deep_learning.md). - -With Jupyterhub you can work with general -data analytics tools. This is the recommended way to start working with -the Taurus. However, some special instruments could not be available on -the Jupyterhub. - -**Keep in mind that the remote Jupyter server can offer more freedom with settings and approaches.** - -## MPI for Python - -Message Passing Interface (MPI) is a standardized and portable -message-passing standard designed to function on a wide variety of -parallel computing architectures. The Message Passing Interface (MPI) is -a library specification that allows HPC to pass information between its -various nodes and clusters. MPI designed to provide access to advanced -parallel hardware for end-users, library writers and tool developers. - -### Why use MPI? - -MPI provides a powerful, efficient and portable way to express parallel -programs. -Among many parallel computational models, message-passing has proven to be an effective one. - -### Parallel Python with mpi4py - -Mpi4py(MPI for Python) package provides bindings of the MPI standard for -the python programming language, allowing any Python program to exploit -multiple processors. - -#### Why use mpi4py? - -Mpi4py based on MPI-2 C++ bindings. It supports almost all MPI calls. -This implementation is popular on Linux clusters and in the SciPy -community. Operations are primarily methods of communicator objects. It -supports communication of pickleable Python objects. Mpi4py provides -optimized communication of NumPy arrays. - -Mpi4py is included as an extension of the SciPy-bundle modules on -taurus. - -Please check the SoftwareModulesList for the modules availability. The availability of the mpi4py -in the module you can check by -the `module whatis <name_of_the module>` command. The `module whatis` -command displays a short information and included extensions of the -module. - -Moreover, it is possible to install mpi4py in your local conda -environment: - -```Bash -srun -p ml --time=04:00:00 -n 1 --pty --mem-per-cpu=8000 bash #allocate recources -module load modenv/ml -module load PythonAnaconda/3.6 #load module to use conda -conda create --prefix=<location_for_your_environment> python=3.6 anaconda #create conda virtual environment - -conda activate <location_for_your_environment> #activate your virtual environment - -conda install -c conda-forge mpi4py #install mpi4py - -python #start python - -from mpi4py import MPI #verify your mpi4py -comm = MPI.COMM_WORLD -print("%d of %d" % (comm.Get_rank(), comm.Get_size())) -``` - -### Horovod - -[Horovod](https://github.com/horovod/horovod) is the open source distributed training -framework for TensorFlow, Keras, PyTorch. It is supposed to make it easy -to develop distributed deep learning projects and speed them up with -TensorFlow. - -#### Why use Horovod? - -Horovod allows you to easily take a single-GPU TensorFlow and Pytorch -program and successfully train it on many GPUs! In -some cases, the MPI model is much more straightforward and requires far -less code changes than the distributed code from TensorFlow for -instance, with parameter servers. Horovod uses MPI and NCCL which gives -in some cases better results than pure TensorFlow and PyTorch. - -#### Horovod as a module - -Horovod is available as a module with **TensorFlow** or **PyTorch**for **all** module environments. -Please check the [software module list](modules.md) for the current version of the software. -Horovod can be loaded like other software on the Taurus: - -```Bash -ml av Horovod #Check available modules with Python -module load Horovod #Loading of the module -``` - -#### Horovod installation - -However, if it is necessary to use Horovod with **PyTorch** or use -another version of Horovod it is possible to install it manually. To -install Horovod you need to create a virtual environment and load the -dependencies (e.g. MPI). Installing PyTorch can take a few hours and is -not recommended - -**Note:** You could work with simple examples in your home directory but **please use workspaces -for your study and work projects** (see the Storage concept). - -Setup: - -```Bash -srun -N 1 --ntasks-per-node=6 -p ml --time=08:00:00 --pty bash #allocate a Slurm job allocation, which is a set of resources (nodes) -module load modenv/ml #Load dependencies by using modules -module load OpenMPI/3.1.4-gcccuda-2018b -module load Python/3.6.6-fosscuda-2018b -module load cuDNN/7.1.4.18-fosscuda-2018b -module load CMake/3.11.4-GCCcore-7.3.0 -virtualenv --system-site-packages <location_for_your_environment> #create virtual environment -source <location_for_your_environment>/bin/activate #activate virtual environment -``` - -Or when you need to use conda: - -```Bash -srun -N 1 --ntasks-per-node=6 -p ml --time=08:00:00 --pty bash #allocate a Slurm job allocation, which is a set of resources (nodes) -module load modenv/ml #Load dependencies by using modules -module load OpenMPI/3.1.4-gcccuda-2018b -module load PythonAnaconda/3.6 -module load cuDNN/7.1.4.18-fosscuda-2018b -module load CMake/3.11.4-GCCcore-7.3.0 - -conda create --prefix=<location_for_your_environment> python=3.6 anaconda #create virtual environment - -conda activate <location_for_your_environment> #activate virtual environment -``` - -Install Pytorch (not recommended) - -```Bash -cd /tmp -git clone https://github.com/pytorch/pytorch #clone Pytorch from the source -cd pytorch #go to folder -git checkout v1.7.1 #Checkout version (example: 1.7.1) -git submodule update --init #Update dependencies -python setup.py install #install it with python -``` - -##### Install Horovod for Pytorch with python and pip - -In the example presented installation for the Pytorch without -TensorFlow. Adapt as required and refer to the horovod documentation for -details. - -```Bash -HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir horovod -``` - -##### Verify that Horovod works - -```Bash -python #start python -import torch #import pytorch -import horovod.torch as hvd #import horovod -hvd.init() #initialize horovod -hvd.size() -hvd.rank() -print('Hello from:', hvd.rank()) -``` - -##### Horovod with NCCL - -If you want to use NCCL instead of MPI you can specify that in the -install command after loading the NCCL module: - -```Bash -module load NCCL/2.3.7-fosscuda-2018b -HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir horovod -``` diff --git a/doc.zih.tu-dresden.de/docs/software/python_virtual_environments.md b/doc.zih.tu-dresden.de/docs/software/python_virtual_environments.md new file mode 100644 index 0000000000000000000000000000000000000000..e19daeeb6731aa32eb993f2495e6ec443bebe2dd --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/python_virtual_environments.md @@ -0,0 +1,126 @@ +# Python Virtual Environments + +Virtual environments allow users to install additional Python packages and create an isolated +run-time environment. We recommend using `virtualenv` for this purpose. In your virtual environment, +you can use packages from the [modules list](modules.md) or if you didn't find what you need you can +install required packages with the command: `pip install`. With the command `pip freeze`, you can +see a list of all installed packages and their versions. + +There are two methods of how to work with virtual environments on ZIH systems: + +1. **virtualenv** is a standard Python tool to create isolated Python environments. + It is the preferred interface for + managing installations and virtual environments on ZIH system and part of the Python modules. + +2. **conda** is an alternative method for managing installations and +virtual environments on ZIH system. conda is an open-source package +management system and environment management system from Anaconda. The +conda manager is included in all versions of Anaconda and Miniconda. + +!!! warning + + Keep in mind that you **cannot** use virtualenv for working + with the virtual environments previously created with conda tool and + vice versa! Prefer virtualenv whenever possible. + +## Python Virtual Environment + +This example shows how to start working with **virtualenv** and Python virtual environment (using +the module system). + +!!! hint + + We recommend to use [workspaces](../data_lifecycle/workspaces.md) for your virtual + environments. + +At first, we check available Python modules and load the preferred version: + +```console +marie@compute$ module avail Python #Check the available modules with Python +[...] +marie@compute$ module load Python #Load default Python +Module Python/3.7 2-GCCcore-8.2.0 with 10 dependencies loaded +marie@compute$ which python #Check which python are you using +/sw/installed/Python/3.7.2-GCCcore-8.2.0/bin/python +``` + +Then create the virtual environment and activate it. + +```console +marie@compute$ ws_allocate -F scratch python_virtual_environment 1 +Info: creating workspace. +/scratch/ws/1/python_virtual_environment +[...] +marie@compute$ virtualenv --system-site-packages /scratch/ws/1/python_virtual_environment/env #Create virtual environment +[...] +marie@compute$ source /scratch/ws/1/python_virtual_environment/env/bin/activate #Activate virtual environment. Example output: (envtest) bash-4.2$ +``` + +Now you can work in this isolated environment, without interfering with other tasks running on the +system. Note that the inscription (env) at the beginning of each line represents that you are in +the virtual environment. You can deactivate the environment as follows: + +```console +(env) marie@compute$ deactivate #Leave the virtual environment +``` + +## Conda Virtual Environment + +This example shows how to start working with **conda** and virtual environment (with using module +system). At first, we use an interactive job and create a directory for the conda virtual +environment: + +```console +marie@compute$ ws_allocate -F scratch conda_virtual_environment 1 +Info: creating workspace. +/scratch/ws/1/conda_virtual_environment +[...] +``` + +Then, we load Anaconda, create an environment in our directory and activate the environment: + +```console +marie@compute$ module load Anaconda3 #load Anaconda module +marie@compute$ conda create --prefix /scratch/ws/1/conda_virtual_environment/conda-env python=3.6 #create virtual environment with Python version 3.6 +marie@compute$ conda activate /scratch/ws/1/conda_virtual_environment/conda-env #activate conda-env virtual environment +``` + +Now you can work in this isolated environment, without interfering with other tasks running on the +system. Note that the inscription (conda-env) at the beginning of each line represents that you +are in the virtual environment. You can deactivate the conda environment as follows: + +```console +(conda-env) marie@compute$ conda deactivate #Leave the virtual environment +``` + +TODO: Link to this page from other DA/ML topics. insert link in alpha centauri + +??? example + + This is an example on partition Alpha. The example creates a virtual environment, and installs + the package `torchvision` with pip. + ```console + marie@login$ srun --partition=alpha-interactive -N=1 --gres=gpu:1 --time=01:00:00 --pty bash + marie@alpha$ mkdir python-environments # please use workspaces + marie@alpha$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch + Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5, PyTorch/1.9.0 and 54 dependencies loaded. + marie@alpha$ which python + /sw/installed/Python/3.8.6-GCCcore-10.2.0/bin/python + marie@alpha$ pip list + [...] + marie@alpha$ virtualenv --system-site-packages python-environments/my-torch-env + created virtual environment CPython3.8.6.final.0-64 in 42960ms + creator CPython3Posix(dest=~/python-environments/my-torch-env, clear=False, global=True) + seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=~/.local/share/virtualenv) + added seed packages: pip==21.1.3, setuptools==57.2.0, wheel==0.36.2 + activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator + marie@alpha$ source python-environments/my-torch-env/bin/activate + (my-torch-env) marie@alpha$ pip install torchvision + [...] + Installing collected packages: torchvision + Successfully installed torchvision-0.10.0 + [...] + (my-torch-env) marie@alpha$ python -c "import torchvision; print(torchvision.__version__)" + 0.10.0+cu102 + (my-torch-env) marie@alpha$ deactivate + ``` diff --git a/doc.zih.tu-dresden.de/docs/software/pytorch.md b/doc.zih.tu-dresden.de/docs/software/pytorch.md index cd476d7296e271e6f7eecf3a84b4af1f80c4ee84..e8e2c4d5ecc7d123527a15140910005204a3d5ef 100644 --- a/doc.zih.tu-dresden.de/docs/software/pytorch.md +++ b/doc.zih.tu-dresden.de/docs/software/pytorch.md @@ -1,260 +1,98 @@ -# Pytorch for Data Analytics +# PyTorch -[PyTorch](https://pytorch.org/) is an open-source machine learning framework. +[PyTorch](https://pytorch.org/){:target="_blank"} is an open-source machine learning framework. It is an optimized tensor library for deep learning using GPUs and CPUs. -PyTorch is a machine learning tool developed by Facebooks AI division -to process large-scale object detection, segmentation, classification, etc. -PyTorch provides a core datastructure, the tensor, a multi-dimensional array that shares many +PyTorch is a machine learning tool developed by Facebooks AI division to process large-scale +object detection, segmentation, classification, etc. +PyTorch provides a core data structure, the tensor, a multi-dimensional array that shares many similarities with Numpy arrays. -PyTorch also consumed Caffe2 for its backend and added support of ONNX. -**Prerequisites:** To work with PyTorch you obviously need [access](../access/ssh_login.md) for the -Taurus system and basic knowledge about Python, Numpy and SLURM system. +Please check the software modules list via -**Aim** of this page is to introduce users on how to start working with PyTorch on the -[HPC-DA](../jobs_and_resources/power9.md) system - part of the TU Dresden HPC system. +```console +marie@login$ module spider pytorch +``` -There are numerous different possibilities of how to work with PyTorch on Taurus. -Here we will consider two main methods. +to find out, which PyTorch modules are available on your partition. -1\. The first option is using Jupyter notebook with HPC-DA nodes. The easiest way is by using -[Jupyterhub](../access/jupyterhub.md). It is a recommended way for beginners in PyTorch and users -who are just starting their work with Taurus. +We recommend using **Alpha** and/or **ML** partitions when working with machine learning workflows +and the PyTorch library. +You can find detailed hardware specification in our +[hardware documentation](../jobs_and_resources/hardware_taurus.md). -2\. The second way is using the Modules system and Python or conda virtual environment. -See [the Python page](python.md) for the HPC-DA system. +## PyTorch Console -Note: The information on working with the PyTorch using Containers could be found -[here](containers.md). +On the **Alpha** partition, load the module environment: -## Get started with PyTorch +```console +marie@login$ srun -p alpha --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash #Job submission on alpha nodes with 1 gpu on 1 node with 800 Mb per CPU +marie@alpha$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch/1.9.0 +Die folgenden Module wurden in einer anderen Version erneut geladen: + 1) modenv/scs5 => modenv/hiera -### Virtual environment +Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5, PyTorch/1.9.0 and 54 dependencies loaded. +``` -For working with PyTorch and python packages using virtual environments (kernels) is necessary. +??? hint "Torchvision on alpha partition" + On the **Alpha** partition, the module torchvision is not yet available within the module + system. (19.08.2021) + Torchvision can be made available by using a virtual environment: -Creating and using your kernel (environment) has the benefit that you can install your preferred -python packages and use them in your notebooks. + ```console + marie@alpha$ virtualenv --system-site-packages python-environments/torchvision_env + marie@alpha$ source python-environments/torchvision_env/bin/activate + marie@alpha$ pip install torchvision --no-deps + ``` -A virtual environment is a cooperatively isolated runtime environment that allows Python users and -applications to install and upgrade Python distribution packages without interfering with -the behaviour of other Python applications running on the same system. So the -[Virtual environment](https://docs.python.org/3/glossary.html#term-virtual-environment) -is a self-contained directory tree that contains a Python installation for a particular version of -Python, plus several additional packages. At its core, the main purpose of -Python virtual environments is to create an isolated environment for Python projects. -Python virtual environment is the main method to work with Deep Learning software as PyTorch on the -HPC-DA system. + Using the **--no-deps** option for "pip install" is necessary here as otherwise the PyTorch + version might be replaced and you will run into trouble with the cuda drivers. -### Conda and Virtualenv +On the **ML** partition: -There are two methods of how to work with virtual environments on -Taurus: +```console +marie@login$ srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash #Job submission in ml nodes with 1 gpu on 1 node with 800 Mb per CPU +``` -1.**Vitualenv (venv)** is a standard Python tool to create isolated Python environments. -In general, It is the preferred interface for managing installations and virtual environments -on Taurus. -It has been integrated into the standard library under the -[venv module](https://docs.python.org/3/library/venv.html). -We recommend using **venv** to work with Python packages and Tensorflow on Taurus. +After calling -2\. The **conda** command is the interface for managing installations and virtual environments on -Taurus. -The **conda** is a tool for managing and deploying applications, environments and packages. -Conda is an open-source package management system and environment management system from Anaconda. -The conda manager is included in all versions of Anaconda and Miniconda. -**Important note!** Due to the use of Anaconda to create PyTorch modules for the ml partition, -it is recommended to use the conda environment for working with the PyTorch to avoid conflicts over -the sources of your packages (pip or conda). +```console +marie@login$ module spider pytorch +``` -**Note:** Keep in mind that you **cannot** use conda for working with the virtual environments -previously created with Vitualenv tool and vice versa +we know that we can load PyTorch (including torchvision) with -This example shows how to install and start working with PyTorch (with -using module system) +```console +marie@ml$ module load modenv/ml torchvision/0.7.0-fosscuda-2019b-Python-3.7.4-PyTorch-1.6.0 +Module torchvision/0.7.0-fosscuda-2019b-Python-3.7.4-PyTorch-1.6.0 and 55 dependencies loaded. +``` - srun -p ml -N 1 -n 1 -c 2 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=5772 bash #Job submission in ml nodes with 1 gpu on 1 node with 2 CPU and with 5772 mb for each cpu. - module load modenv/ml #Changing the environment. Example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml - mkdir python-virtual-environments #Create folder - cd python-virtual-environments #Go to folder - module load PythonAnaconda/3.6 #Load Anaconda with Python. Example output: Module Module PythonAnaconda/3.6 loaded. - which python #Check which python are you using - python3 -m venv --system-site-packages envtest #Create virtual environment - source envtest/bin/activate #Activate virtual environment. Example output: (envtest) bash-4.2$ - module load PyTorch #Load PyTorch module. Example output: Module PyTorch/1.1.0-PythonAnaconda-3.6 loaded. - python #Start python - import torch - torch.version.__version__ #Example output: 1.1.0 +Now, we check that we can access PyTorch: -Keep in mind that using **srun** directly on the shell will lead to blocking and launch an -interactive job. Apart from short test runs, -it is **recommended to launch your jobs into the background by using batch jobs**. -For that, you can conveniently put the parameters directly into the job file -which you can submit using *sbatch [options] <job_file_name>*. +```console +marie@{ml,alpha}$ python -c "import torch; print(torch.__version__)" +``` -## Running the model and examples +The following example shows how to create a python virtual environment and import PyTorch. -Below are examples of Jupyter notebooks with PyTorch models which you can run on ml nodes of HPC-DA. +```console +marie@ml$ mkdir python-environments #create folder +marie@ml$ which python #check which python are you using +/sw/installed/Python/3.7.4-GCCcore-8.3.0/bin/python +marie@ml$ virtualenv --system-site-packages python-environments/env #create virtual environment "env" which inheriting with global site packages +[...] +marie@ml$ source python-environments/env/bin/activate #activate virtual environment "env". Example output: (env) bash-4.2$ +marie@ml$ python -c "import torch; print(torch.__version__)" +``` -There are two ways how to work with the Jupyter notebook on HPC-DA system. You can use a -[remote Jupyter server](deep_learning.md) or [JupyterHub](../access/jupyterhub.md). -Jupyterhub is a simple and recommended way to use PyTorch. -We are using Jupyterhub for our examples. +## PyTorch in JupyterHub -Prepared examples of PyTorch models give you an understanding of how to work with -Jupyterhub and PyTorch models. It can be useful and instructive to start -your acquaintance with PyTorch and HPC-DA system from these simple examples. +In addition to using interactive and batch jobs, it is possible to work with PyTorch using JupyterHub. +The production and test environments of JupyterHub contain Python kernels, that come with a PyTorch support. -JupyterHub is available here: [taurus.hrsk.tu-dresden.de/jupyter](https://taurus.hrsk.tu-dresden.de/jupyter) + +{: align="center"} -After login, you can start a new session by clicking on the button. +## Distributed PyTorch -**Note:** Detailed guide (with pictures and instructions) how to run the Jupyterhub -you could find on [the page](../access/jupyterhub.md). - -Please choose the "IBM Power (ppc64le)". You need to download an example -(prepared as jupyter notebook file) that already contains all you need for the start of the work. -Please put the file into your previously created virtual environment in your working directory or -use the kernel for your notebook [see Jupyterhub page](../access/jupyterhub.md). - -Note: You could work with simple examples in your home directory but according to -[HPCStorageConcept2019](../data_lifecycle/overview.md) please use **workspaces** -for your study and work projects. -For this reason, you have to use advanced options of Jupyterhub and put "/" in "Workspace scope" field. - -To download the first example (from the list below) into your previously created -virtual environment you could use the following command: - - ws_list #list of your workspaces - cd <name_of_your_workspace> #go to workspace - - wget https://doc.zih.tu-dresden.de/hpc-wiki/pub/Compendium/PyTorch/example_MNIST_Pytorch.zip - unzip example_MNIST_Pytorch.zip - -Also, you could use kernels for all notebooks, not only for them which -placed in your virtual environment. See the [jupyterhub](../access/jupyterhub.md) page. - -Examples: - -1\. Simple MNIST model. The MNIST database is a large database of handwritten digits that is -commonly used for training various image processing systems. PyTorch allows us to import and -download the MNIST dataset directly from the Torchvision - package consists of datasets, -model architectures and transformations. -The model contains a neural network with sequential architecture and typical modules -for this kind of models. Recommended parameters for running this model are 1 GPU and 7 cores (28 thread) - -(example_MNIST_Pytorch.zip) - -### Running the model - -Open [JupyterHub](../access/jupyterhub.md) and follow instructions above. - -In Jupyterhub documents are organized with tabs and a very versatile split-screen feature. -On the left side of the screen, you can open your file. Use 'File-Open from Path' -to go to your workspace (e.g. `scratch/ws/<username-name_of_your_ws>`). -You could run each cell separately step by step and analyze the result of each step. -Default command for running one cell Shift+Enter'. Also, you could run all cells with the command ' -run all cells' in the 'Run' Tab. - -## Components and advantages of the PyTorch - -### Pre-trained networks - -The PyTorch gives you an opportunity to use pre-trained models and networks for your purposes -(as a TensorFlow for instance) especially for computer vision and image recognition. As you know -computer vision is one of the fields that have been most impacted by the advent of deep learning. - -We will use a network trained on ImageNet, taken from the TorchVision project, -which contains a few of the best performing neural network architectures for computer vision, -such as AlexNet, one of the early breakthrough networks for image recognition, and ResNet, -which won the ImageNet classification, detection, and localization competitions, in 2015. -[TorchVision](https://pytorch.org/vision/stable/index.html) also has easy access to datasets like -ImageNet and other utilities for getting up -to speed with computer vision applications in PyTorch. -The pre-defined models can be found in torchvision.models. - -**Important note**: For the ml nodes only the Torchvision 0.2.2. is available (10.11.20). -The last updates from IBM include only Torchvision 0.4.1 CPU version. -Be careful some features from modern versions of Torchvision are not available in the 0.2.2 -(e.g. some kinds of `transforms`). Always check the version with: `print(torchvision.__version__)` - -Examples: - -1. Image recognition example. This PyTorch script is using Resnet to single image classification. -Recommended parameters for running this model are 1 GPU and 7 cores (28 thread). - -(example_Pytorch_image_recognition.zip) - -Remember that for using [JupyterHub service](../access/jupyterhub.md) -for PyTorch you need to create and activate -a virtual environment (kernel) with loaded essential modules (see "envtest" environment form the virtual -environment example. - -Run the example in the same way as the previous example (MNIST model). - -### Using Multiple GPUs with PyTorch - -Effective use of GPUs is essential, and it implies using parallelism in -your code and model. Data Parallelism and model parallelism are effective instruments -to improve the performance of your code in case of GPU using. - -The data parallelism is a widely-used technique. It replicates the same model to all GPUs, -where each GPU consumes a different partition of the input data. You could see this method [here](https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html). - -The example below shows how to solve that problem by using model -parallel, which, in contrast to data parallelism, splits a single model -onto different GPUs, rather than replicating the entire model on each -GPU. The high-level idea of model parallel is to place different sub-networks of a model onto different -devices. As the only part of a model operates on any individual device, a set of devices can -collectively serve a larger model. - -It is recommended to use [DistributedDataParallel] -(https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html), -instead of this class, to do multi-GPU training, even if there is only a single node. -See: Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel. -Check the [page](https://pytorch.org/docs/stable/notes/cuda.html#cuda-nn-ddp-instead) and -[Distributed Data Parallel](https://pytorch.org/docs/stable/notes/ddp.html#ddp). - -Examples: - -1\. The parallel model. The main aim of this model to show the way how -to effectively implement your neural network on several GPUs. It -includes a comparison of different kinds of models and tips to improve -the performance of your model. **Necessary** parameters for running this -model are **2 GPU** and 14 cores (56 thread). - -(example_PyTorch_parallel.zip) - -Remember that for using [JupyterHub service](../access/jupyterhub.md) -for PyTorch you need to create and activate -a virtual environment (kernel) with loaded essential modules. - -Run the example in the same way as the previous examples. - -#### Distributed data-parallel - -[DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel) -(DDP) implements data parallelism at the module level which can run across multiple machines. -Applications using DDP should spawn multiple processes and create a single DDP instance per process. -DDP uses collective communications in the [torch.distributed] -(https://pytorch.org/tutorials/intermediate/dist_tuto.html) -package to synchronize gradients and buffers. - -The tutorial could be found [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). - -To use distributed data parallelisation on Taurus please use following -parameters: `--ntasks-per-node` -parameter to the number of GPUs you use -per node. Also, it could be useful to increase `memomy/cpu` parameters -if you run larger models. Memory can be set up to: - ---mem=250000 and --cpus-per-task=7 for the **ml** partition. - ---mem=60000 and --cpus-per-task=6 for the **gpu2** partition. - -Keep in mind that only one memory parameter (`--mem-per-cpu` = <MB> or `--mem`=<MB>) can be specified - -## F.A.Q - -- (example_MNIST_Pytorch.zip) -- (example_Pytorch_image_recognition.zip) -- (example_PyTorch_parallel.zip) +For details on how to run PyTorch with multiple GPUs and/or multiple nodes, see +[distributed training](distributed_training.md). diff --git a/doc.zih.tu-dresden.de/docs/software/tensorboard.md b/doc.zih.tu-dresden.de/docs/software/tensorboard.md new file mode 100644 index 0000000000000000000000000000000000000000..a1fab030bfbca20b1a8f69cf302e95957b565185 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/tensorboard.md @@ -0,0 +1,84 @@ +# TensorBoard + +TensorBoard is a visualization toolkit for TensorFlow and offers a variety of functionalities such +as presentation of loss and accuracy, visualization of the model graph or profiling of the +application. + +## Using JupyterHub + +The easiest way to use TensorBoard is via [JupyterHub](../access/jupyterhub.md). The default +TensorBoard log directory is set to `/tmp/<username>/tf-logs` on the compute node, where Jupyter +session is running. In order to show your own directory with logs, it can be soft-linked to the +default folder. Open a "New Launcher" menu (`Ctrl+Shift+L`) and select "Terminal" session. It +will start new terminal on the respective compute node. Create a directory `/tmp/$USER/tf-logs` +and link it with your log directory +`ln -s <your-tensorboard-target-directory> <local-tf-logs-directory>` + +```Bash +mkdir -p /tmp/$USER/tf-logs +ln -s <your-tensorboard-target-directory> /tmp/$USER/tf-logs +``` + +Update TensorBoard tab if needed with `F5`. + +## Using TensorBoard from Module Environment + +On ZIH systems, TensorBoard is also available as an extension of the TensorFlow module. To check +whether a specific TensorFlow module provides TensorBoard, use the following command: + +```console hl_lines="9" +marie@compute$ module spider TensorFlow/2.3.1 +[...] + Included extensions + =================== + absl-py-0.10.0, astor-0.8.0, astunparse-1.6.3, cachetools-4.1.1, gast-0.3.3, + google-auth-1.21.3, google-auth-oauthlib-0.4.1, google-pasta-0.2.0, + grpcio-1.32.0, Keras-Preprocessing-1.1.2, Markdown-3.2.2, oauthlib-3.1.0, opt- + einsum-3.3.0, pyasn1-modules-0.2.8, requests-oauthlib-1.3.0, rsa-4.6, + tensorboard-2.3.0, tensorboard-plugin-wit-1.7.0, TensorFlow-2.3.1, tensorflow- + estimator-2.3.0, termcolor-1.1.0, Werkzeug-1.0.1, wrapt-1.12.1 +``` + +If TensorBoard occurs in the `Included extensions` section of the output, TensorBoard is available. + +To use TensorBoard, you have to connect via ssh to the ZIH system as usual, schedule an interactive +job and load a TensorFlow module: + +```console +marie@compute$ module load TensorFlow/2.3.1 +Module TensorFlow/2.3.1-fosscuda-2019b-Python-3.7.4 and 47 dependencies loaded. +``` + +Then, create a workspace for the event data, that should be visualized in TensorBoard. If you +already have an event data directory, you can skip that step. + +```console +marie@compute$ ws_allocate -F scratch tensorboard_logdata 1 +Info: creating workspace. +/scratch/ws/1/marie-tensorboard_logdata +[...] +``` + +Now, you can run your TensorFlow application. Note that you might have to adapt your code to make it +accessible for TensorBoard. Please find further information on the official [TensorBoard website](https://www.tensorflow.org/tensorboard/get_started) +Then, you can start TensorBoard and pass the directory of the event data: + +```console +marie@compute$ tensorboard --logdir /scratch/ws/1/marie-tensorboard_logdata --bind_all +[...] +TensorBoard 2.3.0 at http://taurusi8034.taurus.hrsk.tu-dresden.de:6006/ +[...] +``` + +TensorBoard then returns a server address on Taurus, e.g. `taurusi8034.taurus.hrsk.tu-dresden.de:6006` + +For accessing TensorBoard now, you have to set up some port forwarding via ssh to your local +machine: + +```console +marie@local$ ssh -N -f -L 6006:taurusi8034.taurus.hrsk.tu-dresden.de:6006 <zih-login>@taurus.hrsk.tu-dresden.de +``` + +Now, you can see the TensorBoard in your browser at `http://localhost:6006/`. + +Note that you can also use TensorBoard in an [sbatch file](../jobs_and_resources/batch_systems.md). diff --git a/doc.zih.tu-dresden.de/docs/software/tensorflow.md b/doc.zih.tu-dresden.de/docs/software/tensorflow.md index 346eb9a1da4e0728c2751773d656ac70d00a60c4..d8ad85c3b1a5f870f5ced0848274fb866bd14dff 100644 --- a/doc.zih.tu-dresden.de/docs/software/tensorflow.md +++ b/doc.zih.tu-dresden.de/docs/software/tensorflow.md @@ -1,264 +1,156 @@ # TensorFlow -## Introduction - -This is an introduction of how to start working with TensorFlow and run -machine learning applications on the [HPC-DA](../jobs_and_resources/hpcda.md) system of Taurus. - -\<span style="font-size: 1em;">On the machine learning nodes (machine -learning partition), you can use the tools from [IBM PowerAI](power_ai.md) or the other -modules. PowerAI is an enterprise software distribution that combines popular open-source -deep learning frameworks, efficient AI development tools (Tensorflow, Caffe, etc). For -this page and examples was used [PowerAI version 1.5.4](https://www.ibm.com/support/knowledgecenter/en/SS5SF7_1.5.4/navigation/pai_software_pkgs.html) - -[TensorFlow](https://www.tensorflow.org/guide/) is a free end-to-end open-source -software library for dataflow and differentiable programming across many -tasks. It is a symbolic math library, used primarily for machine -learning applications. It has a comprehensive, flexible ecosystem of tools, libraries and -community resources. It is available on taurus along with other common machine -learning packages like Pillow, SciPY, Numpy. - -**Prerequisites:** To work with Tensorflow on Taurus, you obviously need -[access](../access/ssh_login.md) for the Taurus system and basic knowledge about Python, SLURM system. - -**Aim** of this page is to introduce users on how to start working with -TensorFlow on the \<a href="HPCDA" target="\_self">HPC-DA\</a> system - -part of the TU Dresden HPC system. - -There are three main options on how to work with Tensorflow on the -HPC-DA: **1.** **Modules,** **2.** **JupyterNotebook, 3. Containers**. The best option is -to use [module system](../software/runtime_environment.md#Module_Environments) and -Python virtual environment. Please see the next chapters and the [Python page](python.md) for the -HPC-DA system. - -The information about the Jupyter notebook and the **JupyterHub** could -be found [here](../access/jupyterhub.md). The use of -Containers is described [here](tensorflow_container_on_hpcda.md). - -On Taurus, there exist different module environments, each containing a set -of software modules. The default is *modenv/scs5* which is already loaded, -however for the HPC-DA system using the "ml" partition you need to use *modenv/ml*. -To find out which partition are you using use: `ml list`. -You can change the module environment with the command: - - module load modenv/ml - -The machine learning partition is based on the PowerPC Architecture (ppc64le) -(Power9 processors), which means that the software built for x86_64 will not -work on this partition, so you most likely can't use your already locally -installed packages on Taurus. Also, users need to use the modules which are -specially made for the ml partition (from modenv/ml) and not for the rest -of Taurus (e.g. from modenv/scs5). - -Each node on the ml partition has 6x Tesla V-100 GPUs, with 176 parallel threads -on 44 cores per node (Simultaneous multithreading (SMT) enabled) and 256GB RAM. -The specification could be found [here](../jobs_and_resources/power9.md). - -%RED%Note:<span class="twiki-macro ENDCOLOR"></span> Users should not -reserve more than 28 threads per each GPU device so that other users on -the same node still have enough CPUs for their computations left. - -## Get started with Tensorflow - -This example shows how to install and start working with TensorFlow -(with using modules system) and the python virtual environment. Please, -check the next chapter for the details about the virtual environment. - - srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=8000 bash #Job submission in ml nodes with 1 gpu on 1 node with 8000 mb. - - module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml - - mkdir python-environments #create folder - module load TensorFlow #load TensorFlow module. Example output: Module TensorFlow/1.10.0-PythonAnaconda-3.6 and 1 dependency loaded. - which python #check which python are you using - virtualenvv --system-site-packages python-environments/env #create virtual environment "env" which inheriting with global site packages - source python-environments/env/bin/activate #Activate virtual environment "env". Example output: (env) bash-4.2$ - python #start python - import tensorflow as tf - print(tf.VERSION) #example output: 1.10.0 - -Keep in mind that using **srun** directly on the shell will be blocking -and launch an interactive job. Apart from short test runs, it is -recommended to launch your jobs into the background by using batch -jobs:\<span> **sbatch \[options\] \<job file>** \</span>. The example -will be presented later on the page. - -As a Tensorflow example, we will use a \<a -href="<https://www.tensorflow.org/tutorials>" target="\_blank">simple -mnist model\</a>. Even though this example is in Python, the information -here will still apply to other tools. - -The ml partition has very efficacious GPUs to offer. Do not assume that -more power means automatically faster computational speed. The GPU is -only one part of a typical machine learning application. Do not forget -that first the input data needs to be loaded and in most cases even -rescaled or augmented. If you do not specify that you want to use more -than the default one worker (=one CPU thread), then it is very likely -that your GPU computes faster, than it receives the input data. It is, -therefore, possible, that you will not be any faster, than on other GPU -partitions. \<span style="font-size: 1em;">You can solve this by using -multithreading when loading your input data. The \</span>\<a -href="<https://keras.io/models/sequential/#fit_generator>" -target="\_blank">fit_generator\</a>\<span style="font-size: 1em;"> -method supports multiprocessing, just set \`use_multiprocessing\` to -\`True\`, \</span>\<a href="Slurm#Job_Submission" -target="\_blank">request more Threads\</a>\<span style="font-size: -1em;"> from SLURM and set the \`Workers\` amount accordingly.\</span> - -The example below with a \<a -href="<https://www.tensorflow.org/tutorials>" target="\_blank">simple -mnist model\</a> of the python script illustrates using TF-Keras API -from TensorFlow. \<a href="<https://www.tensorflow.org/guide/keras>" -target="\_top">Keras\</a> is TensorFlows high-level API. - -**You can read in detail how to work with Keras on Taurus \<a -href="Keras" target="\_blank">here\</a>.** - - import tensorflow as tf - # Load and prepare the MNIST dataset. Convert the samples from integers to floating-point numbers: - mnist = tf.keras.datasets.mnist - - (x_train, y_train),(x_test, y_test) = mnist.load_data() - x_train, x_test = x_train / 255.0, x_test / 255.0 - - # Build the tf.keras model by stacking layers. Select an optimizer and loss function used for training - model = tf.keras.models.Sequential([ - tf.keras.layers.Flatten(input_shape=(28, 28)), - tf.keras.layers.Dense(512, activation=tf.nn.relu), - tf.keras.layers.Dropout(0.2), - tf.keras.layers.Dense(10, activation=tf.nn.softmax) - ]) - model.compile(optimizer='adam', - loss='sparse_categorical_crossentropy', - metrics=['accuracy']) - - # Train and evaluate model - model.fit(x_train, y_train, epochs=5) - model.evaluate(x_test, y_test) - -The example can train an image classifier with \~98% accuracy based on -this dataset. - -## Python virtual environment - -A virtual environment is a cooperatively isolated runtime environment -that allows Python users and applications to install and update Python -distribution packages without interfering with the behaviour of other -Python applications running on the same system. At its core, the main -purpose of Python virtual environments is to create an isolated -environment for Python projects. - -**Vitualenv**is a standard Python tool to create isolated Python -environments and part of the Python installation/module. We recommend -using virtualenv to work with Tensorflow and Pytorch on Taurus.\<br -/>However, if you have reasons (previously created environments etc) you -can also use conda which is the second way to use a virtual environment -on the Taurus. \<a -href="<https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html>" -target="\_blank">Conda\</a> is an open-source package management system -and environment management system. Note that using conda means that -working with other modules from taurus will be harder or impossible. -Hence it is highly recommended to use virtualenv. - -## Running the sbatch script on ML modules (modenv/ml) and SCS5 modules (modenv/scs5) - -Generally, for machine learning purposes the ml partition is used but -for some special issues, the other partitions can be useful also. The -following sbatch script can execute the above Python script both on ml -partition or gpu2 partition.\<br /> When not using the -TensorFlow-Anaconda modules you may need some additional modules that -are not included (e.g. when using the TensorFlow module from modenv/scs5 -on gpu2).\<br />If you have a question about the sbatch script see the -article about \<a href="Slurm" target="\_blank">SLURM\</a>. Keep in mind -that you need to put the executable file (machine_learning_example.py) -with python code to the same folder as the bash script file -\<script_name>.sh (see below) or specify the path. - - #!/bin/bash - #SBATCH --mem=8GB # specify the needed memory - #SBATCH -p ml # specify ml partition or gpu2 partition - #SBATCH --gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task) - #SBATCH --nodes=1 # request 1 node - #SBATCH --time=00:10:00 # runs for 10 minutes - #SBATCH -c 7 # how many cores per task allocated - #SBATCH -o HLR_<name_your_script>.out # save output message under HLR_${SLURMJOBID}.out - #SBATCH -e HLR_<name_your_script>.err # save error messages under HLR_${SLURMJOBID}.err - - if [ "$SLURM_JOB_PARTITION" == "ml" ]; then - module load modenv/ml - module load TensorFlow/2.0.0-PythonAnaconda-3.7 - else - module load modenv/scs5 - module load TensorFlow/2.0.0-fosscuda-2019b-Python-3.7.4 - module load Pillow/6.2.1-GCCcore-8.3.0 # Optional - module load h5py/2.10.0-fosscuda-2019b-Python-3.7.4 # Optional - fi - - python machine_learning_example.py - - ## when finished writing, submit with: sbatch <script_name> - -Output results and errors file can be seen in the same folder in the -corresponding files after the end of the job. Part of the example -output: - - 1600/10000 [===>..........................] - ETA: 0s - 3168/10000 [========>.....................] - ETA: 0s - 4736/10000 [=============>................] - ETA: 0s - 6304/10000 [=================>............] - ETA: 0s - 7872/10000 [======================>.......] - ETA: 0s - 9440/10000 [===========================>..] - ETA: 0s - 10000/10000 [==============================] - 0s 38us/step - -## TensorFlow 2 - -[TensorFlow -2.0](https://blog.tensorflow.org/2019/09/tensorflow-20-is-now-available.html) -is a significant milestone for TensorFlow and the community. There are -multiple important changes for users. TensorFlow 2.0 removes redundant -APIs, makes APIs more consistent (Unified RNNs, Unified Optimizers), and -better integrates with the Python runtime with Eager execution. Also, -TensorFlow 2.0 offers many performance improvements on GPUs. - -There are a number of TensorFlow 2 modules for both ml and scs5 modenvs -on Taurus. Please check\<a href="SoftwareModulesList" target="\_blank"> -the software modules list\</a> for the information about available -modules or use - - module spider TensorFlow - -%RED%Note:<span class="twiki-macro ENDCOLOR"></span> Tensorflow 2 will -be loaded by default when loading the Tensorflow module without -specifying the version. - -\<span style="font-size: 1em;">TensorFlow 2.0 includes many API changes, -such as reordering arguments, renaming symbols, and changing default -values for parameters. Thus in some cases, it makes code written for the -TensorFlow 1 not compatible with TensorFlow 2. However, If you are using -the high-level APIs (tf.keras) there may be little or no action you need -to take to make your code fully TensorFlow 2.0 \<a -href="<https://www.tensorflow.org/guide/migrate>" -target="\_blank">compatible\</a>. It is still possible to run 1.X code, -unmodified ( [except for -contrib](https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md)), -in TensorFlow 2.0:\</span> - - import tensorflow.compat.v1 as tf - tf.disable_v2_behavior() #instead of "import tensorflow as tf" - -To make the transition to TF 2.0 as seamless as possible, the TensorFlow -team has created the -[`tf_upgrade_v2`](https://www.tensorflow.org/guide/upgrade) utility to -help transition legacy code to the new API. - -## FAQ: - -Q: Which module environment should I use? modenv/ml, modenv/scs5, -modenv/hiera - -A: On the ml partition use modenv/ml, on rome and gpu3 use modenv/hiera, -else stay with the default of modenv/scs5. - -Q: How to change the module environment and know more about modules? - -A: [Modules](../software/runtime_environment.md#Modules) +[TensorFlow](https://www.tensorflow.org) is a free end-to-end open-source software library for data +flow and differentiable programming across many tasks. It is a symbolic math library, used primarily +for machine learning applications. It has a comprehensive, flexible ecosystem of tools, libraries +and community resources. + +Please check the software modules list via + +```console +marie@compute$ module spider TensorFlow +[...] +``` + +to find out, which TensorFlow modules are available on your partition. + +On ZIH systems, TensorFlow 2 is the default module version. For compatibility hints between +TensorFlow 2 and TensorFlow 1, see the corresponding [section below](#compatibility-tf2-and-tf1). + +We recommend using partitions **Alpha** and/or **ML** when working with machine learning workflows +and the TensorFlow library. You can find detailed hardware specification in our +[Hardware](../jobs_and_resources/hardware_taurus.md) documentation. + +## TensorFlow Console + +On the partition Alpha, load the module environment: + +```console +marie@alpha$ module load modenv/scs5 +``` + +Alternatively you can use `modenv/hiera` module environment, where the newest versions are +available + +```console +marie@alpha$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 + +The following have been reloaded with a version change: + 1) modenv/scs5 => modenv/hiera + +Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5 and 15 dependencies loaded. +marie@alpha$ module avail TensorFlow + +-------------- /sw/modules/hiera/all/MPI/GCC-CUDA/10.2.0-11.1.1/OpenMPI/4.0.5 ------------------- + Horovod/0.21.1-TensorFlow-2.4.1 TensorFlow/2.4.1 + +[...] +``` + +On the partition ML load the module environment: + +```console +marie@ml$ module load modenv/ml +The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml +``` + +This example shows how to install and start working with TensorFlow using the modules system. + +```console +marie@ml$ module load TensorFlow +Module TensorFlow/2.3.1-fosscuda-2019b-Python-3.7.4 and 47 dependencies loaded. +``` + +Now we can use TensorFlow. Nevertheless when working with Python in an interactive job, we recommend +to use a virtual environment. In the following example, we create a python virtual environment and +import TensorFlow: + +!!! example + + ```console + marie@ml$ ws_allocate -F scratch python_virtual_environment 1 + Info: creating workspace. + /scratch/ws/1/python_virtual_environment + [...] + marie@ml$ which python #check which python are you using + /sw/installed/Python/3.7.2-GCCcore-8.2.0 + marie@ml$ virtualenv --system-site-packages /scratch/ws/1/python_virtual_environment/env + [...] + marie@ml$ source /scratch/ws/1/python_virtual_environment/env/bin/activate + marie@ml$ python -c "import tensorflow as tf; print(tf.__version__)" + [...] + 2.3.1 + ``` + +## TensorFlow in JupyterHub + +In addition to interactive and batch jobs, it is possible to work with TensorFlow using +JupyterHub. The production and test environments of JupyterHub contain Python and R kernels, that +both come with TensorFlow support. However, you can specify the TensorFlow version when spawning +the notebook by pre-loading a specific TensorFlow module: + + +{: align="center"} + +!!! hint + + You can also define your own Jupyter kernel for more specific tasks. Please read about Jupyter + kernels and virtual environments in our + [JupyterHub](../access/jupyterhub.md#creating-and-using-your-own-environment) documentation. + +## TensorFlow in Containers + +Another option to use TensorFlow are containers. In the HPC domain, the +[Singularity](https://singularity.hpcng.org/) container system is a widely used tool. In the +following example, we use the tensorflow-test in a Singularity container: + +```console +marie@ml$ singularity shell --nv /scratch/singularity/powerai-1.5.3-all-ubuntu16.04-py3.img +Singularity>$ export PATH=/opt/anaconda3/bin:$PATH +Singularity>$ source activate /opt/anaconda3 #activate conda environment +(base) Singularity>$ . /opt/DL/tensorflow/bin/tensorflow-activate +(base) Singularity>$ tensorflow-test +Basic test of tensorflow - A Hello World!!!... +[...] +``` + +## TensorFlow with Python or R + +For further information on TensorFlow in combination with Python see +[data analytics with Python](data_analytics_with_python.md), for R see +[data analytics with R](data_analytics_with_r.md). + +## Distributed TensorFlow + +For details on how to run TensorFlow with multiple GPUs and/or multiple nodes, see +[distributed training](distributed_training.md). + +## Compatibility TF2 and TF1 + +TensorFlow 2.0 includes many API changes, such as reordering arguments, renaming symbols, and +changing default values for parameters. Thus in some cases, it makes code written for the TensorFlow +1.X not compatible with TensorFlow 2.X. However, If you are using the high-level APIs (`tf.keras`) +there may be little or no action you need to take to make your code fully +[TensorFlow 2.0](https://www.tensorflow.org/guide/migrate) compatible. It is still possible to +run 1.X code, unmodified (except for contrib), in TensorFlow 2.0: + +```python +import tensorflow.compat.v1 as tf +tf.disable_v2_behavior() #instead of "import tensorflow as tf" +``` + +To make the transition to TensorFlow 2.0 as seamless as possible, the TensorFlow team has created +the tf_upgrade_v2 utility to help transition legacy code to the new API. + +## Keras + +[Keras](https://keras.io) is a high-level neural network API, written in Python and capable +of running on top of TensorFlow. Please check the software modules list via + +```console +marie@compute$ module spider Keras +[...] +``` + +to find out, which Keras modules are available on your partition. TensorFlow should be automatically +loaded as a dependency. After loading the module, you can use Keras as usual. diff --git a/doc.zih.tu-dresden.de/docs/software/tensorflow_container_on_hpcda.md b/doc.zih.tu-dresden.de/docs/software/tensorflow_container_on_hpcda.md deleted file mode 100644 index 7b77f7da32f720efa0145971b1d3b9b9612a3e92..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/tensorflow_container_on_hpcda.md +++ /dev/null @@ -1,85 +0,0 @@ -# Container on HPC-DA (TensorFlow, PyTorch) - -<span class="twiki-macro RED"></span> **Note: This page is under -construction** <span class="twiki-macro ENDCOLOR"></span> - -\<span style="font-size: 1em;">A container is a standard unit of -software that packages up code and all its dependencies so the -application runs quickly and reliably from one computing environment to -another.\</span> - -**Prerequisites:** To work with Tensorflow, you need \<a href="Login" -target="\_blank">access\</a> for the Taurus system and basic knowledge -about containers, Linux systems. - -**Aim** of this page is to introduce users on how to use Machine -Learning Frameworks such as TensorFlow or PyTorch on the \<a -href="HPCDA" target="\_self">HPC-DA\</a> system - part of the TU Dresden -HPC system. - -Using a container is one of the options to use Machine learning -workflows on Taurus. Using containers gives you more flexibility working -with modules and software but at the same time required more effort. - -\<span style="font-size: 1em;">On Taurus \</span>\<a -href="<https://sylabs.io/>" target="\_blank">Singularity\</a>\<span -style="font-size: 1em;"> used as a standard container solution. -Singularity enables users to have full control of their environment. -Singularity containers can be used to package entire scientific -workflows, software and libraries, and even data. This means that -\</span>**you dont have to ask an HPC support to install anything for -you - you can put it in a Singularity container and run!**\<span -style="font-size: 1em;">As opposed to Docker (the most famous container -solution), Singularity is much more suited to being used in an HPC -environment and more efficient in many cases. Docker containers also can -easily be used in Singularity.\</span> - -Future information is relevant for the HPC-DA system (ML partition) -based on Power9 architecture. - -In some cases using Singularity requires a Linux machine with root -privileges, the same architecture and a compatible kernel. For many -reasons, users on Taurus cannot be granted root permissions. A solution -is a Virtual Machine (VM) on the ml partition which allows users to gain -root permissions in an isolated environment. There are two main options -on how to work with VM on Taurus: - -1\. [VM tools](vm_tools.md). Automative algorithms for using virtual -machines; - -2\. [Manual method](virtual_machines.md). It required more operations but gives you -more flexibility and reliability. - -Short algorithm to run the virtual machine manually: - - srun -p ml -N 1 -c 4 --hint=nomultithread --cloud=kvm --pty /bin/bash<br />cat ~/.cloud_$SLURM_JOB_ID #Example output: ssh root@192.168.0.1<br />ssh root@192.168.0.1 #Copy and paste output from the previous command <br />./mount_host_data.sh - -with VMtools: - -VMtools contains two main programs: -**\<span>buildSingularityImage\</span>** and -**\<span>startInVM.\</span>** - -Main options on how to create a container on ML nodes: - -1\. Create a container from the definition - -1.1 Create a Singularity definition from the Dockerfile. - -\<span style="font-size: 1em;">2. Importing container from the \</span> -[DockerHub](https://hub.docker.com/search?q=ppc64le&type=image&page=1)\<span -style="font-size: 1em;"> or \</span> -[SingularityHub](https://singularity-hub.org/) - -Two main sources for the Tensorflow containers for the Power9 -architecture: - -<https://hub.docker.com/r/ibmcom/tensorflow-ppc64le> - -<https://hub.docker.com/r/ibmcom/powerai> - -Pytorch: - -<https://hub.docker.com/r/ibmcom/powerai> - --- Main.AndreiPolitov - 2020-01-03 diff --git a/doc.zih.tu-dresden.de/docs/software/tensorflow_on_jupyter_notebook.md b/doc.zih.tu-dresden.de/docs/software/tensorflow_on_jupyter_notebook.md deleted file mode 100644 index e011dfd2dc35d7dc5ef1576d7a5dbefa5d52f6d4..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/tensorflow_on_jupyter_notebook.md +++ /dev/null @@ -1,252 +0,0 @@ -# Tensorflow on Jupyter Notebook - -%RED%Note: This page is under construction<span -class="twiki-macro ENDCOLOR"></span> - -Disclaimer: This page dedicates a specific question. For more general -questions please check the JupyterHub webpage. - -The Jupyter Notebook is an open-source web application that allows you -to create documents that contain live code, equations, visualizations, -and narrative text. \<span style="font-size: 1em;">Jupyter notebook -allows working with TensorFlow on Taurus with GUI (graphic user -interface) and the opportunity to see intermediate results step by step -of your work. This can be useful for users who dont have huge experience -with HPC or Linux. \</span> - -**Prerequisites:** To work with Tensorflow and jupyter notebook you need -\<a href="Login" target="\_blank">access\</a> for the Taurus system and -basic knowledge about Python, Slurm system and the Jupyter notebook. - -\<span style="font-size: 1em;"> **This page aims** to introduce users on -how to start working with TensorFlow on the [HPCDA](../jobs_and_resources/hpcda.md) system - part -of the TU Dresden HPC system with a graphical interface.\</span> - -## Get started with Jupyter notebook - -Jupyter notebooks are a great way for interactive computing in your web -browser. Jupyter allows working with data cleaning and transformation, -numerical simulation, statistical modelling, data visualization and of -course with machine learning. - -\<span style="font-size: 1em;">There are two general options on how to -work Jupyter notebooks using HPC. \</span> - -- \<span style="font-size: 1em;">There is \</span>**\<a - href="JupyterHub" target="\_self">jupyterhub\</a>** on Taurus, where - you can simply run your Jupyter notebook on HPC nodes. JupyterHub is - available [here](https://taurus.hrsk.tu-dresden.de/jupyter) -- For more specific cases you can run a manually created **remote - jupyter server.** \<span style="font-size: 1em;"> You can find the - manual server setup [here](deep_learning.md). - -\<span style="font-size: 13px;">Keep in mind that with Jupyterhub you -can't work with some special instruments. However general data analytics -tools are available. Still and all, the simplest option for beginners is -using JupyterHub.\</span> - -## Virtual environment - -\<span style="font-size: 1em;">For working with TensorFlow and python -packages using virtual environments (kernels) is necessary.\</span> - -Interactive code interpreters that are used by Jupyter Notebooks are -called kernels.\<br />Creating and using your kernel (environment) has -the benefit that you can install your preferred python packages and use -them in your notebooks. - -A virtual environment is a cooperatively isolated runtime environment -that allows Python users and applications to install and upgrade Python -distribution packages without interfering with the behaviour of other -Python applications running on the same system. So the [Virtual -environment](https://docs.python.org/3/glossary.html#term-virtual-environment) -is a self-contained directory tree that contains a Python installation -for a particular version of Python, plus several additional packages. At -its core, the main purpose of Python virtual environments is to create -an isolated environment for Python projects. Python virtual environment is -the main method to work with Deep Learning software as TensorFlow on the -[HPCDA](../jobs_and_resources/hpcda.md) system. - -### Conda and Virtualenv - -There are two methods of how to work with virtual environments on -Taurus. **Vitualenv (venv)** is a -standard Python tool to create isolated Python environments. We -recommend using venv to work with Tensorflow and Pytorch on Taurus. It -has been integrated into the standard library under -the [venv](https://docs.python.org/3/library/venv.html). -However, if you have reasons (previously created environments etc) you -could easily use conda. The conda is the second way to use a virtual -environment on the Taurus. -[Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) -is an open-source package management system and environment management system -from the Anaconda. - -**Note:** Keep in mind that you **can not** use conda for working with -the virtual environments previously created with Vitualenv tool and vice -versa! - -This example shows how to start working with environments and prepare -environment (kernel) for working with Jupyter server - - srun -p ml --gres=gpu:1 -n 1 --pty --mem-per-cpu=8000 bash #Job submission in ml nodes with 1 gpu on 1 node with 8000 mb. - - module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml - - mkdir python-virtual-environments #create folder for your environments - cd python-virtual-environments #go to folder - module load TensorFlow #load TensorFlow module. Example output: Module TensorFlow/1.10.0-PythonAnaconda-3.6 and 1 dependency loaded. - which python #check which python are you using - python3 -m venv --system-site-packages env #create virtual environment "env" which inheriting with global site packages - source env/bin/activate #Activate virtual environment "env". Example output: (env) bash-4.2$ - module load TensorFlow #load TensorFlow module in the virtual environment - -The inscription (env) at the beginning of each line represents that now -you are in the virtual environment. - -Now you can check the working capacity of the current environment. - - python #start python - import tensorflow as tf - print(tf.VERSION) #example output: 1.14.0 - -### Install Ipykernel - -Ipykernel is an interactive Python shell and a Jupyter kernel to work -with Python code in Jupyter notebooks. The IPython kernel is the Python -execution backend for Jupyter. The Jupyter Notebook -automatically ensures that the IPython kernel is available. - -``` - (env) bash-4.2$ pip install ipykernel #example output: Collecting ipykernel - ... - #example output: Successfully installed ... ipykernel-5.1.0 ipython-7.5.0 ... - - (env) bash-4.2$ python -m ipykernel install --user --name env --display-name="env" - - #example output: Installed kernelspec my-kernel in .../.local/share/jupyter/kernels/env - [install now additional packages for your notebooks] -``` - -Deactivate the virtual environment - - (env) bash-4.2$ deactivate - -So now you have a virtual environment with included TensorFlow module. -You can use this workflow for your purposes particularly for the simple -running of your jupyter notebook with Tensorflow code. - -## Examples and running the model - -Below are brief explanations examples of Jupyter notebooks with -Tensorflow models which you can run on ml nodes of HPC-DA. Prepared -examples of TensorFlow models give you an understanding of how to work -with jupyterhub and tensorflow models. It can be useful and instructive -to start your acquaintance with Tensorflow and HPC-DA system from these -simple examples. - -You can use a [remote Jupyter server](../access/jupyterhub.md). For simplicity, we -will recommend using Jupyterhub for our examples. - -JupyterHub is available [here](https://taurus.hrsk.tu-dresden.de/jupyter) - -Please check updates and details [JupyterHub](../access/jupyterhub.md). However, -the general pipeline can be briefly explained as follows. - -After logging, you can start a new session and configure it. There are -simple and advanced forms to set up your session. On the simple form, -you have to choose the "IBM Power (ppc64le)" architecture. You can -select the required number of CPUs and GPUs. For the acquaintance with -the system through the examples below the recommended amount of CPUs and -1 GPU will be enough. With the advanced form, you can use the -configuration with 1 GPU and 7 CPUs. To access all your workspaces -use " / " in the workspace scope. - -You need to download the file with a jupyter notebook that already -contains all you need for the start of the work. Please put the file -into your previously created virtual environment in your working -directory or use the kernel for your notebook. - -Note: You could work with simple examples in your home directory but according to -[new storage concept](../data_lifecycle/overview.md) please use -[workspaces](../data_lifecycle/workspaces.md) for your study and work projects**. -For this reason, you have to use advanced options and put "/" in "Workspace scope" field. - -To download the first example (from the list below) into your previously -created virtual environment you could use the following command: - -``` - ws_list - cd <name_of_your_workspace> #go to workspace - - wget https://doc.zih.tu-dresden.de/hpc-wiki/pub/Compendium/TensorFlowOnJupyterNotebook/Mnistmodel.zip - unzip Example_TensorFlow_Automobileset.zip -``` - -Also, you could use kernels for all notebooks, not only for them which placed -in your virtual environment. See the [jupyterhub](../access/jupyterhub.md) page. - -### Examples: - -1\. Simple MNIST model. The MNIST database is a large database of -handwritten digits that is commonly used for \<a -href="<https://en.wikipedia.org/wiki/Training_set>" title="Training -set">t\</a>raining various image processing systems. This model -illustrates using TF-Keras API. \<a -href="<https://www.tensorflow.org/guide/keras>" -target="\_top">Keras\</a> is TensorFlow's high-level API. Tensorflow and -Keras allow us to import and download the MNIST dataset directly from -their API. Recommended parameters for running this model is 1 GPU and 7 -cores (28 thread) - -[doc.zih.tu-dresden.de/hpc-wiki/pub/Compendium/TensorFlowOnJupyterNotebook/Mnistmodel.zip]**todo**(Mnistmodel.zip) - -### Running the model - -\<span style="font-size: 1em;">Documents are organized with tabs and a -very versatile split-screen feature. On the left side of the screen, you -can open your file. Use 'File-Open from Path' to go to your workspace -(e.g. /scratch/ws/\<username-name_of_your_ws>). You could run each cell -separately step by step and analyze the result of each step. Default -command for running one cell Shift+Enter'. Also, you could run all cells -with the command 'run all cells' how presented on the picture -below\</span> - -**todo** \<img alt="Screenshot_from_2019-09-03_15-20-16.png" height="250" -src="Screenshot_from_2019-09-03_15-20-16.png" -title="Screenshot_from_2019-09-03_15-20-16.png" width="436" /> - -#### Additional advanced models - -1\. A simple regression model uses [Automobile -dataset](https://archive.ics.uci.edu/ml/datasets/Automobile). In a -regression problem, we aim to predict the output of a continuous value, -in this case, we try to predict fuel efficiency. This is the simple -model created to present how to work with a jupyter notebook for the -TensorFlow models. Recommended parameters for running this model is 1 -GPU and 7 cores (28 thread) - -[doc.zih.tu-dresden.de/hpc-wiki/pub/Compendium/TensorFlowOnJupyterNotebook/Example_TensorFlow_Automobileset.zip]**todo**(Example_TensorFlow_Automobileset.zip) - -2\. The regression model uses the -[dataset](https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data) -with meteorological data from the Beijing airport and the US embassy. -The data set contains almost 50 thousand on instances and therefore -needs more computational effort. Recommended parameters for running this -model is 1 GPU and 7 cores (28 threads) - -[doc.zih.tu-dresden.de/hpc-wiki/pub/Compendium/TensorFlowOnJupyterNotebook/Example_TensorFlow_Meteo_airport.zip]**todo**(Example_TensorFlow_Meteo_airport.zip) - -**Note**: All examples created only for study purposes. The main aim is -to introduce users of the HPC-DA system of TU-Dresden with TensorFlow -and Jupyter notebook. Examples do not pretend to completeness or -science's significance. Feel free to improve the models and use them for -your study. - -- [Mnistmodel.zip]**todo**(Mnistmodel.zip): Mnistmodel.zip -- [Example_TensorFlow_Automobileset.zip]**todo**(Example_TensorFlow_Automobileset.zip): - Example_TensorFlow_Automobileset.zip -- [Example_TensorFlow_Meteo_airport.zip]**todo**(Example_TensorFlow_Meteo_airport.zip): - Example_TensorFlow_Meteo_airport.zip -- [Example_TensorFlow_3D_road_network.zip]**todo**(Example_TensorFlow_3D_road_network.zip): - Example_TensorFlow_3D_road_network.zip diff --git a/doc.zih.tu-dresden.de/mkdocs.yml b/doc.zih.tu-dresden.de/mkdocs.yml index 8cf3a2c8add9b7976c1f2c3cb1dc8e07bc6d58d6..30efe872f2903f572242a858f37802f0d3a08701 100644 --- a/doc.zih.tu-dresden.de/mkdocs.yml +++ b/doc.zih.tu-dresden.de/mkdocs.yml @@ -1,4 +1,5 @@ nav: + - Home: index.md - Application for Login and Resources: - Overview: application/overview.md @@ -26,6 +27,7 @@ nav: - Modules: software/modules.md - Runtime Environment: software/runtime_environment.md - Custom EasyBuild Modules: software/custom_easy_build_environment.md + - Python Virtual Environments: software/python_virtual_environments.md - Containers: - Singularity: software/containers.md - Singularity Recicpe Hints: software/singularity_recipe_hints.md @@ -38,21 +40,20 @@ nav: - Nanoscale Simulations: software/nanoscale_simulations.md - FEM Software: software/fem_software.md - Visualization: software/visualization.md - - HPC-DA: - - Get started with HPC-DA: software/get_started_with_hpcda.md - - Machine Learning: software/machine_learning.md - - Deep Learning: software/deep_learning.md + - Data Analytics: + - Overview: software/data_analytics.md - Data Analytics with R: software/data_analytics_with_r.md - - Data Analytics with Python: software/python.md - - TensorFlow: - - TensorFlow Overview: software/tensorflow.md - - TensorFlow in Container: software/tensorflow_container_on_hpcda.md - - TensorFlow in JupyterHub: software/tensorflow_on_jupyter_notebook.md - - Keras: software/keras.md - - Dask: software/dask.md - - Power AI: software/power_ai.md + - Data Analytics with RStudio: software/data_analytics_with_rstudio.md + - Data Analytics with Python: software/data_analytics_with_python.md + - Apache Spark: software/big_data_frameworks_spark.md + - Machine Learning: + - Overview: software/machine_learning.md + - TensorFlow: software/tensorflow.md + - TensorBoard: software/tensorboard.md - PyTorch: software/pytorch.md - - Apache Spark, Apache Flink, Apache Hadoop: software/big_data_frameworks.md + - Distributed Training: software/distributed_training.md + - Hyperparameter Optimization (OmniOpt): software/hyperparameter_optimization.md + - PowerAI: software/power_ai.md - SCS5 Migration Hints: software/scs5_software.md - Virtual Machines: software/virtual_machines.md - Virtual Desktops: software/virtual_desktops.md @@ -67,7 +68,7 @@ nav: - Score-P: software/scorep.md - PAPI Library: software/papi.md - Pika: software/pika.md - - Perf Tools: software/perf_tools.md + - Perf Tools: software/perf_tools.md - Score-P: software/scorep.md - Vampir: software/vampir.md - Data Life Cycle Management: @@ -98,17 +99,26 @@ nav: - Taurus: jobs_and_resources/system_taurus.md - Slurm Examples: jobs_and_resources/slurm_examples.md - Slurm: jobs_and_resources/slurm.md - - HPC-DA: jobs_and_resources/hpcda.md - Binding And Distribution Of Tasks: jobs_and_resources/binding_and_distribution_of_tasks.md + # - Queue Policy: jobs/policy.md + # - Examples: jobs/examples/index.md + # - Affinity: jobs/affinity/index.md + # - Interactive: jobs/interactive.md + # - Best Practices: jobs/best-practices.md + # - Reservations: jobs/reservations.md + # - Monitoring: jobs/monitoring.md + # - FAQs: jobs/jobs-faq.md + #- Tests: tests.md + - Support: support.md - Archive: - Overview: archive/overview.md @@ -134,36 +144,49 @@ nav: - VampirTrace: archive/vampirtrace.md - Windows Batchjobs: archive/windows_batch.md - # Project Information + site_name: ZIH HPC Compendium site_description: ZIH HPC Compendium site_author: ZIH Team site_dir: public site_url: https://gitlab.hrz.tu-chemnitz.de/zih/hpc-compendium/hpc-compendium + # uncomment next 3 lines if link to repo should not be displayed in the navbar + repo_name: GitLab hpc-compendium repo_url: https://gitlab.hrz.tu-chemnitz.de/zih/hpc-compendium/hpc-compendium edit_uri: blob/master/docs/ # Configuration -#strict: true + +# strict: true theme: + # basetheme + name: material + # disable fonts being loaded from google fonts + font: false language: en + # dir containing all customizations + custom_dir: tud_theme favicon: assets/images/Logo_klein.png + # logo in header and footer + logo: assets/images/TUD_Logo_weiss_57.png second_logo: assets/images/zih_weiss.png # extends base css + extra_css: + - stylesheets/extra.css markdown_extensions: @@ -180,7 +203,9 @@ extra: homepage: https://tu-dresden.de zih_homepage: https://tu-dresden.de/zih hpcsupport_mail: hpcsupport@zih.tu-dresden.de + # links in footer + footer: - link: /legal_notice name: "Legal Notice / Impressum" diff --git a/doc.zih.tu-dresden.de/util/grep-forbidden-words.sh b/doc.zih.tu-dresden.de/util/grep-forbidden-words.sh index aa20c5a06de665a4420d8c6d41061ee0d6459015..d059a094ae07774abeab691b1610dccab498e94f 100755 --- a/doc.zih.tu-dresden.de/util/grep-forbidden-words.sh +++ b/doc.zih.tu-dresden.de/util/grep-forbidden-words.sh @@ -14,7 +14,7 @@ basedir=`dirname "$basedir"` # The pattern \<io\> should not be present in any file (case-insensitive match), except when it appears as ".io". ruleset="i \<io\> \.io s \<SLURM\> -i file \+system +i file \+system HDFS i \<taurus\> taurus\.hrsk /taurus i \<hrskii\> i hpc \+system diff --git a/doc.zih.tu-dresden.de/wordlist.aspell b/doc.zih.tu-dresden.de/wordlist.aspell index 062652ab8abb32197155eb94ab38f3d172f747b5..f6e63f09748fe622247ff74a6237ee48e5285478 100644 --- a/doc.zih.tu-dresden.de/wordlist.aspell +++ b/doc.zih.tu-dresden.de/wordlist.aspell @@ -1,51 +1,66 @@ -personal_ws-1.1 en 1805 +personal_ws-1.1 en 203 Altix Amdahl's analytics anonymized -Anonymized +APIs BeeGFS benchmarking BLAS +broadwell bsub ccNUMA +centauri citable +conda CPU CPUs +CSV CUDA +cuDNN CXFS +dask +dataframes +DataFrames datamover -Datamover +DataParallel +DDP DDR DFG +DistributedDataParallel +DockerHub Dockerfile dockerized EasyBuild ecryptfs engl english +env +ESSL fastfs FFT FFTW filesystem -Filesystem filesystems Flink +foreach Fortran GBit GFLOPS gfortran GiB gnuplot -Gnuplot GPU +GPUs hadoop -Haswell +haswell HDFS Horovod hostname HPC HPL +hyperparameter +hyperparameters icc icpc ifort @@ -53,6 +68,7 @@ ImageNet Infiniband inode Itanium +jobqueue jpg Jupyter JupyterHub @@ -60,19 +76,22 @@ JupyterLab Keras KNL LAPACK +lapply LINPACK Linter LoadLeveler lsf -LSF lustre +Mathematica MEGWARE MiB MIMD +Miniconda MKL +MNIST Montecito mountpoint -MPI +mpi mpicc mpiCC mpicxx @@ -81,11 +100,16 @@ mpifort mpirun multicore multithreaded +NCCL Neptun NFS +NRINGS NUMA NUMAlink +NumPy Nutzungsbedingungen +OME +OmniOpt OPARI OpenACC OpenBLAS @@ -96,20 +120,33 @@ openmpi OpenMPI OpenSSH Opteron +overfitting +pandarallel +Pandarallel PAPI parallelization +parallelize pdf Perf +PESSL +PGI PiB Pika pipelining png +PowerAI +ppc +PSOCK Pthreads queue +randint reachability +README +Rmpi rome romeo RSA +RStudio Rsync salloc Saxonid @@ -119,6 +156,8 @@ scalable ScaLAPACK Scalasca scancel +Scikit +SciPy scontrol scp scs @@ -130,27 +169,42 @@ SHMEM SLES Slurm SMP +SMT +squeue srun ssd -SSD stderr stdout SUSE TBB +TCP +TensorBoard TensorFlow TFLOPS Theano tmp +todo +ToDo tracefile tracefiles +transferability Trition uplink Vampir VampirTrace VampirTrace's +vectorization +venv +virtualenv WebVNC WinSCP +Workdir +workspace workspaces +XArray Xeon +XGBoost +XLC +XLF ZIH ZIH's