diff --git a/Compendium_attachments/PyTorch/example_PyTorch_parallel.zip b/Compendium_attachments/PyTorch/example_PyTorch_parallel.zip deleted file mode 100644 index 05c458be14d89e7862cb547b2db8aaefd9a654a1..0000000000000000000000000000000000000000 Binary files a/Compendium_attachments/PyTorch/example_PyTorch_parallel.zip and /dev/null differ diff --git a/doc.zih.tu-dresden.de/docs/software/distributed_training.md b/doc.zih.tu-dresden.de/docs/software/distributed_training.md index 98bbdfaa342d1a1a0277e06b6a5ca16b3e9ba10a..ab35c32abcb9756c70aec9caa3dc364f840cef24 100644 --- a/doc.zih.tu-dresden.de/docs/software/distributed_training.md +++ b/doc.zih.tu-dresden.de/docs/software/distributed_training.md @@ -55,90 +55,91 @@ The Parameter Server holds the parameters and is responsible for updating the global state of the models. Each worker runs the training loop independently. -#### Example - -In this case, we will go through an example with Multi Worker Mirrored Strategy. -Multi-node training requires a `TF_CONFIG` environment variable to be set which will -be different on each node. - -```console -marie@compute$ TF_CONFIG='{"cluster": {"worker": ["10.1.10.58:12345", "10.1.10.250:12345"]}, "task": {"index": 0, "type": "worker"}}' python main.py -``` - -The `cluster` field describes how the cluster is set up (same on each node). -Here, the cluster has two nodes referred to as workers. -The `IP:port` information is listed in the `worker` array. -The `task` field varies from node to node. -It specifies the type and index of the node. -In this case, the training job runs on worker 0, which is `10.1.10.58:12345`. -We need to adapt this snippet for each node. -The second node will have `'task': {'index': 1, 'type': 'worker'}`. - -With two modifications, we can parallelize the serial code: -We need to initialize the distributed strategy: - -```python -strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy() -``` - -And define the model under the strategy scope: - -```python -with strategy.scope(): - model = resnet.resnet56(img_input=img_input, classes=NUM_CLASSES) - model.compile( - optimizer=opt, - loss='sparse_categorical_crossentropy', - metrics=['sparse_categorical_accuracy']) -model.fit(train_dataset, - epochs=NUM_EPOCHS) -``` - -To run distributed training, the training script needs to be copied to all nodes, -in this case on two nodes. -TensorFlow is available as a module. -Check for the version. -The `TF_CONFIG` environment variable can be set as a prefix to the command. -Now, run the script on the partition `alpha` simultaneously on both nodes: - -```bash -#!/bin/bash - -#SBATCH --job-name=distr -#SBATCH --partition=alpha -#SBATCH --output=%j.out -#SBATCH --error=%j.err -#SBATCH --mem=64000 -#SBATCH --nodes=2 -#SBATCH --ntasks=2 -#SBATCH --ntasks-per-node=1 -#SBATCH --cpus-per-task=14 -#SBATCH --gres=gpu:1 -#SBATCH --time=01:00:00 - -function print_nodelist { +??? example "Multi Worker Mirrored Strategy" + + In this case, we will go through an example with Multi Worker Mirrored Strategy. + Multi-node training requires a `TF_CONFIG` environment variable to be set which will + be different on each node. + + ```console + marie@compute$ TF_CONFIG='{"cluster": {"worker": ["10.1.10.58:12345", "10.1.10.250:12345"]}, "task": {"index": 0, "type": "worker"}}' python main.py + ``` + + The `cluster` field describes how the cluster is set up (same on each node). + Here, the cluster has two nodes referred to as workers. + The `IP:port` information is listed in the `worker` array. + The `task` field varies from node to node. + It specifies the type and index of the node. + In this case, the training job runs on worker 0, which is `10.1.10.58:12345`. + We need to adapt this snippet for each node. + The second node will have `'task': {'index': 1, 'type': 'worker'}`. + + With two modifications, we can parallelize the serial code: + We need to initialize the distributed strategy: + + ```python + strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy() + ``` + + And define the model under the strategy scope: + + ```python + with strategy.scope(): + model = resnet.resnet56(img_input=img_input, classes=NUM_CLASSES) + model.compile( + optimizer=opt, + loss='sparse_categorical_crossentropy', + metrics=['sparse_categorical_accuracy']) + model.fit(train_dataset, + epochs=NUM_EPOCHS) + ``` + + To run distributed training, the training script needs to be copied to all nodes, + in this case on two nodes. + TensorFlow is available as a module. + Check for the version. + The `TF_CONFIG` environment variable can be set as a prefix to the command. + Now, run the script on the partition `alpha` simultaneously on both nodes: + + ```bash + #!/bin/bash + + #SBATCH --job-name=distr + #SBATCH --partition=alpha + #SBATCH --output=%j.out + #SBATCH --error=%j.err + #SBATCH --mem=64000 + #SBATCH --nodes=2 + #SBATCH --ntasks=2 + #SBATCH --ntasks-per-node=1 + #SBATCH --cpus-per-task=14 + #SBATCH --gres=gpu:1 + #SBATCH --time=01:00:00 + + function print_nodelist { scontrol show hostname $SLURM_NODELIST -} -NODE_1=$(print_nodelist | awk '{print $1}' | sort -u | head -n 1) -NODE_2=$(print_nodelist | awk '{print $1}' | sort -u | tail -n 1) -IP_1=$(dig +short ${NODE_1}.taurus.hrsk.tu-dresden.de) -IP_2=$(dig +short ${NODE_2}.taurus.hrsk.tu-dresden.de) + } + NODE_1=$(print_nodelist | awk '{print $1}' | sort -u | head -n 1) + NODE_2=$(print_nodelist | awk '{print $1}' | sort -u | tail -n 1) + IP_1=$(dig +short ${NODE_1}.taurus.hrsk.tu-dresden.de) + IP_2=$(dig +short ${NODE_2}.taurus.hrsk.tu-dresden.de) -module load modenv/hiera -module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 TensorFlow/2.4.1 + module load modenv/hiera + module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 TensorFlow/2.4.1 -# On the first node -TF_CONFIG='{"cluster": {"worker": ["'"${NODE_1}"':33562", "'"${NODE_2}"':33561"]}, "task": {"index": 0, "type": "worker"}}' srun -w ${NODE_1} -N 1 --ntasks=1 --gres=gpu:1 python main_ddl.py & + # On the first node + TF_CONFIG='{"cluster": {"worker": ["'"${NODE_1}"':33562", "'"${NODE_2}"':33561"]}, "task": {"index": 0, "type": "worker"}}' srun -w ${NODE_1} -N 1 --ntasks=1 --gres=gpu:1 python main_ddl.py & -# On the second node -TF_CONFIG='{"cluster": {"worker": ["'"${NODE_1}"':33562", "'"${NODE_2}"':33561"]}, "task": {"index": 1, "type": "worker"}}' srun -w ${NODE_2} -N 1 --ntasks=1 --gres=gpu:1 python main_ddl.py & + # On the second node + TF_CONFIG='{"cluster": {"worker": ["'"${NODE_1}"':33562", "'"${NODE_2}"':33561"]}, "task": {"index": 1, "type": "worker"}}' srun -w ${NODE_2} -N 1 --ntasks=1 --gres=gpu:1 python main_ddl.py & -wait -``` + wait + ``` ### Distributed PyTorch !!! note + This section is under construction PyTorch provides multiple ways to achieve data parallelism to train the deep learning models @@ -179,23 +180,21 @@ See: Use `nn.parallel.DistributedDataParallel` instead of multiprocessing or `nn Check the [page](https://pytorch.org/docs/stable/notes/cuda.html#cuda-nn-ddp-instead) and [Distributed Data Parallel](https://pytorch.org/docs/stable/notes/ddp.html#ddp). -Examples: +??? example "Parallel Model" -1. The parallel model. -The main aim of this model to show the way how to effectively implement your -neural network on several GPUs. -It includes a comparison of different kinds of models and tips to improve the performance -of your model. -**Necessary** parameters for running this model are **2 GPU** and 14 cores. + The main aim of this model to show the way how to effectively implement your neural network on + multiple GPUs. It includes a comparison of different kinds of models and tips to improve the + performance of your model. + **Necessary** parameters for running this model are **2 GPU** and 14 cores. -(example_PyTorch_parallel.zip) + Download: [example_PyTorch_parallel.zip (4.2 KB)](misc/example_PyTorch_parallel.zip) -Remember that for using [JupyterHub service](../access/jupyterhub.md) for PyTorch you need to -create and activate a virtual environment (kernel) with loaded essential modules. + Remember that for using [JupyterHub service](../access/jupyterhub.md) for PyTorch you need to + create and activate a virtual environment (kernel) with loaded essential modules. -Run the example in the same way as the previous examples. + Run the example in the same way as the previous examples. -#### Distributed data-parallel +#### Distributed Data-Parallel [DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel) (DDP) implements data parallelism at the module level which can run across multiple machines. @@ -206,21 +205,21 @@ synchronize gradients and buffers. Please also look at the [official tutorial](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). -To use distributed data parallelism on ZIH systems, please make sure the `--ntasks-per-node` -parameter is equal to the number of GPUs you use per node. +To use distributed data parallelism on ZIH systems, please make sure the value of +parameter `--ntasks-per-node=<N>` equals the number of GPUs you use per node. Also, it can be useful to increase `memory/cpu` parameters if you run larger models. Memory can be set up to: - `--mem=250G` and `--cpus-per-task=7` for the partition `ml`. - `--mem=60G` and `--cpus-per-task=6` for the partition `gpu2`. -Keep in mind that only one memory parameter (`--mem-per-cpu=<MB>` or `--mem=<MB>`) can be specified +Keep in mind that only one memory parameter (`--mem-per-cpu=<MB>` or `--mem=<MB>`) can be specified. ## External Distribution ### Horovod -[Horovod](https://github.com/horovod/horovod) is the open source distributed training framework +[Horovod](https://github.com/horovod/horovod) is the open-source distributed training framework for TensorFlow, Keras and PyTorch. It makes it easier to develop distributed deep learning projects and speeds them up. Horovod scales well to a large number of nodes and has a strong focus on efficient training on @@ -235,7 +234,7 @@ the distributed code from TensorFlow for instance, with parameter servers. Horovod uses MPI and NCCL which gives in some cases better results than pure TensorFlow and PyTorch. -#### Horovod as a module +#### Horovod as Module Horovod is available as a module with **TensorFlow** or **PyTorch** for **all** module environments. @@ -260,19 +259,19 @@ marie@compute$ module load Horovod/0.19.5-fosscuda-2019b-TensorFlow-2.2.0-Python Or if you want to use Horovod on the partition `alpha`, you can load it with the dependencies: -```bash +```console marie@alpha$ module spider Horovod #Check available modules marie@alpha$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 Horovod/0.21.1-TensorFlow-2.4.1 ``` -#### Horovod installation +#### Horovod Installation However, if it is necessary to use another version of Horovod, it is possible to install it manually. For that, you need to create a [virtual environment](python_virtual_environments.md) and load the dependencies (e.g. MPI). Installing TensorFlow can take a few hours and is not recommended. -##### Install Horovod for TensorFlow with python and pip +##### Install Horovod for TensorFlow with Python and Pip This example shows the installation of Horovod for TensorFlow. Adapt as required and refer to the [Horovod documentation](https://horovod.readthedocs.io/en/stable/install_include.html) @@ -299,13 +298,12 @@ Available Tensor Operations: [ ] CCL [X] MPI [ ] Gloo - ``` If you want to use OpenMPI then specify `HOROVOD_GPU_ALLREDUCE=MPI`. To have better performance it is recommended to use NCCL instead of OpenMPI. -##### Verify that Horovod works +##### Verify Horovod Works ```pycon >>> import tensorflow @@ -320,29 +318,30 @@ To have better performance it is recommended to use NCCL instead of OpenMPI. Hello from: 0 ``` -#### Example - -Follow the steps in the [official examples](https://github.com/horovod/horovod/tree/master/examples) -to parallelize your code. -In Horovod, each GPU gets pinned to a process. -You can easily start your job with the following bash script with four processes on two nodes: - -```bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --ntasks=4 -#SBATCH --ntasks-per-node=2 -#SBATCH --gres=gpu:2 -#SBATCH --partition=ml -#SBATCH --mem=250G -#SBATCH --time=01:00:00 -#SBATCH --output=run_horovod.out - -module load modenv/ml -module load Horovod/0.19.5-fosscuda-2019b-TensorFlow-2.2.0-Python-3.7.4 - -srun python <your_program.py> -``` - -Do not forget to specify the total number of tasks `--ntasks` and the number of tasks per node -`--ntasks-per-node` which must match the number of GPUs per node. +??? example + + Follow the steps in the + [official examples](https://github.com/horovod/horovod/tree/master/examples) + to parallelize your code. + In Horovod, each GPU gets pinned to a process. + You can easily start your job with the following bash script with four processes on two nodes: + + ```bash + #!/bin/bash + #SBATCH --nodes=2 + #SBATCH --ntasks=4 + #SBATCH --ntasks-per-node=2 + #SBATCH --gres=gpu:2 + #SBATCH --partition=ml + #SBATCH --mem=250G + #SBATCH --time=01:00:00 + #SBATCH --output=run_horovod.out + + module load modenv/ml + module load Horovod/0.19.5-fosscuda-2019b-TensorFlow-2.2.0-Python-3.7.4 + + srun python <your_program.py> + ``` + + Do not forget to specify the total number of tasks `--ntasks` and the number of tasks per node + `--ntasks-per-node` which must match the number of GPUs per node. diff --git a/doc.zih.tu-dresden.de/wordlist.aspell b/doc.zih.tu-dresden.de/wordlist.aspell index c54073a4814792a22cd904a36266f46e4cfef91f..2cc8f1197350bfb759fc76721c1ab63055b3bcf1 100644 --- a/doc.zih.tu-dresden.de/wordlist.aspell +++ b/doc.zih.tu-dresden.de/wordlist.aspell @@ -96,6 +96,7 @@ GitHub GitLab GitLab's glibc +Gloo gnuplot gpu GPU