Skip to content
Snippets Groups Projects
Commit ef94e413 authored by Natalie Breidenbach's avatar Natalie Breidenbach
Browse files

Update distributed_training.md

parent d98d3720
No related branches found
No related tags found
2 merge requests!938Automated merge from preview to main,!936Update to Five-Cluster-Operation
...@@ -99,13 +99,12 @@ Each worker runs the training loop independently. ...@@ -99,13 +99,12 @@ Each worker runs the training loop independently.
TensorFlow is available as a module. TensorFlow is available as a module.
Check for the version. Check for the version.
The `TF_CONFIG` environment variable can be set as a prefix to the command. The `TF_CONFIG` environment variable can be set as a prefix to the command.
Now, run the script on the partition `alpha` simultaneously on both nodes: Now, run the script on the cluster `alpha` simultaneously on both nodes:
```bash ```bash
#!/bin/bash #!/bin/bash
#SBATCH --job-name=distr #SBATCH --job-name=distr
#SBATCH --partition=alpha
#SBATCH --output=%j.out #SBATCH --output=%j.out
#SBATCH --error=%j.err #SBATCH --error=%j.err
#SBATCH --mem=64000 #SBATCH --mem=64000
...@@ -121,8 +120,8 @@ Each worker runs the training loop independently. ...@@ -121,8 +120,8 @@ Each worker runs the training loop independently.
} }
NODE_1=$(print_nodelist | awk '{print $1}' | sort -u | head -n 1) NODE_1=$(print_nodelist | awk '{print $1}' | sort -u | head -n 1)
NODE_2=$(print_nodelist | awk '{print $1}' | sort -u | tail -n 1) NODE_2=$(print_nodelist | awk '{print $1}' | sort -u | tail -n 1)
IP_1=$(dig +short ${NODE_1}.taurus.hrsk.tu-dresden.de) IP_1=$(dig +short ${NODE_1}.alpha.hpc.tu-dresden.de)
IP_2=$(dig +short ${NODE_2}.taurus.hrsk.tu-dresden.de) IP_2=$(dig +short ${NODE_2}.alpha.hpc.tu-dresden.de)
module load modenv/hiera module load modenv/hiera
module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 TensorFlow/2.4.1 module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 TensorFlow/2.4.1
...@@ -257,7 +256,7 @@ marie@compute$ module spider Horovod # Check available modules ...@@ -257,7 +256,7 @@ marie@compute$ module spider Horovod # Check available modules
marie@compute$ module load Horovod/0.19.5-fosscuda-2019b-TensorFlow-2.2.0-Python-3.7.4 marie@compute$ module load Horovod/0.19.5-fosscuda-2019b-TensorFlow-2.2.0-Python-3.7.4
``` ```
Or if you want to use Horovod on the partition `alpha`, you can load it with the dependencies: Or if you want to use Horovod on the cluster `alpha`, you can load it with the dependencies:
```console ```console
marie@alpha$ module spider Horovod #Check available modules marie@alpha$ module spider Horovod #Check available modules
...@@ -324,7 +323,7 @@ Hello from: 0 ...@@ -324,7 +323,7 @@ Hello from: 0
[official examples](https://github.com/horovod/horovod/tree/master/examples) [official examples](https://github.com/horovod/horovod/tree/master/examples)
to parallelize your code. to parallelize your code.
In Horovod, each GPU gets pinned to a process. In Horovod, each GPU gets pinned to a process.
You can easily start your job with the following bash script with four processes on two nodes: You can easily start your job with the following bash script with four processes on two nodes using the cluster Power:
```bash ```bash
#!/bin/bash #!/bin/bash
...@@ -332,7 +331,6 @@ Hello from: 0 ...@@ -332,7 +331,6 @@ Hello from: 0
#SBATCH --ntasks=4 #SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2 #SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2 #SBATCH --gres=gpu:2
#SBATCH --partition=ml
#SBATCH --mem=250G #SBATCH --mem=250G
#SBATCH --time=01:00:00 #SBATCH --time=01:00:00
#SBATCH --output=run_horovod.out #SBATCH --output=run_horovod.out
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment