diff --git a/doc.zih.tu-dresden.de/docs/software/distributed_training.md b/doc.zih.tu-dresden.de/docs/software/distributed_training.md
index 1008e33f6a60ba3b4b189deeae2d0f2b14066ffd..98bbdfaa342d1a1a0277e06b6a5ca16b3e9ba10a 100644
--- a/doc.zih.tu-dresden.de/docs/software/distributed_training.md
+++ b/doc.zih.tu-dresden.de/docs/software/distributed_training.md
@@ -141,6 +141,27 @@ wait
 !!! note
     This section is under construction
 
+PyTorch provides multiple ways to achieve data parallelism to train the deep learning models
+efficiently. These models are part of the `torch.distributed` sub-package that ships with the main
+deep learning package.
+
+The easiest method to quickly prototype if the model is trainable in a multi-GPU setting is to wrap
+the existing model with the `torch.nn.DataParallel` class as shown below,
+
+```python
+model = torch.nn.DataParalell(model)
+```
+
+Adding this single line of code to the existing application will let PyTorch know that the model
+needs to be parallelized. But since this method uses threading to achieve parallelism, it fails to
+achieve true parallelism due to the well known issue of Global Interpreter Lock that exists in
+Python. To work around this issue and gain performance benefits of parallelism, the use of
+`torch.nn.DistributedDataParallel` is recommended. This involves little more code changes to set up,
+but further increases the performance of model training. The starting step is to initialize the
+process group by calling the `torch.distributed.init_process_group()` using the appropriate back end
+such as NCCL, MPI or Gloo. The use of NCCL as back end is recommended as it is currently the fastest
+back end when using GPUs.
+
 #### Using Multiple GPUs with PyTorch
 
 The example below shows how to solve that problem by using model parallelism, which in contrast to