Distributed training in Nvidias container (NGC)

docs/software/ngc_containers.md states that multi-node usage is not possible on ZIH-system.

NGC container of Tensorflow 1 and 2 already provide OpenMPI, NCCL and Horovod https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/rel_22-01.html#rel_22-01.

I've done some tests which TensorFlow containers which ran quite fine on partition alpha. I also compared performance of OpenMPI and NCCL provided by the container and provided by module environment on partition alpha with OSU micro benchmarks. From my point of view there are no performance problems by using NGC. Results can be seen here.

Pytorch container do only include OpenMPI and NCCL but not Horovod https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_22-01.html. Performance of OpenMPI and NCCL should be comparible with TensorFlow containers.

TensorFlow and Pytorch containers also provide Nvidia Nsight (but do not include Nvidias DLProf anymore but can be installed easily by the user via pip) to do performance profiling of TensorFlow/Pytorch applications.

There should also be a hint in docs/software/distributed_training.md.md for using NGC containers since they provide an easy start to distributed training without any further installation.