I have done some tests regarding use NGC containers for the alpha partition and using NGC in general.
During the tests, I have discovered that NGC containers have changed a lot of structure and code inside exact NGC containers (PyTorch for example). From Pytorch 21.03 to PyTorch 21.06 they have changed code regarding the parallelization for example. As far as I understand, the way that authors of various containers with PyTorch used parallelization function is different and will be difficult to suggest one way of parallelization for different containers. The documentation regarding the advanced option is not perfect as well.
I have made tests with PyTorch 21.06 for the classification task (downgraded ImageNet dataset) with ResNet. The main metrics were the number of images per second and latency.
the main problem so far is that I can run some containers on 2 nodes (tnx for the solution from Danny) but with the PyTorch container I didn't succeed (I have tried different approaches). They are using Apex. I can run the container on the 1-8 GPUs from one node but that's all.
I have compared the results obtained for the same configuration for the container and for the same code run on the node by srun interactively.
The command for the one node 8 GPUS with NGC container:
the comparison of the results shows that the results for interactive code running and container are pretty close regarding the performance. The latency for the containers is worse. The parallelization gives some improvements regarding the bandwidth (imgs/s) but this topic is required further research.
@rotscher--tu-dresden.de Dear Danny could you help me please solve the issue with the use of NGC containers for multiple nodes. I have no idea how to run it on multiple nodes.
salloc only creates an allocation. To actually run something ON the nodes (and not on the login node) you need to use srun. Of course the (CPU-only) Login nodes don't have any nvidia driver files hence the error