From af8ebb90a67ea7497b4c552cf46285561a8174df Mon Sep 17 00:00:00 2001 From: lazariv <taras.lazariv@tu-dresden.de> Date: Fri, 8 Oct 2021 13:28:23 +0000 Subject: [PATCH] Polished language and formatting. Shorten introduction. --- .../docs/software/ngc_containers.md | 88 ++++++++----------- 1 file changed, 39 insertions(+), 49 deletions(-) diff --git a/doc.zih.tu-dresden.de/docs/software/ngc_containers.md b/doc.zih.tu-dresden.de/docs/software/ngc_containers.md index cef1a642d..835259ce9 100644 --- a/doc.zih.tu-dresden.de/docs/software/ngc_containers.md +++ b/doc.zih.tu-dresden.de/docs/software/ngc_containers.md @@ -1,54 +1,32 @@ # GPU-accelerated Containers for Deep Learning (NGC Containers) -## Containers - -A container is an executable and portable unit of software. -[Containerization](https://www.ibm.com/cloud/learn/containerization) means -encapsulating or packaging up software code and all its dependencies -to run uniformly and consistently on any infrastructure. In other words, -it is agnostic to host specific environment like OS, etc. - -The entire software environment, from the deep learning framework itself, -down to the math and communication libraries that are necessary for performance, -is packaged into a single bundle. - +A [container](containers.md) is an executable and portable unit of software. On ZIH systems, [Singularity](https://sylabs.io/) is used as a standard container solution. -## NGC Containers in General - [NGC](https://developer.nvidia.com/ai-hpc-containers), a registry of highly GPU-optimized software, has been enabling scientists and researchers by providing regularly updated and validated containers of HPC and AI applications. - -Singularity supports NGC containers. - -NGC containers are optimized for HPC applications. NGC containers are **GPU-optimized** containers -for **deep learning,** **machine learning**, visualization: +for deep learning, machine learning, visualization: - Built-in libraries and dependencies; - Faster training with Automatic Mixed Precision (AMP); -- Opportunity to scaling up from single-node to multi-node systems; +- Opportunity to scale up from single-node to multi-node systems; - Performance optimized. -### Why NGC Containers? - -Advantages of NGC containers: - -- NGC containers were highly optimized for cluster usage. -The performance provided by NGC containers is comparable to the performance -provided by the modules on the ZIH system (which is potentially the most performant way). -NGC containers are a quick and efficient way to apply the best models -on your dataset on a ZIH system; -- NGC containers allow using an exact version of the software -without installing it with all prerequisites manually. -Manual installation can result in poor performance (e.g. using conda to install a software). +!!! note "Advantages of NGC containers" + - NGC containers were highly optimized for cluster usage. + The performance provided by NGC containers is comparable to the performance + provided by the modules on the ZIH system (which is potentially the most performant way). + NGC containers are a quick and efficient way to apply the best models + on your dataset on a ZIH system; + - NGC containers allow using an exact version of the software + without installing it with all prerequisites manually. + Manual installation can result in poor performance (e.g. using conda to install a software). ## Run NGC Containers on the ZIH System -### Preparation - The first step is a choice of the necessary software (container) to run. The [NVIDIA NGC catalog](https://ngc.nvidia.com/catalog) contains a host of GPU-optimized containers for deep learning, @@ -60,9 +38,7 @@ To find a container that fits the requirements of your task, please check the [official examples page](https://github.com/NVIDIA/DeepLearningExamples) with the list of main containers with their features and peculiarities. -### Building and Running a Container - -#### Run NGC container on a Single GPU +### Run NGC container on a Single GPU !!! note Almost all NGC containers can work with a single GPU. @@ -77,7 +53,7 @@ Create a container from the image from the NGC catalog. (For this example, the alpha is used): ```console -marie@login$ srun --partition=alpha --nodes=1 --ntasks-per-node=1 --ntasks=1 --gres=gpu:1 --time=08:00:00 --pty --mem=50000 bash #allocate one GPU +marie@login$ srun --partition=alpha --nodes=1 --ntasks-per-node=1 --ntasks=1 --gres=gpu:1 --time=08:00:00 --pty --mem=50000 bash marie@compute$ cd /scratch/ws/<name_of_your_workspace>/containers #please create a Workspace @@ -104,22 +80,26 @@ It is recommended to run the container with a single command. However, for the educational purpose, the separate commands will be presented below: ```console -marie@login$ srun --partition=alpha --nodes=1 --ntasks-per-node=1 --ntasks=1 --gres=gpu:1 --time=08:00:00 --pty --mem=50000 bash #allocate one GPU +marie@login$ srun --partition=alpha --nodes=1 --ntasks-per-node=1 --ntasks=1 --gres=gpu:1 --time=08:00:00 --pty --mem=50000 bash ``` Run a shell within a container with the `singularity shell` command: ```console -marie@compute$ singularity shell --nv -B /scratch/ws/0/anpo879a-ImgNet/imagenet:/data/imagenet pytorch:21.06-py3 +marie@compute$ singularity shell --nv -B /scratch/imagenet:/data/imagenet pytorch:21.06-py3 ``` -The flag `--nv` in the command above was used to enable Nvidia support +The flag `--nv` in the command above was used to enable Nvidia support for GPU usage and a flag `-B` for a user-bind path specification. Run the training inside the container: ```console -marie@compute$ python /workspace/examples/resnet50v1.5/multiproc.py --nnodes=1 --nproc_per_node 1 --node_rank=0 /workspace/examples/resnet50v1.5/main.py --data-backend dali-cpu --raport-file raport.json -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 -b 256 --epochs 90 /data/imagenet +marie@container$ python /workspace/examples/resnet50v1.5/multiproc.py --nnodes=1 --nproc_per_node=1 \ + --node_rank=0 /workspace/examples/resnet50v1.5/main.py --data-backend dali-cpu \ + --raport-file raport.json -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 \ + --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 \ + --wd 3.0517578125e-05 -b 256 --epochs 90 /data/imagenet ``` !!! warning @@ -132,10 +112,15 @@ As an example, please find the full command to run the ResNet50 model on the ImageNet dataset inside the PyTorch container: ```console -marie@login$ srun --partition=alpha --nodes=1 --ntasks-per-node=1 --ntasks=1 --gres=gpu:1 --time=08:00:00 --pty --mem=50000 singularity exec --nv -B /scratch/ws/0/anpo879a-ImgNet/imagenet:/data/imagenet pytorch:21.06-py3 python /workspace/examples/resnet50v1.5/multiproc.py --nnodes=1 --nproc_per_node 1 --node_rank=0 /workspace/examples/resnet50v1.5/main.py --data-backend dali-cpu --raport-file raport.json -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 -b 256 --epochs 90 /data/imagenet +marie@login$ srun --partition=alpha --nodes=1 --ntasks-per-node=1 --ntasks=1 --gres=gpu:1 --time=08:00:00 --pty --mem=50000 \ + singularity exec --nv -B /scratch/ws/0/anpo879a-ImgNet/imagenet:/data/imagenet pytorch:21.06-py3 \ + python /workspace/examples/resnet50v1.5/multiproc.py --nnodes=1 --nproc_per_node 1 \ + --node_rank=0 /workspace/examples/resnet50v1.5/main.py --data-backend dali-cpu --raport-file raport.json \ + -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 \ + --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 -b 256 --epochs 90 /data/imagenet ``` -#### Multi-GPU Usage +### Multi-GPU Usage The majority of the NGC containers allow you to use multiple GPUs from one node to run the model inside the container. @@ -151,21 +136,26 @@ An example of using the PyTorch container for the training of the ResNet50 model on the classification task on the ImageNet dataset is presented below: ```console -marie@login$ srun --partition=alpha --nodes=1 --ntasks-per-node=8 --ntasks=8 --gres=gpu:8 --time=08:00:00 --pty --mem=500000 bash +marie@login$ srun --partition=alpha --nodes=1 --ntasks-per-node=8 --ntasks=8 --gres=gpu:8 --time=08:00:00 --pty --mem=700G bash ``` ```console -marie@compute$ singularity exec --nv -B /scratch/ws/0/marie-ImgNet/imagenet:/data/imagenet /beegfs/global0/ws/marie-beegfs_container_storage/container_storage/pytorch:21.06-py3 python /workspace/examples/resnet50v1.5/multiproc.py --nnodes=1 --nproc_per_node 8 --node_rank=0 /workspace/examples/resnet50v1.5/main.py --data-backend dali-cpu --raport-file raport.json -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 -b 256 --epochs 90 /data/imagenet +marie@alpha$ singularity exec --nv -B /scratch/ws/0/marie-ImgNet/imagenet:/data/imagenet pytorch:21.06-py3 \ + python /workspace/examples/resnet50v1.5/multiproc.py --nnodes=1 --nproc_per_node 8 \ + --node_rank=0 /workspace/examples/resnet50v1.5/main.py --data-backend dali-cpu \ + --raport-file raport.json -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 \ + --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 \ + --wd 3.0517578125e-05 -b 256 --epochs 90 /data/imagenet ``` Please pay attention to the parameter `--nproc_per_node`. -The value is equal to 8 because 8 GPUs per node were allocated by `srun`. +The value is equal to 8 because 8 GPUs per node were allocated with `--gres=gpu:8`. -#### Multi-node Usage +### Multi-node Usage There are few NGC containers with Multi-node support [available](https://github.com/NVIDIA/DeepLearningExamples). Moreover, the realization of the multi-node usage depends on the authors of the exact container. Thus, right now, it is not possible to run NGC containers with multi-node support -on the ZIH system without a change of the source code inside the container. +on the ZIH system without changing the source code inside the container. -- GitLab