Skip to content
Snippets Groups Projects
Commit af8ebb90 authored by Taras Lazariv's avatar Taras Lazariv
Browse files

Polished language and formatting. Shorten introduction.

parent 27eb7b9e
No related branches found
No related tags found
6 merge requests!398Update data_analytics_with_python.md. Fixed spelling and wording. All issues...,!392Merge preview into contrib guide for browser users,!368Update experiments.md,!356Merge preview in main,!355Merge preview in main,!339Ngc containers
# GPU-accelerated Containers for Deep Learning (NGC Containers) # GPU-accelerated Containers for Deep Learning (NGC Containers)
## Containers A [container](containers.md) is an executable and portable unit of software.
A container is an executable and portable unit of software.
[Containerization](https://www.ibm.com/cloud/learn/containerization) means
encapsulating or packaging up software code and all its dependencies
to run uniformly and consistently on any infrastructure. In other words,
it is agnostic to host specific environment like OS, etc.
The entire software environment, from the deep learning framework itself,
down to the math and communication libraries that are necessary for performance,
is packaged into a single bundle.
On ZIH systems, [Singularity](https://sylabs.io/) is used as a standard container solution. On ZIH systems, [Singularity](https://sylabs.io/) is used as a standard container solution.
## NGC Containers in General
[NGC](https://developer.nvidia.com/ai-hpc-containers), [NGC](https://developer.nvidia.com/ai-hpc-containers),
a registry of highly GPU-optimized software, a registry of highly GPU-optimized software,
has been enabling scientists and researchers by providing regularly updated has been enabling scientists and researchers by providing regularly updated
and validated containers of HPC and AI applications. and validated containers of HPC and AI applications.
Singularity supports NGC containers.
NGC containers are optimized for HPC applications.
NGC containers are **GPU-optimized** containers NGC containers are **GPU-optimized** containers
for **deep learning,** **machine learning**, visualization: for deep learning, machine learning, visualization:
- Built-in libraries and dependencies; - Built-in libraries and dependencies;
- Faster training with Automatic Mixed Precision (AMP); - Faster training with Automatic Mixed Precision (AMP);
- Opportunity to scaling up from single-node to multi-node systems; - Opportunity to scale up from single-node to multi-node systems;
- Performance optimized. - Performance optimized.
### Why NGC Containers? !!! note "Advantages of NGC containers"
- NGC containers were highly optimized for cluster usage.
Advantages of NGC containers: The performance provided by NGC containers is comparable to the performance
provided by the modules on the ZIH system (which is potentially the most performant way).
- NGC containers were highly optimized for cluster usage. NGC containers are a quick and efficient way to apply the best models
The performance provided by NGC containers is comparable to the performance on your dataset on a ZIH system;
provided by the modules on the ZIH system (which is potentially the most performant way). - NGC containers allow using an exact version of the software
NGC containers are a quick and efficient way to apply the best models without installing it with all prerequisites manually.
on your dataset on a ZIH system; Manual installation can result in poor performance (e.g. using conda to install a software).
- NGC containers allow using an exact version of the software
without installing it with all prerequisites manually.
Manual installation can result in poor performance (e.g. using conda to install a software).
## Run NGC Containers on the ZIH System ## Run NGC Containers on the ZIH System
### Preparation
The first step is a choice of the necessary software (container) to run. The first step is a choice of the necessary software (container) to run.
The [NVIDIA NGC catalog](https://ngc.nvidia.com/catalog) The [NVIDIA NGC catalog](https://ngc.nvidia.com/catalog)
contains a host of GPU-optimized containers for deep learning, contains a host of GPU-optimized containers for deep learning,
...@@ -60,9 +38,7 @@ To find a container that fits the requirements of your task, please check ...@@ -60,9 +38,7 @@ To find a container that fits the requirements of your task, please check
the [official examples page](https://github.com/NVIDIA/DeepLearningExamples) the [official examples page](https://github.com/NVIDIA/DeepLearningExamples)
with the list of main containers with their features and peculiarities. with the list of main containers with their features and peculiarities.
### Building and Running a Container ### Run NGC container on a Single GPU
#### Run NGC container on a Single GPU
!!! note !!! note
Almost all NGC containers can work with a single GPU. Almost all NGC containers can work with a single GPU.
...@@ -77,7 +53,7 @@ Create a container from the image from the NGC catalog. ...@@ -77,7 +53,7 @@ Create a container from the image from the NGC catalog.
(For this example, the alpha is used): (For this example, the alpha is used):
```console ```console
marie@login$ srun --partition=alpha --nodes=1 --ntasks-per-node=1 --ntasks=1 --gres=gpu:1 --time=08:00:00 --pty --mem=50000 bash #allocate one GPU marie@login$ srun --partition=alpha --nodes=1 --ntasks-per-node=1 --ntasks=1 --gres=gpu:1 --time=08:00:00 --pty --mem=50000 bash
marie@compute$ cd /scratch/ws/<name_of_your_workspace>/containers #please create a Workspace marie@compute$ cd /scratch/ws/<name_of_your_workspace>/containers #please create a Workspace
...@@ -104,22 +80,26 @@ It is recommended to run the container with a single command. ...@@ -104,22 +80,26 @@ It is recommended to run the container with a single command.
However, for the educational purpose, the separate commands will be presented below: However, for the educational purpose, the separate commands will be presented below:
```console ```console
marie@login$ srun --partition=alpha --nodes=1 --ntasks-per-node=1 --ntasks=1 --gres=gpu:1 --time=08:00:00 --pty --mem=50000 bash #allocate one GPU marie@login$ srun --partition=alpha --nodes=1 --ntasks-per-node=1 --ntasks=1 --gres=gpu:1 --time=08:00:00 --pty --mem=50000 bash
``` ```
Run a shell within a container with the `singularity shell` command: Run a shell within a container with the `singularity shell` command:
```console ```console
marie@compute$ singularity shell --nv -B /scratch/ws/0/anpo879a-ImgNet/imagenet:/data/imagenet pytorch:21.06-py3 marie@compute$ singularity shell --nv -B /scratch/imagenet:/data/imagenet pytorch:21.06-py3
``` ```
The flag `--nv` in the command above was used to enable Nvidia support The flag `--nv` in the command above was used to enable Nvidia support for GPU usage
and a flag `-B` for a user-bind path specification. and a flag `-B` for a user-bind path specification.
Run the training inside the container: Run the training inside the container:
```console ```console
marie@compute$ python /workspace/examples/resnet50v1.5/multiproc.py --nnodes=1 --nproc_per_node 1 --node_rank=0 /workspace/examples/resnet50v1.5/main.py --data-backend dali-cpu --raport-file raport.json -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 -b 256 --epochs 90 /data/imagenet marie@container$ python /workspace/examples/resnet50v1.5/multiproc.py --nnodes=1 --nproc_per_node=1 \
--node_rank=0 /workspace/examples/resnet50v1.5/main.py --data-backend dali-cpu \
--raport-file raport.json -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 \
--arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 \
--wd 3.0517578125e-05 -b 256 --epochs 90 /data/imagenet
``` ```
!!! warning !!! warning
...@@ -132,10 +112,15 @@ As an example, please find the full command to run the ResNet50 model ...@@ -132,10 +112,15 @@ As an example, please find the full command to run the ResNet50 model
on the ImageNet dataset inside the PyTorch container: on the ImageNet dataset inside the PyTorch container:
```console ```console
marie@login$ srun --partition=alpha --nodes=1 --ntasks-per-node=1 --ntasks=1 --gres=gpu:1 --time=08:00:00 --pty --mem=50000 singularity exec --nv -B /scratch/ws/0/anpo879a-ImgNet/imagenet:/data/imagenet pytorch:21.06-py3 python /workspace/examples/resnet50v1.5/multiproc.py --nnodes=1 --nproc_per_node 1 --node_rank=0 /workspace/examples/resnet50v1.5/main.py --data-backend dali-cpu --raport-file raport.json -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 -b 256 --epochs 90 /data/imagenet marie@login$ srun --partition=alpha --nodes=1 --ntasks-per-node=1 --ntasks=1 --gres=gpu:1 --time=08:00:00 --pty --mem=50000 \
singularity exec --nv -B /scratch/ws/0/anpo879a-ImgNet/imagenet:/data/imagenet pytorch:21.06-py3 \
python /workspace/examples/resnet50v1.5/multiproc.py --nnodes=1 --nproc_per_node 1 \
--node_rank=0 /workspace/examples/resnet50v1.5/main.py --data-backend dali-cpu --raport-file raport.json \
-j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 \
--lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 -b 256 --epochs 90 /data/imagenet
``` ```
#### Multi-GPU Usage ### Multi-GPU Usage
The majority of the NGC containers allow you to use multiple GPUs from one node The majority of the NGC containers allow you to use multiple GPUs from one node
to run the model inside the container. to run the model inside the container.
...@@ -151,21 +136,26 @@ An example of using the PyTorch container for the training of the ResNet50 model ...@@ -151,21 +136,26 @@ An example of using the PyTorch container for the training of the ResNet50 model
on the classification task on the ImageNet dataset is presented below: on the classification task on the ImageNet dataset is presented below:
```console ```console
marie@login$ srun --partition=alpha --nodes=1 --ntasks-per-node=8 --ntasks=8 --gres=gpu:8 --time=08:00:00 --pty --mem=500000 bash marie@login$ srun --partition=alpha --nodes=1 --ntasks-per-node=8 --ntasks=8 --gres=gpu:8 --time=08:00:00 --pty --mem=700G bash
``` ```
```console ```console
marie@compute$ singularity exec --nv -B /scratch/ws/0/marie-ImgNet/imagenet:/data/imagenet /beegfs/global0/ws/marie-beegfs_container_storage/container_storage/pytorch:21.06-py3 python /workspace/examples/resnet50v1.5/multiproc.py --nnodes=1 --nproc_per_node 8 --node_rank=0 /workspace/examples/resnet50v1.5/main.py --data-backend dali-cpu --raport-file raport.json -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 -b 256 --epochs 90 /data/imagenet marie@alpha$ singularity exec --nv -B /scratch/ws/0/marie-ImgNet/imagenet:/data/imagenet pytorch:21.06-py3 \
python /workspace/examples/resnet50v1.5/multiproc.py --nnodes=1 --nproc_per_node 8 \
--node_rank=0 /workspace/examples/resnet50v1.5/main.py --data-backend dali-cpu \
--raport-file raport.json -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 \
--arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 \
--wd 3.0517578125e-05 -b 256 --epochs 90 /data/imagenet
``` ```
Please pay attention to the parameter `--nproc_per_node`. Please pay attention to the parameter `--nproc_per_node`.
The value is equal to 8 because 8 GPUs per node were allocated by `srun`. The value is equal to 8 because 8 GPUs per node were allocated with `--gres=gpu:8`.
#### Multi-node Usage ### Multi-node Usage
There are few NGC containers with Multi-node support There are few NGC containers with Multi-node support
[available](https://github.com/NVIDIA/DeepLearningExamples). [available](https://github.com/NVIDIA/DeepLearningExamples).
Moreover, the realization of the multi-node usage depends on the authors Moreover, the realization of the multi-node usage depends on the authors
of the exact container. of the exact container.
Thus, right now, it is not possible to run NGC containers with multi-node support Thus, right now, it is not possible to run NGC containers with multi-node support
on the ZIH system without a change of the source code inside the container. on the ZIH system without changing the source code inside the container.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment