From 1a80aea3c879229dd9627110b615a59f2ddebb54 Mon Sep 17 00:00:00 2001
From: Andrei Politov <andrei.politov@tu-dresden.de>
Date: Mon, 27 Sep 2021 10:18:34 +0200
Subject: [PATCH] Upload New File

---
 .../docs/software/ngc_containers.md           | 91 +++++++++++++++++++
 1 file changed, 91 insertions(+)
 create mode 100644 doc.zih.tu-dresden.de/docs/software/ngc_containers.md

diff --git a/doc.zih.tu-dresden.de/docs/software/ngc_containers.md b/doc.zih.tu-dresden.de/docs/software/ngc_containers.md
new file mode 100644
index 000000000..e0fd7d7d2
--- /dev/null
+++ b/doc.zih.tu-dresden.de/docs/software/ngc_containers.md
@@ -0,0 +1,91 @@
+# GPU-accelerated containers for deep learning (NGC containers)
+
+## Containers
+
+Containers are executable portable units of software in which 
+application code is packaged, along with its 
+libraries and dependencies. 
+[Containerization](https://www.ibm.com/cloud/learn/containerization) encapsulating or packaging up
+software code and all its dependencies to run uniformly and consistently 
+on any infrastructure with other words it is agnostic to host sprefic environment like OS, etc.
+
+Containers are a widely adopted method of taming the complexity of deploying HPC and AI software. 
+The entire software environment, from the deep learning framework itself, 
+down to the math and communication libraries are necessary for performance, is packaged into 
+a single bundle. Since workloads inside a container 
+always use the same environment, the performance is reproducible and portable.
+
+On Taurus [Singularity](https://sylabs.io/) used as a standard container solution.
+
+## NGC containers
+
+[NGC](https://developer.nvidia.com/ai-hpc-containers), a registry of highly GPU-optimized software, 
+has been enabling scientists and researchers by providing regularly updated 
+and validated containers of HPC and AI applications.
+
+NGC containers support Singularity.
+
+NGC containers are optimized for high-performance computing (HPC) applications.
+NGC containers are **GPU-optimized** containers 
+for **deep learning,** **machine learning**, visualization:
+
+- Built-in libraries and dependencies
+
+- Faster training with Automatic Mixed Precision (AMP)
+
+- Opportunity to scaling up from single-node to multi-node systems
+  
+- Allowing you to develop on the cloud, on premises, or at the edge
+
+- Highly versatile with support for various container runtimes such as Docker, Singularity, cri-o, etc
+
+- Performance optimized
+
+## Run NGC containers on ZIH system
+
+### Preparation
+
+The first step is a choice of the necessary software (container) to run. 
+The [NVIDIA NGC catalog](https://ngc.nvidia.com/catalog) 
+contains a host of GPU-optimized containers for deep learning, 
+machine learning, visualization, and high-performance computing (HPC) applications that are tested 
+for performance, security, and scalability. 
+It is necessary to register to have a full access to the catalouge.
+
+To find a container which fits to the requirements of your task please check 
+the [resourse](https://github.com/NVIDIA/DeepLearningExamples) 
+with the list of main containers with their features and precularities.
+
+### Building and Run the Container
+
+To use NGC containers it is necessary to undertend main Singularity commands.
+If you are nor familiar with singularity syntax please find the information [here](https://sylabs.io/guides/3.0/user-guide/quick_start.html#interact-with-images).
+
+Create a container from the image from the NGC catalog. For the exemple alpha partition was used.
+
+```console
+marie@login$ srun -p alpha --nodes 1 --ntasks-per-node 1 --ntasks 1 --gres=gpu:1 --time=08:00:00 --pty --mem=50000 bash    #allocate alpha partition with one GPU
+
+marie@compute$ cd /scratch/ws/<name_of_your_workspace>/containers   #please create a Workspace
+
+marie@compute$ singularity pull pytorch:21.08-py3.sif docker://nvcr.io/nvidia/pytorch:21.08-py3
+```
+
+Now you have a fully functional PyTorch container.
+
+In majority of cases the container doesn't containe the datasets for training models. 
+To download the dataset please follow the instructions for the exact container [here](https://github.com/NVIDIA/DeepLearningExamples).
+Also you can find the instructions in a README file which you can find inside the container:
+
+```console
+marie@compute$ singularity exec pytorch:21.06-py3_beegfs vim /workspace/examples/resnet50v1.5/README.md
+```
+
+As an exemple please find the full command to run the Resnet50 model on the ImageNet dataset 
+inside the PyTorch container
+
+```console
+marie@compute$ singularity exec --nv -B /scratch/ws/0/anpo879a-ImgNet/imagenet:/data/imagenet pytorch:21.06-py3 python /workspace/examples/resnet50v1.5/multiproc.py --nnodes=1 --nproc_per_node 1 --node_rank=0 /workspace/examples/resnet50v1.5/main.py --data-backend dali-cpu --raport-file raport.json -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 -b 256 --epochs 90 /data/imagenet
+```
+
+### Multi-GPU case 
-- 
GitLab