Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
hpc-compendium
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Deploy
Releases
Package Registry
Container Registry
Model registry
Operate
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
ZIH
hpcsupport
hpc-compendium
Commits
bde9d645
Commit
bde9d645
authored
1 year ago
by
Natalie Breidenbach
Browse files
Options
Downloads
Patches
Plain Diff
Update ngc_containers.md
parent
da2105d1
No related branches found
No related tags found
2 merge requests
!938
Automated merge from preview to main
,
!936
Update to Five-Cluster-Operation
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc.zih.tu-dresden.de/docs/software/ngc_containers.md
+11
-11
11 additions, 11 deletions
doc.zih.tu-dresden.de/docs/software/ngc_containers.md
with
11 additions
and
11 deletions
doc.zih.tu-dresden.de/docs/software/ngc_containers.md
+
11
−
11
View file @
bde9d645
...
...
@@ -50,14 +50,14 @@ If you are not familiar with Singularity's syntax, please find the information o
However, some main commands will be explained.
Create a container from the image from the NGC catalog.
(For this example, the alpha is used):
(For this example, the
cluster
alpha is used):
```
console
marie@login$
srun
--partition
=
alpha
--nodes
=
1
--ntasks-per-node
=
1
--ntasks
=
1
--gres
=
gpu:1
--time
=
08:00:00
--pty
--mem
=
50000 bash
marie@login
.alpha
$
srun
--nodes
=
1
--ntasks-per-node
=
1
--ntasks
=
1
--gres
=
gpu:1
--time
=
08:00:00
--pty
--mem
=
50000 bash
marie@
compute$
cd
/scratch
/ws/<name_of_your_workspace>/containers
#please create a Workspace
marie@
alpha$
cd
/data/horse
/ws/<name_of_your_workspace>/containers
#please create a Workspace
marie@
compute
$
singularity pull pytorch:21.08-py3.sif docker://nvcr.io/nvidia/pytorch:21.08-py3
marie@
alpha
$
singularity pull pytorch:21.08-py3.sif docker://nvcr.io/nvidia/pytorch:21.08-py3
```
Now, you have a fully functional PyTorch container.
...
...
@@ -73,20 +73,20 @@ To download the dataset, please follow the
Also, you can find the instructions in a README file which you can find inside the container:
```
console
marie@
compute
$
singularity
exec
pytorch:21.06-py3_beegfs vim /workspace/examples/resnet50v1.5/README.md
marie@
alpha
$
singularity
exec
pytorch:21.06-py3_beegfs vim /workspace/examples/resnet50v1.5/README.md
```
It is recommended to run the container with a single command.
However, for the educational purpose, the separate commands will be presented below:
```
console
marie@login$
srun
--partition
=
alpha
--nodes
=
1
--ntasks-per-node
=
1
--ntasks
=
1
--gres
=
gpu:1
--time
=
08:00:00
--pty
--mem
=
50000 bash
marie@login
.alpha
$
srun
--nodes
=
1
--ntasks-per-node
=
1
--ntasks
=
1
--gres
=
gpu:1
--time
=
08:00:00
--pty
--mem
=
50000 bash
```
Run a shell within a container with the
`singularity shell`
command:
```
console
marie@
compute
$
singularity shell
--nv
-B
/
scratch
/imagenet:/data/imagenet pytorch:21.06-py3
marie@
alpha
$
singularity shell
--nv
-B
/
data/horse
/imagenet:/data/imagenet pytorch:21.06-py3
```
The flag
`--nv`
in the command above was used to enable Nvidia support for GPU usage
...
...
@@ -112,8 +112,8 @@ As an example, please find the full command to run the ResNet50 model
on the ImageNet dataset inside the PyTorch container:
```
console
marie@login$
srun
--partition
=
alpha
--nodes
=
1
--ntasks-per-node
=
1
--ntasks
=
1
--gres
=
gpu:1
--time
=
08:00:00
--pty
--mem
=
50000
\
singularity exec --nv -B /
scratch
/ws/0/anpo879a-ImgNet/imagenet:/data/imagenet pytorch:21.06-py3 \
marie@login
.alpha
$
srun
--nodes
=
1
--ntasks-per-node
=
1
--ntasks
=
1
--gres
=
gpu:1
--time
=
08:00:00
--pty
--mem
=
50000
\
singularity exec --nv -B /
data/horse
/ws/0/anpo879a-ImgNet/imagenet:/data/imagenet pytorch:21.06-py3 \
python /workspace/examples/resnet50v1.5/multiproc.py --nnodes=1 --nproc_per_node 1 \
--node_rank=0 /workspace/examples/resnet50v1.5/main.py --data-backend dali-cpu --raport-file raport.json \
-j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 \
...
...
@@ -136,11 +136,11 @@ An example of using the PyTorch container for the training of the ResNet50 model
on the classification task on the ImageNet dataset is presented below:
```
console
marie@login$
srun
--partition
=
alpha
--nodes
=
1
--ntasks-per-node
=
8
--ntasks
=
8
--gres
=
gpu:8
--time
=
08:00:00
--pty
--mem
=
700G bash
marie@login
.alpha
$
srun
--nodes
=
1
--ntasks-per-node
=
8
--ntasks
=
8
--gres
=
gpu:8
--time
=
08:00:00
--pty
--mem
=
700G bash
```
```
console
marie@alpha$
singularity
exec
--nv
-B
/
scratch
/ws/0/marie-ImgNet/imagenet:/data/imagenet pytorch:21.06-py3
\
marie@alpha$
singularity
exec
--nv
-B
/
data/horse
/ws/0/marie-ImgNet/imagenet:/data/imagenet pytorch:21.06-py3
\
python /workspace/examples/resnet50v1.5/multiproc.py --nnodes=1 --nproc_per_node 8 \
--node_rank=0 /workspace/examples/resnet50v1.5/main.py --data-backend dali-cpu \
--raport-file raport.json -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 \
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment