Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
hpc-compendium
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Deploy
Releases
Package registry
Container Registry
Model registry
Operate
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
ZIH
hpcsupport
hpc-compendium
Commits
4afbc1e0
Commit
4afbc1e0
authored
3 years ago
by
Taras Lazariv
Browse files
Options
Downloads
Patches
Plain Diff
Move content to install_jupyter.md and delete deep_learning.md
parent
c2658acf
No related branches found
No related tags found
5 merge requests
!333
Draft: update NGC containers
,
!322
Merge preview into main
,
!319
Merge preview into main
,
!279
Draft: Machine Learning restructuring
,
!258
Data Analytics restructuring
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc.zih.tu-dresden.de/docs/archive/install_jupyter.md
+181
-0
181 additions, 0 deletions
doc.zih.tu-dresden.de/docs/archive/install_jupyter.md
with
181 additions
and
0 deletions
doc.zih.tu-dresden.de/docs/archive/
deep_learning
.md
→
doc.zih.tu-dresden.de/docs/archive/
install_jupyter
.md
+
181
−
0
View file @
4afbc1e0
# Deep learning
**Prerequisites**
: To work with Deep Learning tools you obviously need
[
Login
](
../access/ssh_login.md
)
for the ZIH system system and basic knowledge about Python, Slurm manager.
**Aim**
of this page is to introduce users on how to start working with Deep learning software on
both the ml environment and the scs5 environment of the system.
## Deep Learning Software
### TensorFlow
[
TensorFlow
](
https://www.tensorflow.org/guide/
)
is a free end-to-end open-source software library
for dataflow and differentiable programming across a range of tasks.
TensorFlow is available in both
[
ml environment and scs5 environment
](
modules.md#module-environments
)
under the module name "TensorFlow". For example:
```
Bash
module load TensorFlow
```
There are numerous different possibilities on how to work with
[
TensorFlow
](
tensorflow.md
)
on
ZIH system. On this page, for all examples default, scs5 partition is used. Generally, the easiest way
is using the
[
modules system
](
modules.md
)
and Python virtual environment (test case). However, in some cases, you may need directly installed
TensorFlow stable or night releases. For this purpose use the
[
EasyBuild
](
custom_easy_build_environment.md
)
,
[
Containers
](
tensorflow_container_on_hpcda.md
)
and see
[
the example
](
https://www.tensorflow.org/install/pip
)
. For examples of using TensorFlow for ml partition
with module system see
[
TensorFlow page
](
../software/tensorflow.md
)
.
Note: If you are going used manually installed TensorFlow release we recommend use only stable
versions.
## Keras
[
Keras
](
https://keras.io/
)
is a high-level neural network API, written in Python and capable of
running on top of
[
TensorFlow
](
https://github.com/tensorflow/tensorflow
)
Keras is available in both
environments
[
ml environment and scs5 environment
](
modules.md#module-environments
)
under the module
name "Keras".
On this page for all examples default scs5 partition used. There are numerous different
possibilities on how to work with
[
TensorFlow
](
../software/tensorflow.md
)
and Keras
on ZIH system. Generally, the easiest way is using the
[
module system
](
modules.md
)
and Python
virtual environment (test case) to see TensorFlow part above.
For examples of using Keras for ml partition with the module system see the
[
Keras page
](
../software/keras.md
)
.
It can either use TensorFlow as its backend. As mentioned in Keras documentation Keras capable of
running on Theano backend. However, due to the fact that Theano has been abandoned by the
developers, we don't recommend use Theano anymore. If you wish to use Theano backend you need to
install it manually. To use the TensorFlow backend, please don't forget to load the corresponding
TensorFlow module. TensorFlow should be loaded automatically as a dependency.
Test case: Keras with TensorFlow on MNIST data
Go to a directory on ZIH system, get Keras for the examples and go to the examples:
```
Bash
git clone https://github.com/fchollet/keras.git
cd keras/examples/
```
If you do not specify Keras backend, then TensorFlow is used as a default
Job-file (schedule job with sbatch, check the status with 'squeue -u
\<
Username>'):
```
Bash
#!/bin/bash
#SBATCH --gres=gpu:1 # 1 - using one gpu, 2 - for using 2 gpus
#SBATCH --mem=8000
#SBATCH -p gpu2 # select the type of nodes (options: haswell, smp, sandy, west, gpu, ml) K80 GPUs on Haswell node
#SBATCH --time=00:30:00
#SBATCH -o HLR_<name_of_your_script>.out # save output under HLR_${SLURMJOBID}.out
#SBATCH -e HLR_<name_of_your_script>.err # save error messages under HLR_${SLURMJOBID}.err
module purge # purge if you already have modules loaded
module load modenv/scs5 # load scs5 environment
module load Keras # load Keras module
module load TensorFlow # load TensorFlow module
# if you see 'broken pipe error's (might happen in interactive session after the second srun command) uncomment line below
# module load h5py
python mnist_cnn.py
```
Keep in mind that you need to put the bash script to the same folder as an executable file or
specify the path.
Example output:
```
Bash
x_train shape: (60000, 28, 28, 1) 60000 train samples 10000 test samples Train on 60000 samples,
validate on 10000 samples Epoch 1/12
128/60000 [..............................] - ETA: 12:08 - loss: 2.3064 - acc: 0.0781 256/60000
[..............................] - ETA: 7:04 - loss: 2.2613 - acc: 0.1523 384/60000
[..............................] - ETA: 5:22 - loss: 2.2195 - acc: 0.2005
...
60000/60000 [==============================] - 128s 2ms/step - loss: 0.0296 - acc: 0.9905 -
val_loss: 0.0268 - val_acc: 0.9911 Test loss: 0.02677746053306255 Test accuracy: 0.9911
```
## Data Sets
There are many different data sets designed for research purposes. If you would like to download some
of them, first of all, keep in mind that many machine learning libraries have direct access to
public data sets without downloading it (for example
[
TensorFlow data sets
](
https://www.tensorflow.org/datasets
)
.
If you still need to download some data sets, first of all, be careful with the size of the data sets
which you would like to download (some of them have a size of few Terabytes). Don't download what
you really not need to use! Use login nodes only for downloading small files (hundreds of the
megabytes). For downloading huge files use
[
DataMover
](
../data_transfer/data_mover.md
)
.
For example, you can use command
`dtwget`
(it is an analogue of the general wget
command). This command submits a job to the data transfer machines. If you need to download or
allocate massive files (more than one terabyte) please contact the support before.
### The ImageNet Data Set
The
[
ImageNet
](
http://www.image-net.org/
)
project is a large visual database designed for use in
visual object recognition software research. In order to save space in the filesystem by avoiding
to have multiple duplicates of this lying around, we have put a copy of the ImageNet database
(ILSVRC2012 and ILSVR2017) under
`/scratch/imagenet`
which you can use without having to download it
again. For the future, the ImageNet data set will be available in
`/warm_archive`
. ILSVR2017 also
includes a data set for recognition objects from a video. Please respect the corresponding
[
Terms of Use
](
https://image-net.org/download.php
)
.
## Jupyter Notebook
# Jupyter Installation
Jupyter notebooks are a great way for interactive computing in your web browser. Jupyter allows
working with data cleaning and transformation, numerical simulation, statistical modelling, data
...
...
@@ -148,7 +17,7 @@ analytics tools are available.
The remote Jupyter server is able to offer more freedom with settings and approaches.
##
#
Preparation phase (optional)
## Preparation phase (optional)
On ZIH system, start an interactive session for setting up the
environment:
...
...
@@ -189,7 +58,7 @@ directory (/home/userxx/anaconda3). Create a new anaconda environment with the n
conda create --name jnb
```
##
#
Set environmental variables
## Set environmental variables
In shell activate previously created python environment (you can
deactivate it also manually) and install Jupyter packages for this python environment:
...
...
@@ -247,7 +116,7 @@ hashed password here>' c.NotebookApp.port = 9999 c.NotebookApp.allow_remote_acce
Note:
`<path-to-cert>`
- path to key and certificate files, for example:
(
`/home/\<username>/mycert.pem`
)
##
#
Slurm job file to run the Jupyter server on ZIH system with GPU (1x K80) (also works on K20)
## Slurm job file to run the Jupyter server on ZIH system with GPU (1x K80) (also works on K20)
```
Bash
#!/bin/bash -l #SBATCH --gres=gpu:1 # request GPU #SBATCH --partition=gpu2 # use GPU partition
...
...
@@ -310,20 +179,3 @@ To login into the Jupyter notebook site, you have to enter the **token**.
If you would like to use
[
JupyterHub
](
../access/jupyterhub.md
)
after using a remote manually configured
Jupyter server (example above) you need to change the name of the configuration file
(
`/home//.jupyter/jupyter_notebook_config.py`
) to any other.
### F.A.Q
**Q:**
- I have an error to connect to the Jupyter server (e.g. "open failed: administratively
prohibited: open failed")
**A:**
- Check the settings of your Jupyter config file. Is it all necessary lines not commented, the
right path to cert and key files, right hashed password from .json file? Check is the used local
port
[
available
](
https://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers
)
Check local settings e.g. (
`/etc/ssh/sshd_config`
,
`/etc/hosts`
).
**Q:**
I have an error during the start of the interactive session (e.g. PMI2_Init failed to
initialize. Return code: 1)
**A:**
Probably you need to provide
`--mpi=none`
to avoid ompi errors ().
`srun --mpi=none --reservation \<...> -A \<...> -t 90 --mem=4000 --gres=gpu:1
--partition=gpu2-interactive --pty bash -l`
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment