diff --git a/doc.zih.tu-dresden.de/docs/software/DeepLearning.md b/doc.zih.tu-dresden.de/docs/software/DeepLearning.md index 3d2874d0727d8800070df3420897cb02dac88be0..e9f5854c43c32d6e6cfcf303edd999cc9b2dd17f 100644 --- a/doc.zih.tu-dresden.de/docs/software/DeepLearning.md +++ b/doc.zih.tu-dresden.de/docs/software/DeepLearning.md @@ -1,372 +1,335 @@ # Deep learning +**Prerequisites**: To work with Deep Learning tools you obviously need [Login](../access/Login.md) +for the Taurus system and basic knowledge about Python, SLURM manager. +**Aim** of this page is to introduce users on how to start working with Deep learning software on +both the ml environment and the scs5 environment of the Taurus system. -**Prerequisites**: To work with Deep Learning tools you obviously need -\<a href="Login" target="\_blank">access\</a> for the Taurus system and -basic knowledge about Python, SLURM manager. +## Deep Learning Software -**Aim** \<span style="font-size: 1em;">of this page is to introduce -users on how to start working with Deep learning software on both the -\</span>\<span style="font-size: 1em;">ml environment and the scs5 -environment of the Taurus system.\</span> +### TensorFlow -## Deep Learning Software +[TensorFlow](https://www.tensorflow.org/guide/) is a free end-to-end open-source software library +for dataflow and differentiable programming across a range of tasks. + +TensorFlow is available in both main partitions +[ml environment and scs5 environment](modules.md#module-environments) +under the module name "TensorFlow". However, for purposes of machine learning and deep learning, we +recommend using Ml partition [HPC-DA](../jobs/HPCDA.md). For example: + +```Bash +module load TensorFlow +``` + +There are numerous different possibilities on how to work with [TensorFlow](TensorFlow.md) on +Taurus. On this page, for all examples default, scs5 partition is used. Generally, the easiest way +is using the [modules system](modules.md) +and Python virtual environment (test case). However, in some cases, you may need directly installed +Tensorflow stable or night releases. For this purpose use the +[EasyBuild](CustomEasyBuildEnvironment.md), [Containers](TensorFlowContainerOnHPCDA.md) and see +[the example](https://www.tensorflow.org/install/pip). For examples of using TensorFlow for ml partition +with module system see [TensorFlow page for HPC-DA](TensorFlow.md). -Please refer to our [List of Modules](SoftwareModulesList) page for a -daily-updated list of the respective software versions that are -currently installed. - -## TensorFlow - -\<a href="<https://www.tensorflow.org/guide/>" -target="\_blank">TensorFlow\</a> is a free end-to-end open-source -software library for dataflow and differentiable programming across a -range of tasks. - -TensorFlow is available in both main partitions [ml environment and scs5 -environment](RuntimeEnvironment#Module_Environments) under the module -name "TensorFlow". However, for purposes of machine learning and deep -learning, we recommend using Ml partition (\<a href="HPCDA" -target="\_blank">HPC-DA\</a>). For example: - - module load TensorFlow - -There are numerous different possibilities on how to work with \<a -href="TensorFlow" target="\_blank">Tensorflow\</a> on Taurus. On this -page, for all examples default, scs5 partition is used. Generally, the -easiest way is using the \<a -href="RuntimeEnvironment#Module_Environments" target="\_blank">Modules -system\</a> and Python virtual environment (test case). However, in some -cases, you may need directly installed Tensorflow stable or night -releases. For this purpose use the -[EasyBuild](CustomEasyBuildEnvironment), -[Containers](TensorFlowContainerOnHPCDA) and see [the -example](https://www.tensorflow.org/install/pip). For examples of using -TensorFlow for ml partition with module system see \<a href="TensorFlow" -target="\_self">TensorFlow page for HPC-DA.\</a> - -Note: If you are going used manually installed Tensorflow release we -recommend use only stable versions. +Note: If you are going used manually installed Tensorflow release we recommend use only stable +versions. ## Keras -\<a href="<https://keras.io/>" target="\_blank">Keras\</a>\<span -style="font-size: 1em;"> is a high-level neural network API, written in -Python and capable of running on top of \</span>\<a -href="<https://github.com/tensorflow/tensorflow>" -target="\_top">TensorFlow\</a>\<span style="font-size: 1em;">. Keras is -available in both environments \</span> [ml environment and scs5 -environment](RuntimeEnvironment#Module_Environments)\<span -style="font-size: 1em;"> under the module name "Keras".\</span> +[Keras](https://keras.io/) is a high-level neural network API, written in Python and capable of +running on top of [TensorFlow](https://github.com/tensorflow/tensorflow) Keras is available in both +environments [ml environment and scs5 environment](modules.md#module-environments) under the module +name "Keras". -On this page for all examples default scs5 partition used. There are -numerous different possibilities on how to work with \<a -href="TensorFlow" target="\_blank">Tensorflow\</a> and Keras on Taurus. -Generally, the easiest way is using the \<a -href="RuntimeEnvironment#Module_Environments" target="\_blank">Modules -system\</a> and Python virtual environment (test case) to see Tensorflow -part above. \<span style="font-size: 1em;">For examples of using Keras -for ml partition with the module system see the \</span> [Keras page for -HPC-DA](Keras). +On this page for all examples default scs5 partition used. There are numerous different +possibilities on how to work with [TensorFlow](TensorFlow.md) and Keras +on Taurus. Generally, the easiest way is using the [module system](modules.md) and Python +virtual environment (test case) to see Tensorflow part above. +For examples of using Keras for ml partition with the module system see the +[Keras page for HPC-DA](Keras.md). -It can either use TensorFlow as its backend. As mentioned in Keras -documentation Keras capable of running on Theano backend. However, due -to the fact that Theano has been abandoned by the developers, we don't -recommend use Theano anymore. If you wish to use Theano backend you need -to install it manually. To use the TensorFlow backend, please don't -forget to load the corresponding TensorFlow module. TensorFlow should be -loaded automatically as a dependency. +It can either use TensorFlow as its backend. As mentioned in Keras documentation Keras capable of +running on Theano backend. However, due to the fact that Theano has been abandoned by the +developers, we don't recommend use Theano anymore. If you wish to use Theano backend you need to +install it manually. To use the TensorFlow backend, please don't forget to load the corresponding +TensorFlow module. TensorFlow should be loaded automatically as a dependency. -\<span style="color: #222222; font-size: 1.385em;">Test case: Keras with -TensorFlow on MNIST data\</span> +Test case: Keras with TensorFlow on MNIST data -Go to a directory on Taurus, get Keras for the examples and go to the -examples: +Go to a directory on Taurus, get Keras for the examples and go to the examples: - git clone <a href='https://github.com/fchollet/keras.git'>https://github.com/fchollet/keras.git</a><br />cd keras/examples/ +```Bash +git clone https://github.com/fchollet/keras.git'>https://github.com/fchollet/keras.git +cd keras/examples/ +``` -If you do not specify Keras backend, then TensorFlow is used as a -default +If you do not specify Keras backend, then TensorFlow is used as a default -Job-file (schedule job with sbatch, check the status with 'squeue -u -\<Username>'): +Job-file (schedule job with sbatch, check the status with 'squeue -u \<Username>'): - #!/bin/bash<br />#SBATCH --gres=gpu:1 # 1 - using one gpu, 2 - for using 2 gpus<br />#SBATCH --mem=8000<br />#SBATCH -p gpu2 # select the type of nodes (opitions: haswell, <code>smp</code>, <code>sandy</code>, <code>west</code>, <code>gpu, ml) </code><b>K80</b> GPUs on Haswell node<br />#SBATCH --time=00:30:00<br />#SBATCH -o HLR_<name_of_your_script>.out # save output under HLR_${SLURMJOBID}.out<br />#SBATCH -e HLR_<name_of_your_script>.err # save error messages under HLR_${SLURMJOBID}.err<br /> - module purge # purge if you already have modules loaded<br />module load modenv/scs5 # load scs5 environment<br />module load Keras # load Keras module<br />module load TensorFlow # load TensorFlow module<br /> +```Bash +#!/bin/bash +#SBATCH --gres=gpu:1 # 1 - using one gpu, 2 - for using 2 gpus +#SBATCH --mem=8000 +#SBATCH -p gpu2 # select the type of nodes (opitions: haswell, smp, sandy, west,gpu, ml) K80 GPUs on Haswell node +#SBATCH --time=00:30:00 +#SBATCH -o HLR_<name_of_your_script>.out # save output under HLR_${SLURMJOBID}.out +#SBATCH -e HLR_<name_of_your_script>.err # save error messages under HLR_${SLURMJOBID}.err - # if you see 'broken pipe error's (might happen in interactive session after the second srun command) uncomment line below<br /># module load h5py<br /><br />python mnist_cnn.py +module purge # purge if you already have modules loaded +module load modenv/scs5 # load scs5 environment +module load Keras # load Keras module +module load TensorFlow # load TensorFlow module -Keep in mind that you need to put the bash script to the same folder as -an executable file or specify the path. +# if you see 'broken pipe error's (might happen in interactive session after the second srun +command) uncomment line below +# module load h5py + +python mnist_cnn.py +``` + +Keep in mind that you need to put the bash script to the same folder as an executable file or +specify the path. Example output: - x_train shape: (60000, 28, 28, 1) - 60000 train samples - 10000 test samples - Train on 60000 samples, validate on 10000 samples - Epoch 1/12 +```Bash +x_train shape: (60000, 28, 28, 1) 60000 train samples 10000 test samples Train on 60000 samples, +validate on 10000 samples Epoch 1/12 - 128/60000 [..............................] - ETA: 12:08 - loss: 2.3064 - acc: 0.0781 - 256/60000 [..............................] - ETA: 7:04 - loss: 2.2613 - acc: 0.1523 - 384/60000 [..............................] - ETA: 5:22 - loss: 2.2195 - acc: 0.2005 +128/60000 [..............................] - ETA: 12:08 - loss: 2.3064 - acc: 0.0781 256/60000 +[..............................] - ETA: 7:04 - loss: 2.2613 - acc: 0.1523 384/60000 +[..............................] - ETA: 5:22 - loss: 2.2195 - acc: 0.2005 - ... +... - 60000/60000 [==============================] - 128s 2ms/step - loss: 0.0296 - acc: 0.9905 - val_loss: 0.0268 - val_acc: 0.9911 - Test loss: 0.02677746053306255 - Test accuracy: 0.9911 +60000/60000 [==============================] - 128s 2ms/step - loss: 0.0296 - acc: 0.9905 - +val_loss: 0.0268 - val_acc: 0.9911 Test loss: 0.02677746053306255 Test accuracy: 0.9911 +``` ## Datasets -There are many different datasets designed for research purposes. If you -would like to download some of them, first of all, keep in mind that -many machine learning libraries have direct access to public datasets -without downloading it (for example [TensorFlow -Datasets](https://www.tensorflow.org/datasets)\<span style="font-size: -1em; color: #444444;">). \</span> - -\<span style="font-size: 1em; color: #444444;">If you still need to -download some datasets, first of all, be careful with the size of the -datasets which you would like to download (some of them have a size of -few Terabytes). Don't download what you really not need to use! Use -login nodes only for downloading small files (hundreds of the -megabytes). For downloading huge files use \</span>\<a href="DataMover" -target="\_blank">Datamover\</a>\<span style="font-size: 1em; color: -#444444;">. For example, you can use command **\<span -class="WYSIWYG_TT">dtwget \</span>**(it is an analogue of the general -wget command). This command submits a job to the data transfer machines. -If you need to download or allocate massive files (more than one -terabyte) please contact the support before.\</span> +There are many different datasets designed for research purposes. If you would like to download some +of them, first of all, keep in mind that many machine learning libraries have direct access to +public datasets without downloading it (for example +[TensorFlow Datasets](https://www.tensorflow.org/datasets). + +If you still need to download some datasets, first of all, be careful with the size of the datasets +which you would like to download (some of them have a size of few Terabytes). Don't download what +you really not need to use! Use login nodes only for downloading small files (hundreds of the +megabytes). For downloading huge files use [DataMover](../data_moving/DataMover.md). +For example, you can use command `dtwget` (it is an analogue of the general wget +command). This command submits a job to the data transfer machines. If you need to download or +allocate massive files (more than one terabyte) please contact the support before. ### The ImageNet dataset -\<span style="font-size: 1em;">The \</span> [ **ImageNet** -](http://www.image-net.org/)\<span style="font-size: 1em;">project is a -large visual database designed for use in visual object recognition -software research. In order to save space in the file system by avoiding -to have multiple duplicates of this lying around, we have put a copy of -the ImageNet database (ILSVRC2012 and ILSVR2017) under\</span>** -/scratch/imagenet**\<span style="font-size: 1em;"> which you can use -without having to download it again. For the future, the Imagenet -dataset will be available in **/warm_archive.**ILSVR2017 also includes a -dataset for recognition objects from a video. Please respect the -corresponding \</span> [Terms of -Use](http://image-net.org/download-faq)\<span style="font-size: -1em;">.\</span> - -## Jupyter notebook - -Jupyter notebooks are a great way for interactive computing in your web -browser. Jupyter allows working with data cleaning and transformation, -numerical simulation, statistical modelling, data visualization and of -course with machine learning. - -There are two general options on how to work Jupyter notebooks using -HPC: remote jupyter server and jupyterhub. - -These sections show how to run and set up a remote jupyter server within -a sbatch GPU job and which modules and packages you need for that. - -\<span style="font-size: 1em; color: #444444;">%RED%Note:<span -class="twiki-macro ENDCOLOR"></span> On Taurus, there is a \</span>\<a -href="JupyterHub" target="\_self">jupyterhub\</a>\<span -style="font-size: 1em; color: #444444;">, where you do not need the -manual server setup described below and can simply run your Jupyter -notebook on HPC nodes. Keep in mind that with Jupyterhub you can't work -with some special instruments. However general data analytics tools are -available.\</span> - -The remote Jupyter server is able to offer more freedom with settings -and approaches. +The [ImageNet](http://www.image-net.org/) project is a large visual database designed for use in +visual object recognition software research. In order to save space in the file system by avoiding +to have multiple duplicates of this lying around, we have put a copy of the ImageNet database +(ILSVRC2012 and ILSVR2017) under `/scratch/imagenet` which you can use without having to download it +again. For the future, the Imagenet dataset will be available in `/warm_archive`. ILSVR2017 also +includes a dataset for recognition objects from a video. Please respect the corresponding +[Terms of Use](https://image-net.org/download.php). + +## Jupyter Notebook + +Jupyter notebooks are a great way for interactive computing in your web browser. Jupyter allows +working with data cleaning and transformation, numerical simulation, statistical modelling, data +visualization and of course with machine learning. + +There are two general options on how to work Jupyter notebooks using HPC: remote jupyter server and +jupyterhub. + +These sections show how to run and set up a remote jupyter server within a sbatch GPU job and which +modules and packages you need for that. + +**Note:** On Taurus, there is a [JupyterHub](JupyterHub.md), where you do not need the manual server +setup described below and can simply run your Jupyter notebook on HPC nodes. Keep in mind that with +Jupyterhub you can't work with some special instruments. However general data analytics tools are +available. + +The remote Jupyter server is able to offer more freedom with settings and approaches. Note: Jupyterhub is could be under construction ### Preparation phase (optional) -\<span style="font-size: 1em;">On Taurus, start an interactive session -for setting up the environment:\</span> +On Taurus, start an interactive session for setting up the +environment: - srun --pty -n 1 --cpus-per-task=2 --time=2:00:00 --mem-per-cpu=2500 --x11=first bash -l -i +```Bash +srun --pty -n 1 --cpus-per-task=2 --time=2:00:00 --mem-per-cpu=2500 --x11=first bash -l -i +``` Create a new subdirectory in your home, e.g. Jupyter - mkdir Jupyter - cd Jupyter +```Bash +mkdir Jupyter cd Jupyter +``` -There are two ways how to run Anaconda. The easiest way is to load the -Anaconda module. The second one is to download Anaconda in your home -directory. +There are two ways how to run Anaconda. The easiest way is to load the Anaconda module. The second +one is to download Anaconda in your home directory. -1\. Load Anaconda module (recommended): +1. Load Anaconda module (recommended): - module load modenv/scs5 - module load Anaconda3 +```Bash +module load modenv/scs5 module load Anaconda3 +``` -2\. Download latest Anaconda release (see example below) and change the -rights to make it an executable script and run the installation script: +1. Download latest Anaconda release (see example below) and change the rights to make it an +executable script and run the installation script: - wget https://repo.continuum.io/archive/Anaconda3-2019.03-Linux-x86_64.sh - chmod 744 Anaconda3-2019.03-Linux-x86_64.sh - ./Anaconda3-2019.03-Linux-x86_64.sh +```Bash +wget https://repo.continuum.io/archive/Anaconda3-2019.03-Linux-x86_64.sh chmod 744 +Anaconda3-2019.03-Linux-x86_64.sh ./Anaconda3-2019.03-Linux-x86_64.sh - (during installation you have to confirm the licence agreement) +(during installation you have to confirm the licence agreement) +``` -\<span style="font-size: 1em;">Next step will install the anaconda -environment into the home directory (/home/userxx/anaconda3). Create a -new anaconda environment with the name "jnb".\</span> +Next step will install the anaconda environment into the home +directory (/home/userxx/anaconda3). Create a new anaconda environment with the name "jnb". - conda create --name jnb +```Bash +conda create --name jnb +``` ### Set environmental variables on Taurus -\<span style="font-size: 1em;">In shell activate previously created -python environment (you can deactivate it also manually) and Install -jupyter packages for this python environment:\</span> +In shell activate previously created python environment (you can +deactivate it also manually) and Install jupyter packages for this python environment: - source activate jnb - conda install jupyter +```Bash +source activate jnb conda install jupyter +``` -\<span style="font-size: 1em;">If you need to adjust the config, you -should create the template. Generate config files for jupyter notebook -server:\</span> +If you need to adjust the config, you should create the template. Generate config files for jupyter +notebook server: - jupyter notebook --generate-config +```Bash +jupyter notebook --generate-config +``` -Find a path of the configuration file, usually in the home under -.jupyter directory, e.g.\<br -/>/home//.jupyter/jupyter_notebook_config.py +Find a path of the configuration file, usually in the home under `.jupyter` directory, e.g. +`/home//.jupyter/jupyter_notebook_config.py` -\<br />Set a password (choose easy one for testing), which is needed -later on to log into the server in browser session: +Set a password (choose easy one for testing), which is needed later on to log into the server +in browser session: - jupyter notebook password - Enter password: - Verify password: +```Bash +jupyter notebook password Enter password: Verify password: +``` you will get a message like that: - [NotebookPasswordApp] Wrote *hashed password* to /home/<zih_user>/.jupyter/jupyter_notebook_config.json +```Bash +[NotebookPasswordApp] Wrote *hashed password* to +/home/<zih_user>/.jupyter/jupyter_notebook_config.json +``` -I order to create an SSL certificate for https connections, you can -create a self-signed certificate: +I order to create an SSL certificate for https connections, you can create a self-signed +certificate: - openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mykey.key -out mycert.pem +```Bash +openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mykey.key -out mycert.pem +``` -fill in the form with decent values +fill in the form with decent values. -Possible entries for your jupyter config -(\_.jupyter/jupyter_notebook*config.py*). Uncomment below lines: +Possible entries for your jupyter config (`.jupyter/jupyter_notebook*config.py*`). Uncomment below +lines: - c.NotebookApp.certfile = u'<path-to-cert>/mycert.pem' - c.NotebookApp.keyfile = u'<path-to-cert>/mykey.key' +```Bash +c.NotebookApp.certfile = u'<path-to-cert>/mycert.pem' c.NotebookApp.keyfile = +u'<path-to-cert>/mykey.key' - # set ip to '*' otherwise server is bound to localhost only - c.NotebookApp.ip = '*' - c.NotebookApp.open_browser = False +# set ip to '*' otherwise server is bound to localhost only c.NotebookApp.ip = '*' +c.NotebookApp.open_browser = False - # copy hashed password from the jupyter_notebook_config.json - c.NotebookApp.password = u'<your hashed password here>' - c.NotebookApp.port = 9999 - c.NotebookApp.allow_remote_access = True +# copy hashed password from the jupyter_notebook_config.json c.NotebookApp.password = u'<your +hashed password here>' c.NotebookApp.port = 9999 c.NotebookApp.allow_remote_access = True +``` -Note: \<path-to-cert> - path to key and certificate files, for example: -('/home/\<username>/mycert.pem') +Note: `<path-to-cert>` - path to key and certificate files, for example: +(`/home/\<username>/mycert.pem`) ### SLURM job file to run the jupyter server on Taurus with GPU (1x K80) (also works on K20) - #!/bin/bash -l - #SBATCH --gres=gpu:1 # request GPU - #SBATCH --partition=gpu2 # use GPU partition - #SBATCH --output=notebok_output.txt - #SBATCH --nodes=1 - #SBATCH --ntasks=1 - #SBATCH --time=02:30:00 - #SBATCH --mem=4000M - #SBATCH -J "jupyter-notebook" # job-name - #SBATCH -A <name_of_your_project> +```Bash +#!/bin/bash -l #SBATCH --gres=gpu:1 # request GPU #SBATCH --partition=gpu2 # use GPU partition +SBATCH --output=notebok_output.txt #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --time=02:30:00 +SBATCH --mem=4000M #SBATCH -J "jupyter-notebook" # job-name #SBATCH -A <name_of_your_project> - unset XDG_RUNTIME_DIR # might be required when interactive instead of sbatch to avoid 'Permission denied error' - srun jupyter notebook +unset XDG_RUNTIME_DIR # might be required when interactive instead of sbatch to avoid +'Permission denied error' srun jupyter notebook +``` -Start the script above (e.g. with the name jnotebook) with sbatch -command: +Start the script above (e.g. with the name jnotebook) with sbatch command: - sbatch jnotebook.slurm +```Bash +sbatch jnotebook.slurm +``` -If you have a question about sbatch script see the article about \<a -href="Slurm" target="\_blank">SLURM\</a> +If you have a question about sbatch script see the article about [Slurm](../jobs/Slurm.md). -Check by the command: '\<span>tail notebook_output.txt'\</span> the -status and the **token** of the server. It should look like this: +Check by the command: `tail notebook_output.txt` the status and the **token** of the server. It +should look like this: - https://(taurusi2092.taurus.hrsk.tu-dresden.de or 127.0.0.1):9999/ +```Bash +https://(taurusi2092.taurus.hrsk.tu-dresden.de or 127.0.0.1):9999/ +``` -\<span style="font-size: 1em;">You can see the \</span>**server node's -hostname**\<span style="font-size: 1em;">by the command: -'\</span>\<span>squeue -u \<username>'\</span>\<span style="font-size: -1em;">.\</span> +You can see the **server node's hostname** by the command: `squeue -u <username>`. -\<span style="color: #222222; font-size: 1.231em;">Remote connect to the -server\</span> +Remote connect to the server There are two options on how to connect to the server: -\<span style="font-size: 1em;">1. You can create an ssh tunnel if you -have problems with the solution above.\</span> \<span style="font-size: -1em;">Open the other terminal and configure ssh tunnel: \</span>\<span -style="font-size: 1em;">(look up connection values in the output file of -slurm job, e.g.)\</span> (recommended): - - node=taurusi2092 #see the name of the node with squeue -u <your_login> - localport=8887 #local port on your computer - remoteport=9999 #pay attention on the value. It should be the same value as value in the notebook_output.txt - ssh -fNL ${localport}:${node}:${remoteport} <zih_user>@taurus.hrsk.tu-dresden.de #configure of the ssh tunnel for connection to your remote server - pgrep -f "ssh -fNL ${localport}" #verify that tunnel is alive - -\<span style="font-size: 1em;">2. On your client (local machine) you now -can connect to the server. You need to know the\</span>** node's -hostname**\<span style="font-size: 1em;">, the \</span> **port** \<span -style="font-size: 1em;"> of the server and the \</span> **token** \<span -style="font-size: 1em;"> to login (see paragraph above).\</span> - -You can connect directly if you know the IP address (just ping the -node's hostname while logged on Taurus). - - #comand on remote terminal - taurusi2092$> host taurusi2092 - # copy IP address from output - # paste IP to your browser or call on local terminal e.g. - local$> firefox https://<IP>:<PORT> # https important to use SSL cert - -To login into the jupyter notebook site, you have to enter the -**token**. (<https://localhost:8887>). Now you can create and execute -notebooks on Taurus with GPU support. - -%RED%Note:<span class="twiki-macro ENDCOLOR"></span> If you would like -to use \<a href="JupyterHub" target="\_self">jupyterhub\</a> after using -a remote manually configurated jupyter server (example above) you need -to change the name of the configuration file -(/home//.jupyter/jupyter_notebook_config.py) to any other. +1. You can create an ssh tunnel if you have problems with the +solution above. Open the other terminal and configure ssh +tunnel: (look up connection values in the output file of slurm job, e.g.) (recommended): + +```Bash +node=taurusi2092 #see the name of the node with squeue -u <your_login> +localport=8887 #local port on your computer remoteport=9999 +#pay attention on the value. It should be the same value as value in the notebook_output.txt ssh +-fNL ${localport}:${node}:${remoteport} <zih_user>@taurus.hrsk.tu-dresden.de #configure +of the ssh tunnel for connection to your remote server pgrep -f "ssh -fNL ${localport}" +#verify that tunnel is alive +``` + +2. On your client (local machine) you now can connect to the server. You need to know the **node's + hostname**, the **port** of the server and the **token** to login (see paragraph above). + +You can connect directly if you know the IP address (just ping the node's hostname while logged on +Taurus). + +```Bash +#comand on remote terminal taurusi2092$> host taurusi2092 # copy IP address from output # paste +IP to your browser or call on local terminal e.g. local$> firefox https://<IP>:<PORT> # https +important to use SSL cert +``` + +To login into the jupyter notebook site, you have to enter the **token**. +(`https://localhost:8887`). Now you can create and execute notebooks on Taurus with GPU support. + +If you would like to use [JupyterHub](JupyterHub.md) after using a remote manually configurated +jupyter server (example above) you need to change the name of the configuration file +(`/home//.jupyter/jupyter_notebook_config.py`) to any other. ### F.A.Q -Q: - I have an error to connect to the Jupyter server (e.g. "open -failed: administratively prohibited: open failed") +**Q:** - I have an error to connect to the Jupyter server (e.g. "open failed: administratively +prohibited: open failed") -A: - Check the settings of your \<span style="font-size: 1em;">jupyter -config file. Is it all necessary lines uncommented, the right path to -cert and key files, right hashed password from .json file? Check is the -used local port \<a -href="<https://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers>" -target="\_blank">available\</a>? Check local settings e.g. -(/etc/ssh/sshd_config, /etc/hosts)\</span> +**A:** - Check the settings of your jupyter config file. Is it all necessary lines uncommented, the +right path to cert and key files, right hashed password from .json file? Check is the used local +port [available](https://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers) +Check local settings e.g. (`/etc/ssh/sshd_config`, `/etc/hosts`). -Q: I have an error during the start of the interactive session (e.g. -PMI2_Init failed to initialize. Return code: 1) +**Q:** I have an error during the start of the interactive session (e.g. PMI2_Init failed to +initialize. Return code: 1) -A: Probably you need to provide --mpi=none to avoid ompi errors (). -\<span style="font-size: 1em;">srun --mpi=none --reservation \<...> -A -\<...> -t 90 --mem=4000 --gres=gpu:1 --partition=gpu2-interactive --pty -bash -l\</span> +**A:** Probably you need to provide `--mpi=none` to avoid ompi errors (). +`srun --mpi=none --reservation \<...> -A \<...> -t 90 --mem=4000 --gres=gpu:1 +--partition=gpu2-interactive --pty bash -l`