From 24ec0536af21dcd9dc762f63cc1ae31ffa86c040 Mon Sep 17 00:00:00 2001 From: Martin Schroschk <martin.schroschk@tu-dresden.de> Date: Thu, 4 Nov 2021 08:13:07 +0100 Subject: [PATCH] Fix line lenght; minor issues --- .../docs/software/pytorch.md | 58 +++++++++++-------- 1 file changed, 34 insertions(+), 24 deletions(-) diff --git a/doc.zih.tu-dresden.de/docs/software/pytorch.md b/doc.zih.tu-dresden.de/docs/software/pytorch.md index 12d90e43a..923ed7e73 100644 --- a/doc.zih.tu-dresden.de/docs/software/pytorch.md +++ b/doc.zih.tu-dresden.de/docs/software/pytorch.md @@ -1,6 +1,6 @@ # PyTorch -[PyTorch](https://pytorch.org/){:target="_blank"} is an open-source machine learning framework. +[PyTorch](https://pytorch.org/) is an open-source machine learning framework. It is an optimized tensor library for deep learning using GPUs and CPUs. PyTorch is a machine learning tool developed by Facebooks AI division to process large-scale object detection, segmentation, classification, etc. @@ -13,9 +13,9 @@ Please check the software modules list via marie@login$ module spider pytorch ``` -to find out, which PyTorch modules are available on your partition. +to find out, which PyTorch modules are available. -We recommend using partitions alpha and/or ml when working with machine learning workflows +We recommend using partitions `alpha` and/or `ml` when working with machine learning workflows and the PyTorch library. You can find detailed hardware specification in our [hardware documentation](../jobs_and_resources/hardware_overview.md). @@ -25,7 +25,8 @@ You can find detailed hardware specification in our On the partition `alpha`, load the module environment: ```console -marie@login$ srun -p alpha --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash #Job submission on alpha nodes with 1 gpu on 1 node with 800 Mb per CPU +# Job submission on alpha nodes with 1 gpu on 1 node with 800 Mb per CPU +marie@login$ srun -p alpha --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash marie@alpha$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch/1.9.0 Die folgenden Module wurden in einer anderen Version erneut geladen: 1) modenv/scs5 => modenv/hiera @@ -34,6 +35,7 @@ Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5, PyTorch/1.9.0 and 54 dependencies ``` ??? hint "Torchvision on partition `alpha`" + On the partition `alpha`, the module torchvision is not yet available within the module system. (19.08.2021) Torchvision can be made available by using a virtual environment: @@ -50,7 +52,8 @@ Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5, PyTorch/1.9.0 and 54 dependencies On the partition `ml`: ```console -marie@login$ srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash #Job submission in ml nodes with 1 gpu on 1 node with 800 Mb per CPU +# Job submission in ml nodes with 1 gpu on 1 node with 800 Mb per CPU +marie@login$ srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash ``` After calling @@ -75,19 +78,24 @@ marie@{ml,alpha}$ python -c "import torch; print(torch.__version__)" The following example shows how to create a python virtual environment and import PyTorch. ```console -marie@ml$ mkdir python-environments #create folder -marie@ml$ which python #check which python are you using +# Create folder +marie@ml$ mkdir python-environments +# Check which python are you using +marie@ml$ which python /sw/installed/Python/3.7.4-GCCcore-8.3.0/bin/python -marie@ml$ virtualenv --system-site-packages python-environments/env #create virtual environment "env" which inheriting with global site packages +# Create virtual environment "env" which inheriting with global site packages +marie@ml$ virtualenv --system-site-packages python-environments/env [...] -marie@ml$ source python-environments/env/bin/activate #activate virtual environment "env". Example output: (env) bash-4.2$ +# Activate virtual environment "env". Example output: (env) bash-4.2$ +marie@ml$ source python-environments/env/bin/activate marie@ml$ python -c "import torch; print(torch.__version__)" ``` ## PyTorch in JupyterHub -In addition to using interactive and batch jobs, it is possible to work with PyTorch using JupyterHub. -The production and test environments of JupyterHub contain Python kernels, that come with a PyTorch support. +In addition to using interactive and batch jobs, it is possible to work with PyTorch using +JupyterHub. The production and test environments of JupyterHub contain Python kernels, that come +with a PyTorch support.  {: align="center"} @@ -99,8 +107,8 @@ For details on how to run PyTorch with multiple GPUs and/or multiple nodes, see ## Migrate PyTorch-script from CPU to GPU -It is recommended to use GPUs when using large training data sets. While TensorFlow automatically uses GPUs if they are available, in -PyTorch you have to move your tensors manually. +It is recommended to use GPUs when using large training data sets. While TensorFlow automatically +uses GPUs if they are available, in PyTorch you have to move your tensors manually. First, you need to import `torch.cuda`: @@ -108,7 +116,8 @@ First, you need to import `torch.cuda`: import torch.cuda ``` -Then you define a `device`-variable, which is set to 'cuda' automatically when a GPU is available with this code: +Then you define a `device`-variable, which is set to 'cuda' automatically when a GPU is available +with this code: ```python3 device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') @@ -121,15 +130,16 @@ x_train = torch.FloatTensor(x_train).to(device) y_train = torch.FloatTensor(y_train).to(device) ``` -Remember that this does not break backward compatibility when you port the script back to a computer without GPU, because without GPU, -`device` is set to 'cpu'. +Remember that this does not break backward compatibility when you port the script back to a computer +without GPU, because without GPU, `device` is set to 'cpu'. ### Caveats #### Moving Data Back to the CPU-Memory -The CPU cannot directly access variables stored on the GPU. If you want to use the variables, e.g. in a `print`-statement or -when editing with NumPy or anything that is not PyTorch, you have to move them back to the CPU-memory again. This then may look like this: +The CPU cannot directly access variables stored on the GPU. If you want to use the variables, e.g., +in a `print` statement or when editing with NumPy or anything that is not PyTorch, you have to move +them back to the CPU-memory again. This then may look like this: ```python3 cpu_x_train = x_train.cpu() @@ -138,8 +148,8 @@ print(cpu_x_train) error_train = np.sqrt(metrics.mean_squared_error(y_train[:,1].cpu(), y_prediction_train[:,1])) ``` -Remember that, without `.detach()` before the CPU, if you change `cpu_x_train`, `x_train` will also be changed. -If you want to treat them independently, use +Remember that, without `.detach()` before the CPU, if you change `cpu_x_train`, `x_train` will also +be changed. If you want to treat them independently, use ```python3 cpu_x_train = x_train.detach().cpu() @@ -149,7 +159,7 @@ Now you can change `cpu_x_train` without `x_train` being affected. #### Speed Improvements and Batch Size -When you have a lot of very small data points, the speed may actually decrease when you try to train them on the GPU. -This is because moving data from the CPU-memory to the GPU-memory takes time. If this occurs, please try using -a very large batch size. This way, copying back and forth only takes places a few times and the bottleneck may -be reduced. +When you have a lot of very small data points, the speed may actually decrease when you try to train +them on the GPU. This is because moving data from the CPU-memory to the GPU-memory takes time. If +this occurs, please try using a very large batch size. This way, copying back and forth only takes +places a few times and the bottleneck may be reduced. -- GitLab