diff --git a/doc.zih.tu-dresden.de/docs/software/pytorch.md b/doc.zih.tu-dresden.de/docs/software/pytorch.md index 3c2e88a6c9fc209c246ede0e50410771be541c3f..e84f3aac54a88e0984b0da17e3e3527fe37e7b46 100644 --- a/doc.zih.tu-dresden.de/docs/software/pytorch.md +++ b/doc.zih.tu-dresden.de/docs/software/pytorch.md @@ -1,11 +1,11 @@ # PyTorch -[PyTorch](https://pytorch.org/){:target="_blank"} is an open-source machine learning framework. +[PyTorch](https://pytorch.org/) is an open-source machine learning framework. It is an optimized tensor library for deep learning using GPUs and CPUs. -PyTorch is a machine learning tool developed by Facebooks AI division to process large-scale +PyTorch is a machine learning tool developed by Facebook's AI division to process large-scale object detection, segmentation, classification, etc. PyTorch provides a core data structure, the tensor, a multi-dimensional array that shares many -similarities with Numpy arrays. +similarities with NumPy arrays. Please check the software modules list via @@ -13,9 +13,9 @@ Please check the software modules list via marie@login$ module spider pytorch ``` -to find out, which PyTorch modules are available on your partition. +to find out, which PyTorch modules are available. -We recommend using partitions alpha and/or ml when working with machine learning workflows +We recommend using partitions `alpha` and/or `ml` when working with machine learning workflows and the PyTorch library. You can find detailed hardware specification in our [hardware documentation](../jobs_and_resources/hardware_overview.md). @@ -25,7 +25,8 @@ You can find detailed hardware specification in our On the partition `alpha`, load the module environment: ```console -marie@login$ srun -p alpha --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash #Job submission on alpha nodes with 1 gpu on 1 node with 800 Mb per CPU +# Job submission on alpha nodes with 1 gpu on 1 node with 800 Mb per CPU +marie@login$ srun -p alpha --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash marie@alpha$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch/1.9.0 Die folgenden Module wurden in einer anderen Version erneut geladen: 1) modenv/scs5 => modenv/hiera @@ -34,6 +35,7 @@ Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5, PyTorch/1.9.0 and 54 dependencies ``` ??? hint "Torchvision on partition `alpha`" + On the partition `alpha`, the module torchvision is not yet available within the module system. (19.08.2021) Torchvision can be made available by using a virtual environment: @@ -45,12 +47,13 @@ Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5, PyTorch/1.9.0 and 54 dependencies ``` Using the **--no-deps** option for "pip install" is necessary here as otherwise the PyTorch - version might be replaced and you will run into trouble with the cuda drivers. + version might be replaced and you will run into trouble with the CUDA drivers. On the partition `ml`: ```console -marie@login$ srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash #Job submission in ml nodes with 1 gpu on 1 node with 800 Mb per CPU +# Job submission in ml nodes with 1 gpu on 1 node with 800 Mb per CPU +marie@login$ srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash ``` After calling @@ -62,8 +65,8 @@ marie@login$ module spider pytorch we know that we can load PyTorch (including torchvision) with ```console -marie@ml$ module load modenv/ml torchvision/0.7.0-fosscuda-2019b-Python-3.7.4-PyTorch-1.6.0 -Module torchvision/0.7.0-fosscuda-2019b-Python-3.7.4-PyTorch-1.6.0 and 55 dependencies loaded. +marie@ml$ module load modenv/ml torchvision/0.7.0-fossCUDA-2019b-Python-3.7.4-PyTorch-1.6.0 +Module torchvision/0.7.0-fossCUDA-2019b-Python-3.7.4-PyTorch-1.6.0 and 55 dependencies loaded. ``` Now, we check that we can access PyTorch: @@ -75,19 +78,24 @@ marie@{ml,alpha}$ python -c "import torch; print(torch.__version__)" The following example shows how to create a python virtual environment and import PyTorch. ```console -marie@ml$ mkdir python-environments #create folder -marie@ml$ which python #check which python are you using +# Create folder +marie@ml$ mkdir python-environments +# Check which python are you using +marie@ml$ which python /sw/installed/Python/3.7.4-GCCcore-8.3.0/bin/python -marie@ml$ virtualenv --system-site-packages python-environments/env #create virtual environment "env" which inheriting with global site packages +# Create virtual environment "env" which inheriting with global site packages +marie@ml$ virtualenv --system-site-packages python-environments/env [...] -marie@ml$ source python-environments/env/bin/activate #activate virtual environment "env". Example output: (env) bash-4.2$ +# Activate virtual environment "env". Example output: (env) bash-4.2$ +marie@ml$ source python-environments/env/bin/activate marie@ml$ python -c "import torch; print(torch.__version__)" ``` ## PyTorch in JupyterHub -In addition to using interactive and batch jobs, it is possible to work with PyTorch using JupyterHub. -The production and test environments of JupyterHub contain Python kernels, that come with a PyTorch support. +In addition to using interactive and batch jobs, it is possible to work with PyTorch using +JupyterHub. The production and test environments of JupyterHub contain Python kernels, that come +with a PyTorch support.  {: align="center"} @@ -96,3 +104,62 @@ The production and test environments of JupyterHub contain Python kernels, that For details on how to run PyTorch with multiple GPUs and/or multiple nodes, see [distributed training](distributed_training.md). + +## Migrate PyTorch-script from CPU to GPU + +It is recommended to use GPUs when using large training data sets. While TensorFlow automatically +uses GPUs if they are available, in PyTorch you have to move your tensors manually. + +First, you need to import `torch.CUDA`: + +```python3 +import torch.CUDA +``` + +Then you define a `device`-variable, which is set to 'CUDA' automatically when a GPU is available +with this code: + +```python3 +device = torch.device('CUDA' if torch.CUDA.is_available() else 'cpu') +``` + +You then have to move all of your tensors to the selected device. This looks like this: + +```python3 +x_train = torch.FloatTensor(x_train).to(device) +y_train = torch.FloatTensor(y_train).to(device) +``` + +Remember that this does not break backward compatibility when you port the script back to a computer +without GPU, because without GPU, `device` is set to 'cpu'. + +### Caveats + +#### Moving Data Back to the CPU-Memory + +The CPU cannot directly access variables stored on the GPU. If you want to use the variables, e.g., +in a `print` statement or when editing with NumPy or anything that is not PyTorch, you have to move +them back to the CPU-memory again. This then may look like this: + +```python3 +cpu_x_train = x_train.cpu() +print(cpu_x_train) +... +error_train = np.sqrt(metrics.mean_squared_error(y_train[:,1].cpu(), y_prediction_train[:,1])) +``` + +Remember that, without `.detach()` before the CPU, if you change `cpu_x_train`, `x_train` will also +be changed. If you want to treat them independently, use + +```python3 +cpu_x_train = x_train.detach().cpu() +``` + +Now you can change `cpu_x_train` without `x_train` being affected. + +#### Speed Improvements and Batch Size + +When you have a lot of very small data points, the speed may actually decrease when you try to train +them on the GPU. This is because moving data from the CPU-memory to the GPU-memory takes time. If +this occurs, please try using a very large batch size. This way, copying back and forth only takes +places a few times and the bottleneck may be reduced. diff --git a/doc.zih.tu-dresden.de/wordlist.aspell b/doc.zih.tu-dresden.de/wordlist.aspell index 1b550736a6387fa243f1d8334ee1313593938b63..dadfc9dd82834cd0bbdfb7e485ca21262a69cd8a 100644 --- a/doc.zih.tu-dresden.de/wordlist.aspell +++ b/doc.zih.tu-dresden.de/wordlist.aspell @@ -324,6 +324,8 @@ todo ToDo toolchain toolchains +torchvision +Torchvision tracefile tracefiles tracepoints