Update data_analytics_with_r.md according to new rules

af54e90c · Taras Lazariv · a137785d · af54e90c
Commit af54e90c authored 3 years ago by Taras Lazariv
--- a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md
+++ b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md
@@ -9,7 +9,7 @@ graphing.
 R possesses an extensive catalog of statistical and graphical methods.  It includes machine
 learning algorithms, linear regression, time series, statistical inference.

-We recommend using **Haswell** and/or **Romeo** partitions to work with R. For more details
+We recommend using `haswell` and/or `rome` partitions to work with R. For more details
 see [here](../jobs_and_resources/hardware_taurus.md).

 ## R Console
@@ -18,20 +18,13 @@ In the following example the `srun` command is used to submit a real-time execut
 designed for interactive use with monitoring the output. Please check
 [the Slurm page](../jobs_and_resources/slurm.md) for details.

-```Bash
-# job submission on haswell nodes with allocating: 1 task, 1 node, 4 CPUs per task with 2541 mb per CPU(core) for 1 hour
-tauruslogin$ srun --partition=haswell --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2541 --time=01:00:00 --pty bash
-
-# Ensure that you are using the scs5 environment
-module load modenv/scs5
-# Check all available modules for R with version 3.6
-module available R/3.6
-# Load default R module
-module load R
-# Checking the current R version
-which R
-# Start R console
-R
+```console
+marie@login$ srun --partition=haswell --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2541 --time=01:00:00 --pty bash
+marie@compute$ module load modenv/scs5
+marie@compute$ module available R/3.6
+marie@compute$ module load R
+marie@compute$ which R
+marie@compute$ R
 ```

 Using `srun` is recommended only for short test runs, while for larger runs batch jobs should be
@@ -41,13 +34,12 @@ used. The examples can be found [here](get_started_with_hpcda.md) or
 It is also possible to run `Rscript` command directly (after loading the module):

 ```Bash
-# Run Rscript directly. For instance: Rscript /scratch/ws/0/marie-study_project/my_r_script.R
-Rscript /path/to/script/your_script.R param1 param2
+Rscript /path/to/script/your_script.R <param1> <param2>
 ```

 ## R in JupyterHub

-In addition to using interactive and batch jobs, it is possible to work with **R** using
+In addition to using interactive and batch jobs, it is possible to work with R using
 [JupyterHub](../access/jupyterhub.md).

 The production and test [environments](../access/jupyterhub.md#standard-environments) of
@@ -60,16 +52,14 @@ For using R with RStudio please refer to [Data Analytics with RStudio](data_anal
 ## Install Packages in R

 By default, user-installed packages are saved in the users home in a folder depending on
-the architecture (x86 or PowerPC). Therefore the packages should be installed using interactive
+the architecture (`x86` or `PowerPC`). Therefore the packages should be installed using interactive
 jobs on the compute node:

-```Bash
-srun -p haswell --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2541 --time=01:00:00 --pty bash
-
-module purge
-module load modenv/scs5
-module load R
-R -e 'install.packages("package_name")'  #For instance: 'install.packages("ggplot2")'
+```console
+marie@compute$ module load R
+Module R/3.6.0-foss-2019a and 56 dependencies loaded.
+marie@compute$ R -e 'install.packages("ggplot2")'
+[...]
 ```

 ## Deep Learning with R
@@ -84,26 +74,18 @@ The ["TensorFlow" R package](https://tensorflow.rstudio.com/) provides R users a
 TensorFlow framework. [TensorFlow](https://www.tensorflow.org/) is an open-source software library
 for numerical computation using data flow graphs.

-```Bash
-srun --partition=ml --ntasks=1 --nodes=1 --cpus-per-task=7 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash
+The respective modules can be loaded with the following

-module purge
-ml modenv/ml
-ml TensorFlow
-ml R
-
-which python
-mkdir python-virtual-environments  # Create a folder for virtual environments
-cd python-virtual-environments
-python3 -m venv --system-site-packages R-TensorFlow        #create python virtual environment
-source R-TensorFlow/bin/activate                           #activate environment
-module list
-which R
+```console
+marie@compute$ module load R/3.6.2-fosscuda-2019b
+Module R/3.6.2-fosscuda-2019b and 63 dependencies loaded.
+marie@compute$ module load TensorFlow/2.3.1-fosscuda-2019b-Python-3.7.4
+Module TensorFlow/2.3.1-fosscuda-2019b-Python-3.7.4 and 15 dependencies loaded.
 ```

-Please allocate the job with respect to
-[hardware specification](../jobs_and_resources/hardware_taurus.md)! Note that the nodes on `ml`
-partition have 4way-SMT, so for every physical core allocated, you will always get 4\*1443Mb=5772mb.
+!!! warning
+    Be aware that for compatibility reasons it is important to choose modules with 
+    the same toolchain version (in this case `fosscuda/2019b`). For reference see [here](modules.md)

 In order to interact with Python-based frameworks (like TensorFlow) `reticulate` R library is used.
 To configure it to point to the correct Python executable in your virtual environment, create
@@ -111,18 +93,34 @@ a file named `.Rprofile` in your project directory (e.g. R-TensorFlow) with the
 contents:

 ```R
-Sys.setenv(RETICULATE_PYTHON = "/sw/installed/Anaconda3/2019.03/bin/python")    #assign the output of the 'which python' from above to RETICULATE_PYTHON
+Sys.setenv(RETICULATE_PYTHON = "/sw/installed/Python/3.7.4-GCCcore-8.3.0/bin/python")    #assign the output of the 'which python' from above to RETICULATE_PYTHON
 ```

 Let's start R, install some libraries and evaluate the result:

-```R
-install.packages("reticulate")
-library(reticulate)
-reticulate::py_config()
-install.packages("tensorflow")
-library(tensorflow)
-tf$constant("Hello TensorFlow")         #In the output 'Tesla V100-SXM2-32GB' should be mentioned
+```rconsole
+> install.packages(c("reticulate", "tensorflow"))
+Installing packages into ‘~/R/x86_64-pc-linux-gnu-library/3.6’
+(as ‘lib’ is unspecified)
+> reticulate::py_config()
+python:         /software/rome/Python/3.7.4-GCCcore-8.3.0/bin/python
+libpython:      /sw/installed/Python/3.7.4-GCCcore-8.3.0/lib/libpython3.7m.so
+pythonhome:     /software/rome/Python/3.7.4-GCCcore-8.3.0:/software/rome/Python/3.7.4-GCCcore-8.3.0
+version:        3.7.4 (default, Mar 25 2020, 13:46:43)  [GCC 8.3.0]
+numpy:          /software/rome/SciPy-bundle/2019.10-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/numpy
+numpy_version:  1.17.3
+
+NOTE: Python version was forced by RETICULATE_PYTHON
+
+> library(tensorflow)
+2021-08-26 16:11:47.110548: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
+> tf$constant("Hello TensorFlow")
+2021-08-26 16:14:00.269248: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
+2021-08-26 16:14:00.674878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
+pciBusID: 0000:0b:00.0 name: A100-SXM4-40GB computeCapability: 8.0
+coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
+[...]
+tf.Tensor(b'Hello TensorFlow', shape=(), dtype=string)
 ```

 ??? example
@@ -203,13 +201,7 @@ tf$constant("Hello TensorFlow")         #In the output 'Tesla V100-SXM2-32GB' sh
 ## Parallel Computing with R

 Generally, the R code is serial. However, many computations in R can be made faster by the use of
-parallel computations. Taurus allows a vast number of options for parallel computations. Large
-amounts of data and/or use of complex models are indications to use parallelization.
-
-### General Information about the R Parallelism
-
-There are various techniques and packages in R that allow parallelization. This section
-concentrates on most general methods and examples. The Information here is Taurus-specific.
+parallel computations. This section concentrates on most general methods and examples. 
 The [parallel](https://www.rdocumentation.org/packages/parallel/versions/3.6.2) library
 will be used below.

@@ -297,7 +289,8 @@ This way of the R parallelism uses the
 [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) (Message Passing Interface) as a
 "back-end" for its parallel operations. The MPI-based job in R is very similar to submitting an
 [MPI Job](../jobs_and_resources/slurm.md#binding-and-distribution-of-tasks) since both are running
-multicore jobs on multiple nodes. Below is an example of running R script with the Rmpi on Taurus:
+multicore jobs on multiple nodes. Below is an example of running R script with the Rmpi on 
+ZIH system:

 ```Bash
 #!/bin/bash
@@ -305,8 +298,8 @@ multicore jobs on multiple nodes. Below is an example of running R script with t
 #SBATCH --ntasks=32              # this parameter determines how many processes will be spawned, please use >=8
 #SBATCH --cpus-per-task=1
 #SBATCH --time=01:00:00
-#SBATCH -o test_Rmpi.out
-#SBATCH -e test_Rmpi.err
+#SBATCH --output=test_Rmpi.out
+#SBATCH --error=test_Rmpi.err

 module purge
 module load modenv/scs5
@@ -323,10 +316,10 @@ However, in some specific cases, you can specify the number of nodes and the num
 tasks per node explicitly:

 ```Bash
-#!/bin/bash
 #SBATCH --nodes=2
 #SBATCH --tasks-per-node=16
 #SBATCH --cpus-per-task=1
+
 module purge
 module load modenv/scs5
 module load R
@@ -395,7 +388,7 @@ Another example:
    #snow::stopCluster(cl)  # usually it hangs over here with OpenMPI > 2.0. In this case this command may be avoided, Slurm will clean up after the job finishes
    ```

-To use Rmpi and MPI please use one of these partitions: **haswell**, **broadwell** or **rome**.
+To use Rmpi and MPI please use one of these partitions: `haswell`, `broadwell` or `rome`.

 Use `mpirun` command to start the R script. It is a wrapper that enables the communication
 between processes running on different nodes. It is important to use `-np 1` (the number of spawned