Skip to content
Snippets Groups Projects
Commit af54e90c authored by Taras Lazariv's avatar Taras Lazariv
Browse files

Update data_analytics_with_r.md according to new rules

parent a137785d
No related branches found
No related tags found
5 merge requests!333Draft: update NGC containers,!322Merge preview into main,!319Merge preview into main,!279Draft: Machine Learning restructuring,!258Data Analytics restructuring
......@@ -9,7 +9,7 @@ graphing.
R possesses an extensive catalog of statistical and graphical methods. It includes machine
learning algorithms, linear regression, time series, statistical inference.
We recommend using **Haswell** and/or **Romeo** partitions to work with R. For more details
We recommend using `haswell` and/or `rome` partitions to work with R. For more details
see [here](../jobs_and_resources/hardware_taurus.md).
## R Console
......@@ -18,20 +18,13 @@ In the following example the `srun` command is used to submit a real-time execut
designed for interactive use with monitoring the output. Please check
[the Slurm page](../jobs_and_resources/slurm.md) for details.
```Bash
# job submission on haswell nodes with allocating: 1 task, 1 node, 4 CPUs per task with 2541 mb per CPU(core) for 1 hour
tauruslogin$ srun --partition=haswell --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2541 --time=01:00:00 --pty bash
# Ensure that you are using the scs5 environment
module load modenv/scs5
# Check all available modules for R with version 3.6
module available R/3.6
# Load default R module
module load R
# Checking the current R version
which R
# Start R console
R
```console
marie@login$ srun --partition=haswell --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2541 --time=01:00:00 --pty bash
marie@compute$ module load modenv/scs5
marie@compute$ module available R/3.6
marie@compute$ module load R
marie@compute$ which R
marie@compute$ R
```
Using `srun` is recommended only for short test runs, while for larger runs batch jobs should be
......@@ -41,13 +34,12 @@ used. The examples can be found [here](get_started_with_hpcda.md) or
It is also possible to run `Rscript` command directly (after loading the module):
```Bash
# Run Rscript directly. For instance: Rscript /scratch/ws/0/marie-study_project/my_r_script.R
Rscript /path/to/script/your_script.R param1 param2
Rscript /path/to/script/your_script.R <param1> <param2>
```
## R in JupyterHub
In addition to using interactive and batch jobs, it is possible to work with **R** using
In addition to using interactive and batch jobs, it is possible to work with R using
[JupyterHub](../access/jupyterhub.md).
The production and test [environments](../access/jupyterhub.md#standard-environments) of
......@@ -60,16 +52,14 @@ For using R with RStudio please refer to [Data Analytics with RStudio](data_anal
## Install Packages in R
By default, user-installed packages are saved in the users home in a folder depending on
the architecture (x86 or PowerPC). Therefore the packages should be installed using interactive
the architecture (`x86` or `PowerPC`). Therefore the packages should be installed using interactive
jobs on the compute node:
```Bash
srun -p haswell --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2541 --time=01:00:00 --pty bash
module purge
module load modenv/scs5
module load R
R -e 'install.packages("package_name")' #For instance: 'install.packages("ggplot2")'
```console
marie@compute$ module load R
Module R/3.6.0-foss-2019a and 56 dependencies loaded.
marie@compute$ R -e 'install.packages("ggplot2")'
[...]
```
## Deep Learning with R
......@@ -84,26 +74,18 @@ The ["TensorFlow" R package](https://tensorflow.rstudio.com/) provides R users a
TensorFlow framework. [TensorFlow](https://www.tensorflow.org/) is an open-source software library
for numerical computation using data flow graphs.
```Bash
srun --partition=ml --ntasks=1 --nodes=1 --cpus-per-task=7 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash
The respective modules can be loaded with the following
module purge
ml modenv/ml
ml TensorFlow
ml R
which python
mkdir python-virtual-environments # Create a folder for virtual environments
cd python-virtual-environments
python3 -m venv --system-site-packages R-TensorFlow #create python virtual environment
source R-TensorFlow/bin/activate #activate environment
module list
which R
```console
marie@compute$ module load R/3.6.2-fosscuda-2019b
Module R/3.6.2-fosscuda-2019b and 63 dependencies loaded.
marie@compute$ module load TensorFlow/2.3.1-fosscuda-2019b-Python-3.7.4
Module TensorFlow/2.3.1-fosscuda-2019b-Python-3.7.4 and 15 dependencies loaded.
```
Please allocate the job with respect to
[hardware specification](../jobs_and_resources/hardware_taurus.md)! Note that the nodes on `ml`
partition have 4way-SMT, so for every physical core allocated, you will always get 4\*1443Mb=5772mb.
!!! warning
Be aware that for compatibility reasons it is important to choose modules with
the same toolchain version (in this case `fosscuda/2019b`). For reference see [here](modules.md)
In order to interact with Python-based frameworks (like TensorFlow) `reticulate` R library is used.
To configure it to point to the correct Python executable in your virtual environment, create
......@@ -111,18 +93,34 @@ a file named `.Rprofile` in your project directory (e.g. R-TensorFlow) with the
contents:
```R
Sys.setenv(RETICULATE_PYTHON = "/sw/installed/Anaconda3/2019.03/bin/python") #assign the output of the 'which python' from above to RETICULATE_PYTHON
Sys.setenv(RETICULATE_PYTHON = "/sw/installed/Python/3.7.4-GCCcore-8.3.0/bin/python") #assign the output of the 'which python' from above to RETICULATE_PYTHON
```
Let's start R, install some libraries and evaluate the result:
```R
install.packages("reticulate")
library(reticulate)
reticulate::py_config()
install.packages("tensorflow")
library(tensorflow)
tf$constant("Hello TensorFlow") #In the output 'Tesla V100-SXM2-32GB' should be mentioned
```rconsole
> install.packages(c("reticulate", "tensorflow"))
Installing packages into ‘~/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)
> reticulate::py_config()
python: /software/rome/Python/3.7.4-GCCcore-8.3.0/bin/python
libpython: /sw/installed/Python/3.7.4-GCCcore-8.3.0/lib/libpython3.7m.so
pythonhome: /software/rome/Python/3.7.4-GCCcore-8.3.0:/software/rome/Python/3.7.4-GCCcore-8.3.0
version: 3.7.4 (default, Mar 25 2020, 13:46:43) [GCC 8.3.0]
numpy: /software/rome/SciPy-bundle/2019.10-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/numpy
numpy_version: 1.17.3
NOTE: Python version was forced by RETICULATE_PYTHON
> library(tensorflow)
2021-08-26 16:11:47.110548: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
> tf$constant("Hello TensorFlow")
2021-08-26 16:14:00.269248: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-08-26 16:14:00.674878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:0b:00.0 name: A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
[...]
tf.Tensor(b'Hello TensorFlow', shape=(), dtype=string)
```
??? example
......@@ -203,13 +201,7 @@ tf$constant("Hello TensorFlow") #In the output 'Tesla V100-SXM2-32GB' sh
## Parallel Computing with R
Generally, the R code is serial. However, many computations in R can be made faster by the use of
parallel computations. Taurus allows a vast number of options for parallel computations. Large
amounts of data and/or use of complex models are indications to use parallelization.
### General Information about the R Parallelism
There are various techniques and packages in R that allow parallelization. This section
concentrates on most general methods and examples. The Information here is Taurus-specific.
parallel computations. This section concentrates on most general methods and examples.
The [parallel](https://www.rdocumentation.org/packages/parallel/versions/3.6.2) library
will be used below.
......@@ -297,7 +289,8 @@ This way of the R parallelism uses the
[MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) (Message Passing Interface) as a
"back-end" for its parallel operations. The MPI-based job in R is very similar to submitting an
[MPI Job](../jobs_and_resources/slurm.md#binding-and-distribution-of-tasks) since both are running
multicore jobs on multiple nodes. Below is an example of running R script with the Rmpi on Taurus:
multicore jobs on multiple nodes. Below is an example of running R script with the Rmpi on
ZIH system:
```Bash
#!/bin/bash
......@@ -305,8 +298,8 @@ multicore jobs on multiple nodes. Below is an example of running R script with t
#SBATCH --ntasks=32 # this parameter determines how many processes will be spawned, please use >=8
#SBATCH --cpus-per-task=1
#SBATCH --time=01:00:00
#SBATCH -o test_Rmpi.out
#SBATCH -e test_Rmpi.err
#SBATCH --output=test_Rmpi.out
#SBATCH --error=test_Rmpi.err
module purge
module load modenv/scs5
......@@ -323,10 +316,10 @@ However, in some specific cases, you can specify the number of nodes and the num
tasks per node explicitly:
```Bash
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --tasks-per-node=16
#SBATCH --cpus-per-task=1
module purge
module load modenv/scs5
module load R
......@@ -395,7 +388,7 @@ Another example:
#snow::stopCluster(cl) # usually it hangs over here with OpenMPI > 2.0. In this case this command may be avoided, Slurm will clean up after the job finishes
```
To use Rmpi and MPI please use one of these partitions: **haswell**, **broadwell** or **rome**.
To use Rmpi and MPI please use one of these partitions: `haswell`, `broadwell` or `rome`.
Use `mpirun` command to start the R script. It is a wrapper that enables the communication
between processes running on different nodes. It is important to use `-np 1` (the number of spawned
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment