# R for Data Analytics

[R](https://www.r-project.org/about.html) is a programming language and environment for statistical
computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling,
classical statistical tests, time-series analysis, classification, etc) and graphical techniques. R
is an integrated suite of software facilities for data manipulation, calculation and
graphing.

R possesses an extensive catalogue of statistical and graphical methods.  It includes machine
learning algorithms, linear regression, time series, statistical inference.

We recommend using **Haswell** and/or **Romeo** partitions to work with R. For more details
see [here](../jobs_and_resources/hardware_taurus.md).

## R Console

This is a quickstart example. The `srun` command is used to submit a real-time execution job
designed for interactive use with monitoring the output. Please check
[the Slurm page](../jobs_and_resources/slurm.md) for details.

```Bash
# job submission on haswell nodes with allocating: 1 task, 1 node, 4 CPUs per task with 2541 mb per CPU(core) for 1 hour
tauruslogin$ srun --partition=haswell --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2541 --time=01:00:00 --pty bash

# Ensure that you are using the scs5 environment
module load modenv/scs5
# Check all available modules for R with version 3.6
module available R/3.6
# Load default R module
module load R
# Checking the current R version
which R
# Start R console
R
```

Using `srun` is recommended only for short test runs, while for larger runs batch jobs should be
used. The examples can be found [here](get_started_with_hpcda.md) or
[here](../jobs_and_resources/slurm.md).

It is also possible to run `Rscript` command directly (after loading the module):

```Bash
# Run Rscript directly. For instance: Rscript /scratch/ws/0/marie-study_project/my_r_script.R
Rscript /path/to/script/your_script.R param1 param2
```

## R in JupyterHub

In addition to using interactive and batch jobs, it is possible to work with **R** using
[JupyterHub](../access/jupyterhub.md).

The production and test [environments](../access/jupyterhub.md#standard-environments) of
JupyterHub contain R kernel. It can be started either in the notebook or in the console.

## RStudio

For using R with RStudio please refer to [Data Analytics with RStudio](data_analytics_with_rstudio.md).

## Install Packages in R

By default, user-installed packages are saved in the users home in a subfolder depending on
the architecture (x86 or PowerPC). Therefore the packages should be installed using interactive
jobs on the compute node:

```Bash
srun -p haswell --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2541 --time=01:00:00 --pty bash

module purge
module load modenv/scs5
module load R
R -e 'install.packages("package_name")'  #For instance: 'install.packages("ggplot2")'
```

## Deep Learning with R

The deep learning frameworks perform extremely fast when run on accelerators such as GPU.
Therefore, using nodes with built-in GPUs ([ml](../jobs_and_resources/power9.md) or
[alpha](../jobs_and_resources/alpha_centauri.md) partitions) is beneficial for the examples here.

### R Interface to TensorFlow

The ["TensorFlow" R package](https://tensorflow.rstudio.com/) provides R users access to the
Tensorflow toolset. [TensorFlow](https://www.tensorflow.org/) is an open-source software library
for numerical computation using data flow graphs.

```Bash
srun --partition=ml --ntasks=1 --nodes=1 --cpus-per-task=7 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash

module purge
ml modenv/ml
ml TensorFlow
ml R

which python
mkdir python-virtual-environments  # Create a folder for virtual environments
cd python-virtual-environments
python3 -m venv --system-site-packages R-TensorFlow        #create python virtual environment
source R-TensorFlow/bin/activate                           #activate environment
module list
which R
```

Please allocate the job with respect to
[hardware specification](../jobs_and_resources/hardware_taurus.md)! Note that the nodes on `ml`
partition have 4way-SMT, so for every physical core allocated, you will always get 4\*1443Mb=5772mb.

In order to interact with Python-based frameworks (like TensorFlow) `reticulate` R library is used.
To configure it to point to the correct Python executable in your virtual environment, create
a file named `.Rprofile` in your project directory (e.g. R-TensorFlow) with the following
contents:

```R
Sys.setenv(RETICULATE_PYTHON = "/sw/installed/Anaconda3/2019.03/bin/python")    #assign the output of the 'which python' from above to RETICULATE_PYTHON
```

Let's start R, install some libraries and evaluate the result:

```R
install.packages("reticulate")
library(reticulate)
reticulate::py_config()
install.packages("tensorflow")
library(tensorflow)
tf$constant("Hello Tensorflow")         #In the output 'Tesla V100-SXM2-32GB' should be mentioned
```

??? example
    The example shows the use of the TensorFlow package with the R for the classification problem
    related to the MNIST dataset.
    ```R
    library(tensorflow)
    library(keras)

    # Data preparation
    batch_size <- 128
    num_classes <- 10
    epochs <- 12

    # Input image dimensions
    img_rows <- 28
    img_cols <- 28

    # Shuffled and split the data between train and test sets
    mnist <- dataset_mnist()
    x_train <- mnist$train$x
    y_train <- mnist$train$y
    x_test <- mnist$test$x
    y_test <- mnist$test$y

    # Redefine dimension of train/test inputs
    x_train <- array_reshape(x_train, c(nrow(x_train), img_rows, img_cols, 1))
    x_test <- array_reshape(x_test, c(nrow(x_test), img_rows, img_cols, 1))
    input_shape <- c(img_rows, img_cols, 1)

    # Transform RGB values into [0,1] range
    x_train <- x_train / 255
    x_test <- x_test / 255

    cat('x_train_shape:', dim(x_train), '\n')
    cat(nrow(x_train), 'train samples\n')
    cat(nrow(x_test), 'test samples\n')

    # Convert class vectors to binary class matrices
    y_train <- to_categorical(y_train, num_classes)
    y_test <- to_categorical(y_test, num_classes)

    # Define Model
    model <- keras_model_sequential() %>%
      layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = 'relu',
                    input_shape = input_shape) %>%
      layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu') %>%
      layer_max_pooling_2d(pool_size = c(2, 2)) %>%
      layer_dropout(rate = 0.25) %>%
      layer_flatten() %>%
      layer_dense(units = 128, activation = 'relu') %>%
      layer_dropout(rate = 0.5) %>%
      layer_dense(units = num_classes, activation = 'softmax')

    # Compile model
    model %>% compile(
      loss = loss_categorical_crossentropy,
      optimizer = optimizer_adadelta(),
      metrics = c('accuracy')
    )

    # Train model
    model %>% fit(
      x_train, y_train,
      batch_size = batch_size,
      epochs = epochs,
      validation_split = 0.2
    )
    scores <- model %>% evaluate(
      x_test, y_test, verbose = 0
    )

    # Output metrics
    cat('Test loss:', scores[[1]], '\n')
    cat('Test accuracy:', scores[[2]], '\n')
    ```

## Parallel Computing with R

Generally, the R code is serial. However, many computations in R can be made faster by the use of
parallel computations. Taurus allows a vast number of options for parallel computations. Large
amounts of data and/or use of complex models are indications to use parallelization.

### General Information about the R Parallelism

There are various techniques and packages in R that allow parallelization. This section
concentrates on most general methods and examples. The Information here is Taurus-specific.
The [parallel](https://www.rdocumentation.org/packages/parallel/versions/3.6.2) library
will be used below.

**Warning:** Please do not install or update R packages related to parallelism as it could lead to
conflicts with other pre-installed packages.

### Basic Lapply-Based Parallelism

`lapply()` function is a part of base R. lapply is useful for performing operations on list-objects.
Roughly speaking, lapply is a vectorization of the source code and it is the first step before
explicit parallelization of the code.

### Shared-Memory Parallelism

The `parallel` library includes the `mclapply()` function which is a shared memory version of
lapply. The "mc" stands for "multicore". This function distributes the `lapply` tasks across
multiple CPU cores to be executed in parallel.

This is a simple option for parallelization. It doesn't require much effort to rewrite the serial
code to use `mclapply` function. Check out an example below.

??? example
    ```R
    library(parallel)

    # here some function that needs to be executed in parallel
    average <- function(size){
      norm_vector <- rnorm(n=size, mean=mu, sd=sigma)
      return(mean(norm_vector))
    }

    # variable initialization
    mu <- 1.0
    sigma <- 1.0
    vector_length <- 10^7
    n_repeat <- 100
    sample_sizes <- rep(vector_length, times=n_repeat)


    # shared-memory version
    threads <- as.integer(Sys.getenv("SLURM_CPUS_ON_NODE"))
    # here the name of the variable depends on the correct sbatch configuration
    # unfortunately the built-in function gets the total number of physical cores without
    # taking into account allocated cores by Slurm

    list_of_averages <- mclapply(X=sample_sizes, FUN=average, mc.cores=threads)  # apply function "average" 100 times
    ```

The disadvantages of using shared-memory parallelism approach are, that the number of parallel
tasks is limited to the number of cores on a single node. The maximum number of cores on a single
node can be found [here](../jobs_and_resources/hardware_taurus.md).

Submitting a multicore R job to Slurm is very similar to submitting an
[OpenMP Job](../jobs_and_resources/slurm.md#binding-and-distribution-of-tasks),
since both are running multicore jobs on a **single** node. Below is an example:

```Bash
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --time=00:10:00
#SBATCH -o test_Rmpi.out
#SBATCH -e test_Rmpi.err

module purge
module load modenv/scs5
module load R

R CMD BATCH Rcode.R
```

### Distributed-Memory Parallelism

In order to go beyond the limitation of the number of cores on a single node, a cluster of workers
shall be set up. There are three options for it: MPI, PSOCK and FORK clusters.
We use `makeCluster` function from `parallel` library to create a set of copies of R processes
running in parallel. The desired type of the cluster can be specified with a parameter `TYPE`.

#### MPI Cluster

This way of the R parallelism uses the
[Rmpi](http://cran.r-project.org/web/packages/Rmpi/index.html) package and the
[MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) (Message Passing Interface) as a
"backend" for its parallel operations. The MPI-based job in R is very similar to submitting an
[MPI Job](../jobs_and_resources/slurm.md#binding-and-distribution-of-tasks) since both are running
multicore jobs on multiple nodes. Below is an example of running R script with the Rmpi on Taurus:

```Bash
#!/bin/bash
#SBATCH --partition=haswell      # specify the partition
#SBATCH --ntasks=32              # this parameter determines how many processes will be spawned, please use >=8
#SBATCH --cpus-per-task=1
#SBATCH --time=01:00:00
#SBATCH -o test_Rmpi.out
#SBATCH -e test_Rmpi.err

module purge
module load modenv/scs5
module load R

mpirun -np 1 R CMD BATCH Rmpi.R   # specify the absolute path to the R script, like: /scratch/ws/marie-Work/R/Rmpi.R

# submit with sbatch <script_name>
```

Slurm option `--ntasks` controls the total number of parallel tasks. The number of
nodes required to complete this number of tasks will be automatically selected.
However, in some specific cases, you can specify the number of nodes and the number of necessary
tasks per node explicitly:

```Bash
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --tasks-per-node=16
#SBATCH --cpus-per-task=1
module purge
module load modenv/scs5
module load R

mpirun -np 1 R CMD BATCH --no-save --no-restore Rmpi_c.R
```

Use an example below, where 32 global ranks are distributed over 2 nodes with 16 cores each.
Each MPI rank has 1 core assigned to it.

??? example
    ```R
    library(Rmpi)

    # initialize an Rmpi environment
    ns <- mpi.universe.size()-1
    mpi.spawn.Rslaves(nslaves=ns)

    # send these commands to the slaves
    mpi.bcast.cmd( id <- mpi.comm.rank() )
    mpi.bcast.cmd( ns <- mpi.comm.size() )
    mpi.bcast.cmd( host <- mpi.get.processor.name() )

    # all slaves execute this command
    mpi.remote.exec(paste("I am", id, "of", ns, "running on", host))

    # close down the Rmpi environment
    mpi.close.Rslaves(dellog = FALSE)
    mpi.exit()
    ```

Another example:

??? example
    ```R
    library(Rmpi)
    library(parallel)

    # here some function that needs to be executed in parallel
    average <- function(size){
      norm_vector <- rnorm(n=size, mean=mu, sd=sigma)
      return(mean(norm_vector))
    }

    # variable initialization
    mu <- 1.0
    sigma <- 1.0
    vector_length <- 10^7
    n_repeat <- 100
    sample_sizes <- rep(vector_length, times=n_repeat)

    # cluster setup
    # get number of available MPI ranks
    threads = mpi.universe.size()-1
    print(paste("The cluster of size", threads, "will be setup..."))

    # initialize MPI cluster
    cl <- makeCluster(threads, type="MPI", outfile="")

    # distribute required variables for the execution over the cluster
    clusterExport(cl, list("mu","sigma"))

    list_of_averages <- parLapply(X=sample_sizes, fun=average, cl=cl)

    # shut down the cluster
    #snow::stopCluster(cl)  # usually it hangs over here with OpenMPI > 2.0. In this case this command may be avoided, Slurm will clean up after the job finishes
    ```

To use Rmpi and MPI please use one of these partitions: **haswell**, **broadwell** or **rome**.

Use `mpirun` command to start the R script. It is a wrapper that enables the communication
between processes running on different nodes. It is important to use `-np 1` (the number of spawned
processes by `mpirun`), since the R takes care of it with `makeCluster` function.

#### PSOCK cluster

The `type="PSOCK"` uses TCP sockets to transfer data between nodes. PSOCK is the default on *all*
systems. The advantage of this method is that it does not require external libraries such as MPI.
On the other hand, TCP sockets are relatively
[slow](http://glennklockwood.blogspot.com/2013/06/whats-killing-cloud-interconnect.html). Creating
a PSOCK cluster is similar to launching an MPI cluster, but instead of specifying the number of
parallel workers, you have to manually specify the number of nodes according to the
hardware specification and parameters of your job.

??? example
    ```R
    library(parallel)

    # a function that needs to be executed in parallel
    average <- function(size){
      norm_vector <- rnorm(n=size, mean=mu, sd=sigma)
      return(mean(norm_vector))
    }

    # variable initialization
    mu <- 1.0
    sigma <- 1.0
    vector_length <- 10^7
    n_repeat <- 100
    sample_sizes <- rep(vector_length, times=n_repeat)

    # cluster setup

    # get number of available nodes (should be equal to "ntasks")
    mynodes = 8
    print(paste("The cluster of size", threads, "will be setup..."))

    # initialize cluster
    cl <- makeCluster(mynodes, type="PSOCK", outfile="")

    # distribute required variables for the execution over the cluster
    clusterExport(cl, list("mu","sigma"))

    list_of_averages <- parLapply(X=sample_sizes, fun=average, cl=cl)

    # shut down the cluster
    print(paste("Program finished"))
    ```

#### FORK cluster

The `type="FORK"` method behaves exactly like the `mclapply` function discussed in the previous
section. Like `mclapply`, it can only use the cores available on a single node. However this method
requires exporting the workspace data to other processes. The FORK method in a combination with
`parLapply` function might be used in situations, where different source code should run on each
parallel process.

### Other parallel options

- [foreach](https://cran.r-project.org/web/packages/foreach/index.html) library.
  It is functionally equivalent to the
  [lapply-based parallelism](https://www.glennklockwood.com/data-intensive/r/lapply-parallelism.html)
  discussed before but based on the for-loop
- [future](https://cran.r-project.org/web/packages/future/index.html)
  library. The purpose of this package is to provide a lightweight and
  unified Future API for sequential and parallel processing of R
  expression via futures
- [Poor-man's parallelism](https://www.glennklockwood.com/data-intensive/r/alternative-parallelism.html#6-1-poor-man-s-parallelism)
  (simple data parallelism). It is the simplest, but not an elegant way to parallelize R code.
  It runs several copies of the same R script where's each read different sectors of the input data
- [Hands-off (OpenMP)](https://www.glennklockwood.com/data-intensive/r/alternative-parallelism.html#6-2-hands-off-parallelism)
  method. R has [OpenMP](https://www.openmp.org/resources/) support. Thus using OpenMP is a simple
  method where you don't need to know much about the parallelism options in your code. Please be
  careful and don't mix this technique with other methods!