Skip to content
Snippets Groups Projects

R for Data Analytics

R is a programming language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, etc) and graphical techniques. R is an integrated suite of software facilities for data manipulation, calculation and graphing.

R possesses an extensive catalogue of statistical and graphical methods. It includes machine learning algorithms, linear regression, time series, statistical inference.

We recommend using Haswell and/or Romeo partitions to work with R. For more details see here.

R Console

This is a quickstart example. The srun command is used to submit a real-time execution job designed for interactive use with monitoring the output. Please check the Slurm page for details.

# job submission on haswell nodes with allocating: 1 task, 1 node, 4 CPUs per task with 2541 mb per CPU(core) for 1 hour
tauruslogin$ srun --partition=haswell --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2541 --time=01:00:00 --pty bash

# Ensure that you are using the scs5 environment
module load modenv/scs5
# Check all available modules for R with version 3.6
module available R/3.6
# Load default R module
module load R
# Checking the current R version
which R
# Start R console
R

Using srun is recommended only for short test runs, while for larger runs batch jobs should be used. The examples can be found here or here.

It is also possible to run Rscript command directly (after loading the module):

# Run Rscript directly. For instance: Rscript /scratch/ws/0/marie-study_project/my_r_script.R
Rscript /path/to/script/your_script.R param1 param2

R in JupyterHub

In addition to using interactive and batch jobs, it is possible to work with R using JupyterHub.

The production and test environments of JupyterHub contain R kernel. It can be started either in the notebook or in the console.

RStudio

For using R with RStudio please refer to Data Analytics with RStudio.

Install Packages in R

By default, user-installed packages are saved in the users home in a subfolder depending on the architecture (x86 or PowerPC). Therefore the packages should be installed using interactive jobs on the compute node:

srun -p haswell --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2541 --time=01:00:00 --pty bash

module purge
module load modenv/scs5
module load R
R -e 'install.packages("package_name")'  #For instance: 'install.packages("ggplot2")'

Deep Learning with R

The deep learning frameworks perform extremely fast when run on accelerators such as GPU. Therefore, using nodes with built-in GPUs (ml or alpha partitions) is beneficial for the examples here.

R Interface to TensorFlow

The "TensorFlow" R package provides R users access to the Tensorflow toolset. TensorFlow is an open-source software library for numerical computation using data flow graphs.

srun --partition=ml --ntasks=1 --nodes=1 --cpus-per-task=7 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash

module purge
ml modenv/ml
ml TensorFlow
ml R

which python
mkdir python-virtual-environments  # Create a folder for virtual environments
cd python-virtual-environments
python3 -m venv --system-site-packages R-TensorFlow        #create python virtual environment
source R-TensorFlow/bin/activate                           #activate environment
module list
which R

Please allocate the job with respect to hardware specification! Note that the nodes on ml partition have 4way-SMT, so for every physical core allocated, you will always get 4*1443Mb=5772mb.

In order to interact with Python-based frameworks (like TensorFlow) reticulate R library is used. To configure it to point to the correct Python executable in your virtual environment, create a file named .Rprofile in your project directory (e.g. R-TensorFlow) with the following contents:

Sys.setenv(RETICULATE_PYTHON = "/sw/installed/Anaconda3/2019.03/bin/python")    #assign the output of the 'which python' from above to RETICULATE_PYTHON

Let's start R, install some libraries and evaluate the result:

install.packages("reticulate")
library(reticulate)
reticulate::py_config()
install.packages("tensorflow")
library(tensorflow)
tf$constant("Hello Tensorflow")         #In the output 'Tesla V100-SXM2-32GB' should be mentioned

??? example The example shows the use of the TensorFlow package with the R for the classification problem related to the MNIST dataset. ```R library(tensorflow) library(keras)

# Data preparation
batch_size <- 128
num_classes <- 10
epochs <- 12

# Input image dimensions
img_rows <- 28
img_cols <- 28

# Shuffled and split the data between train and test sets
mnist <- dataset_mnist()
x_train <- mnist$train$x
y_train <- mnist$train$y
x_test <- mnist$test$x
y_test <- mnist$test$y

# Redefine dimension of train/test inputs
x_train <- array_reshape(x_train, c(nrow(x_train), img_rows, img_cols, 1))
x_test <- array_reshape(x_test, c(nrow(x_test), img_rows, img_cols, 1))
input_shape <- c(img_rows, img_cols, 1)

# Transform RGB values into [0,1] range
x_train <- x_train / 255
x_test <- x_test / 255

cat('x_train_shape:', dim(x_train), '\n')
cat(nrow(x_train), 'train samples\n')
cat(nrow(x_test), 'test samples\n')

# Convert class vectors to binary class matrices
y_train <- to_categorical(y_train, num_classes)
y_test <- to_categorical(y_test, num_classes)

# Define Model
model <- keras_model_sequential() %>%
  layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = 'relu',
                input_shape = input_shape) %>%
  layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu') %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>%
  layer_dropout(rate = 0.25) %>%
  layer_flatten() %>%
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = num_classes, activation = 'softmax')

# Compile model
model %>% compile(
  loss = loss_categorical_crossentropy,
  optimizer = optimizer_adadelta(),
  metrics = c('accuracy')
)

# Train model
model %>% fit(
  x_train, y_train,
  batch_size = batch_size,
  epochs = epochs,
  validation_split = 0.2
)
scores <- model %>% evaluate(
  x_test, y_test, verbose = 0
)

# Output metrics
cat('Test loss:', scores[[1]], '\n')
cat('Test accuracy:', scores[[2]], '\n')
```

Parallel Computing with R

Generally, the R code is serial. However, many computations in R can be made faster by the use of parallel computations. Taurus allows a vast number of options for parallel computations. Large amounts of data and/or use of complex models are indications to use parallelization.

General Information about the R Parallelism

There are various techniques and packages in R that allow parallelization. This section concentrates on most general methods and examples. The Information here is Taurus-specific. The parallel library will be used below.

Warning: Please do not install or update R packages related to parallelism as it could lead to conflicts with other pre-installed packages.

Basic Lapply-Based Parallelism

lapply() function is a part of base R. lapply is useful for performing operations on list-objects. Roughly speaking, lapply is a vectorization of the source code and it is the first step before explicit parallelization of the code.

Shared-Memory Parallelism

The parallel library includes the mclapply() function which is a shared memory version of lapply. The "mc" stands for "multicore". This function distributes the lapply tasks across multiple CPU cores to be executed in parallel.

This is a simple option for parallelization. It doesn't require much effort to rewrite the serial code to use mclapply function. Check out an example below.

??? example ```R library(parallel)

# here some function that needs to be executed in parallel
average <- function(size){
  norm_vector <- rnorm(n=size, mean=mu, sd=sigma)
  return(mean(norm_vector))
}

# variable initialization
mu <- 1.0
sigma <- 1.0
vector_length <- 10^7
n_repeat <- 100
sample_sizes <- rep(vector_length, times=n_repeat)


# shared-memory version
threads <- as.integer(Sys.getenv("SLURM_CPUS_ON_NODE"))
# here the name of the variable depends on the correct sbatch configuration
# unfortunately the built-in function gets the total number of physical cores without
# taking into account allocated cores by Slurm

list_of_averages <- mclapply(X=sample_sizes, FUN=average, mc.cores=threads)  # apply function "average" 100 times
```

The disadvantages of using shared-memory parallelism approach are, that the number of parallel tasks is limited to the number of cores on a single node. The maximum number of cores on a single node can be found here.

Submitting a multicore R job to Slurm is very similar to submitting an OpenMP Job, since both are running multicore jobs on a single node. Below is an example:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --time=00:10:00
#SBATCH -o test_Rmpi.out
#SBATCH -e test_Rmpi.err

module purge
module load modenv/scs5
module load R

R CMD BATCH Rcode.R

Distributed-Memory Parallelism

In order to go beyond the limitation of the number of cores on a single node, a cluster of workers shall be set up. There are three options for it: MPI, PSOCK and FORK clusters. We use makeCluster function from parallel library to create a set of copies of R processes running in parallel. The desired type of the cluster can be specified with a parameter TYPE.

MPI Cluster

This way of the R parallelism uses the Rmpi package and the MPI (Message Passing Interface) as a "backend" for its parallel operations. The MPI-based job in R is very similar to submitting an MPI Job since both are running multicore jobs on multiple nodes. Below is an example of running R script with the Rmpi on Taurus:

#!/bin/bash
#SBATCH --partition=haswell      # specify the partition
#SBATCH --ntasks=32              # this parameter determines how many processes will be spawned, please use >=8
#SBATCH --cpus-per-task=1
#SBATCH --time=01:00:00
#SBATCH -o test_Rmpi.out
#SBATCH -e test_Rmpi.err

module purge
module load modenv/scs5
module load R

mpirun -np 1 R CMD BATCH Rmpi.R   # specify the absolute path to the R script, like: /scratch/ws/marie-Work/R/Rmpi.R

# submit with sbatch <script_name>

Slurm option --ntasks controls the total number of parallel tasks. The number of nodes required to complete this number of tasks will be automatically selected. However, in some specific cases, you can specify the number of nodes and the number of necessary tasks per node explicitly:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --tasks-per-node=16
#SBATCH --cpus-per-task=1
module purge
module load modenv/scs5
module load R

mpirun -np 1 R CMD BATCH --no-save --no-restore Rmpi_c.R

Use an example below, where 32 global ranks are distributed over 2 nodes with 16 cores each. Each MPI rank has 1 core assigned to it.

??? example ```R library(Rmpi)

# initialize an Rmpi environment
ns <- mpi.universe.size()-1
mpi.spawn.Rslaves(nslaves=ns)

# send these commands to the slaves
mpi.bcast.cmd( id <- mpi.comm.rank() )
mpi.bcast.cmd( ns <- mpi.comm.size() )
mpi.bcast.cmd( host <- mpi.get.processor.name() )

# all slaves execute this command
mpi.remote.exec(paste("I am", id, "of", ns, "running on", host))

# close down the Rmpi environment
mpi.close.Rslaves(dellog = FALSE)
mpi.exit()
```

Another example:

??? example ```R library(Rmpi) library(parallel)

# here some function that needs to be executed in parallel
average <- function(size){
  norm_vector <- rnorm(n=size, mean=mu, sd=sigma)
  return(mean(norm_vector))
}

# variable initialization
mu <- 1.0
sigma <- 1.0
vector_length <- 10^7
n_repeat <- 100
sample_sizes <- rep(vector_length, times=n_repeat)

# cluster setup
# get number of available MPI ranks
threads = mpi.universe.size()-1
print(paste("The cluster of size", threads, "will be setup..."))

# initialize MPI cluster
cl <- makeCluster(threads, type="MPI", outfile="")

# distribute required variables for the execution over the cluster
clusterExport(cl, list("mu","sigma"))

list_of_averages <- parLapply(X=sample_sizes, fun=average, cl=cl)

# shut down the cluster
#snow::stopCluster(cl)  # usually it hangs over here with OpenMPI > 2.0. In this case this command may be avoided, Slurm will clean up after the job finishes
```

To use Rmpi and MPI please use one of these partitions: haswell, broadwell or rome.

Use mpirun command to start the R script. It is a wrapper that enables the communication between processes running on different nodes. It is important to use -np 1 (the number of spawned processes by mpirun), since the R takes care of it with makeCluster function.

PSOCK cluster

The type="PSOCK" uses TCP sockets to transfer data between nodes. PSOCK is the default on all systems. The advantage of this method is that it does not require external libraries such as MPI. On the other hand, TCP sockets are relatively slow. Creating a PSOCK cluster is similar to launching an MPI cluster, but instead of specifying the number of parallel workers, you have to manually specify the number of nodes according to the hardware specification and parameters of your job.

??? example ```R library(parallel)

# a function that needs to be executed in parallel
average <- function(size){
  norm_vector <- rnorm(n=size, mean=mu, sd=sigma)
  return(mean(norm_vector))
}

# variable initialization
mu <- 1.0
sigma <- 1.0
vector_length <- 10^7
n_repeat <- 100
sample_sizes <- rep(vector_length, times=n_repeat)

# cluster setup

# get number of available nodes (should be equal to "ntasks")
mynodes = 8
print(paste("The cluster of size", threads, "will be setup..."))

# initialize cluster
cl <- makeCluster(mynodes, type="PSOCK", outfile="")

# distribute required variables for the execution over the cluster
clusterExport(cl, list("mu","sigma"))

list_of_averages <- parLapply(X=sample_sizes, fun=average, cl=cl)

# shut down the cluster
print(paste("Program finished"))
```

FORK cluster

The type="FORK" method behaves exactly like the mclapply function discussed in the previous section. Like mclapply, it can only use the cores available on a single node. However this method requires exporting the workspace data to other processes. The FORK method in a combination with parLapply function might be used in situations, where different source code should run on each parallel process.

Other parallel options

  • foreach library. It is functionally equivalent to the lapply-based parallelism discussed before but based on the for-loop
  • future library. The purpose of this package is to provide a lightweight and unified Future API for sequential and parallel processing of R expression via futures
  • Poor-man's parallelism (simple data parallelism). It is the simplest, but not an elegant way to parallelize R code. It runs several copies of the same R script where's each read different sectors of the input data
  • Hands-off (OpenMP) method. R has OpenMP support. Thus using OpenMP is a simple method where you don't need to know much about the parallelism options in your code. Please be careful and don't mix this technique with other methods!