Skip to content
Snippets Groups Projects
Commit 9f501605 authored by Taras Lazariv's avatar Taras Lazariv
Browse files

Ready: add image and examples to data_analytics_with_r.md

parent 11efb8f5
No related branches found
No related tags found
3 merge requests!322Merge preview into main,!319Merge preview into main,!184Ready: data_analytics_with_r.md
...@@ -62,14 +62,8 @@ plotting, history, debugging and workspace management. RStudio is also available ...@@ -62,14 +62,8 @@ plotting, history, debugging and workspace management. RStudio is also available
The easiest option is to run RStudio in JupyterHub directly in the browser. It can be started The easiest option is to run RStudio in JupyterHub directly in the browser. It can be started
similarly to a new kernel from [JupyterLab](../access/jupyterhub.md#jupyterlab) launcher. similarly to a new kernel from [JupyterLab](../access/jupyterhub.md#jupyterlab) launcher.
**todo** image ![RStudio launcher in JupyterHub](misc/data_analytics_with_r_RStudio_launcher.png)
\<img alt="environments.png" height="70" {: align="center"}
src="%ATTACHURL%/environments.png" title="environments.png" width="300"
/>
**todo** image
\<img alt="Launcher.png" height="205" src="%ATTACHURL%/Launcher.png"
title="Launcher.png" width="195" />
Please keep in mind that it is currently not recommended to use the interactive x11 job with the Please keep in mind that it is currently not recommended to use the interactive x11 job with the
desktop version of RStudio, as described, for example, in introduction HPC-DA slides. desktop version of RStudio, as described, for example, in introduction HPC-DA slides.
...@@ -139,16 +133,84 @@ library(reticulate) ...@@ -139,16 +133,84 @@ library(reticulate)
reticulate::py_config() reticulate::py_config()
install.packages("tensorflow") install.packages("tensorflow")
library(tensorflow) library(tensorflow)
tf$constant("Hellow Tensorflow") #In the output 'Tesla V100-SXM2-32GB' should be mentioned tf$constant("Hello Tensorflow") #In the output 'Tesla V100-SXM2-32GB' should be mentioned
``` ```
Please find the example of the code in the [attachment]**todo** ??? example
%ATTACHURL%/TensorflowMNIST.R?t=1597837603. The example shows the use of the TensorFlow package The example shows the use of the TensorFlow package with the R for the classification problem
with the R for the classification problem related to the MNIST dataset. As an alternative to the related to the MNIST dataset.
TensorFlow rTorch can be used. ```R
[rTorch](https://cran.r-project.org/web/packages/rTorch/index.html) is the interface to library(tensorflow)
the [PyTorch](https://pytorch.org/) Machine Learning framework. library(keras)
# Data preparation
batch_size <- 128
num_classes <- 10
epochs <- 12
# Input image dimensions
img_rows <- 28
img_cols <- 28
# Shuffled and split the data between train and test sets
mnist <- dataset_mnist()
x_train <- mnist$train$x
y_train <- mnist$train$y
x_test <- mnist$test$x
y_test <- mnist$test$y
# Redefine dimension of train/test inputs
x_train <- array_reshape(x_train, c(nrow(x_train), img_rows, img_cols, 1))
x_test <- array_reshape(x_test, c(nrow(x_test), img_rows, img_cols, 1))
input_shape <- c(img_rows, img_cols, 1)
# Transform RGB values into [0,1] range
x_train <- x_train / 255
x_test <- x_test / 255
cat('x_train_shape:', dim(x_train), '\n')
cat(nrow(x_train), 'train samples\n')
cat(nrow(x_test), 'test samples\n')
# Convert class vectors to binary class matrices
y_train <- to_categorical(y_train, num_classes)
y_test <- to_categorical(y_test, num_classes)
# Define Model
model <- keras_model_sequential() %>%
layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = 'relu',
input_shape = input_shape) %>%
layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_dropout(rate = 0.25) %>%
layer_flatten() %>%
layer_dense(units = 128, activation = 'relu') %>%
layer_dropout(rate = 0.5) %>%
layer_dense(units = num_classes, activation = 'softmax')
# Compile model
model %>% compile(
loss = loss_categorical_crossentropy,
optimizer = optimizer_adadelta(),
metrics = c('accuracy')
)
# Train model
model %>% fit(
x_train, y_train,
batch_size = batch_size,
epochs = epochs,
validation_split = 0.2
)
scores <- model %>% evaluate(
x_test, y_test, verbose = 0
)
# Output metrics
cat('Test loss:', scores[[1]], '\n')
cat('Test accuracy:', scores[[2]], '\n')
```
## Parallel Computing with R ## Parallel Computing with R
Generally, the R code is serial. However, many computations in R can be made faster by the use of Generally, the R code is serial. However, many computations in R can be made faster by the use of
...@@ -162,7 +224,7 @@ concentrates on most general methods and examples. The Information here is Tauru ...@@ -162,7 +224,7 @@ concentrates on most general methods and examples. The Information here is Tauru
The [parallel](https://www.rdocumentation.org/packages/parallel/versions/3.6.2) library The [parallel](https://www.rdocumentation.org/packages/parallel/versions/3.6.2) library
will be used below. will be used below.
**Note:** Please do not install or update R packages related to parallelism as it could lead to **Warning:** Please do not install or update R packages related to parallelism as it could lead to
conflicts with other pre-installed packages. conflicts with other pre-installed packages.
### Basic Lapply-Based Parallelism ### Basic Lapply-Based Parallelism
...@@ -178,10 +240,38 @@ lapply. The "mc" stands for "multicore". This function distributes the `lapply` ...@@ -178,10 +240,38 @@ lapply. The "mc" stands for "multicore". This function distributes the `lapply`
multiple CPU cores to be executed in parallel. multiple CPU cores to be executed in parallel.
This is a simple option for parallelization. It doesn't require much effort to rewrite the serial This is a simple option for parallelization. It doesn't require much effort to rewrite the serial
code to use `mclapply` function. Check out an [example]**todo** %ATTACHURL%/multicore.R. The code to use `mclapply` function. Check out an example below.
disadvantages of using shared-memory parallelism approach are, that the number of parallel tasks
is limited to the number of cores on a single node. The maximum number of cores on a single node ??? example
can be found [here](../jobs_and_resources/hardware_taurus.md). ```R
library(parallel)
# here some function that needs to be executed in parallel
average <- function(size){
norm_vector <- rnorm(n=size, mean=mu, sd=sigma)
return(mean(norm_vector))
}
# variable initialization
mu <- 1.0
sigma <- 1.0
vector_length <- 10^7
n_repeat <- 100
sample_sizes <- rep(vector_length, times=n_repeat)
# shared-memory version
threads <- as.integer(Sys.getenv("SLURM_CPUS_ON_NODE"))
# here the name of the variable depends on the correct sbatch configuration
# unfortunately the built-in function gets the total number of physical cores without
# taking into account allocated cores by Slurm
list_of_averages <- mclapply(X=sample_sizes, FUN=average, mc.cores=threads) # apply function "average" 100 times
```
The disadvantages of using shared-memory parallelism approach are, that the number of parallel
tasks is limited to the number of cores on a single node. The maximum number of cores on a single
node can be found [here](../jobs_and_resources/hardware_taurus.md).
Submitting a multicore R job to Slurm is very similar to submitting an Submitting a multicore R job to Slurm is very similar to submitting an
[OpenMP Job](../jobs_and_resources/slurm.md#binding-and-distribution-of-tasks), [OpenMP Job](../jobs_and_resources/slurm.md#binding-and-distribution-of-tasks),
...@@ -254,9 +344,66 @@ module load R ...@@ -254,9 +344,66 @@ module load R
mpirun -np 1 R CMD BATCH --no-save --no-restore Rmpi_c.R mpirun -np 1 R CMD BATCH --no-save --no-restore Rmpi_c.R
``` ```
The illustration above shows the binding of an MPI-job. Use an [example]**todo** Use an example below, where 32 global ranks are distributed over 2 nodes with 16 cores each.
%ATTACHURL%/Rmpi_c.R from the attachment. In which 32 global ranks are distributed over 2 nodes with Each MPI rank has 1 core assigned to it.
16 cores(CPUs) each. Each MPI rank has 1 core assigned to it.
??? example
```R
library(Rmpi)
# initialize an Rmpi environment
ns <- mpi.universe.size()-1
mpi.spawn.Rslaves(nslaves=ns)
# send these commands to the slaves
mpi.bcast.cmd( id <- mpi.comm.rank() )
mpi.bcast.cmd( ns <- mpi.comm.size() )
mpi.bcast.cmd( host <- mpi.get.processor.name() )
# all slaves execute this command
mpi.remote.exec(paste("I am", id, "of", ns, "running on", host))
# close down the Rmpi environment
mpi.close.Rslaves(dellog = FALSE)
mpi.exit()
```
Another example:
??? example
```R
library(Rmpi)
library(parallel)
# here some function that needs to be executed in parallel
average <- function(size){
norm_vector <- rnorm(n=size, mean=mu, sd=sigma)
return(mean(norm_vector))
}
# variable initialization
mu <- 1.0
sigma <- 1.0
vector_length <- 10^7
n_repeat <- 100
sample_sizes <- rep(vector_length, times=n_repeat)
# cluster setup
# get number of available MPI ranks
threads = mpi.universe.size()-1
print(paste("The cluster of size", threads, "will be setup..."))
# initialize MPI cluster
cl <- makeCluster(threads, type="MPI", outfile="")
# distribute required variables for the execution over the cluster
clusterExport(cl, list("mu","sigma"))
list_of_averages <- parLapply(X=sample_sizes, fun=average, cl=cl)
# shut down the cluster
#snow::stopCluster(cl) # usually it hangs over here with OpenMPI > 2.0. In this case this command may be avoided, Slurm will clean up after the job finishes
```
To use Rmpi and MPI please use one of these partitions: **haswell**, **broadwell** or **rome**. To use Rmpi and MPI please use one of these partitions: **haswell**, **broadwell** or **rome**.
...@@ -272,8 +419,42 @@ On the other hand, TCP sockets are relatively ...@@ -272,8 +419,42 @@ On the other hand, TCP sockets are relatively
[slow](http://glennklockwood.blogspot.com/2013/06/whats-killing-cloud-interconnect.html). Creating [slow](http://glennklockwood.blogspot.com/2013/06/whats-killing-cloud-interconnect.html). Creating
a PSOCK cluster is similar to launching an MPI cluster, but instead of specifying the number of a PSOCK cluster is similar to launching an MPI cluster, but instead of specifying the number of
parallel workers, you have to manually specify the number of nodes according to the parallel workers, you have to manually specify the number of nodes according to the
hardware specification and parameters of your job. The example of the code could be found as an hardware specification and parameters of your job.
[attachment]**todo** %ATTACHURL%/RPSOCK.R?t=1597043002.
??? example
```R
library(parallel)
# a function that needs to be executed in parallel
average <- function(size){
norm_vector <- rnorm(n=size, mean=mu, sd=sigma)
return(mean(norm_vector))
}
# variable initialization
mu <- 1.0
sigma <- 1.0
vector_length <- 10^7
n_repeat <- 100
sample_sizes <- rep(vector_length, times=n_repeat)
# cluster setup
# get number of available nodes (should be equal to "ntasks")
mynodes = 8
print(paste("The cluster of size", threads, "will be setup..."))
# initialize cluster
cl <- makeCluster(mynodes, type="PSOCK", outfile="")
# distribute required variables for the execution over the cluster
clusterExport(cl, list("mu","sigma"))
list_of_averages <- parLapply(X=sample_sizes, fun=average, cl=cl)
# shut down the cluster
print(paste("Program finished"))
```
#### FORK cluster #### FORK cluster
...@@ -285,8 +466,6 @@ parallel process. ...@@ -285,8 +466,6 @@ parallel process.
### Other parallel options ### Other parallel options
There exist other methods of parallelization for R:
- [foreach](https://cran.r-project.org/web/packages/foreach/index.html) library. - [foreach](https://cran.r-project.org/web/packages/foreach/index.html) library.
It is functionally equivalent to the It is functionally equivalent to the
[lapply-based parallelism](https://www.glennklockwood.com/data-intensive/r/lapply-parallelism.html) [lapply-based parallelism](https://www.glennklockwood.com/data-intensive/r/lapply-parallelism.html)
...@@ -302,17 +481,3 @@ There exist other methods of parallelization for R: ...@@ -302,17 +481,3 @@ There exist other methods of parallelization for R:
method. R has [OpenMP](https://www.openmp.org/resources/) support. Thus using OpenMP is a simple method. R has [OpenMP](https://www.openmp.org/resources/) support. Thus using OpenMP is a simple
method where you don't need to know much about the parallelism options in your code. Please be method where you don't need to know much about the parallelism options in your code. Please be
careful and don't mix this technique with other methods! careful and don't mix this technique with other methods!
**todo** Attachments
<!--- [TensorflowMNIST.R](%ATTACHURL%/TensorflowMNIST.R?t=1597837603)\<span-->
<!--style="font-size: 13px;">: TensorflowMNIST.R\</span>-->
<!--- [lapply.R](%ATTACHURL%/lapply.R)\<span style="font-size: 13px;">:-->
<!--lapply.R\</span>-->
<!--- [multicore.R](%ATTACHURL%/multicore.R)\<span style="font-size:-->
<!--13px;">: multicore.R\</span>-->
<!--- [Rmpi.R](%ATTACHURL%/Rmpi.R)\<span style="font-size: 13px;">:-->
<!--Rmpi.R\</span>-->
<!--- [Rmpi_c.R](%ATTACHURL%/Rmpi_c.R)\<span style="font-size: 13px;">:-->
<!--Rmpi_c.R\</span>-->
<!--- [RPSOCK.R](%ATTACHURL%/RPSOCK.R)\<span style="font-size: 13px;">:-->
<!--RPSOCK.R\</span>-->
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment