diff --git a/doc.zih.tu-dresden.de/docs/DataAnalyticsWithR.md b/doc.zih.tu-dresden.de/docs/DataAnalyticsWithR.md new file mode 100644 index 0000000000000000000000000000000000000000..0acc0ba2fb7bd674488521860ecddc7ff8dd6684 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/DataAnalyticsWithR.md @@ -0,0 +1,423 @@ +# R for data analytics + + + +[ **R** ](https://www.r-project.org/about.html)\<span style="font-size: +1em;"> is a programming language and environment for statistical +computing and graphics. R provides a wide variety of statistical (linear +and nonlinear modelling, classical statistical tests, time-series +analysis, classification, etc) and graphical techniques. R is an +integrated suite of software facilities for data manipulation, +calculation and graphing.\</span> + +R possesses an extensive catalogue of statistical and graphical methods. +It includes machine learning algorithms, linear regression, time series, +statistical inference. + +**Aim** of this page is to introduce users on how to start working with +the R language on Taurus in general as well as on the HPC-DA system.\<br +/>**Prerequisites:** To work with the R on Taurus you obviously need +access for the Taurus system and basic knowledge about programming and +[SLURM](Slurm) system. + +For general information on using the HPC-DA system, see the [Get started +with HPC-DA system](GetStartedWithHPCDA) page. + +You can also find the information you need on the HPC-Introduction and +HPC-DA-Introduction presentation slides. + +\<br />\<span style="font-size: 1em;">We recommend using +\</span>**Haswell**\<span style="font-size: 1em;">and/or\</span> ** +[Romeo](RomeNodes)**\<span style="font-size: 1em;">partitions to work +with R. Please use the ml partition only if you need GPUs! \</span> + +## R console + +This is a quickstart example. The `srun` command is used to submit a +real-time execution job designed for interactive use with output +monitoring. Please check [the page](Slurm) for details. The R language +available for both types of Taurus nodes/architectures x86 (scs5 +software environment) and Power9 (ml software environment). + +Haswell partition: + + srun --partition=haswell --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2583 --time=01:00:00 --pty bash #job submission in haswell nodes with allocating: 1 task per node, 1 node, 4 CPUs per task with 2583 mb per CPU(core) on 1 hour + + module load modenv/scs5 #Ensure that you are using the scs5 partition. Example output: The following have been reloaded with a version change: 1) modenv/ml => modenv/scs5 + module available R/3.6 #Check all availble modules with R version 3.6. You could use also "ml av R" but it gives huge output. + module load R #Load default R module Example output: Module R/3.6.0-foss 2019a and 56 dependencies loaded. + which R #Checking of current version of R + R #Start R console + +Here are the parameters of the job with all the details to show you the +correct and optimal way to do it. Please allocate the job with respect +to [hardware specification](HardwareTaurus)! Besides, it should be noted +that the value of the \<span>--mem-per-cpu\</span> parameter is +different for the different partitions. it is important to respect \<a +href="SystemTaurus#Memory_Limits" target="\_blank">memory limits\</a>. +Please note that the default limit is 300 MB per cpu. + +However, using srun directly on the shell will lead to blocking and +launch an interactive job. Apart from short test runs, it is +**recommended to launch your jobs into the background by using batch +jobs**. For that, you can conveniently place the parameters directly +into the job file which can be submitted using +`sbatch [options] <job file><span>. </span>`\<span style="font-size: +1em;">The examples could be found [here](GetStartedWithHPCDA) or +[here](Slurm). Furthermore, you could work with simple examples in your +home directory but according to \<a href="HPCStorageConcept2019" +target="\_blank">storage concept\</a>** please use \<a href="WorkSpaces" +target="\_blank">workspaces\</a> for your study and work +projects!**\</span> + +It is also possible to run Rscript directly (after loading the module): + + Rscript /path/to/script/your_script.R param1 param2 #run Rscript directly. For instance: Rscript /scratch/ws/mastermann-study_project/da_script.r + +## R with Jupyter notebook + +In addition to using interactive srun jobs and batch jobs, there is +another way to work with the **R** on Taurus. JipyterHub is a quick and +easy way to work with jupyter notebooks on Taurus. See\<span +style="font-size: 1em;"> the \<a href="JupyterHub" +target="\_blank">JupyterHub page\</a> for detailed instructions.\</span> + +The [production environment](JupyterHub#Standard_environments) of +JupyterHub contains R as a module for all partitions. R could be run in +the Notebook or Console for [JupyterLab](JupyterHub#JupyterLab). + +## RStudio + +\<a href="<https://rstudio.com/>" target="\_blank">RStudio\</a> is an +integrated development environment (IDE) for R. It includes a console, +syntax-highlighting editor that supports direct code execution, as well +as tools for plotting, history, debugging and workspace management. +RStudio is also available for both Taurus x86 (scs5) and Power9 (ml) +nodes/architectures. + +The best option to run RStudio is to use JupyterHub. RStudio will work +in a browser. It is currently available in the **test** environment on +both x86 (**scs5**) and Power9 (**ml**) architectures/partitions. It can +be started similarly as a new kernel from \<a +href="JupyterHub#JupyterLab" target="\_blank">JupyterLab\</a> launcher. +See the picture below. + +\<img alt="environments.png" height="70" +src="%ATTACHURL%/environments.png" title="environments.png" width="300" +/> + +\<img alt="Launcher.png" height="205" src="%ATTACHURL%/Launcher.png" +title="Launcher.png" width="195" /> + +Please keep in mind that it is not currently recommended to use the +interactive x11 job with the desktop version of Rstudio, as described, +for example, [here](Slurm#Interactive_X11_47GUI_Jobs) or in introduction +HPC-DA slides. This method is unstable. + +## Install packages in R + +By default, user-installed packages are stored in the +\<span>/$HOME/R\</span>/ folder inside a subfolder depending on the +architecture (on Taurus: x86 or PowerPC). Install packages using the +shell: + + srun -p haswell -N 1 -n 1 -c 4 --mem-per-cpu=2583 --time=01:00:00 --pty bash #job submission to the haswell nodes with allocating: 1 task per node, 1 node, 4 CPUs per task with 2583 mb per CPU(core) in 1 hour + module purge + module load modenv/scs5 #Changing the environment. Example output: The following have been reloaded with a version change: 1) modenv/ml => modenv/scs5 + + module load R #Load R module Example output: Module R/3.6.0-foss-2019a and 56 dependencies loaded. + which R #Checking of current version of R + R #Start of R console + install.packages("package_name") #For instance: install.packages("ggplot2") + +Note that to allocate the job the slurm parameters are used with +different (short) notations, but with the same values as in the previous +example. + +## Deep Learning with R + +This chapter will briefly describe working with **ml partition** (Power9 +architecture). This means that it will focus on the work with the GPUs, +and the main scenarios will be explained. + +\*Important: Please use the ml partition if you need GPUs\* \<span +style="font-size: 1em;"> Otherwise using the x86 partitions (e.g +Haswell) would most likely be more beneficial. \</span> + +### R Interface to Tensorflow + +The ["Tensorflow" R package](https://tensorflow.rstudio.com/) provides R +users access to the Tensorflow toolset. +[TensorFlow](https://www.tensorflow.org/) is an open-source software +library for numerical computation using data flow graphs. + + srun -p ml -N 1 -n 1 -c 7 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash + + module purge #clear modules + ml modenv/ml #load ml environment + ml TensorFlow + ml R + + which python + mkdir python-virtual-environments #Create folder. Please use Workspaces! + cd python-virtual-environments #Go to folder + python3 -m venv --system-site-packages R-TensorFlow #create python virtual environment + source R-TensorFlow/bin/activate #activate environment + module list + which R + +Please allocate the job with respect to [hardware +specification](HardwareTaurus)! Note that the ML nodes have 4way-SMT, so +for every physical core allocated, you will always get 4\*1443mb +=5772mb. + +To configure "reticulate" R library to point to the Python executable in +your virtual environment, create a file \<span style="font-size: +1em;">nam\</span>\<span style="font-size: 1em;">ed .Rprofile in your +project directory (e.g. R-TensorFlow) with the following +contents:\</span> + + Sys.setenv(RETICULATE_PYTHON = "/sw/installed/Anaconda3/2019.03/bin/python") #assign the output of the 'which python' to the RETICULATE_PYTHON + +Let's start R, install some libraries and evaluate the result + + R + install.packages("reticulate") + library(reticulate) + reticulate::py_config() + install.packages("tensorflow") + library(tensorflow) + tf$constant("Hellow Tensorflow") #In the output 'Tesla V100-SXM2-32GB' should be mentioned + +Please find the example of the code in the +[attachment](%ATTACHURL%/TensorflowMNIST.R?t=1597837603). The example +shows the use of the Tensorflow package with the R for the +classification problem related to the MNIST dataset.\<br />\<span +style="font-size: 1em;">As an alternative to the TensorFlow rTorch could +be used. \</span>\<a +href="<https://cran.r-project.org/web/packages/rTorch/index.html>" +target="\_blank">rTorch\</a>\<span style="font-size: 1em;"> is an 'R' +implementation and interface for the \<a href="<https://pytorch.org/>" +target="\_blank">PyTorch\</a> Machine Learning framework\</span>\<span +style="font-size: 1em;">.\</span> + +## Parallel computing with R + +Generally, the R code is serial. However, many computations in R can be +made faster by the use of parallel computations. Taurus allows a vast +number of options for parallel computations. Large amounts of data +and/or use of complex models are indications of the use of parallelism. + +### General information about the R parallelism + +There are various techniques and packages in R that allow +parallelization. This chapter concentrates on most general methods and +examples. The Information here is Taurus-specific. The \<a +href="<https://www.rdocumentation.org/packages/parallel/versions/3.6.2>" +target="\_blank">parallel package\</a> will be used for the purpose of +the chapter. + +%RED%Note:<span class="twiki-macro ENDCOLOR"></span> Please do not +install or update R packages related to parallelism it could lead to +conflict with other pre-installed packages. + +### \<span style="font-size: 1em;">Basic lapply-based parallelism \</span> + +**`lapply()`** function is a part of base R. lapply is useful for +performing operations on list-objects. Roughly speaking, lapply is a +vectorisation of the source code but it could be used for +parallelization. To use more than one core with lapply-style +parallelism, you have to use some type of networking so that each node +can communicate with each other and shuffle the relevant data around. +The simple example of using the "pure" lapply parallelism could be found +as the [attachment](%ATTACHURL%/lapply.R). + +### Shared-memory parallelism + +The `parallel` library includes the `mclapply()` function which is a +shared memory version of lapply. The "mc" stands for "multicore". This +function distributes the `lapply` tasks across multiple CPU cores to be +executed in parallel. + +This is a simple option for parallelisation. It doesn't require much +effort to rewrite the serial code to use mclapply function. Check out an +[example](%ATTACHURL%/multicore.R). The cons of using shared-memory +parallelism approach that it is limited by the number of cores(cpus) on +a single node. + +<span class="twiki-macro RED"></span> **Important:** <span +class="twiki-macro ENDCOLOR"></span> Please allocate the job with +respect to [hardware specification](HardwareTaurus). The current maximum +number of processors (read as cores) for an SMP-parallel program on +Taurus is 56 (smp2 partition), for the Haswell partition, it is a 24. +The large SMP system (Julia) is coming soon with a total number of 896 +nodes. + +Submitting a multicore R job to SLURM is very similar to [Submitting an +OpenMP Job](Slurm#Binding_and_Distribution_of_Tasks) since both are +running multicore jobs on a **single** node. Below is an example: + + #!/bin/bash + #SBATCH --nodes=1 + #SBATCH --tasks-per-node=1 + #SBATCH --cpus-per-task=16 + #SBATCH --time=00:10:00 + #SBATCH -o test_Rmpi.out + #SBATCH -e test_Rmpi.err + + module purge + module load modenv/scs5 + module load R + + R CMD BATCH Rcode.R + +Examples of R scripts with the shared-memory parallelism can be found as +an [attachment](%ATTACHURL%/multicore.R) on the bottom of the page. + +### Distributed memory parallelism + +To use this option we need to start by setting up a cluster, a +collection of workers that will do the job in parallel. There are three +main options for it: MPI cluster, PSOCK cluster and FORK cluster. We use +\<span>makeCluster {parallel}\</span> function to create a set of copies +of **R** running in parallel and communicating over sockets, the type of +the cluster could be specified by the \<span>TYPE \</span>variable. + +#### MPI cluster + +This way of the R parallelism uses the +[Rmpi](http://cran.r-project.org/web/packages/Rmpi/index.html) package +and the [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) +(Message Passing Interface) as a "backend" for its parallel operations. +Parallel R codes submitting a multinode MPI R job to SLURM is very +similar to \<a href="Slurm#Binding_and_Distribution_of_Tasks" +target="\_blank">submitting an MPI Job\</a> since both are running +multicore jobs on multiple nodes. Below is an example of running R +script with the Rmpi on Taurus: + + #!/bin/bash + #SBATCH --partition=haswell #specify the partition + #SBATCH --ntasks=16 #This parameter determines how many processes will be spawned. Please use >= 8. + #SBATCH --cpus-per-task=1 + #SBATCH --time=00:10:00 + #SBATCH -o test_Rmpi.out + #SBATCH -e test_Rmpi.err + + module purge + module load modenv/scs5 + module load R + + mpirun -n 1 R CMD BATCH Rmpi.R #specify the absolute path to the R script, like: /scratch/ws/max1234-Work/R/Rmpi.R + + # when finished writing, submit with sbatch <script_name> + +\<span class="WYSIWYG_TT"> **-ntasks** \</span> SLURM option is the best +and simplest way to run your application with MPI. The number of nodes +required to complete this number of tasks will then be selected. Each +MPI rank is assigned 1 core(CPU). + +However, in some specific cases, you can specify the number of nodes and +the number of necessary tasks per node: + + #!/bin/bash + #SBATCH --nodes=2 + #SBATCH --tasks-per-node=16 + #SBATCH --cpus-per-task=1 + module purge + module load modenv/scs5 + module load R + + time mpirun -quiet -np 1 R CMD BATCH --no-save --no-restore Rmpi_c.R #this command will calculate the time of completion for your script + +The illustration above shows the binding of an MPI-job. Use an +[example](%ATTACHURL%/Rmpi_c.R) from the attachment. In which 32 global +ranks are distributed over 2 nodes with 16 cores(CPUs) each. Each MPI +rank has 1 core assigned to it. + +To use Rmpi and MPI please use one of these partitions: **Haswell**, +**Broadwell** or **Rome**.\<br />**%RED%Important:<span +class="twiki-macro ENDCOLOR"></span>**\<span +style`"font-size: 1em;"> Please allocate the required number of nodes and cores according to the hardware specification: 1 Haswell's node: 2 x [Intel Xeon (12 cores)]; 1 Broadwell's Node: 2 x [Intel Xeon (14 cores)]; 1 Rome's node: 2 x [AMD EPYC (64 cores)]. Please also check the </span><a href="HardwareTaurus" target="_blank">hardware specification</a><span style="font-size: 1em;"> (number of nodes etc). The =sinfo` +command gives you a quick overview of the status of partitions.\</span> + +\<span style="font-size: 1em;">Please use \</span>\<span>mpirun\</span> +command \<span style="font-size: 1em;">to run the Rmpi script. It is a +wrapper that enables the communication between processes running on +different machines. \</span>\<span style="font-size: 1em;">We recommend +always use \</span>\<span style="font-size: 1em;">"\</span>\<span>-np +1\</span>\<span style="font-size: 1em;">"\</span>\<span +style="font-size: 1em;"> (the number of MPI processes to launch)\</span> +\<span style="font-size: 1em;">because otherwise, it spawns additional +processes dynamically.\</span> + +Examples of R scripts with the Rmpi can be found as attachments at the +bottom of the page. + +#### PSOCK cluster + +The `type="PSOCK"` uses TCP sockets to transfer data between nodes. +PSOCK is the default on *all* systems. However, if your parallel code +will be executed on Windows as well you should use the PSOCK method. The +advantage of this method is that It does not require external libraries +such as Rmpi. On the other hand, TCP sockets are relatively +[slow](http://glennklockwood.blogspot.com/2013/06/whats-killing-cloud-interconnect.html). +Creating a PSOCK cluster is similar to launching an MPI cluster, but +instead of simply saying how many parallel workers you want, you have to +manually specify the number of nodes according to the hardware +specification and parameters of your job. The example of the code could +be found as an [attachment](%ATTACHURL%/RPSOCK.R?t=1597043002). + +#### FORK cluster + +The `type="FORK"` behaves exactly like the `mclapply` function discussed +in the previous section. Like `mclapply`, it can only use the cores +available on a single node, but this does not require clustered data +export since all cores use the same memory. You may find it more +convenient to use a FORK cluster with `parLapply` than `mclapply` if you +anticipate using the same code across multicore *and* multinode systems. + +### Other parallel options + +There are numerous different parallel options for R. However for general +users, we would recommend using the options listed above. However, the +alternatives should be mentioned: + +- \<span> + [foreach](https://cran.r-project.org/web/packages/foreach/index.html) + \</span>package. It is functionally equivalent to the [lapply-based + parallelism](https://www.glennklockwood.com/data-intensive/r/lapply-parallelism.html) + discussed before but based on the for-loop; +- [future](https://cran.r-project.org/web/packages/future/index.html) + package. The purpose of this package is to provide a lightweight and + unified Future API for sequential and parallel processing of R + expression via futures; +- [Poor-man's + parallelism](https://www.glennklockwood.com/data-intensive/r/alternative-parallelism.html#6-1-poor-man-s-parallelism) + (simple data parallelism). It is the simplest, but not an elegant + way to parallelize R code. It runs several copies of the same R + script where's each read different sectors of the input data; +- \<a + href="<https://www.glennklockwood.com/data-intensive/r/alternative-parallelism.html#6-2-hands-off-parallelism>" + target="\_blank">Hands-off (OpenMP) method\</a>. R has + [OpenMP](https://www.openmp.org/resources/) support. Thus using + OpenMP is a simple method where you don't need to know a much about + the parallelism options in your code. Please be careful and don't + mix this technique with other methods! + +-- Main.AndreiPolitov - 2020-05-18 + +- [TensorflowMNIST.R](%ATTACHURL%/TensorflowMNIST.R?t=1597837603)\<span + style="font-size: 13px;">: TensorflowMNIST.R\</span> +- [lapply.R](%ATTACHURL%/lapply.R)\<span style="font-size: 13px;">: + lapply.R\</span> +- [multicore.R](%ATTACHURL%/multicore.R)\<span style="font-size: + 13px;">: multicore.R\</span> +- [Rmpi.R](%ATTACHURL%/Rmpi.R)\<span style="font-size: 13px;">: + Rmpi.R\</span> +- [Rmpi_c.R](%ATTACHURL%/Rmpi_c.R)\<span style="font-size: 13px;">: + Rmpi_c.R\</span> +- [RPSOCK.R](%ATTACHURL%/RPSOCK.R)\<span style="font-size: 13px;">: + RPSOCK.R\</span> + +\<div id="gtx-trans" style="position: absolute; left: 35px; top: +5011.8px;"> \</div> diff --git a/doc.zih.tu-dresden.de/mkdocs.yml b/doc.zih.tu-dresden.de/mkdocs.yml index 1a1ff9012a96e1a3d79acee857a4378f04ccdf00..95a3dce56adde700de14d88d7984c76ad0dfe3e0 100644 --- a/doc.zih.tu-dresden.de/mkdocs.yml +++ b/doc.zih.tu-dresden.de/mkdocs.yml @@ -17,6 +17,7 @@ nav: - Support: support.md - Running Jobs: - Overview: jobs/index.md + - HPDCA: DataAnalyticsWithR.md # - Queue Policy: jobs/policy.md # - Examples: jobs/examples/index.md # - Affinity: jobs/affinity/index.md