diff --git a/doc.zih.tu-dresden.de/docs/software/data_analytics.md b/doc.zih.tu-dresden.de/docs/software/data_analytics.md index 31ce02047f2ad70209a8613bf179634dcc643893..1097376ae2d22a187979ad90763e9a236f0ba813 100644 --- a/doc.zih.tu-dresden.de/docs/software/data_analytics.md +++ b/doc.zih.tu-dresden.de/docs/software/data_analytics.md @@ -5,7 +5,7 @@ analytics. The boundaries between data analytics and machine learning are fluid. Therefore, it may be worthwhile to search for a specific issue within the data analytics and machine learning sections. -The following tools are available in the ZIH system, among others: +The following tools are available on ZIH systems, among others: 1. [Python](data_analytics_with_python.md) 1. [R](data_analytics_with_r.md) @@ -18,17 +18,17 @@ or [PyTorch](pytorch.md), can be found in [machine learning](machine_learning.md Other software not listed here can be searched with -```bash +```Bash module spider <software_name> ``` Additional software or special versions of individual modules can be installed individually by each user. If possible, the use of virtual environments is recommended (e.g. for Python). -Likewise software can be used within [containers](containers.md). +Likewise, software can be used within [containers](containers.md). For the transfer of larger amounts of data into and within the system, the [export nodes and data mover](../data_transfer/overview.md) should be used. -The data storage takes place in the [workspaces](../data_lifecycle/workspaces.md). +Data is stored in the [workspaces](../data_lifecycle/workspaces.md). Software modules or virtual environments can also be installed in workspaces to enable collaborative work even within larger groups. General recommendations for setting up workflows can be found in the [experiments](../data_lifecycle/experiments.md) section. diff --git a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md index a138448a010418cd1f15953bd6a0f6d378f525b7..745638ab316fd2a702aafb48d7f2c9452895e5f8 100644 --- a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md +++ b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md @@ -1,15 +1,15 @@ # Python for Data Analytics Python is a high-level interpreted language widely used in research and science. Using ZIH system -allows you to work with python quicker and more effective. Here the general introduction to working -with python on ZIH system is given. For specific machine learning frameworks see respective -documentation in [machine learning](machine_learning.md) section. +allows you to work with python quicker and more effective. Here, a general introduction to working +with python on ZIH systems is given. Further documentation is available for specific +[machine learning frameworks](machine_learning.md). ## Python Console and Virtual Environments -Often it is useful to create an isolated development environment, which can be shared among -a research group and/or teaching class. For this purpose python virtual environments can be used. -For more details see [here](python_virtual_environments.md). +Often, it is useful to create an isolated development environment, which can be shared among +a research group and/or teaching class. For this purpose, [python virtual environments](python_virtual_environments.md) +can be used. The interactive Python interpreter can also be used on ZIH systems via an interactive job: @@ -25,12 +25,14 @@ Type "help", "copyright", "credits" or "license" for more information. ## Jupyter Notebooks -Jupyter notebooks are a great way for interactive computing in a web -browser. They allow working with data cleaning and transformation, -numerical simulation, statistical modeling, data visualization and machine learning. +Jupyter notebooks allow to analyze data interactively using your web browser. One advantage of +Jupyter is, that code, documentation and visualization can be included in a single notebook, so that +it forms a unit. Jupyter notebooks can be used for many tasks, such as data cleaning and +transformation, numerical simulation, statistical modeling, data visualization and also machine +learning. -On ZIH system a [JupyterHub](../access/jupyterhub.md) is available, which can be used to run -a Jupyter notebook on a node, as well using a GPU when needed. +On ZIH systems, a [JupyterHub](../access/jupyterhub.md) is available, which can be used to run a +Jupyter notebook on a node, using a GPU when needed. ## Parallel Computing with Python @@ -38,7 +40,7 @@ a Jupyter notebook on a node, as well using a GPU when needed. [Pandas](https://pandas.pydata.org/){:target="_blank"} is a widely used library for data analytics in Python. -In many cases an existing source code using Pandas can be easily modified for parallel execution +In many cases, an existing source code using Pandas can be easily modified for parallel execution by using the [pandarallel](https://github.com/nalepae/pandarallel/tree/v1.5.2){:target="_blank"} module. The number of threads that can be used in parallel depends on the number of cores (parameter `--cpus-per-task`) within the Slurm request, e.g. @@ -47,11 +49,11 @@ module. The number of threads that can be used in parallel depends on the number marie@login$ srun --partition=haswell --cpus-per-task=4 --mem=2G --hint=nomultithread --pty --time=8:00:00 bash ``` -The request from above will allow to use 4 parallel threads. +The above request allows to use 4 parallel threads. The following example shows how to parallelize the apply method for pandas dataframes with the -pandarallel module. If the pandarallel module is not installed already, check out the usage of -[virtual environments](python_virtual_environments.md) for installing the module. +pandarallel module. If the pandarallel module is not installed already, use a +[virtual environment](python_virtual_environments.md) to install the module. ??? example ```python @@ -83,10 +85,10 @@ For more examples of using pandarallel check out ### Dask [Dask](https://dask.org/) is a flexible and open-source library for parallel computing in Python. -It scales Python and provides advanced parallelism for analytics, enabling performance at -scale for some of the popular tools. For instance: Dask arrays scale NumPy workflows, Dask -dataframes scale Pandas workflows, Dask-ML scales machine learning APIs like Scikit-Learn and -XGBoost. +It replaces some Python data structures with parallel versions in order to provide advanced +parallelism for analytics, enabling performance at scale for some of the popular tools. For +instance: Dask arrays replace NumPy arrays, Dask dataframes replace Pandas dataframes. +Furthermore, Dask-ML scales machine learning APIs like Scikit-Learn and XGBoost. Dask is composed of two parts: @@ -107,37 +109,21 @@ Dask supports several user interfaces: - Delayed: Parallel function evaluation - Futures: Real-time parallel function evaluation -#### Dask Installation - -!!! hint - This step might be obsolete, since the library may be already available as a module. - Check it with - - ```console - marie@compute$ module spider dask - ------------------------------------------------------------------------------------------ - dask: - ---------------------------------------------------------------------------------------------- - Versions: - dask/2.8.0-fosscuda-2019b-Python-3.7.4 - dask/2.8.0-Python-3.7.4 - dask/2.8.0 (E) - [...] - ``` +#### Dask Usage -The installation of Dask is very easy and can be done by a user using a [virtual environment](python_virtual_environments.md) +On ZIH systems, Dask is available as a module. Check available versions and load your preferred one: ```console -marie@compute$ module load SciPy-bundle/2020.11-fosscuda-2020b Pillow/8.0.1-GCCcore-10.2.0 -marie@compute$ virtualenv --system-site-packages dask-test -created virtual environment CPython3.8.6.final.0-64 in 10325ms - creator CPython3Posix(dest=~/dask-test, clear=False, global=True) - seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=~/.local/share/virtualenv) - added seed packages: pip==21.1.3, setuptools==57.4.0, wheel==0.36.2 - activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator -marie@compute$ source dask-test/bin/activate -(dask-test) marie@compute$ pip install dask dask-jobqueue +marie@compute$ module spider dask +------------------------------------------------------------------------------------------ + dask: +---------------------------------------------------------------------------------------------- + Versions: + dask/2.8.0-fosscuda-2019b-Python-3.7.4 + dask/2.8.0-Python-3.7.4 + dask/2.8.0 (E) [...] +marie@compute$ module load dask/2.8.0-fosscuda-2019b-Python-3.7.4 marie@compute$ python -c "import dask; print(dask.__version__)" 2021.08.1 ``` @@ -155,24 +141,21 @@ client ### mpi4py - MPI for Python -Message Passing Interface (MPI) is a standardized and portable -message-passing standard designed to function on a wide variety of -parallel computing architectures. The Message Passing Interface (MPI) is -a library specification that allows HPC to pass information between its -various nodes and clusters. MPI is designed to provide access to advanced -parallel hardware for end-users, library writers and tool developers. - -mpi4py(MPI for Python) package provides bindings of the MPI standard for -the python programming language, allowing any Python program to exploit -multiple processors. - -mpi4py based on MPI-2 C++ bindings. It supports almost all MPI calls. -This implementation is popular on Linux clusters and in the SciPy -community. Operations are primarily methods of communicator objects. It -supports communication of pickle-able Python objects. mpi4py provides +Message Passing Interface (MPI) is a standardized and portable message-passing standard, designed to +function on a wide variety of parallel computing architectures. The Message Passing Interface (MPI) +is a library specification that allows HPC to pass information between its various nodes and +clusters. MPI is designed to provide access to advanced parallel hardware for end-users, library +writers and tool developers. + +mpi4py (MPI for Python) provides bindings of the MPI standard for the python programming +language, allowing any Python program to exploit multiple processors. + +mpi4py is based on MPI-2 C++ bindings. It supports almost all MPI calls. This implementation is +popular on Linux clusters and in the SciPy community. Operations are primarily methods of +communicator objects. It supports communication of pickle-able Python objects. mpi4py provides optimized communication of NumPy arrays. -mpi4py is included as an extension of the SciPy-bundle modules on a ZIH system +mpi4py is included in the SciPy-bundle modules on the ZIH system. ```console marie@compute$ module load SciPy-bundle/2020.11-foss-2020b diff --git a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md index a17f4974272edd8f521d91bc7dcc5b3612427de3..c36f7c83ace9725f970673fe8abcfd33d54ae433 100644 --- a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md +++ b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md @@ -6,17 +6,16 @@ classical statistical tests, time-series analysis, classification, etc) and grap is an integrated suite of software facilities for data manipulation, calculation and graphing. -R possesses an extensive catalog of statistical and graphical methods. It includes machine -learning algorithms, linear regression, time series, statistical inference. +R possesses an extensive catalog of statistical and graphical methods. It includes machine learning +algorithms, linear regression, time series, statistical inference. -We recommend using `haswell` and/or `rome` partitions to work with R. For more details -see [here](../jobs_and_resources/hardware_taurus.md). +We recommend using **Haswell** and/or **Romeo** partitions to work with R. For more details +see our [hardware documentation](../jobs_and_resources/hardware_taurus.md). ## R Console -In the following example the `srun` command is used to submit a real-time execution job -designed for interactive use with monitoring the output. Please check -[the Slurm page](../jobs_and_resources/slurm.md) for details. +In the following example, the `srun` command is used to start an interactive job, so that the output +is visible to the user. Please check the [Slurm page](../jobs_and_resources/slurm.md) for details. ```console marie@login$ srun --partition=haswell --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2541 --time=01:00:00 --pty bash @@ -28,7 +27,7 @@ marie@compute$ R ``` Using `srun` is recommended only for short test runs, while for larger runs batch jobs should be -used. The examples can be found [here](../jobs_and_resources/slurm.md). +used. Examples can be found on the [Slurm page](../jobs_and_resources/slurm.md). It is also possible to run `Rscript` command directly (after loading the module): @@ -83,8 +82,8 @@ Module TensorFlow/2.3.1-fosscuda-2019b-Python-3.7.4 and 15 dependencies loaded. ``` !!! warning - Be aware that for compatibility reasons it is important to choose modules with - the same toolchain version (in this case `fosscuda/2019b`). For reference see [here](modules.md) + Be aware that for compatibility reasons it is important to choose [modules](modules.md) with + the same toolchain version (in this case `fosscuda/2019b`). In order to interact with Python-based frameworks (like TensorFlow) `reticulate` R library is used. To configure it to point to the correct Python executable in your virtual environment, create @@ -92,7 +91,7 @@ a file named `.Rprofile` in your project directory (e.g. R-TensorFlow) with the contents: ```R -Sys.setenv(RETICULATE_PYTHON = "/sw/installed/Python/3.7.4-GCCcore-8.3.0/bin/python") #assign the output of the 'which python' from above to RETICULATE_PYTHON +Sys.setenv(RETICULATE_PYTHON = "/sw/installed/Python/3.7.4-GCCcore-8.3.0/bin/python") #assign RETICULATE_PYTHON to the python executable ``` Let's start R, install some libraries and evaluate the result: @@ -252,7 +251,7 @@ code to use `mclapply` function. Check out an example below. The disadvantages of using shared-memory parallelism approach are, that the number of parallel tasks is limited to the number of cores on a single node. The maximum number of cores on a single -node can be found [here](../jobs_and_resources/hardware_taurus.md). +node can be found in our [hardware documentation](../jobs_and_resources/hardware_taurus.md). Submitting a multicore R job to Slurm is very similar to submitting an [OpenMP Job](../jobs_and_resources/slurm.md#binding-and-distribution-of-tasks), @@ -288,8 +287,8 @@ This way of the R parallelism uses the [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) (Message Passing Interface) as a "back-end" for its parallel operations. The MPI-based job in R is very similar to submitting an [MPI Job](../jobs_and_resources/slurm.md#binding-and-distribution-of-tasks) since both are running -multicore jobs on multiple nodes. Below is an example of running R script with the Rmpi on -ZIH system: +multicore jobs on multiple nodes. Below is an example of running R script with the Rmpi on the ZIH +system: ```Bash #!/bin/bash @@ -458,7 +457,8 @@ parallel process. expression via futures - [Poor-man's parallelism](https://www.glennklockwood.com/data-intensive/r/alternative-parallelism.html#6-1-poor-man-s-parallelism) (simple data parallelism). It is the simplest, but not an elegant way to parallelize R code. - It runs several copies of the same R script where's each read different sectors of the input data + It runs several copies of the same R script where each copy reads a different part of the input + data. - [Hands-off (OpenMP)](https://www.glennklockwood.com/data-intensive/r/alternative-parallelism.html#6-2-hands-off-parallelism) method. R has [OpenMP](https://www.openmp.org/resources/) support. Thus using OpenMP is a simple method where you don't need to know much about the parallelism options in your code. Please be diff --git a/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md b/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md index 437f5c498d7250cb080497bfde1cae1bfa01a1fb..5bca0d4fa6bcd24aba066c765b6fa6031a673f7e 100644 --- a/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md +++ b/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md @@ -43,7 +43,7 @@ There are three script preparation steps for OmniOpt: ??? note "Parsing arguments in Python" There are many ways for parsing arguments into Python scripts. - The most easiest approach is the `sys` module (see + The easiest approach is the `sys` module (see [https://www.geeksforgeeks.org/how-to-use-sys-argv-in-python/](https://www.geeksforgeeks.org/how-to-use-sys-argv-in-python/){:target="_blank"}), which would be fully sufficient for usage with OmniOpt. Nevertheless, this basic approach has no consistency checks or error handling etc. @@ -51,7 +51,7 @@ There are three script preparation steps for OmniOpt: + Mark the output of the optimization target (chosen here: average loss) by prefixing it with the RESULT string. OmniOpt takes the **last appearing value** prefixed with the RESULT string. - In the example different epochs are performed and the average from the last epoch is caught + In the example, different epochs are performed and the average from the last epoch is caught by OmniOpt. Additionally, the RESULT output has to be a **single line**. After all these changes, the final script is as follows (with the lines containing relevant changes highlighted).