From 786bf032935c3b8dd9262719fb702599cd29e90c Mon Sep 17 00:00:00 2001 From: Christoph Lehmann <christoph.lehmann@tu-dresden.de> Date: Thu, 5 Aug 2021 17:34:47 +0200 Subject: [PATCH] basic stucturing DA section finished --- .../docs/{software => archive}/power_ai.md | 0 doc.zih.tu-dresden.de/docs/software/dask.md | 136 ----------------- ...ython.md => data_analytics_with_python.md} | 137 ++++++++++++++++++ ...udio.md => data_analytics_with_rstudio.md} | 0 doc.zih.tu-dresden.de/mkdocs.yml | 4 +- 5 files changed, 139 insertions(+), 138 deletions(-) rename doc.zih.tu-dresden.de/docs/{software => archive}/power_ai.md (100%) delete mode 100644 doc.zih.tu-dresden.de/docs/software/dask.md rename doc.zih.tu-dresden.de/docs/software/{python.md => data_analytics_with_python.md} (77%) rename doc.zih.tu-dresden.de/docs/software/{rstudio.md => data_analytics_with_rstudio.md} (100%) diff --git a/doc.zih.tu-dresden.de/docs/software/power_ai.md b/doc.zih.tu-dresden.de/docs/archive/power_ai.md similarity index 100% rename from doc.zih.tu-dresden.de/docs/software/power_ai.md rename to doc.zih.tu-dresden.de/docs/archive/power_ai.md diff --git a/doc.zih.tu-dresden.de/docs/software/dask.md b/doc.zih.tu-dresden.de/docs/software/dask.md deleted file mode 100644 index d6f7d087e..000000000 --- a/doc.zih.tu-dresden.de/docs/software/dask.md +++ /dev/null @@ -1,136 +0,0 @@ -# Dask - -**Dask** is an open-source library for parallel computing. Dask is a flexible library for parallel -computing in Python. - -Dask natively scales Python. It provides advanced parallelism for analytics, enabling performance at -scale for some of the popular tools. For instance: Dask arrays scale Numpy workflows, Dask -dataframes scale Pandas workflows, Dask-ML scales machine learning APIs like Scikit-Learn and -XGBoost. - -Dask is composed of two parts: - -- Dynamic task scheduling optimized for computation and interactive - computational workloads. -- Big Data collections like parallel arrays, data frames, and lists - that extend common interfaces like NumPy, Pandas, or Python - iterators to larger-than-memory or distributed environments. These - parallel collections run on top of dynamic task schedulers. - -Dask supports several user interfaces: - -High-Level: - -- Arrays: Parallel NumPy -- Bags: Parallel lists -- DataFrames: Parallel Pandas -- Machine Learning : Parallel Scikit-Learn -- Others from external projects, like XArray - -Low-Level: - -- Delayed: Parallel function evaluation -- Futures: Real-time parallel function evaluation - -## Installation - -### Installation Using Conda - -Dask is installed by default in [Anaconda](https://www.anaconda.com/download/). To install/update -Dask on a Taurus with using the [conda](https://www.anaconda.com/download/) follow the example: - -```Bash -# Job submission in ml nodes with allocating: 1 node, 1 gpu per node, 4 hours -srun -p ml -N 1 -n 1 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash -``` - -Create a conda virtual environment. We would recommend using a workspace. See the example (use -`--prefix` flag to specify the directory). - -**Note:** You could work with simple examples in your home directory (where you are loading by -default). However, in accordance with the -[HPC storage concept](../data_lifecycle/hpc_storage_concept2019.md) please use a -[workspaces](../data_lifecycle/workspaces.md) for your study and work projects. - -```Bash -conda create --prefix /scratch/ws/0/aabc1234-Workproject/conda-virtual-environment/dask-test python=3.6 -``` - -By default, conda will locate the environment in your home directory: - -```Bash -conda create -n dask-test python=3.6 -``` - -Activate the virtual environment, install Dask and verify the installation: - -```Bash -ml modenv/ml -ml PythonAnaconda/3.6 -conda activate /scratch/ws/0/aabc1234-Workproject/conda-virtual-environment/dask-test python=3.6 -which python -which conda -conda install dask -python - -from dask.distributed import Client, progress -client = Client(n_workers=4, threads_per_worker=1) -client -``` - -### Installation Using Pip - -You can install everything required for most common uses of Dask (arrays, dataframes, etc) - -```Bash -srun -p ml -N 1 -n 1 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash - -cd /scratch/ws/0/aabc1234-Workproject/python-virtual-environment/dask-test - -ml modenv/ml -module load PythonAnaconda/3.6 -which python - -python3 -m venv --system-site-packages dask-test -source dask-test/bin/activate -python -m pip install "dask[complete]" - -python -from dask.distributed import Client, progress -client = Client(n_workers=4, threads_per_worker=1) -client -``` - -Distributed scheduler - -? - -## Run Dask on Taurus - -The preferred and simplest way to run Dask on HPC systems today both for new, experienced users or -administrator is to use [dask-jobqueue](https://jobqueue.dask.org/). - -You can install dask-jobqueue with `pip` or `conda` - -Installation with Pip - -```Bash -srun -p haswell -N 1 -n 1 -c 4 --mem-per-cpu=2583 --time=01:00:00 --pty bash -cd -/scratch/ws/0/aabc1234-Workproject/python-virtual-environment/dask-test -ml modenv/ml module load PythonAnaconda/3.6 which python - -source dask-test/bin/activate pip -install dask-jobqueue --upgrade # Install everything from last released version -``` - -Installation with Conda - -```Bash -srun -p haswell -N 1 -n 1 -c 4 --mem-per-cpu=2583 --time=01:00:00 --pty bash - -ml modenv/ml module load PythonAnaconda/3.6 source -dask-test/bin/activate - -conda install dask-jobqueue -c conda-forge\</verbatim> -``` diff --git a/doc.zih.tu-dresden.de/docs/software/python.md b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md similarity index 77% rename from doc.zih.tu-dresden.de/docs/software/python.md rename to doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md index 281d1fd99..f5121b33c 100644 --- a/doc.zih.tu-dresden.de/docs/software/python.md +++ b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md @@ -130,6 +130,143 @@ the Jupyterhub. **Keep in mind that the remote Jupyter server can offer more freedom with settings and approaches.** +## Dask + +**Dask** is an open-source library for parallel computing. Dask is a flexible library for parallel +computing in Python. + +Dask natively scales Python. It provides advanced parallelism for analytics, enabling performance at +scale for some of the popular tools. For instance: Dask arrays scale Numpy workflows, Dask +dataframes scale Pandas workflows, Dask-ML scales machine learning APIs like Scikit-Learn and +XGBoost. + +Dask is composed of two parts: + +- Dynamic task scheduling optimized for computation and interactive + computational workloads. +- Big Data collections like parallel arrays, data frames, and lists + that extend common interfaces like NumPy, Pandas, or Python + iterators to larger-than-memory or distributed environments. These + parallel collections run on top of dynamic task schedulers. + +Dask supports several user interfaces: + +High-Level: + +- Arrays: Parallel NumPy +- Bags: Parallel lists +- DataFrames: Parallel Pandas +- Machine Learning : Parallel Scikit-Learn +- Others from external projects, like XArray + +Low-Level: + +- Delayed: Parallel function evaluation +- Futures: Real-time parallel function evaluation + +### Installation + +### Installation Using Conda + +Dask is installed by default in [Anaconda](https://www.anaconda.com/download/). To install/update +Dask on a Taurus with using the [conda](https://www.anaconda.com/download/) follow the example: + +```Bash +# Job submission in ml nodes with allocating: 1 node, 1 gpu per node, 4 hours +srun -p ml -N 1 -n 1 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash +``` + +Create a conda virtual environment. We would recommend using a workspace. See the example (use +`--prefix` flag to specify the directory). + +**Note:** You could work with simple examples in your home directory (where you are loading by +default). However, in accordance with the +[HPC storage concept](../data_lifecycle/hpc_storage_concept2019.md) please use a +[workspaces](../data_lifecycle/workspaces.md) for your study and work projects. + +```Bash +conda create --prefix /scratch/ws/0/aabc1234-Workproject/conda-virtual-environment/dask-test python=3.6 +``` + +By default, conda will locate the environment in your home directory: + +```Bash +conda create -n dask-test python=3.6 +``` + +Activate the virtual environment, install Dask and verify the installation: + +```Bash +ml modenv/ml +ml PythonAnaconda/3.6 +conda activate /scratch/ws/0/aabc1234-Workproject/conda-virtual-environment/dask-test python=3.6 +which python +which conda +conda install dask +python + +from dask.distributed import Client, progress +client = Client(n_workers=4, threads_per_worker=1) +client +``` + +### Installation Using Pip + +You can install everything required for most common uses of Dask (arrays, dataframes, etc) + +```Bash +srun -p ml -N 1 -n 1 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash + +cd /scratch/ws/0/aabc1234-Workproject/python-virtual-environment/dask-test + +ml modenv/ml +module load PythonAnaconda/3.6 +which python + +python3 -m venv --system-site-packages dask-test +source dask-test/bin/activate +python -m pip install "dask[complete]" + +python +from dask.distributed import Client, progress +client = Client(n_workers=4, threads_per_worker=1) +client +``` + +Distributed scheduler + +? + +### Run Dask on Taurus + +The preferred and simplest way to run Dask on HPC systems today both for new, experienced users or +administrator is to use [dask-jobqueue](https://jobqueue.dask.org/). + +You can install dask-jobqueue with `pip` or `conda` + +Installation with Pip + +```Bash +srun -p haswell -N 1 -n 1 -c 4 --mem-per-cpu=2583 --time=01:00:00 --pty bash +cd +/scratch/ws/0/aabc1234-Workproject/python-virtual-environment/dask-test +ml modenv/ml module load PythonAnaconda/3.6 which python + +source dask-test/bin/activate pip +install dask-jobqueue --upgrade # Install everything from last released version +``` + +Installation with Conda + +```Bash +srun -p haswell -N 1 -n 1 -c 4 --mem-per-cpu=2583 --time=01:00:00 --pty bash + +ml modenv/ml module load PythonAnaconda/3.6 source +dask-test/bin/activate + +conda install dask-jobqueue -c conda-forge\</verbatim> +``` + ## MPI for Python Message Passing Interface (MPI) is a standardized and portable diff --git a/doc.zih.tu-dresden.de/docs/software/rstudio.md b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_rstudio.md similarity index 100% rename from doc.zih.tu-dresden.de/docs/software/rstudio.md rename to doc.zih.tu-dresden.de/docs/software/data_analytics_with_rstudio.md diff --git a/doc.zih.tu-dresden.de/mkdocs.yml b/doc.zih.tu-dresden.de/mkdocs.yml index e200c93f7..ca42ea609 100644 --- a/doc.zih.tu-dresden.de/mkdocs.yml +++ b/doc.zih.tu-dresden.de/mkdocs.yml @@ -49,8 +49,8 @@ nav: - Data Analytics: - Overview: software/data_analytics.md - Data Analytics with R: software/data_analytics_with_r.md - - Data Analytics with RStudio: software/rstudio.md - - Data Analytics with Python: software/python.md + - Data Analytics with RStudio: software/data_analytics_with_rstudio.md + - Data Analytics with Python: software/data_analytics_with_python.md - Dask: software/dask.md - Power AI: software/power_ai.md - Apache Spark, Apache Flink, Apache Hadoop: software/big_data_frameworks.md -- GitLab