Skip to content
Snippets Groups Projects
Commit d026e1af authored by Martin Schroschk's avatar Martin Schroschk
Browse files

Merge branch 'dask' into 'preview'

Dask: Transfer content to new wiki and fix checks

See merge request zih/hpc-compendium/hpc-compendium!160
parents 59bd7bee 7fb9cef2
No related branches found
No related tags found
3 merge requests!322Merge preview into main,!319Merge preview into main,!160Dask: Transfer content to new wiki and fix checks
# Dask
**Dask** is an open-source library for parallel computing. Dask is a flexible library for parallel
computing in Python.
Dask natively scales Python. It provides advanced parallelism for analytics, enabling performance at
scale for some of the popular tools. For instance: Dask arrays scale Numpy workflows, Dask
dataframes scale Pandas workflows, Dask-ML scales machine learning APIs like Scikit-Learn and
XGBoost.
Dask is composed of two parts:
- Dynamic task scheduling optimized for computation and interactive
computational workloads.
- Big Data collections like parallel arrays, data frames, and lists
that extend common interfaces like NumPy, Pandas, or Python
iterators to larger-than-memory or distributed environments. These
parallel collections run on top of dynamic task schedulers.
Dask supports several user interfaces:
High-Level:
- Arrays: Parallel NumPy
- Bags: Parallel lists
- DataFrames: Parallel Pandas
- Machine Learning : Parallel Scikit-Learn
- Others from external projects, like XArray
Low-Level:
- Delayed: Parallel function evaluation
- Futures: Real-time parallel function evaluation
## Installation
### Installation Using Conda
Dask is installed by default in [Anaconda](https://www.anaconda.com/download/). To install/update
Dask on a Taurus with using the [conda](https://www.anaconda.com/download/) follow the example:
```Bash
# Job submission in ml nodes with allocating: 1 node, 1 gpu per node, 4 hours
srun -p ml -N 1 -n 1 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash
```
Create a conda virtual environment. We would recommend using a workspace. See the example (use
`--prefix` flag to specify the directory).
**Note:** You could work with simple examples in your home directory (where you are loading by
default). However, in accordance with the
[HPC storage concept](../data_management/HPCStorageConcept2019.md) please use a
[workspaces](../data_management/Workspaces.md) for your study and work projects.
```Bash
conda create --prefix /scratch/ws/0/aabc1234-Workproject/conda-virtual-environment/dask-test python=3.6
```
By default, conda will locate the environment in your home directory:
```Bash
conda create -n dask-test python=3.6
```
Activate the virtual environment, install Dask and verify the installation:
```Bash
ml modenv/ml
ml PythonAnaconda/3.6
conda activate /scratch/ws/0/aabc1234-Workproject/conda-virtual-environment/dask-test python=3.6
which python
which conda
conda install dask
python
from dask.distributed import Client, progress
client = Client(n_workers=4, threads_per_worker=1)
client
```
### Installation Using Pip
You can install everything required for most common uses of Dask (arrays, dataframes, etc)
```Bash
srun -p ml -N 1 -n 1 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash
cd /scratch/ws/0/aabc1234-Workproject/python-virtual-environment/dask-test
ml modenv/ml
module load PythonAnaconda/3.6
which python
python3 -m venv --system-site-packages dask-test
source dask-test/bin/activate
python -m pip install "dask[complete]"
python
from dask.distributed import Client, progress
client = Client(n_workers=4, threads_per_worker=1)
client
```
Distributed scheduler
?
## Run Dask on Taurus
The preferred and simplest way to run Dask on HPC systems today both for new, experienced users or
administrator is to use [dask-jobqueue](https://jobqueue.dask.org/).
You can install dask-jobqueue with `pip` or `conda`
Installation with Pip
```Bash
srun -p haswell -N 1 -n 1 -c 4 --mem-per-cpu=2583 --time=01:00:00 --pty bash
cd
/scratch/ws/0/aabc1234-Workproject/python-virtual-environment/dask-test
ml modenv/ml module load PythonAnaconda/3.6 which python
source dask-test/bin/activate pip
install dask-jobqueue --upgrade # Install everything from last released version
```
Installation with Conda
```Bash
srun -p haswell -N 1 -n 1 -c 4 --mem-per-cpu=2583 --time=01:00:00 --pty bash
ml modenv/ml module load PythonAnaconda/3.6 source
dask-test/bin/activate
conda install dask-jobqueue -c conda-forge\</verbatim>
```
...@@ -24,20 +24,22 @@ nav: ...@@ -24,20 +24,22 @@ nav:
- Singularity Recicpe Hints: software/SingularityRecipeHints.md - Singularity Recicpe Hints: software/SingularityRecipeHints.md
- Singularity Example Definitions: software/SingularityExampleDefinitions.md - Singularity Example Definitions: software/SingularityExampleDefinitions.md
- Custom Easy Build Modules: software/CustomEasyBuildEnvironment.md - Custom Easy Build Modules: software/CustomEasyBuildEnvironment.md
- Get started with HPC-DA: software/GetStartedWithHPCDA.md
- Mathematics: software/Mathematics.md - Mathematics: software/Mathematics.md
- Machine Learning: software/MachineLearning.md
- Deep Learning: software/DeepLearning.md
- Visualization: software/Visualization.md - Visualization: software/Visualization.md
- Data Analytics with R: software/DataAnalyticsWithR.md - HPC-DA:
- Data Analytics with Python: software/Python.md - Get started with HPC-DA: software/GetStartedWithHPCDA.md
- Tensorflow: - Machine Learning: software/MachineLearning.md
- Tensorflow Overview: software/TensorFlow.md - Deep Learning: software/DeepLearning.md
- Tensorflow in Container: software/TensorFlowContainerOnHPCDA.md - Data Analytics with R: software/DataAnalyticsWithR.md
- Tensorflow in JupyterHub: software/TensorFlowOnJupyterNotebook.md - Data Analytics with Python: software/Python.md
- Keras: software/Keras.md - Tensorflow:
- Power AI: software/PowerAI.md - Tensorflow Overview: software/TensorFlow.md
- PyTorch: software/PyTorch.md - Tensorflow in Container: software/TensorFlowContainerOnHPCDA.md
- Tensorflow in JupyterHub: software/TensorFlowOnJupyterNotebook.md
- Keras: software/Keras.md
- Dask: software/Dask.md
- Power AI: software/PowerAI.md
- PyTorch: software/PyTorch.md
- Computational Fluid Dynamics (CFD): software/CFD.md - Computational Fluid Dynamics (CFD): software/CFD.md
- FAQs: software/modules-faq.md - FAQs: software/modules-faq.md
- Bio Informatics: software/Bioinformatics.md - Bio Informatics: software/Bioinformatics.md
......
# Dask
\<span style="font-size: 1em;"> **Dask** is an open-source library for
parallel computing. Dask is a flexible library for parallel computing in
Python.\</span>
Dask natively scales Python. It\<span style="font-size: 1em;"> provides
advanced parallelism for analytics, enabling performance at scale for
some of the popular tools. For instance: Dask arrays scale Numpy
workflows, Dask dataframes scale Pandas workflows, Dask-ML scales
machine learning APIs like Scikit-Learn and XGBoost\</span>
Dask is composed of two parts:
- Dynamic task scheduling optimized for computation and interactive
computational workloads.
- Big Data collections like parallel arrays, data frames, and lists
that extend common interfaces like NumPy, Pandas, or Python
iterators to larger-than-memory or distributed environments. These
parallel collections run on top of dynamic task schedulers.
Dask supports several user interfaces:
High-Level:
- Arrays: Parallel NumPy
- Bags: Parallel lists
- DataFrames: Parallel Pandas
- Machine Learning : Parallel Scikit-Learn
- Others from external projects, like XArray
Low-Level:
- Delayed: Parallel function evaluation
- Futures: Real-time parallel function evaluation
## Installation
### installation using Conda
Dask is installed by default in
[Anaconda](https://www.anaconda.com/download/). To install/update Dask
on a Taurus with using the [conda](https://www.anaconda.com/download/)
follow the example:
srun -p ml -N 1 -n 1 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash #Job submission in ml nodes with allocating: 1 node, 1 gpu per node, 4 hours
Create a conda virtual environment. We would recommend using a
workspace. See the example (use `--prefix` flag to specify the
directory)\<br />\<span style="font-size: 1em;">Note: You could work
with simple examples in your home directory (where you are loading by
default). However, in accordance with the \</span>\<a
href="HPCStorageConcept2019" target="\_blank">storage concept\</a>\<span
style="font-size: 1em;">,\</span>** please use \<a href="WorkSpaces"
target="\_blank">workspaces\</a> for your study and work projects.**
conda create --prefix /scratch/ws/0/aabc1234-Workproject/conda-virtual-environment/dask-test python=3.6
By default, conda will locate the environment in your home directory:
conda create -n dask-test python=3.6
Activate the virtual environment, install Dask and verify the
installation:
ml modenv/ml
ml PythonAnaconda/3.6
conda activate /scratch/ws/0/aabc1234-Workproject/conda-virtual-environment/dask-test python=3.6
which python
which conda
conda install dask
python
from dask.distributed import Client, progress
client = Client(n_workers=4, threads_per_worker=1)
client
### installation using Pip
You can install everything required for most common uses of Dask
(arrays, dataframes, etc)
srun -p ml -N 1 -n 1 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash
cd /scratch/ws/0/aabc1234-Workproject/python-virtual-environment/dask-test
ml modenv/ml
module load PythonAnaconda/3.6
which python
python3 -m venv --system-site-packages dask-test
source dask-test/bin/activate
python -m pip install "dask[complete]"
python
from dask.distributed import Client, progress
client = Client(n_workers=4, threads_per_worker=1)
client
Distributed scheduler
?
## Run Dask on Taurus
\<span style="font-size: 1em;">The preferred and simplest way to run
Dask on HPC systems today both for new, experienced users or
administrator is to use \</span>
[dask-jobqueue](https://jobqueue.dask.org/)\<span style="font-size:
1em;">.\</span>
You can install dask-jobqueue with `pip <span>or</span>` `conda`
Installation with Pip
srun -p haswell -N 1 -n 1 -c 4 --mem-per-cpu=2583 --time=01:00:00 --pty bash
\<verbatim>cd
/scratch/ws/0/aabc1234-Workproject/python-virtual-environment/dask-test
ml modenv/ml module load PythonAnaconda/3.6 which python
source dask-test/bin/activate pip install dask-jobqueue --upgrade #
Install everything from last released version\</verbatim>
Installation with Conda
srun -p haswell -N 1 -n 1 -c 4 --mem-per-cpu=2583 --time=01:00:00 --pty bash
\<verbatim>ml modenv/ml module load PythonAnaconda/3.6 source
dask-test/bin/activate
conda install dask-jobqueue -c conda-forge\</verbatim>
-- Main.AndreiPolitov - 2020-08-26
**\<br />**
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment