diff --git a/doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md b/doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md index 913cf2283ea25bab464b1e4514ce549c1afb3a25..472637e3a89702b4423b644a5bb74cb3e6c2a5c4 100644 --- a/doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md +++ b/doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md @@ -1,4 +1,4 @@ -# BIG DATA FRAMEWORKS: APACHE SPARK, APACHE FLINK, APACHE HADOOP +# Big Data Frameworks: Apache Spark, Apache Flink, Apache Hadoop !!! note @@ -6,197 +6,218 @@ [Apache Spark](https://spark.apache.org/), [Apache Flink](https://flink.apache.org/) and [Apache Hadoop](https://hadoop.apache.org/) are frameworks for processing and integrating -Big Data. These frameworks are also offered as software [modules](modules.md) on Taurus -for both ml and scs5 partitions. You could check module availability with the command: +Big Data. These frameworks are also offered as software [modules](modules.md) on both `ml` and +`scs5` partition. You can check module versions and availability with the command -```bash -ml av Spark +```console +marie@login$ module av Spark ``` -**Aim** of this page is to introduce users on how to start working with -the frameworks on Taurus in general as well as on the [HPC-DA](../jobs_and_resources/hpcda.md) system. +The **aim** of this page is to introduce users on how to start working with +these frameworks on ZIH systems, e. g. on the [HPC-DA](../jobs_and_resources/hpcda.md) system. -**Prerequisites:** To work with the frameworks, you need [access](../access/login.md) to the Taurus -system and basic knowledge about data analysis and [Slurm](../jobs_and_resources/slurm.md). +**Prerequisites:** To work with the frameworks, you need [access](../access/login.md) to ZIH +systems and basic knowledge about data analysis and the batch system +[Slurm](../jobs_and_resources/slurm.md). -The usage of big data frameworks is +The usage of Big Data frameworks is different from other modules due to their master-worker approach. That -means, before an application can be started, one has to do additional -steps. In the following, we assume that a Spark application should be +means, before an application can be started, one has to do additional steps. +In the following, we assume that a Spark application should be started. -The steps are: 1 Load the Spark software module 1 Configure the Spark -cluster 1 Start a Spark cluster 1 Start the Spark application +The steps are: -## Interactive jobs with Apache Spark with the default configuration +1. Load the Spark software module +1. Configure the Spark cluster +1. Start a Spark cluster +1. Start the Spark application -The Spark module is available for both **scs5** and **ml** partitions. -Thus, it could be used for different CPU architectures: Haswell, Power9 -(ml partition) etc. +Apache Spark can be used in [interactive](#interactive-jobs) and [batch](#batch-jobs) jobs as well +as via [Jupyter notebook](#jupyter-notebook). All three ways are outlined in the following. -Let us assume that 2 nodes should be used for the computation. Use a +!!! note + + It is recommended to use ssh keys to avoid entering the password + every time to log in to nodes. For the details, please check the + [external documentation](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/s2-ssh-configuration-keypairs). + +## Interactive Jobs + +### Default Configuration + +The Spark module is available for both `scs5` and `ml` partitions. +Thus, Spark can be executed using different CPU architectures, e.g., Haswell and Power9. + +Let us assume that two nodes should be used for the computation. Use a `srun` command similar to the following to start an interactive session -using the Haswell partition: +using the Haswell partition. The following code snippet shows a job submission +to Haswell nodes with an allocation of two nodes with 60 GB main memory +exclusively for one hour: -```bash -srun --partition=haswell -N2 --mem=60g --exclusive --time=01:00:00 --pty bash -l #Job submission to haswell nodes with an allocation of 2 nodes with 60 GB main memory exclusively for 1 hour +```console +marie@login$ srun --partition=haswell -N 2 --mem=60g --exclusive --time=01:00:00 --pty bash -l ``` -The command for different resource allocation on the **ml** partition is -similar: +The command for different resource allocation on the `ml` partition is +similar, e. g. for a job submission to `ml` nodes with an allocation of one +node, one task per node, two CPUs per task, one GPU per node, with 10000 MB for one hour: -```bash -srun -p ml -N 1 -n 1 -c 2 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=10000 bash #job submission to ml nodes with an allocation of 1 node, 1 task per node, 2 CPUs per task, 1 gpu per node, with 10000 MB for 1 hour. +```console +marie@login$ srun --partition=ml -N 1 -n 1 -c 2 --gres=gpu:1 --mem-per-cpu=10000 --time=01:00:00 --pty bash ``` -Once you have the shell, load Spark using the following command: +Once you have the shell, load Spark using the command -```bash -ml Spark +```console +marie@compute$ module load Spark ``` Before the application can be started, the Spark cluster needs to be set up. To do this, configure Spark first using configuration template at `$SPARK_HOME/conf`: -```bash -source framework-configure.sh spark $SPARK_HOME/conf +```console +marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf ``` This places the configuration in a directory called -`cluster-conf-<JOB_ID>` in your home directory, where `<JOB_ID>` stands -for the job id of the SLURM job. After that, you can start Spark in the +`cluster-conf-<JOB_ID>` in your `home` directory, where `<JOB_ID>` stands +for the id of the Slurm job. After that, you can start Spark in the usual way: -```bash -start-all.sh +```console +marie@compute$ start-all.sh ``` The Spark processes should now be set up and you can start your application, e. g.: -```bash -spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.4.jar 1000 +```console +marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.4.jar 1000 +``` + +!!! warning + + Do not delete the directory `cluster-conf-<JOB_ID>` while the job is still + running. This leads to errors. + +### Custom Configuration + +The script `framework-configure.sh` is used to derive a configuration from +a template. It takes two parameters: + +- The framework to set up (Spark, Flink, Hadoop) +- A configuration template + +Thus, you can modify the configuration by replacing the default +configuration template with a customized one. This way, your custom +configuration template is reusable for different jobs. You can start +with a copy of the default configuration ahead of your interactive +session: + +```console +marie@login$ cp -r $SPARK_HOME/conf my-config-template +``` + +After you have changed `my-config-template`, you can use your new template +in an interactive job with: + +```console +marie@compute$ source framework-configure.sh spark my-config-template ``` -%RED%Note<span class="twiki-macro ENDCOLOR"></span>: Please do not -delete the directory `cluster-conf-<JOB_ID>` while the job is still -running. This may lead to errors. +### Using Hadoop Distributed File System (HDFS) -## Batch jobs +If you want to use Spark and HDFS together (or in general more than one +framework), a scheme similar to the following can be used: + +```console +marie@compute$ module load Hadoop +marie@compute$ module load Spark +marie@compute$ source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop +marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf +marie@compute$ start-dfs.sh +marie@compute$ start-all.sh +``` + +## Batch Jobs Using `srun` directly on the shell blocks the shell and launches an interactive job. Apart from short test runs, it is **recommended to launch your jobs in the background using batch jobs**. For that, you can -conveniently put the parameters directly into the job file which you can -submit using `sbatch \[options\] \<job file>`. +conveniently put the parameters directly into the job file and submit it via +`sbatch [options] <job file>`. -Please use a [batch job](../jobs_and_resources/slurm.md) similar to the one attached: +Please use a [batch job](../jobs_and_resources/slurm.md) similar to [example-spark.sbatch](misc/example-spark.sbatch). -## Apache Spark with Jupyter Notebook +## Jupyter Notebook -There are two general options on how to work with Jupyter notebooks on -Taurus:\<br />There is [jupyterhub](../access/jupyterhub.md), where you can simply +There are two general options on how to work with Jupyter notebooks: +There is [JupyterHub](../access/jupyterhub.md), where you can simply run your Jupyter notebook on HPC nodes (the preferable way). Also, you -can run a remote jupyter server manually within a sbatch GPU job and -with the modules and packages you need. You can find the manual server +can run a remote Jupyter server manually within a GPU job using +the modules and packages you need. You can find the manual server setup [here](deep_learning.md). ### Preparation If you want to run Spark in Jupyter notebooks, you have to prepare it first. This is comparable -to the [description for custom environments](../access/jupyterhub.md#Conda_environment). +to the [description for custom environments](../access/jupyterhub.md#conda-environment). You start with an allocation: -```bash -srun --pty -n 1 -c 2 --mem-per-cpu 2583 -t 01:00:00 bash -l +```console +marie@login$ srun --pty -n 1 -c 2 --mem-per-cpu=2500 -t 01:00:00 bash -l ``` When a node is allocated, install the required package with Anaconda: -```bash -module load Anaconda3 -cd -mkdir user-kernel -conda create --prefix $HOME/user-kernel/haswell-py3.6-spark python=3.6 #Example output: Collecting package metadata:done Solving environment: done [...] +```console +marie@compute$ module load Anaconda3 +marie@compute$ cd +marie@compute$ mkdir user-kernel +marie@compute$ conda create --prefix $HOME/user-kernel/haswell-py3.6-spark python=3.6 +Collecting package metadata: done +Solving environment: done [...] -conda activate $HOME/user-kernel/haswell-py3.6-spark -conda install ipykernel #Example output: Collecting package metadata: done Solving environment: done[...] +marie@compute$ conda activate $HOME/user-kernel/haswell-py3.6-spark +marie@compute$ conda install ipykernel +Collecting package metadata: done +Solving environment: done [...] -python -m ipykernel install --user --name haswell-py3.6-spark --display-name="haswell-py3.6-spark" #Example output: Installed kernelspec haswell-py3.6-spark in [...] +marie@compute$ python -m ipykernel install --user --name haswell-py3.6-spark --display-name="haswell-py3.6-spark" +Installed kernelspec haswell-py3.6-spark in [...] -conda install -c conda-forge findspark -conda install pyspark -conda install keras +marie@compute$ conda install -c conda-forge findspark +marie@compute$ conda install pyspark -conda deactivate +marie@compute$ conda deactivate ``` You are now ready to spawn a notebook with Spark. -### Spawning a notebook +### Spawning a Notebook Assuming that you have prepared everything as described above, you can go to [https://taurus.hrsk.tu-dresden.de/jupyter](https://taurus.hrsk.tu-dresden.de/jupyter). In the tab "Advanced", go to the field "Preload modules" and select one of the Spark modules. -When your jupyter instance is started, check whether the kernel that +When your Jupyter instance is started, check whether the kernel that you created in the preparation phase (see above) is shown in the top right corner of the notebook. If it is not already selected, select the -kernel haswell-py3.6-spark. Then, you can set up Spark. Since the setup +kernel `haswell-py3.6-spark`. Then, you can set up Spark. Since the setup in the notebook requires more steps than in an interactive session, we have created an example notebook that you can use as a starting point for convenience: [SparkExample.ipynb](misc/SparkExample.ipynb) -%RED%Note<span class="twiki-macro ENDCOLOR"></span>: You could work with -simple examples in your home directory but according to the [storage concept](../data_lifecycle/hpc_storage_concept2019.md) -**please use [workspaces](../data_lifecycle/workspaces.md) for -your study and work projects**. For this reason, you have to use -advanced options of Jupyterhub and put "/" in "Workspace scope" field. - -## Interactive jobs using a custom configuration - -The script `framework-configure.sh` is used to derive a configuration from -a template. It takes 2 parameters: - -- The framework to set up (Spark, Flink, Hadoop) -- A configuration template - -Thus, you can modify the configuration by replacing the default -configuration template with a customized one. This way, your custom -configuration template is reusable for different jobs. You can start -with a copy of the default configuration ahead of your interactive -session: - -```bash -cp -r $SPARK_HOME/conf my-config-template -``` - -After you have changed `my-config-template`, you can use your new template -in an interactive job with: - -```bash -source framework-configure.sh spark my-config-template -``` - -## Interactive jobs with Spark and Hadoop Distributed File System (HDFS) - -If you want to use Spark and HDFS together (or in general more than one -framework), a scheme similar to the following can be used: - -```bash -ml Hadoop -ml Spark -source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop -source framework-configure.sh spark $SPARK_HOME/conf -start-dfs.sh -start-all.sh -``` +!!! note -Note: It is recommended to use ssh keys to avoid entering the password -every time to log in to nodes. For the details, please check the -[documentation](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/s2-ssh-configuration-keypairs). + You could work with simple examples in your home directory but according to the + [storage concept](../data_lifecycle/hpc_storage_concept2019.md) + **please use [workspaces](../data_lifecycle/workspaces.md) for + your study and work projects**. For this reason, you have to use + advanced options of Jupyterhub and put "/" in "Workspace scope" field. ## FAQ @@ -205,7 +226,7 @@ $HADOOP_ROOT_DIR/etc/hadoop` gives the output: `bash: framework-configure.sh: No such file or directory`. How can this be resolved? A: Please try to re-submit or re-run the job and if that doesn't help -re-login to Taurus. +re-login to the ZIH system. Q: There are a lot of errors and warnings during the set up of the session @@ -213,5 +234,7 @@ session A: Please check the work capability on a simple example. The source of warnings could be ssh etc, and it could be not affecting the frameworks -Note: If you have questions or need advice, please see -[https://www.scads.ai/services](https://www.scads.ai/services) or contact the HPC support. +!!! help + + If you have questions or need advice, please see + [https://www.scads.de/transfer-2/beratung-und-support-en/](https://www.scads.de/transfer-2/beratung-und-support-en/) or contact the HPC support.