-
Jan Frenzel authored
big_data_frameworks.md.
Jan Frenzel authoredbig_data_frameworks.md.
- Big Data Frameworks: Apache Spark, Apache Flink, Apache Hadoop
- Interactive jobs with Apache Spark with the default configuration
- Batch jobs
- Apache Spark with Jupyter Notebook
- Preparation
- Spawning a notebook
- Interactive jobs using a custom configuration
- Interactive jobs with Spark and Hadoop Distributed File System (HDFS)
- FAQ
Big Data Frameworks: Apache Spark, Apache Flink, Apache Hadoop
!!! note
This page is under construction
Apache Spark, Apache Flink and Apache Hadoop are frameworks for processing and integrating Big Data. These frameworks are also offered as software modules on Taurus for both ml and scs5 partitions. You could check module availability with the command:
marie@login$ module av Spark
Aim of this page is to introduce users on how to start working with the frameworks on ZIH systems, e. g. on the HPC-DA system.
Prerequisites: To work with the frameworks, you need access to the ZIH system and basic knowledge about data analysis and the batch system Slurm.
The usage of big data frameworks is different from other modules due to their master-worker approach. That means, before an application can be started, one has to do additional steps. In the following, we assume that a Spark application should be started.
The steps are:
- Load the Spark software module
- Configure the Spark cluster
- Start a Spark cluster
- Start the Spark application
Interactive jobs with Apache Spark with the default configuration
The Spark module is available for both scs5 and ml partitions. Thus, it could be used for different CPU architectures: Haswell, Power9 (ml partition) etc.
Let us assume that 2 nodes should be used for the computation. Use a
srun
command similar to the following to start an interactive session
using the Haswell partition. The following code snippet shows a job submission
to haswell nodes with an allocation of 2 nodes with 60 GB main memory
exclusively for 1 hour:
marie@login$ srun --partition=haswell -N2 --mem=60g --exclusive --time=01:00:00 --pty bash -l
The command for different resource allocation on the ml partition is similar, e. g. for a job submission to ml nodes with an allocation of 1 node, 1 task per node, 2 CPUs per task, 1 gpu per node, with 10000 MB for 1 hour:
marie@login$ srun -p ml -N 1 -n 1 -c 2 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=10000 bash
Once you have the shell, load Spark using the following command:
marie@compute$ module load Spark
Before the application can be started, the Spark cluster needs to be set
up. To do this, configure Spark first using configuration template at
$SPARK_HOME/conf
:
marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf
This places the configuration in a directory called
cluster-conf-<JOB_ID>
in your home directory, where <JOB_ID>
stands
for the job id of the SLURM job. After that, you can start Spark in the
usual way:
marie@compute$ start-all.sh
The Spark processes should now be set up and you can start your application, e. g.:
marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.4.jar 1000
!!! warning
Please do not delete the directory `cluster-conf-<JOB_ID>` while the job is still
running. This leads to errors.
Batch jobs
Using srun
directly on the shell blocks the shell and launches an
interactive job. Apart from short test runs, it is recommended to
launch your jobs in the background using batch jobs. For that, you can
conveniently put the parameters directly into the job file which you can
submit using sbatch [options] <job file>
.
Please use a batch job similar to example-spark.sbatch.
Apache Spark with Jupyter Notebook
There are two general options on how to work with Jupyter notebooks: There is jupyterhub, where you can simply run your Jupyter notebook on HPC nodes (the preferable way). Also, you can run a remote jupyter server manually within a sbatch GPU job and with the modules and packages you need. You can find the manual server setup here.
Preparation
If you want to run Spark in Jupyter notebooks, you have to prepare it first. This is comparable to the description for custom environments. You start with an allocation:
marie@login$ srun --pty -n 1 -c 2 --mem-per-cpu 2500 -t 01:00:00 bash -l
When a node is allocated, install the required package with Anaconda:
marie@compute$ module load Anaconda3
marie@compute$ cd
marie@compute$ mkdir user-kernel
marie@compute$ conda create --prefix $HOME/user-kernel/haswell-py3.6-spark python=3.6
Collecting package metadata: done
Solving environment: done [...]
marie@compute$ conda activate $HOME/user-kernel/haswell-py3.6-spark
marie@compute$ conda install ipykernel
Collecting package metadata: done
Solving environment: done [...]
marie@compute$ python -m ipykernel install --user --name haswell-py3.6-spark --display-name="haswell-py3.6-spark"
Installed kernelspec haswell-py3.6-spark in [...]
marie@compute$ conda install -c conda-forge findspark
marie@compute$ conda install pyspark
marie@compute$ conda deactivate
You are now ready to spawn a notebook with Spark.
Spawning a notebook
Assuming that you have prepared everything as described above, you can go to https://taurus.hrsk.tu-dresden.de/jupyter. In the tab "Advanced", go to the field "Preload modules" and select one of the Spark modules. When your jupyter instance is started, check whether the kernel that you created in the preparation phase (see above) is shown in the top right corner of the notebook. If it is not already selected, select the kernel haswell-py3.6-spark. Then, you can set up Spark. Since the setup in the notebook requires more steps than in an interactive session, we have created an example notebook that you can use as a starting point for convenience: SparkExample.ipynb
!!! note
You could work with simple examples in your home directory but according to the
[storage concept](../data_lifecycle/hpc_storage_concept2019.md)
**please use [workspaces](../data_lifecycle/workspaces.md) for
your study and work projects**. For this reason, you have to use
advanced options of Jupyterhub and put "/" in "Workspace scope" field.
Interactive jobs using a custom configuration
The script framework-configure.sh
is used to derive a configuration from
a template. It takes 2 parameters:
- The framework to set up (Spark, Flink, Hadoop)
- A configuration template
Thus, you can modify the configuration by replacing the default configuration template with a customized one. This way, your custom configuration template is reusable for different jobs. You can start with a copy of the default configuration ahead of your interactive session:
marie@login$ cp -r $SPARK_HOME/conf my-config-template
After you have changed my-config-template
, you can use your new template
in an interactive job with:
marie@compute$ source framework-configure.sh spark my-config-template
Interactive jobs with Spark and Hadoop Distributed File System (HDFS)
If you want to use Spark and HDFS together (or in general more than one framework), a scheme similar to the following can be used:
marie@compute$ module load Hadoop
marie@compute$ module load Spark
marie@compute$ source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop
marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf
marie@compute$ start-dfs.sh
marie@compute$ start-all.sh
Note: It is recommended to use ssh keys to avoid entering the password every time to log in to nodes. For the details, please check the external documentation.
FAQ
Q: Command source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop
gives the output:
bash: framework-configure.sh: No such file or directory
. How can this be resolved?
A: Please try to re-submit or re-run the job and if that doesn't help re-login to the ZIH system.
Q: There are a lot of errors and warnings during the set up of the session
A: Please check the work capability on a simple example. The source of warnings could be ssh etc, and it could be not affecting the frameworks
!!! help
If you have questions or need advice, please see
[https://www.scads.de/transfer-2/beratung-und-support-en/](https://www.scads.de/transfer-2/beratung-und-support-en/) or contact the HPC support.