Skip to content
Snippets Groups Projects
Commit e98e423c authored by Taras Lazariv's avatar Taras Lazariv
Browse files

Merge branch '218-apache-flink' into 'preview'

Added short description about how to use Flink. Resolves #218.

Closes #218

See merge request !409
parents 0fc35ff0 a644af4c
No related branches found
No related tags found
2 merge requests!415Added a specific file list containing all files to skip for each,!409Added short description about how to use Flink. Resolves #218.
# Big Data Frameworks: Apache Spark # Big Data Frameworks
[Apache Spark](https://spark.apache.org/), [Apache Flink](https://flink.apache.org/) [Apache Spark](https://spark.apache.org/), [Apache Flink](https://flink.apache.org/)
and [Apache Hadoop](https://hadoop.apache.org/) are frameworks for processing and integrating and [Apache Hadoop](https://hadoop.apache.org/) are frameworks for processing and integrating
Big Data. These frameworks are also offered as software [modules](modules.md) in both `ml` and Big Data. These frameworks are also offered as software [modules](modules.md) in both `ml` and
`scs5` software environments. You can check module versions and availability with the command `scs5` software environments. You can check module versions and availability with the command
```console === "Spark"
marie@login$ module avail Spark ```console
``` marie@login$ module avail Spark
```
=== "Flink"
```console
marie@login$ module avail Flink
```
**Prerequisites:** To work with the frameworks, you need [access](../access/ssh_login.md) to ZIH **Prerequisites:** To work with the frameworks, you need [access](../access/ssh_login.md) to ZIH
systems and basic knowledge about data analysis and the batch system systems and basic knowledge about data analysis and the batch system
...@@ -15,7 +20,8 @@ systems and basic knowledge about data analysis and the batch system ...@@ -15,7 +20,8 @@ systems and basic knowledge about data analysis and the batch system
The usage of Big Data frameworks is different from other modules due to their master-worker The usage of Big Data frameworks is different from other modules due to their master-worker
approach. That means, before an application can be started, one has to do additional steps. approach. That means, before an application can be started, one has to do additional steps.
In the following, we assume that a Spark application should be started. In the following, we assume that a Spark application should be started and give alternative
commands for Flink where applicable.
The steps are: The steps are:
...@@ -26,6 +32,7 @@ The steps are: ...@@ -26,6 +32,7 @@ The steps are:
Apache Spark can be used in [interactive](#interactive-jobs) and [batch](#batch-jobs) jobs as well Apache Spark can be used in [interactive](#interactive-jobs) and [batch](#batch-jobs) jobs as well
as via [Jupyter notebooks](#jupyter-notebook). All three ways are outlined in the following. as via [Jupyter notebooks](#jupyter-notebook). All three ways are outlined in the following.
The usage of Flink with Jupyter notebooks is currently under examination.
## Interactive Jobs ## Interactive Jobs
...@@ -36,39 +43,60 @@ Thus, Spark can be executed using different CPU architectures, e.g., Haswell and ...@@ -36,39 +43,60 @@ Thus, Spark can be executed using different CPU architectures, e.g., Haswell and
Let us assume that two nodes should be used for the computation. Use a `srun` command similar to Let us assume that two nodes should be used for the computation. Use a `srun` command similar to
the following to start an interactive session using the partition haswell. The following code the following to start an interactive session using the partition haswell. The following code
snippet shows a job submission to haswell nodes with an allocation of two nodes with 50 GB main snippet shows a job submission to haswell nodes with an allocation of two nodes with 60000 MB main
memory exclusively for one hour: memory exclusively for one hour:
```console ```console
marie@login$ srun --partition=haswell --nodes=2 --mem=50g --exclusive --time=01:00:00 --pty bash -l marie@login$ srun --partition=haswell --nodes=2 --mem=60000M --exclusive --time=01:00:00 --pty bash -l
``` ```
Once you have the shell, load Spark using the command Once you have the shell, load desired Big Data framework using the command
```console === "Spark"
marie@compute$ module load Spark ```console
``` marie@compute$ module load Spark
```
=== "Flink"
```console
marie@compute$ module load Flink
```
Before the application can be started, the Spark cluster needs to be set up. To do this, configure Before the application can be started, the Spark cluster needs to be set up. To do this, configure
Spark first using configuration template at `$SPARK_HOME/conf`: Spark first using configuration template at `$SPARK_HOME/conf`:
```console === "Spark"
marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf ```console
``` marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf
```
=== "Flink"
```console
marie@compute$ source framework-configure.sh flink $FLINK_ROOT_DIR/conf
```
This places the configuration in a directory called `cluster-conf-<JOB_ID>` in your `home` This places the configuration in a directory called `cluster-conf-<JOB_ID>` in your `home`
directory, where `<JOB_ID>` stands for the id of the Slurm job. After that, you can start Spark in directory, where `<JOB_ID>` stands for the id of the Slurm job. After that, you can start Spark in
the usual way: the usual way:
```console === "Spark"
marie@compute$ start-all.sh ```console
``` marie@compute$ start-all.sh
```
=== "Flink"
```console
marie@compute$ start-cluster.sh
```
The Spark processes should now be set up and you can start your application, e. g.: The Spark processes should now be set up and you can start your application, e. g.:
```console === "Spark"
marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000 ```console
``` marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi \
$SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000
```
=== "Flink"
```console
marie@compute$ flink run $FLINK_ROOT_DIR/examples/batch/KMeans.jar
```
!!! warning !!! warning
...@@ -80,37 +108,57 @@ marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOM ...@@ -80,37 +108,57 @@ marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOM
The script `framework-configure.sh` is used to derive a configuration from a template. It takes two The script `framework-configure.sh` is used to derive a configuration from a template. It takes two
parameters: parameters:
- The framework to set up (Spark, Flink, Hadoop) - The framework to set up (parameter `spark` for Spark, `flink` for Flink, and `hadoop` for Hadoop)
- A configuration template - A configuration template
Thus, you can modify the configuration by replacing the default configuration template with a Thus, you can modify the configuration by replacing the default configuration template with a
customized one. This way, your custom configuration template is reusable for different jobs. You customized one. This way, your custom configuration template is reusable for different jobs. You
can start with a copy of the default configuration ahead of your interactive session: can start with a copy of the default configuration ahead of your interactive session:
```console === "Spark"
marie@login$ cp -r $SPARK_HOME/conf my-config-template ```console
``` marie@login$ cp -r $SPARK_HOME/conf my-config-template
```
=== "Flink"
```console
marie@login$ cp -r $FLINK_ROOT_DIR/conf my-config-template
```
After you have changed `my-config-template`, you can use your new template in an interactive job After you have changed `my-config-template`, you can use your new template in an interactive job
with: with:
```console === "Spark"
marie@compute$ source framework-configure.sh spark my-config-template ```console
``` marie@compute$ source framework-configure.sh spark my-config-template
```
=== "Flink"
```console
marie@compute$ source framework-configure.sh flink my-config-template
```
### Using Hadoop Distributed Filesystem (HDFS) ### Using Hadoop Distributed Filesystem (HDFS)
If you want to use Spark and HDFS together (or in general more than one framework), a scheme If you want to use Spark and HDFS together (or in general more than one framework), a scheme
similar to the following can be used: similar to the following can be used:
```console === "Spark"
marie@compute$ module load Hadoop ```console
marie@compute$ module load Spark marie@compute$ module load Hadoop
marie@compute$ source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop marie@compute$ module load Spark
marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf marie@compute$ source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop
marie@compute$ start-dfs.sh marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf
marie@compute$ start-all.sh marie@compute$ start-dfs.sh
``` marie@compute$ start-all.sh
```
=== "Flink"
```console
marie@compute$ module load Hadoop
marie@compute$ module load Flink
marie@compute$ source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop
marie@compute$ source framework-configure.sh flink $FLINK_ROOT_DIR/conf
marie@compute$ start-dfs.sh
marie@compute$ start-cluster.sh
```
## Batch Jobs ## Batch Jobs
...@@ -122,41 +170,76 @@ that, you can conveniently put the parameters directly into the job file and sub ...@@ -122,41 +170,76 @@ that, you can conveniently put the parameters directly into the job file and sub
Please use a [batch job](../jobs_and_resources/slurm.md) with a configuration, similar to the Please use a [batch job](../jobs_and_resources/slurm.md) with a configuration, similar to the
example below: example below:
??? example "spark.sbatch" ??? example "example-starting-script.sbatch"
```bash === "Spark"
#!/bin/bash -l ```bash
#SBATCH --time=00:05:00 #!/bin/bash -l
#SBATCH --partition=haswell #SBATCH --time=01:00:00
#SBATCH --nodes=2 #SBATCH --partition=haswell
#SBATCH --exclusive #SBATCH --nodes=2
#SBATCH --mem=50G #SBATCH --exclusive
#SBATCH --job-name="example-spark" #SBATCH --mem=60000M
#SBATCH --job-name="example-spark"
module load Spark/3.0.1-Hadoop-2.7-Java-1.8-Python-3.7.4-GCCcore-8.3.0
function myExitHandler () {
stop-all.sh
}
#configuration
. framework-configure.sh spark $SPARK_HOME/conf
#register cleanup hook in case something goes wrong
trap myExitHandler EXIT
ml Spark/3.0.1-Hadoop-2.7-Java-1.8-Python-3.7.4-GCCcore-8.3.0 start-all.sh
spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000
function myExitHandler () {
stop-all.sh stop-all.sh
}
#configuration exit 0
. framework-configure.sh spark $SPARK_HOME/conf ```
=== "Flink"
```bash
#!/bin/bash -l
#SBATCH --time=01:00:00
#SBATCH --partition=haswell
#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --mem=60000M
#SBATCH --job-name="example-flink"
#register cleanup hook in case something goes wrong module load Flink/1.12.3-Java-1.8.0_161-OpenJDK-Python-3.7.4-GCCcore-8.3.0
trap myExitHandler EXIT
start-all.sh function myExitHandler () {
stop-cluster.sh
}
spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000 #configuration
. framework-configure.sh flink $FLINK_ROOT_DIR/conf
stop-all.sh #register cleanup hook in case something goes wrong
trap myExitHandler EXIT
exit 0 #start the cluster
``` start-cluster.sh
#run your application
flink run $FLINK_ROOT_DIR/examples/batch/KMeans.jar
#stop the cluster
stop-cluster.sh
exit 0
```
## Jupyter Notebook ## Jupyter Notebook
You can run Jupyter notebooks with Spark on the ZIH systems in a similar way as described on the You can run Jupyter notebooks with Spark on the ZIH systems in a similar way as described on the
[JupyterHub](../access/jupyterhub.md) page. [JupyterHub](../access/jupyterhub.md) page. Interaction of Flink with JupyterHub is currently
under examination and will be posted here upon availability.
### Preparation ### Preparation
......
...@@ -10,7 +10,7 @@ The following tools are available on ZIH systems, among others: ...@@ -10,7 +10,7 @@ The following tools are available on ZIH systems, among others:
* [Python](data_analytics_with_python.md) * [Python](data_analytics_with_python.md)
* [R](data_analytics_with_r.md) * [R](data_analytics_with_r.md)
* [RStudio](data_analytics_with_rstudio.md) * [RStudio](data_analytics_with_rstudio.md)
* [Big Data framework Spark](big_data_frameworks_spark.md) * [Big Data framework Spark](big_data_frameworks.md)
* [MATLAB and Mathematica](mathematics.md) * [MATLAB and Mathematica](mathematics.md)
Detailed information about frameworks for machine learning, such as [TensorFlow](tensorflow.md) Detailed information about frameworks for machine learning, such as [TensorFlow](tensorflow.md)
......
...@@ -45,7 +45,7 @@ nav: ...@@ -45,7 +45,7 @@ nav:
- Data Analytics with R: software/data_analytics_with_r.md - Data Analytics with R: software/data_analytics_with_r.md
- Data Analytics with RStudio: software/data_analytics_with_rstudio.md - Data Analytics with RStudio: software/data_analytics_with_rstudio.md
- Data Analytics with Python: software/data_analytics_with_python.md - Data Analytics with Python: software/data_analytics_with_python.md
- Apache Spark: software/big_data_frameworks_spark.md - Big Data Analytics: software/big_data_frameworks.md
- Machine Learning: - Machine Learning:
- Overview: software/machine_learning.md - Overview: software/machine_learning.md
- TensorFlow: software/tensorflow.md - TensorFlow: software/tensorflow.md
...@@ -187,6 +187,8 @@ markdown_extensions: ...@@ -187,6 +187,8 @@ markdown_extensions:
permalink: True permalink: True
- attr_list - attr_list
- footnotes - footnotes
- pymdownx.tabbed:
alternate_style: true
extra: extra:
homepage: https://tu-dresden.de homepage: https://tu-dresden.de
......
...@@ -79,6 +79,7 @@ FFT ...@@ -79,6 +79,7 @@ FFT
FFTW FFTW
filesystem filesystem
filesystems filesystems
flink
Flink Flink
FMA FMA
foreach foreach
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment