Skip to content
Snippets Groups Projects

Added a specific file list containing all files to skip for each

Merged Jan Frenzel requested to merge merge-preview-in-main into main
5 files
+ 182
81
Compare changes
  • Side-by-side
  • Inline
Files
5
# Big Data Frameworks: Apache Spark
# Big Data Frameworks
[Apache Spark](https://spark.apache.org/), [Apache Flink](https://flink.apache.org/)
and [Apache Hadoop](https://hadoop.apache.org/) are frameworks for processing and integrating
Big Data. These frameworks are also offered as software [modules](modules.md) in both `ml` and
`scs5` software environments. You can check module versions and availability with the command
```console
marie@login$ module avail Spark
```
=== "Spark"
```console
marie@login$ module avail Spark
```
=== "Flink"
```console
marie@login$ module avail Flink
```
**Prerequisites:** To work with the frameworks, you need [access](../access/ssh_login.md) to ZIH
systems and basic knowledge about data analysis and the batch system
@@ -15,7 +20,8 @@ systems and basic knowledge about data analysis and the batch system
The usage of Big Data frameworks is different from other modules due to their master-worker
approach. That means, before an application can be started, one has to do additional steps.
In the following, we assume that a Spark application should be started.
In the following, we assume that a Spark application should be started and give alternative
commands for Flink where applicable.
The steps are:
@@ -26,6 +32,7 @@ The steps are:
Apache Spark can be used in [interactive](#interactive-jobs) and [batch](#batch-jobs) jobs as well
as via [Jupyter notebooks](#jupyter-notebook). All three ways are outlined in the following.
The usage of Flink with Jupyter notebooks is currently under examination.
## Interactive Jobs
@@ -36,39 +43,60 @@ Thus, Spark can be executed using different CPU architectures, e.g., Haswell and
Let us assume that two nodes should be used for the computation. Use a `srun` command similar to
the following to start an interactive session using the partition haswell. The following code
snippet shows a job submission to haswell nodes with an allocation of two nodes with 50 GB main
snippet shows a job submission to haswell nodes with an allocation of two nodes with 60000 MB main
memory exclusively for one hour:
```console
marie@login$ srun --partition=haswell --nodes=2 --mem=50g --exclusive --time=01:00:00 --pty bash -l
marie@login$ srun --partition=haswell --nodes=2 --mem=60000M --exclusive --time=01:00:00 --pty bash -l
```
Once you have the shell, load Spark using the command
Once you have the shell, load desired Big Data framework using the command
```console
marie@compute$ module load Spark
```
=== "Spark"
```console
marie@compute$ module load Spark
```
=== "Flink"
```console
marie@compute$ module load Flink
```
Before the application can be started, the Spark cluster needs to be set up. To do this, configure
Spark first using configuration template at `$SPARK_HOME/conf`:
```console
marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf
```
=== "Spark"
```console
marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf
```
=== "Flink"
```console
marie@compute$ source framework-configure.sh flink $FLINK_ROOT_DIR/conf
```
This places the configuration in a directory called `cluster-conf-<JOB_ID>` in your `home`
directory, where `<JOB_ID>` stands for the id of the Slurm job. After that, you can start Spark in
the usual way:
```console
marie@compute$ start-all.sh
```
=== "Spark"
```console
marie@compute$ start-all.sh
```
=== "Flink"
```console
marie@compute$ start-cluster.sh
```
The Spark processes should now be set up and you can start your application, e. g.:
```console
marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000
```
=== "Spark"
```console
marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi \
$SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000
```
=== "Flink"
```console
marie@compute$ flink run $FLINK_ROOT_DIR/examples/batch/KMeans.jar
```
!!! warning
@@ -80,37 +108,57 @@ marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOM
The script `framework-configure.sh` is used to derive a configuration from a template. It takes two
parameters:
- The framework to set up (Spark, Flink, Hadoop)
- The framework to set up (parameter `spark` for Spark, `flink` for Flink, and `hadoop` for Hadoop)
- A configuration template
Thus, you can modify the configuration by replacing the default configuration template with a
customized one. This way, your custom configuration template is reusable for different jobs. You
can start with a copy of the default configuration ahead of your interactive session:
```console
marie@login$ cp -r $SPARK_HOME/conf my-config-template
```
=== "Spark"
```console
marie@login$ cp -r $SPARK_HOME/conf my-config-template
```
=== "Flink"
```console
marie@login$ cp -r $FLINK_ROOT_DIR/conf my-config-template
```
After you have changed `my-config-template`, you can use your new template in an interactive job
with:
```console
marie@compute$ source framework-configure.sh spark my-config-template
```
=== "Spark"
```console
marie@compute$ source framework-configure.sh spark my-config-template
```
=== "Flink"
```console
marie@compute$ source framework-configure.sh flink my-config-template
```
### Using Hadoop Distributed Filesystem (HDFS)
If you want to use Spark and HDFS together (or in general more than one framework), a scheme
similar to the following can be used:
```console
marie@compute$ module load Hadoop
marie@compute$ module load Spark
marie@compute$ source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop
marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf
marie@compute$ start-dfs.sh
marie@compute$ start-all.sh
```
=== "Spark"
```console
marie@compute$ module load Hadoop
marie@compute$ module load Spark
marie@compute$ source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop
marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf
marie@compute$ start-dfs.sh
marie@compute$ start-all.sh
```
=== "Flink"
```console
marie@compute$ module load Hadoop
marie@compute$ module load Flink
marie@compute$ source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop
marie@compute$ source framework-configure.sh flink $FLINK_ROOT_DIR/conf
marie@compute$ start-dfs.sh
marie@compute$ start-cluster.sh
```
## Batch Jobs
@@ -122,41 +170,76 @@ that, you can conveniently put the parameters directly into the job file and sub
Please use a [batch job](../jobs_and_resources/slurm.md) with a configuration, similar to the
example below:
??? example "spark.sbatch"
```bash
#!/bin/bash -l
#SBATCH --time=00:05:00
#SBATCH --partition=haswell
#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --mem=50G
#SBATCH --job-name="example-spark"
??? example "example-starting-script.sbatch"
=== "Spark"
```bash
#!/bin/bash -l
#SBATCH --time=01:00:00
#SBATCH --partition=haswell
#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --mem=60000M
#SBATCH --job-name="example-spark"
module load Spark/3.0.1-Hadoop-2.7-Java-1.8-Python-3.7.4-GCCcore-8.3.0
function myExitHandler () {
stop-all.sh
}
#configuration
. framework-configure.sh spark $SPARK_HOME/conf
#register cleanup hook in case something goes wrong
trap myExitHandler EXIT
ml Spark/3.0.1-Hadoop-2.7-Java-1.8-Python-3.7.4-GCCcore-8.3.0
start-all.sh
spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000
function myExitHandler () {
stop-all.sh
}
#configuration
. framework-configure.sh spark $SPARK_HOME/conf
exit 0
```
=== "Flink"
```bash
#!/bin/bash -l
#SBATCH --time=01:00:00
#SBATCH --partition=haswell
#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --mem=60000M
#SBATCH --job-name="example-flink"
#register cleanup hook in case something goes wrong
trap myExitHandler EXIT
module load Flink/1.12.3-Java-1.8.0_161-OpenJDK-Python-3.7.4-GCCcore-8.3.0
start-all.sh
function myExitHandler () {
stop-cluster.sh
}
spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000
#configuration
. framework-configure.sh flink $FLINK_ROOT_DIR/conf
stop-all.sh
#register cleanup hook in case something goes wrong
trap myExitHandler EXIT
exit 0
```
#start the cluster
start-cluster.sh
#run your application
flink run $FLINK_ROOT_DIR/examples/batch/KMeans.jar
#stop the cluster
stop-cluster.sh
exit 0
```
## Jupyter Notebook
You can run Jupyter notebooks with Spark on the ZIH systems in a similar way as described on the
[JupyterHub](../access/jupyterhub.md) page.
[JupyterHub](../access/jupyterhub.md) page. Interaction of Flink with JupyterHub is currently
under examination and will be posted here upon availability.
### Preparation
Loading