Fuse flink and spark documentation to one Big Data Frameworks page

0eae110f · Taras Lazariv · 420b2c65 · 0eae110f · 0eae110f
Commit 0eae110f authored 3 years ago by Taras Lazariv
--- a/doc.zih.tu-dresden.de/docs/software/big_data_frameworks_spark.md
+++ b/doc.zih.tu-dresden.de/docs/software/big_data_frameworks_spark.md
@@ -5,9 +5,14 @@ and [Apache Hadoop](https://hadoop.apache.org/) are frameworks for processing an
 Big Data. These frameworks are also offered as software [modules](modules.md) in both `ml` and
 `scs5` software environments. You can check module versions and availability with the command
-```console
+=== "Spark"
-marie@login$ module avail Spark
+    ```console
-```
+    marie@login$ module avail Spark
+    ```
+=== "Flink"
+    ```console
+    marie@login$ module avail Flink
+    ```
 **Prerequisites:** To work with the frameworks, you need [access](../access/ssh_login.md) to ZIH
 systems and basic knowledge about data analysis and the batch system
@@ -15,7 +20,8 @@ systems and basic knowledge about data analysis and the batch system
 The usage of Big Data frameworks is different from other modules due to their master-worker
 approach. That means, before an application can be started, one has to do additional steps.
-In the following, we assume that a Spark application should be started.
+In the following, we assume that a Spark application should be started and give alternative
+commands for Flink where applicable.
 The steps are:
@@ -26,6 +32,7 @@ The steps are:
 Apache Spark can be used in [interactive](#interactive-jobs) and [batch](#batch-jobs) jobs as well
 as via [Jupyter notebooks](#jupyter-notebook). All three ways are outlined in the following.
+The usage of Flink with Jupyter notebooks is currently under examination.
 ## Interactive Jobs
@@ -43,32 +50,53 @@ memory exclusively for one hour:
 marie@login$ srun --partition=haswell --nodes=2 --mem=60g --exclusive --time=01:00:00 --pty bash -l
 ```
-Once you have the shell, load Spark using the command
+Once you have the shell, load desired Big Data framework using the command
-```console
+=== "Spark"
-marie@compute$ module load Spark
+    ```console
-```
+    marie@compute$ module load Spark
+    ```
+=== "Flink"
+    ```console
+    marie@compute$ module load Flink
+    ```
 Before the application can be started, the Spark cluster needs to be set up. To do this, configure
 Spark first using configuration template at `$SPARK_HOME/conf`:
-```console
+=== "Spark"
-marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf
+    ```console
-```
+    marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf
+    ```
+=== "Flink"
+    ```console
+    marie@compute$ source framework-configure.sh flink $FLINK_ROOT_DIR/conf
+    ```
 This places the configuration in a directory called `cluster-conf-<JOB_ID>` in your `home`
 directory, where `<JOB_ID>` stands for the id of the Slurm job. After that, you can start Spark in
 the usual way:
-```console
+=== "Spark"
-marie@compute$ start-all.sh
+    ```console
-```
+    marie@compute$ start-all.sh
+    ```
+=== "Flink"
+    ```console
+    marie@compute$ start-cluster.sh
+    ```
 The Spark processes should now be set up and you can start your application, e. g.:
-```console
+=== "Spark"
-marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000
+    ```console
-```
+    marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi \
+    $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000
+    ```
+=== "Flink"
+    ```console
+    marie@compute$ flink run $FLINK_ROOT_DIR/examples/batch/KMeans.jar
+    ```
 !!! warning
@@ -80,37 +108,57 @@ marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOM
 The script `framework-configure.sh` is used to derive a configuration from a template. It takes two
 parameters:
- The framework to set up (Spark, Flink, Hadoop)
+- The framework to set up (parameter `spark` for Spark, `flink` for Flink, and `hadoop` for Hadoop)
 - A configuration template
 Thus, you can modify the configuration by replacing the default configuration template with a
 customized one. This way, your custom configuration template is reusable for different jobs. You
 can start with a copy of the default configuration ahead of your interactive session:
-```console
+=== "Spark"
-marie@login$ cp -r $SPARK_HOME/conf my-config-template
+    ```console
-```
+    marie@login$ cp -r $SPARK_HOME/conf my-config-template
+    ```
+=== "Flink"
+    ```console
+    marie@login$ cp -r $FLINK_ROOT_DIR/conf my-config-template
+    ```
 After you have changed `my-config-template`, you can use your new template in an interactive job
 with:
-```console
+=== "Spark"
-marie@compute$ source framework-configure.sh spark my-config-template
+    ```console
-```
+    marie@compute$ source framework-configure.sh spark my-config-template
+    ```
+=== "Flink"
+    ```console
+    marie@compute$ source framework-configure.sh flink my-config-template
+    ```
 ### Using Hadoop Distributed Filesystem (HDFS)
 If you want to use Spark and HDFS together (or in general more than one framework), a scheme
 similar to the following can be used:
-```console
+=== "Spark"
-marie@compute$ module load Hadoop
+    ```console
-marie@compute$ module load Spark
+    marie@compute$ module load Hadoop
-marie@compute$ source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop
+    marie@compute$ module load Spark
-marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf
+    marie@compute$ source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop
-marie@compute$ start-dfs.sh
+    marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf
-marie@compute$ start-all.sh
+    marie@compute$ start-dfs.sh
-```
+    marie@compute$ start-all.sh
+    ```
+=== "Flink"
+    ```console
+    marie@compute$ module load Hadoop
+    marie@compute$ module load Flink
+    marie@compute$ source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop
+    marie@compute$ source framework-configure.sh flink $FLINK_ROOT_DIR/conf
+    marie@compute$ start-dfs.sh
+    marie@compute$ start-cluster.sh
+    ```
 ## Batch Jobs
@@ -122,41 +170,76 @@ that, you can conveniently put the parameters directly into the job file and sub
 Please use a [batch job](../jobs_and_resources/slurm.md) with a configuration, similar to the
 example below:
-??? example "spark.sbatch"
+??? example "example-starting-script.sbatch"
-    ```bash
+    === "Spark"
-    #!/bin/bash -l
+        ```bash
-    #SBATCH --time=00:05:00
+        #!/bin/bash -l
-    #SBATCH --partition=haswell
+        #SBATCH --time=00:05:00
-    #SBATCH --nodes=2
+        #SBATCH --partition=haswell
-    #SBATCH --exclusive
+        #SBATCH --nodes=2
-    #SBATCH --mem=60G
+        #SBATCH --exclusive
-    #SBATCH --job-name="example-spark"
+        #SBATCH --mem=60G
+        #SBATCH --job-name="example-spark"
+        ml Spark/3.0.1-Hadoop-2.7-Java-1.8-Python-3.7.4-GCCcore-8.3.0
+        function myExitHandler () {
+            stop-all.sh
+        }
+        #configuration
+        . framework-configure.sh spark $SPARK_HOME/conf
+        #register cleanup hook in case something goes wrong
+        trap myExitHandler EXIT
-    ml Spark/3.0.1-Hadoop-2.7-Java-1.8-Python-3.7.4-GCCcore-8.3.0
+        start-all.sh
+        spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000
-    function myExitHandler () {
        stop-all.sh
-    }
-    #configuration
+        exit 0
-    . framework-configure.sh spark $SPARK_HOME/conf
+        ```
+    === "Flink"
+        ```bash
+        #!/bin/bash -l
+        #SBATCH --time=00:05:00
+        #SBATCH --partition=haswell
+        #SBATCH --nodes=2
+        #SBATCH --exclusive
+        #SBATCH --mem=50G
+        #SBATCH --job-name="example-flink"
-    #register cleanup hook in case something goes wrong
+        ml Flink/1.12.3-Java-1.8.0_161-OpenJDK-Python-3.7.4-GCCcore-8.3.0
-    trap myExitHandler EXIT
-    start-all.sh
+        function myExitHandler () {
+            stop-cluster.sh
+        }
-    spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000
+        #configuration
+        . framework-configure.sh flink $FLINK_ROOT_DIR/conf
-    stop-all.sh
+        #register cleanup hook in case something goes wrong
+        trap myExitHandler EXIT
-    exit 0
+        #start the cluster
-    ```
+        start-cluster.sh
+        #run your application
+        flink run $FLINK_ROOT_DIR/examples/batch/KMeans.jar
+        #stop the cluster
+        stop-cluster.sh
+        exit 0
+        ```
 ## Jupyter Notebook
 You can run Jupyter notebooks with Spark on the ZIH systems in a similar way as described on the
-[JupyterHub](../access/jupyterhub.md) page.
+[JupyterHub](../access/jupyterhub.md) page. Interaction of Flink with JupyterHub is currently
+under examination and will be posted here upon availability.
 ### Preparation

--- a/doc.zih.tu-dresden.de/mkdocs.yml
+++ b/doc.zih.tu-dresden.de/mkdocs.yml
@@ -187,6 +187,8 @@ markdown_extensions:
        permalink: True
    - attr_list
    - footnotes
+    - pymdownx.tabbed:
+        alternate_style: true
 extra:
  homepage: https://tu-dresden.de