From 741f9e14c81361a1e3d1782cf1b71ed2f50243c6 Mon Sep 17 00:00:00 2001 From: lazariv <taras.lazariv@tu-dresden.de> Date: Thu, 18 Nov 2021 14:47:57 +0000 Subject: [PATCH] Remove flink file and update big_data_frameworks.md --- ...eworks_spark.md => big_data_frameworks.md} | 2 +- doc.zih.tu-dresden.de/docs/software/flink.md | 178 ------------------ 2 files changed, 1 insertion(+), 179 deletions(-) rename doc.zih.tu-dresden.de/docs/software/{big_data_frameworks_spark.md => big_data_frameworks.md} (99%) delete mode 100644 doc.zih.tu-dresden.de/docs/software/flink.md diff --git a/doc.zih.tu-dresden.de/docs/software/big_data_frameworks_spark.md b/doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md similarity index 99% rename from doc.zih.tu-dresden.de/docs/software/big_data_frameworks_spark.md rename to doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md index f3e751566..247d35c54 100644 --- a/doc.zih.tu-dresden.de/docs/software/big_data_frameworks_spark.md +++ b/doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md @@ -1,4 +1,4 @@ -# Big Data Frameworks: Apache Spark +# Big Data Frameworks [Apache Spark](https://spark.apache.org/), [Apache Flink](https://flink.apache.org/) and [Apache Hadoop](https://hadoop.apache.org/) are frameworks for processing and integrating diff --git a/doc.zih.tu-dresden.de/docs/software/flink.md b/doc.zih.tu-dresden.de/docs/software/flink.md deleted file mode 100644 index 4cc72a422..000000000 --- a/doc.zih.tu-dresden.de/docs/software/flink.md +++ /dev/null @@ -1,178 +0,0 @@ -# Apache Flink - -[Apache Flink](https://flink.apache.org/) is a framework for processing and integrating Big Data. -It offers a similar API as [Apache Spark](big_data_frameworks_spark.md), but is more appropriate -for data stream processing. You can check module versions and availability with the command: - -```console -marie@login$ module avail Flink -``` - -**Prerequisites:** To work with the frameworks, you need [access](../access/ssh_login.md) to ZIH -systems and basic knowledge about data analysis and the batch system -[Slurm](../jobs_and_resources/slurm.md). - -The usage of Big Data frameworks is different from other modules due to their master-worker -approach. That means, before an application can be started, one has to do additional steps. - -The steps are: - -1. Load the Flink software module -1. Configure the Flink cluster -1. Start a Flink cluster -1. Start the Flink application - -Apache Flink can be used in [interactive](#interactive-jobs) and [batch](#batch-jobs) jobs as -described below. - -## Interactive Jobs - -### Default Configuration - -Let us assume that two nodes should be used for the computation. Use a `srun` command similar to -the following to start an interactive session using the partition haswell. The following code -snippet shows a job submission to haswell nodes with an allocation of two nodes with 60 GB main -memory exclusively for one hour: - -```console -marie@login$ srun --partition=haswell --nodes=2 --mem=50g --exclusive --time=01:00:00 --pty bash -l -``` - -Once you have the shell, load Flink using the command - -```console -marie@compute$ module load Flink -``` - -Before the application can be started, the Flink cluster needs to be set up. To do this, configure -Flink first using configuration template at `$FLINK_ROOT_DIR/conf`: - -```console -marie@compute$ source framework-configure.sh flink $FLINK_ROOT_DIR/conf -``` - -This places the configuration in a directory called `cluster-conf-<JOB_ID>` in your `home` -directory, where `<JOB_ID>` stands for the id of the Slurm job. After that, you can start Flink in -the usual way: - -```console -marie@compute$ start-cluster.sh -``` - -The Flink processes should now be set up and you can start your application, e. g.: - -```console -marie@compute$ flink run $FLINK_ROOT_DIR/examples/batch/KMeans.jar -``` - -!!! warning - - Do not delete the directory `cluster-conf-<JOB_ID>` while the job is still - running. This leads to errors. - -### Custom Configuration - -The script `framework-configure.sh` is used to derive a configuration from a template. It takes two -parameters: - -- The framework to set up (Spark, Flink, Hadoop) -- A configuration template - -Thus, you can modify the configuration by replacing the default configuration template with a -customized one. This way, your custom configuration template is reusable for different jobs. You -can start with a copy of the default configuration ahead of your interactive session: - -```console -marie@login$ cp -r $FLINK_ROOT_DIR/conf my-config-template -``` - -After you have changed `my-config-template`, you can use your new template in an interactive job -with: - -```console -marie@compute$ source framework-configure.sh flink my-config-template -``` - -### Using Hadoop Distributed Filesystem (HDFS) - -If you want to use Flink and HDFS together (or in general more than one framework), a scheme -similar to the following can be used: - -```console -marie@compute$ module load Hadoop -marie@compute$ module load Flink -marie@compute$ source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop -marie@compute$ source framework-configure.sh flink $FLINK_ROOT_DIR/conf -marie@compute$ start-dfs.sh -marie@compute$ start-cluster.sh -``` - -## Batch Jobs - -Using `srun` directly on the shell blocks the shell and launches an interactive job. Apart from -short test runs, it is **recommended to launch your jobs in the background using batch jobs**. For -that, you can conveniently put the parameters directly into the job file and submit it via -`sbatch [options] <job file>`. - -Please use a [batch job](../jobs_and_resources/slurm.md) with a configuration, similar to the -example below: - -??? example "flink.sbatch" - ```bash - #!/bin/bash -l - #SBATCH --time=00:05:00 - #SBATCH --partition=haswell - #SBATCH --nodes=2 - #SBATCH --exclusive - #SBATCH --mem=50G - #SBATCH --job-name="example-flink" - - ml Flink/1.12.3-Java-1.8.0_161-OpenJDK-Python-3.7.4-GCCcore-8.3.0 - - function myExitHandler () { - stop-cluster.sh - } - - #configuration - . framework-configure.sh flink $FLINK_ROOT_DIR/conf - - #register cleanup hook in case something goes wrong - trap myExitHandler EXIT - - #start the cluster - start-cluster.sh - - #run your application - flink run $FLINK_ROOT_DIR/examples/batch/KMeans.jar - - #stop the cluster - stop-cluster.sh - - exit 0 - ``` - -!!! note - - You could work with simple examples in your home directory, but, according to the - [storage concept](../data_lifecycle/overview.md), **please use - [workspaces](../data_lifecycle/workspaces.md) for your study and work projects**. For this - reason, you have to use advanced options of Jupyterhub and put "/" in "Workspace scope" field. - -## FAQ - -Q: Command `source framework-configure.sh hadoop -$HADOOP_ROOT_DIR/etc/hadoop` gives the output: -`bash: framework-configure.sh: No such file or directory`. How can this be resolved? - -A: Please try to re-submit or re-run the job and if that doesn't help -re-login to the ZIH system. - -Q: There are a lot of errors and warnings during the set up of the -session - -A: Please check the work capability on a simple example as shown in this documentation. - -!!! help - - If you have questions or need advice, please use the contact form on - [https://scads.ai/contact/](https://scads.ai/contact/) or contact the HPC support. -- GitLab