Skip to content
Snippets Groups Projects
Commit 6b296d87 authored by Elias Werner's avatar Elias Werner
Browse files

remove separate flink.md file and move jupyter flink content to big_data_frameworks.md

parent 67c9094f
No related branches found
No related tags found
3 merge requests!446docs: Add Jupyter Teaching Example,!445Automated merge from preview to main,!439remove deprecated python virtual environment from Flink/Spark in Jupyter
......@@ -32,7 +32,6 @@ The steps are:
Apache Spark can be used in [interactive](#interactive-jobs) and [batch](#batch-jobs) jobs as well
as via [Jupyter notebooks](#jupyter-notebook). All three ways are outlined in the following.
The usage of Flink with Jupyter notebooks is currently under examination.
## Interactive Jobs
......@@ -238,27 +237,36 @@ example below:
## Jupyter Notebook
You can run Jupyter notebooks with Spark on the ZIH systems in a similar way as described on the
[JupyterHub](../access/jupyterhub.md) page. Interaction of Flink with JupyterHub is currently
under examination and will be posted here upon availability.
You can run Jupyter notebooks with Spark and Flink on the ZIH systems in a similar way as described
on the [JupyterHub](../access/jupyterhub.md) page.
### Spawning a Notebook
Go to [https://taurus.hrsk.tu-dresden.de/jupyter](https://taurus.hrsk.tu-dresden.de/jupyter).
In the tab "Advanced", go to the field "Preload modules" and select the following Spark module:
In the tab "Advanced", go to the field "Preload modules" and select the following Spark or Flink
module:
=== "Spark"
```
Spark/3.0.1-Hadoop-2.7-Java-1.8-Python-3.7.4-GCCcore-8.3.0
```
=== "Flink"
```
Flink/1.12.3-Java-1.8.0_161-OpenJDK-Python-3.7.4-GCCcore-8.3.0
```
```
Spark/3.0.1-Hadoop-2.7-Java-1.8-Python-3.7.4-GCCcore-8.3.0
```
When your Jupyter instance is started, you can set up Spark. Since the setup in the notebook
requires more steps than in an interactive session, we have created an example notebook that you can
use as a starting point for convenience: [SparkExample.ipynb](misc/SparkExample.ipynb)
When your Jupyter instance is started, you can set up Spark/Flink. Since the setup in the notebook
requires more steps than in an interactive session, we have created example notebooks that you can
use as a starting point for convenience:
[SparkExample.ipynb](misc/SparkExample.ipynb),
[FlinkExample.ipynb](misc/FlinkExample.ipynb)
!!! warning
This notebook only works with the Spark module mentioned above. When using other Spark modules,
it is possible that you have to do additional or other steps in order to make Spark running.
The notebooks only work with the Spark or Flink module mentioned above. When using other
Spark/Flink modules, it is possible that you have to do additional or other steps in order to
make Spark/Flink running.
!!! note
......
# Apache Flink
[Apache Flink](https://flink.apache.org/) is a framework for processing and integrating Big Data.
It offers a similar API as [Apache Spark](big_data_frameworks.md), but is more appropriate
for data stream processing. You can check module versions and availability with the command:
```console
marie@login$ module avail Flink
```
**Prerequisites:** To work with the frameworks, you need [access](../access/ssh_login.md) to ZIH
systems and basic knowledge about data analysis and the batch system
[Slurm](../jobs_and_resources/slurm.md).
The usage of Big Data frameworks is different from other modules due to their master-worker
approach. That means, before an application can be started, one has to do additional steps.
The steps are:
1. Load the Flink software module
1. Configure the Flink cluster
1. Start a Flink cluster
1. Start the Flink application
Apache Flink can be used in [interactive](#interactive-jobs) and [batch](#batch-jobs) jobs as
described below.
## Interactive Jobs
### Default Configuration
Let us assume that two nodes should be used for the computation. Use a `srun` command similar to
the following to start an interactive session using the partition haswell. The following code
snippet shows a job submission to haswell nodes with an allocation of two nodes with 60 GB main
memory exclusively for one hour:
```console
marie@login$ srun --partition=haswell --nodes=2 --mem=60g --exclusive --time=01:00:00 --pty bash -l
```
Once you have the shell, load Flink using the command
```console
marie@compute$ module load Flink
```
Before the application can be started, the Flink cluster needs to be set up. To do this, configure
Flink first using configuration template at `$FLINK_ROOT_DIR/conf`:
```console
marie@compute$ source framework-configure.sh flink $FLINK_ROOT_DIR/conf
```
This places the configuration in a directory called `cluster-conf-<JOB_ID>` in your `home`
directory, where `<JOB_ID>` stands for the id of the Slurm job. After that, you can start Flink in
the usual way:
```console
marie@compute$ start-cluster.sh
```
The Flink processes should now be set up and you can start your application, e. g.:
```console
marie@compute$ flink run $FLINK_ROOT_DIR/examples/batch/KMeans.jar
```
!!! warning
Do not delete the directory `cluster-conf-<JOB_ID>` while the job is still
running. This leads to errors.
### Custom Configuration
The script `framework-configure.sh` is used to derive a configuration from a template. It takes two
parameters:
- The framework to set up (Spark, Flink, Hadoop)
- A configuration template
Thus, you can modify the configuration by replacing the default configuration template with a
customized one. This way, your custom configuration template is reusable for different jobs. You
can start with a copy of the default configuration ahead of your interactive session:
```console
marie@login$ cp -r $FLINK_ROOT_DIR/conf my-config-template
```
After you have changed `my-config-template`, you can use your new template in an interactive job
with:
```console
marie@compute$ source framework-configure.sh flink my-config-template
```
### Using Hadoop Distributed Filesystem (HDFS)
If you want to use Flink and HDFS together (or in general more than one framework), a scheme
similar to the following can be used:
```console
marie@compute$ module load Hadoop
marie@compute$ module load Flink
marie@compute$ source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop
marie@compute$ source framework-configure.sh flink $FLINK_ROOT_DIR/conf
marie@compute$ start-dfs.sh
marie@compute$ start-cluster.sh
```
## Batch Jobs
Using `srun` directly on the shell blocks the shell and launches an interactive job. Apart from
short test runs, it is **recommended to launch your jobs in the background using batch jobs**. For
that, you can conveniently put the parameters directly into the job file and submit it via
`sbatch [options] <job file>`.
Please use a [batch job](../jobs_and_resources/slurm.md) with a configuration, similar to the
example below:
??? example "flink.sbatch"
```bash
#!/bin/bash -l
#SBATCH --time=00:05:00
#SBATCH --partition=haswell
#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --mem=60G
#SBATCH --job-name="example-flink"
ml Flink/1.12.3-Java-1.8.0_161-OpenJDK-Python-3.7.4-GCCcore-8.3.0
function myExitHandler () {
stop-cluster.sh
}
#configuration
. framework-configure.sh flink $FLINK_ROOT_DIR/conf
#register cleanup hook in case something goes wrong
trap myExitHandler EXIT
#start the cluster
start-cluster.sh
#run your application
flink run $FLINK_ROOT_DIR/examples/batch/KMeans.jar
#stop the cluster
stop-cluster.sh
exit 0
```
!!! note
You could work with simple examples in your home directory, but, according to the
[storage concept](../data_lifecycle/overview.md), **please use
[workspaces](../data_lifecycle/workspaces.md) for your study and work projects**. For this
reason, you have to use advanced options of Jupyterhub and put "/" in "Workspace scope" field.
## Jupyter Notebook
You can run Jupyter notebooks with Flink on the ZIH systems in a similar way as described on the
[JupyterHub](../access/jupyterhub.md) page.
### Spawning a Notebook
Go to [https://taurus.hrsk.tu-dresden.de/jupyter](https://taurus.hrsk.tu-dresden.de/jupyter).
In the tab "Advanced", go to the field "Preload modules" and select the following Flink module:
```
Flink/1.12.3-Java-1.8.0_161-OpenJDK-Python-3.7.4-GCCcore-8.3.0
```
When your Jupyter instance is started, you can set up Flink. Since the setup in the notebook
requires more steps than in an interactive session, we have created an example notebook that you can
use as a starting point for convenience: [FlinkExample.ipynb](misc/FlinkExample.ipynb)
!!! warning
This notebook only works with the Flink module mentioned above. When using other Flink modules,
it is possible that you have to do additional or other steps in order to make Flink running.
!!! note
You could work with simple examples in your home directory, but, according to the
[storage concept](../data_lifecycle/overview.md), **please use
[workspaces](../data_lifecycle/workspaces.md) for your study and work projects**. For this
reason, you have to use advanced options of Jupyterhub and put "/" in "Workspace scope" field.
## FAQ
Q: Command `source framework-configure.sh hadoop
$HADOOP_ROOT_DIR/etc/hadoop` gives the output:
`bash: framework-configure.sh: No such file or directory`. How can this be resolved?
A: Please try to re-submit or re-run the job and if that doesn't help
re-login to the ZIH system.
Q: There are a lot of errors and warnings during the set up of the
session
A: Please check the work capability on a simple example as shown in this documentation.
!!! help
If you have questions or need advice, please use the contact form on
[https://scads.ai/contact/](https://scads.ai/contact/) or contact the HPC support.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment