Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
hpc-compendium
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Deploy
Releases
Package Registry
Container Registry
Model registry
Operate
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
ZIH
hpcsupport
hpc-compendium
Commits
741f9e14
Commit
741f9e14
authored
3 years ago
by
Taras Lazariv
Browse files
Options
Downloads
Patches
Plain Diff
Remove flink file and update big_data_frameworks.md
parent
bc2d47d7
No related branches found
No related tags found
2 merge requests
!415
Added a specific file list containing all files to skip for each
,
!409
Added short description about how to use Flink. Resolves #218.
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md
+1
-1
1 addition, 1 deletion
doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md
doc.zih.tu-dresden.de/docs/software/flink.md
+0
-178
0 additions, 178 deletions
doc.zih.tu-dresden.de/docs/software/flink.md
with
1 addition
and
179 deletions
doc.zih.tu-dresden.de/docs/software/big_data_frameworks
_spark
.md
→
doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md
+
1
−
1
View file @
741f9e14
# Big Data Frameworks
: Apache Spark
# Big Data Frameworks
[
Apache Spark
](
https://spark.apache.org/
)
,
[
Apache Flink
](
https://flink.apache.org/
)
[
Apache Spark
](
https://spark.apache.org/
)
,
[
Apache Flink
](
https://flink.apache.org/
)
and
[
Apache Hadoop
](
https://hadoop.apache.org/
)
are frameworks for processing and integrating
and
[
Apache Hadoop
](
https://hadoop.apache.org/
)
are frameworks for processing and integrating
...
...
This diff is collapsed.
Click to expand it.
doc.zih.tu-dresden.de/docs/software/flink.md
deleted
100644 → 0
+
0
−
178
View file @
bc2d47d7
# Apache Flink
[
Apache Flink
](
https://flink.apache.org/
)
is a framework for processing and integrating Big Data.
It offers a similar API as
[
Apache Spark
](
big_data_frameworks_spark.md
)
, but is more appropriate
for data stream processing. You can check module versions and availability with the command:
```
console
marie@login$
module avail Flink
```
**Prerequisites:**
To work with the frameworks, you need
[
access
](
../access/ssh_login.md
)
to ZIH
systems and basic knowledge about data analysis and the batch system
[
Slurm
](
../jobs_and_resources/slurm.md
)
.
The usage of Big Data frameworks is different from other modules due to their master-worker
approach. That means, before an application can be started, one has to do additional steps.
The steps are:
1.
Load the Flink software module
1.
Configure the Flink cluster
1.
Start a Flink cluster
1.
Start the Flink application
Apache Flink can be used in
[
interactive
](
#interactive-jobs
)
and
[
batch
](
#batch-jobs
)
jobs as
described below.
## Interactive Jobs
### Default Configuration
Let us assume that two nodes should be used for the computation. Use a
`srun`
command similar to
the following to start an interactive session using the partition haswell. The following code
snippet shows a job submission to haswell nodes with an allocation of two nodes with 60 GB main
memory exclusively for one hour:
```
console
marie@login$
srun
--partition
=
haswell
--nodes
=
2
--mem
=
50g
--exclusive
--time
=
01:00:00
--pty
bash
-l
```
Once you have the shell, load Flink using the command
```
console
marie@compute$
module load Flink
```
Before the application can be started, the Flink cluster needs to be set up. To do this, configure
Flink first using configuration template at
`$FLINK_ROOT_DIR/conf`
:
```
console
marie@compute$
source
framework-configure.sh flink
$FLINK_ROOT_DIR
/conf
```
This places the configuration in a directory called
`cluster-conf-<JOB_ID>`
in your
`home`
directory, where
`<JOB_ID>`
stands for the id of the Slurm job. After that, you can start Flink in
the usual way:
```
console
marie@compute$
start-cluster.sh
```
The Flink processes should now be set up and you can start your application, e. g.:
```
console
marie@compute$
flink run
$FLINK_ROOT_DIR
/examples/batch/KMeans.jar
```
!!! warning
Do not delete the directory `cluster-conf-<JOB_ID>` while the job is still
running. This leads to errors.
### Custom Configuration
The script
`framework-configure.sh`
is used to derive a configuration from a template. It takes two
parameters:
-
The framework to set up (Spark, Flink, Hadoop)
-
A configuration template
Thus, you can modify the configuration by replacing the default configuration template with a
customized one. This way, your custom configuration template is reusable for different jobs. You
can start with a copy of the default configuration ahead of your interactive session:
```
console
marie@login$
cp
-r
$FLINK_ROOT_DIR
/conf my-config-template
```
After you have changed
`my-config-template`
, you can use your new template in an interactive job
with:
```
console
marie@compute$
source
framework-configure.sh flink my-config-template
```
### Using Hadoop Distributed Filesystem (HDFS)
If you want to use Flink and HDFS together (or in general more than one framework), a scheme
similar to the following can be used:
```
console
marie@compute$
module load Hadoop
marie@compute$
module load Flink
marie@compute$
source
framework-configure.sh hadoop
$HADOOP_ROOT_DIR
/etc/hadoop
marie@compute$
source
framework-configure.sh flink
$FLINK_ROOT_DIR
/conf
marie@compute$
start-dfs.sh
marie@compute$
start-cluster.sh
```
## Batch Jobs
Using
`srun`
directly on the shell blocks the shell and launches an interactive job. Apart from
short test runs, it is
**recommended to launch your jobs in the background using batch jobs**
. For
that, you can conveniently put the parameters directly into the job file and submit it via
`sbatch [options] <job file>`
.
Please use a
[
batch job
](
../jobs_and_resources/slurm.md
)
with a configuration, similar to the
example below:
??? example "flink.sbatch"
```
bash
#!/bin/bash -l
#SBATCH --time=00:05:00
#SBATCH --partition=haswell
#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --mem=50G
#SBATCH --job-name="example-flink"
ml Flink/1.12.3-Java-1.8.0_161-OpenJDK-Python-3.7.4-GCCcore-8.3.0
function
myExitHandler
()
{
stop-cluster.sh
}
#configuration
.
framework-configure.sh flink
$FLINK_ROOT_DIR
/conf
#register cleanup hook in case something goes wrong
trap
myExitHandler EXIT
#start the cluster
start-cluster.sh
#run your application
flink run
$FLINK_ROOT_DIR
/examples/batch/KMeans.jar
#stop the cluster
stop-cluster.sh
exit
0
```
!!! note
You could work with simple examples in your home directory, but, according to the
[storage concept](../data_lifecycle/overview.md), **please use
[workspaces](../data_lifecycle/workspaces.md) for your study and work projects**. For this
reason, you have to use advanced options of Jupyterhub and put "/" in "Workspace scope" field.
## FAQ
Q: Command
`source framework-configure.sh hadoop
$HADOOP_ROOT_DIR/etc/hadoop`
gives the output:
`bash: framework-configure.sh: No such file or directory`
. How can this be resolved?
A: Please try to re-submit or re-run the job and if that doesn't help
re-login to the ZIH system.
Q: There are a lot of errors and warnings during the set up of the
session
A: Please check the work capability on a simple example as shown in this documentation.
!!! help
If you have questions or need advice, please use the contact form on
[https://scads.ai/contact/](https://scads.ai/contact/) or contact the HPC support.
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment