Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
hpc-compendium
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Deploy
Releases
Package Registry
Container Registry
Model registry
Operate
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
ZIH
hpcsupport
hpc-compendium
Commits
0eae110f
Commit
0eae110f
authored
3 years ago
by
Taras Lazariv
Browse files
Options
Downloads
Patches
Plain Diff
Fuse flink and spark documentation to one Big Data Frameworks page
parent
420b2c65
No related branches found
No related tags found
2 merge requests
!415
Added a specific file list containing all files to skip for each
,
!409
Added short description about how to use Flink. Resolves #218.
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
doc.zih.tu-dresden.de/docs/software/big_data_frameworks_spark.md
+137
-54
137 additions, 54 deletions
....tu-dresden.de/docs/software/big_data_frameworks_spark.md
doc.zih.tu-dresden.de/mkdocs.yml
+2
-0
2 additions, 0 deletions
doc.zih.tu-dresden.de/mkdocs.yml
with
139 additions
and
54 deletions
doc.zih.tu-dresden.de/docs/software/big_data_frameworks_spark.md
+
137
−
54
View file @
0eae110f
...
@@ -5,9 +5,14 @@ and [Apache Hadoop](https://hadoop.apache.org/) are frameworks for processing an
...
@@ -5,9 +5,14 @@ and [Apache Hadoop](https://hadoop.apache.org/) are frameworks for processing an
Big Data. These frameworks are also offered as software
[
modules
](
modules.md
)
in both
`ml`
and
Big Data. These frameworks are also offered as software
[
modules
](
modules.md
)
in both
`ml`
and
`scs5`
software environments. You can check module versions and availability with the command
`scs5`
software environments. You can check module versions and availability with the command
```
console
=== "Spark"
marie@login$
module avail Spark
```
console
```
marie@login$
module avail Spark
```
=== "Flink"
```
console
marie@login$
module avail Flink
```
**Prerequisites:**
To work with the frameworks, you need
[
access
](
../access/ssh_login.md
)
to ZIH
**Prerequisites:**
To work with the frameworks, you need
[
access
](
../access/ssh_login.md
)
to ZIH
systems and basic knowledge about data analysis and the batch system
systems and basic knowledge about data analysis and the batch system
...
@@ -15,7 +20,8 @@ systems and basic knowledge about data analysis and the batch system
...
@@ -15,7 +20,8 @@ systems and basic knowledge about data analysis and the batch system
The usage of Big Data frameworks is different from other modules due to their master-worker
The usage of Big Data frameworks is different from other modules due to their master-worker
approach. That means, before an application can be started, one has to do additional steps.
approach. That means, before an application can be started, one has to do additional steps.
In the following, we assume that a Spark application should be started.
In the following, we assume that a Spark application should be started and give alternative
commands for Flink where applicable.
The steps are:
The steps are:
...
@@ -26,6 +32,7 @@ The steps are:
...
@@ -26,6 +32,7 @@ The steps are:
Apache Spark can be used in
[
interactive
](
#interactive-jobs
)
and
[
batch
](
#batch-jobs
)
jobs as well
Apache Spark can be used in
[
interactive
](
#interactive-jobs
)
and
[
batch
](
#batch-jobs
)
jobs as well
as via
[
Jupyter notebooks
](
#jupyter-notebook
)
. All three ways are outlined in the following.
as via
[
Jupyter notebooks
](
#jupyter-notebook
)
. All three ways are outlined in the following.
The usage of Flink with Jupyter notebooks is currently under examination.
## Interactive Jobs
## Interactive Jobs
...
@@ -43,32 +50,53 @@ memory exclusively for one hour:
...
@@ -43,32 +50,53 @@ memory exclusively for one hour:
marie@login$
srun
--partition
=
haswell
--nodes
=
2
--mem
=
60g
--exclusive
--time
=
01:00:00
--pty
bash
-l
marie@login$
srun
--partition
=
haswell
--nodes
=
2
--mem
=
60g
--exclusive
--time
=
01:00:00
--pty
bash
-l
```
```
Once you have the shell, load
Spa
rk using the command
Once you have the shell, load
desired Big Data framewo
rk using the command
```
console
=== "Spark"
marie@compute$
module load Spark
```
console
```
marie@compute$
module load Spark
```
=== "Flink"
```
console
marie@compute$
module load Flink
```
Before the application can be started, the Spark cluster needs to be set up. To do this, configure
Before the application can be started, the Spark cluster needs to be set up. To do this, configure
Spark first using configuration template at
`$SPARK_HOME/conf`
:
Spark first using configuration template at
`$SPARK_HOME/conf`
:
```
console
=== "Spark"
marie@compute$
source
framework-configure.sh spark
$SPARK_HOME
/conf
```
console
```
marie@compute$
source
framework-configure.sh spark
$SPARK_HOME
/conf
```
=== "Flink"
```
console
marie@compute$
source
framework-configure.sh flink
$FLINK_ROOT_DIR
/conf
```
This places the configuration in a directory called
`cluster-conf-<JOB_ID>`
in your
`home`
This places the configuration in a directory called
`cluster-conf-<JOB_ID>`
in your
`home`
directory, where
`<JOB_ID>`
stands for the id of the Slurm job. After that, you can start Spark in
directory, where
`<JOB_ID>`
stands for the id of the Slurm job. After that, you can start Spark in
the usual way:
the usual way:
```
console
=== "Spark"
marie@compute$
start-all.sh
```
console
```
marie@compute$
start-all.sh
```
=== "Flink"
```
console
marie@compute$
start-cluster.sh
```
The Spark processes should now be set up and you can start your application, e. g.:
The Spark processes should now be set up and you can start your application, e. g.:
```
console
=== "Spark"
marie@compute$
spark-submit
--class
org.apache.spark.examples.SparkPi
$SPARK_HOME
/examples/jars/spark-examples_2.12-3.0.1.jar 1000
```
console
```
marie@compute$
spark-submit
--class
org.apache.spark.examples.SparkPi
\
$
SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000
```
=== "Flink"
```
console
marie@compute$
flink run
$FLINK_ROOT_DIR
/examples/batch/KMeans.jar
```
!!! warning
!!! warning
...
@@ -80,37 +108,57 @@ marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOM
...
@@ -80,37 +108,57 @@ marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOM
The script
`framework-configure.sh`
is used to derive a configuration from a template. It takes two
The script
`framework-configure.sh`
is used to derive a configuration from a template. It takes two
parameters:
parameters:
-
The framework to set up (
Spark, Flink,
Hadoop)
-
The framework to set up (
parameter
`spark`
for Spark,
`flink`
for Flink, and
`hadoop`
for
Hadoop)
-
A configuration template
-
A configuration template
Thus, you can modify the configuration by replacing the default configuration template with a
Thus, you can modify the configuration by replacing the default configuration template with a
customized one. This way, your custom configuration template is reusable for different jobs. You
customized one. This way, your custom configuration template is reusable for different jobs. You
can start with a copy of the default configuration ahead of your interactive session:
can start with a copy of the default configuration ahead of your interactive session:
```
console
=== "Spark"
marie@login$
cp
-r
$SPARK_HOME
/conf my-config-template
```
console
```
marie@login$
cp
-r
$SPARK_HOME
/conf my-config-template
```
=== "Flink"
```
console
marie@login$
cp
-r
$FLINK_ROOT_DIR
/conf my-config-template
```
After you have changed
`my-config-template`
, you can use your new template in an interactive job
After you have changed
`my-config-template`
, you can use your new template in an interactive job
with:
with:
```
console
=== "Spark"
marie@compute$
source
framework-configure.sh spark my-config-template
```
console
```
marie@compute$
source
framework-configure.sh spark my-config-template
```
=== "Flink"
```
console
marie@compute$
source
framework-configure.sh flink my-config-template
```
### Using Hadoop Distributed Filesystem (HDFS)
### Using Hadoop Distributed Filesystem (HDFS)
If you want to use Spark and HDFS together (or in general more than one framework), a scheme
If you want to use Spark and HDFS together (or in general more than one framework), a scheme
similar to the following can be used:
similar to the following can be used:
```
console
=== "Spark"
marie@compute$
module load Hadoop
```
console
marie@compute$
module load Spark
marie@compute$
module load Hadoop
marie@compute$
source
framework-configure.sh hadoop
$HADOOP_ROOT_DIR
/etc/hadoop
marie@compute$
module load Spark
marie@compute$
source
framework-configure.sh spark
$SPARK_HOME
/conf
marie@compute$
source
framework-configure.sh hadoop
$HADOOP_ROOT_DIR
/etc/hadoop
marie@compute$
start-dfs.sh
marie@compute$
source
framework-configure.sh spark
$SPARK_HOME
/conf
marie@compute$
start-all.sh
marie@compute$
start-dfs.sh
```
marie@compute$
start-all.sh
```
=== "Flink"
```
console
marie@compute$
module load Hadoop
marie@compute$
module load Flink
marie@compute$
source
framework-configure.sh hadoop
$HADOOP_ROOT_DIR
/etc/hadoop
marie@compute$
source
framework-configure.sh flink
$FLINK_ROOT_DIR
/conf
marie@compute$
start-dfs.sh
marie@compute$
start-cluster.sh
```
## Batch Jobs
## Batch Jobs
...
@@ -122,41 +170,76 @@ that, you can conveniently put the parameters directly into the job file and sub
...
@@ -122,41 +170,76 @@ that, you can conveniently put the parameters directly into the job file and sub
Please use a
[
batch job
](
../jobs_and_resources/slurm.md
)
with a configuration, similar to the
Please use a
[
batch job
](
../jobs_and_resources/slurm.md
)
with a configuration, similar to the
example below:
example below:
??? example "spark.sbatch"
??? example "example-starting-script.sbatch"
```
bash
=== "Spark"
#!/bin/bash -l
```
bash
#SBATCH --time=00:05:00
#!/bin/bash -l
#SBATCH --partition=haswell
#SBATCH --time=00:05:00
#SBATCH --nodes=2
#SBATCH --partition=haswell
#SBATCH --exclusive
#SBATCH --nodes=2
#SBATCH --mem=60G
#SBATCH --exclusive
#SBATCH --job-name="example-spark"
#SBATCH --mem=60G
#SBATCH --job-name="example-spark"
ml Spark/3.0.1-Hadoop-2.7-Java-1.8-Python-3.7.4-GCCcore-8.3.0
function
myExitHandler
()
{
stop-all.sh
}
#configuration
.
framework-configure.sh spark
$SPARK_HOME
/conf
#register cleanup hook in case something goes wrong
trap
myExitHandler EXIT
ml Spark/3.0.1-Hadoop-2.7-Java-1.8-Python-3.7.4-GCCcore-8.3.0
start-all.sh
spark-submit
--class
org.apache.spark.examples.SparkPi
$SPARK_HOME
/examples/jars/spark-examples_2.12-3.0.1.jar 1000
function
myExitHandler
()
{
stop-all.sh
stop-all.sh
}
#configuration
exit
0
.
framework-configure.sh spark
$SPARK_HOME
/conf
```
=== "Flink"
```
bash
#!/bin/bash -l
#SBATCH --time=00:05:00
#SBATCH --partition=haswell
#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --mem=50G
#SBATCH --job-name="example-flink"
#register cleanup hook in case something goes wrong
ml Flink/1.12.3-Java-1.8.0_161-OpenJDK-Python-3.7.4-GCCcore-8.3.0
trap
myExitHandler EXIT
start-all.sh
function
myExitHandler
()
{
stop-cluster.sh
}
spark-submit
--class
org.apache.spark.examples.SparkPi
$SPARK_HOME
/examples/jars/spark-examples_2.12-3.0.1.jar 1000
#configuration
.
framework-configure.sh flink
$FLINK_ROOT_DIR
/conf
stop-all.sh
#register cleanup hook in case something goes wrong
trap
myExitHandler EXIT
exit
0
#start the cluster
```
start-cluster.sh
#run your application
flink run
$FLINK_ROOT_DIR
/examples/batch/KMeans.jar
#stop the cluster
stop-cluster.sh
exit
0
```
## Jupyter Notebook
## Jupyter Notebook
You can run Jupyter notebooks with Spark on the ZIH systems in a similar way as described on the
You can run Jupyter notebooks with Spark on the ZIH systems in a similar way as described on the
[
JupyterHub
](
../access/jupyterhub.md
)
page.
[
JupyterHub
](
../access/jupyterhub.md
)
page. Interaction of Flink with JupyterHub is currently
under examination and will be posted here upon availability.
### Preparation
### Preparation
...
...
This diff is collapsed.
Click to expand it.
doc.zih.tu-dresden.de/mkdocs.yml
+
2
−
0
View file @
0eae110f
...
@@ -187,6 +187,8 @@ markdown_extensions:
...
@@ -187,6 +187,8 @@ markdown_extensions:
permalink
:
True
permalink
:
True
-
attr_list
-
attr_list
-
footnotes
-
footnotes
-
pymdownx.tabbed
:
alternate_style
:
true
extra
:
extra
:
homepage
:
https://tu-dresden.de
homepage
:
https://tu-dresden.de
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment