big_data_frameworks.md

This page is under construction
marie@login$ module av Spark
marie@login$ srun --partition=haswell -N2 --mem=60g --exclusive --time=01:00:00 --pty bash -l
marie@login$ srun -p ml -N 1 -n 1 -c 2 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=10000 bash
marie@compute$ module load Spark
marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf
marie@compute$ start-all.sh
marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.4.jar 1000
Please do not delete the directory `cluster-conf-<JOB_ID>` while the job is still
running. This leads to errors.
marie@login$ srun --pty -n 1 -c 2 --mem-per-cpu 2500 -t 01:00:00 bash -l
marie@compute$ module load Anaconda3
marie@compute$ cd
marie@compute$ mkdir user-kernel
marie@compute$ conda create --prefix $HOME/user-kernel/haswell-py3.6-spark python=3.6
Collecting package metadata: done
Solving environment: done [...]

marie@compute$ conda activate $HOME/user-kernel/haswell-py3.6-spark
marie@compute$ conda install ipykernel
Collecting package metadata: done
Solving environment: done [...]

marie@compute$ python -m ipykernel install --user --name haswell-py3.6-spark --display-name="haswell-py3.6-spark"
Installed kernelspec haswell-py3.6-spark in [...]

marie@compute$ conda install -c conda-forge findspark
marie@compute$ conda install pyspark

marie@compute$ conda deactivate
You could work with simple examples in your home directory but according to the
[storage concept](../data_lifecycle/hpc_storage_concept2019.md)
**please use [workspaces](../data_lifecycle/workspaces.md) for
your study and work projects**. For this reason, you have to use
advanced options of Jupyterhub and put "/" in "Workspace scope" field.
marie@login$ cp -r $SPARK_HOME/conf my-config-template
marie@compute$ source framework-configure.sh spark my-config-template
marie@compute$ module load Hadoop
marie@compute$ module load Spark
marie@compute$ source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop
marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf
marie@compute$ start-dfs.sh
marie@compute$ start-all.sh
If you have questions or need advice, please see
[https://www.scads.de/transfer-2/beratung-und-support-en/](https://www.scads.de/transfer-2/beratung-und-support-en/) or contact the HPC support.