diff --git a/Dockerfile b/Dockerfile index 731e831c9b2fc1ff1068ae2b2a80c04bbf0039c7..b272bf553212534167e23e083d4a0c088700a025 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,21 +1,34 @@ FROM python:3.8-bullseye +SHELL ["/bin/bash", "-c"] + ######## # Base # ######## -COPY ./ /src/ - -RUN pip install -r /src/doc.zih.tu-dresden.de/requirements.txt +RUN pip install mkdocs>=1.1.2 mkdocs-material>=7.1.0 ########## # Linter # ########## -RUN apt update && apt install -y nodejs npm aspell +RUN apt update && apt install -y nodejs npm aspell git RUN npm install -g markdownlint-cli markdown-link-check -WORKDIR /src/doc.zih.tu-dresden.de +########################################### +# prepare git for automatic merging in CI # +########################################### +RUN git config --global user.name 'Gitlab Bot' +RUN git config --global user.email 'hpcsupport@zih.tu-dresden.de' + +RUN mkdir -p ~/.ssh + +#see output of `ssh-keyscan gitlab.hrz.tu-chemnitz.de` +RUN echo $'# gitlab.hrz.tu-chemnitz.de:22 SSH-2.0-OpenSSH_7.4\n\ +gitlab.hrz.tu-chemnitz.de ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDNixJ1syD506jOtiLPxGhAXsNnVfweFfzseh9/WrNxbTgIhi09fLb5aZI2CfOOWIi4fQz07S+qGugChBs4lJenLYAu4b0IAnEv/n/Xnf7wITf/Wlba2VSKiXdDqbSmNbOQtbdBLNu1NSt+inFgrreaUxnIqvWX4pBDEEGBAgG9e2cteXjT/dHp4+vPExKEjM6Nsxw516Cqv5H1ZU7XUTHFUYQr0DoulykDoXU1i3odJqZFZQzcJQv/RrEzya/2bwaatzKfbgoZLlb18T2LjkP74b71DeFIQWV2e6e3vsNwl1NsvlInEcsSZB1TZP+mKke7JWiI6HW2IrlSaGqM8n4h\n\ +gitlab.hrz.tu-chemnitz.de ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJ/cSNsKRPrfXCMjl+HsKrnrI3HgbCyKWiRa715S99BR\n' > ~/.ssh/known_hosts + +WORKDIR /docs CMD ["mkdocs", "build", "--verbose", "--strict"] diff --git a/README.md b/README.md index 05825be788b1d0e0d6436454e6aa0849d28d93c3..d3482f3ae680798e81cdd2ea7814eeadb4abe57d 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,7 @@ within the CI/CD pipeline help to ensure a high quality documentation. ## Reporting Issues Issues concerning this documentation can reported via the GitLab -[issue tracking system](https://gitlab.hrz.tu-chemnitz.de/zih/hpc-compendium/hpc-compendium/-/issues). +[issue tracking system](https://gitlab.hrz.tu-chemnitz.de/zih/hpcsupport/hpc-compendium/-/issues). Please check for any already existing issue before submitting your issue in order to avoid duplicate issues. diff --git a/.markdownlintrc b/doc.zih.tu-dresden.de/.markdownlintrc similarity index 100% rename from .markdownlintrc rename to doc.zih.tu-dresden.de/.markdownlintrc diff --git a/doc.zih.tu-dresden.de/README.md b/doc.zih.tu-dresden.de/README.md index 31344cece97859451158faa45a172ebcacea1752..bf1b82f52a145f959068fa063d9dbdf31fb2eae3 100644 --- a/doc.zih.tu-dresden.de/README.md +++ b/doc.zih.tu-dresden.de/README.md @@ -9,9 +9,9 @@ long describing complex steps, contributing is quite easy - trust us. ## Contribute via Issue Users can contribute to the documentation via the -[issue tracking system](https://gitlab.hrz.tu-chemnitz.de/zih/hpc-compendium/hpc-compendium/-/issues). +[issue tracking system](https://gitlab.hrz.tu-chemnitz.de/zih/hpcsupport/hpc-compendium/-/issues). For that, open an issue to report typos and missing documentation or request for more precise -wording etc. ZIH staff will get in touch with you to resolve the issue and improve the +wording etc. ZIH staff will get in touch with you to resolve the issue and improve the documentation. **Reminder:** Non-documentation issues and requests need to be send as ticket to @@ -40,15 +40,7 @@ Now, create a local clone of your fork #### Install Dependencies -**TODO:** Description - -```Shell Session -~ cd hpc-compendium/doc.zih.tu-dresden.de -~ pip install -r requirements.txt -``` - -**TODO:** virtual environment -**TODO:** What we need for markdownlinter and checks? +See [Installation with Docker](#preview-using-mkdocs-with-dockerfile). <!--- All branches are protected, i.e., only ZIH staff can create branches and push to them ---> @@ -113,21 +105,27 @@ Open `http://127.0.0.1:8000` with a web browser to preview the local copy of the You can also use `docker` to build a container from the `Dockerfile`, if you are familiar with it. This may take a while, as mkdocs and other necessary software needs to be downloaded. -Building a container with the documentation inside could be done with the following steps: +Building a container could be done with the following steps: ```Bash cd /PATH/TO/hpc-compendium docker build -t hpc-compendium . ``` +To avoid a lot of retyping, use the following in your shell: + +```bash +alias wiki="docker run --name=hpc-compendium --rm -it -w /docs --mount src=$PWD/doc.zih.tu-dresden.de,target=/docs,type=bind hpc-compendium bash -c" +``` + If you want to see how it looks in your browser, you can use shell commands to serve the documentation: ```Bash -docker run --name=hpc-compendium -p 8000:8000 --rm -it -w /docs --mount src="$(pwd)"/doc.zih.tu-dresden.de,target=/docs,type=bind hpc-compendium bash -c "mkdocs build --verbose && mkdocs serve -a 0.0.0.0:8000" +wiki "mkdocs build --verbose && mkdocs serve -a 0.0.0.0:8000" ``` -You can view the documentation via [http://localhost:8000](http://localhost:8000) in your browser, now. +You can view the documentation via `http://localhost:8000` in your browser, now. If that does not work, check if you can get the URL for your browser's address bar from a different terminal window: @@ -137,36 +135,36 @@ echo http://$(docker inspect -f "{{.NetworkSettings.IPAddress}}" $(docker ps -qf ``` The running container automatically takes care of file changes and rebuilds the -documentation. If you want to check whether the markdown files are formatted +documentation. If you want to check whether the markdown files are formatted properly, use the following command: ```Bash -docker run --name=hpc-compendium --rm -it -w /docs --mount src="$(pwd)"/doc.zih.tu-dresden.de,target=/docs,type=bind hpc-compendium markdownlint docs +wiki 'markdownlint docs' ``` To check whether there are links that point to a wrong target, use (this may take a while and gives a lot of output because it runs over all files): ```Bash -docker run --name=hpc-compendium --rm -it -w /docs --mount src="$(pwd)"/doc.zih.tu-dresden.de,target=/docs,type=bind hpc-compendium bash -c "find docs -type f -name '*.md' | xargs -L1 markdown-link-check" +wiki "find docs -type f -name '*.md' | xargs -L1 markdown-link-check" ``` -To check a single file, e. g. `doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md`, use: +To check a single file, e. g. `doc.zih.tu-dresden.de/docs/software/big_data_frameworks_spark.md`, use: ```Bash -docker run --name=hpc-compendium --rm -it -w /docs --mount src="$(pwd)"/doc.zih.tu-dresden.de,target=/docs,type=bind hpc-compendium markdown-link-check docs/software/big_data_frameworks.md +wiki 'markdown-link-check docs/software/big_data_frameworks_spark.md' ``` For spell-checking a single file, use: ```Bash -docker run --name=hpc-compendium --rm -it -w /docs --mount src="$(pwd)"/doc.zih.tu-dresden.de,target=/docs,type=bind hpc-compendium ./util/check-spelling.sh <file> +wiki 'util/check-spelling.sh <file>' ``` For spell-checking all files, use: ```Bash -docker run --name=hpc-compendium --rm -it -w /docs --mount src="$(pwd)"/doc.zih.tu-dresden.de,target=/docs,type=bind hpc-compendium ./util/check-spelling.sh +docker run --name=hpc-compendium --rm -it -w /docs --mount src="$(pwd)",target=/docs,type=bind hpc-compendium ./doc.zih.tu-dresden.de/util/check-spelling.sh ``` This outputs all words of all files that are unknown to the spell checker. @@ -194,7 +192,7 @@ locally on the documentation. At first, you should add a remote pointing to the documentation. ```Shell Session -~ git remote add upstream-zih git@gitlab.hrz.tu-chemnitz.de:zih/hpc-compendium/hpc-compendium.git +~ git remote add upstream-zih git@gitlab.hrz.tu-chemnitz.de:zih/hpcsupport/hpc-compendium.git ``` Now, you have two remotes, namely *origin* and *upstream-zih*. The remote *origin* points to your fork, @@ -204,8 +202,8 @@ whereas *upstream-zih* points to the original documentation repository at GitLab $ git remote -v origin git@gitlab.hrz.tu-chemnitz.de:LOGIN/hpc-compendium.git (fetch) origin git@gitlab.hrz.tu-chemnitz.de:LOGIN/hpc-compendium.git (push) -upstream-zih git@gitlab.hrz.tu-chemnitz.de:zih/hpc-compendium/hpc-compendium.git (fetch) -upstream-zih git@gitlab.hrz.tu-chemnitz.de:zih/hpc-compendium/hpc-compendium.git (push) +upstream-zih git@gitlab.hrz.tu-chemnitz.de:zih/hpcsupport/hpc-compendium.git (fetch) +upstream-zih git@gitlab.hrz.tu-chemnitz.de:zih/hpcsupport/hpc-compendium.git (push) ``` Next, you should synchronize your `main` branch with the upstream. @@ -237,7 +235,7 @@ new branch (a so-called feature branch) basing on the `main` branch and commit y The last command pushes the changes to your remote at branch `FEATUREBRANCH`. Now, it is time to incorporate the changes and improvements into the HPC Compendium. For this, create a -[merge request](https://gitlab.hrz.tu-chemnitz.de/zih/hpc-compendium/hpc-compendium/-/merge_requests/new) +[merge request](https://gitlab.hrz.tu-chemnitz.de/zih/hpcsupport/hpc-compendium/-/merge_requests/new) to the `main` branch. ### Important Branches @@ -247,9 +245,10 @@ There are two important branches in this repository: - Preview: - Branch containing recent changes which will be soon merged to main branch (protected branch) - - Served at [todo url](todo url) from TUD VPN -- Main: Branch which is deployed at [doc.zih.tu-dresden.de](doc.zih.tu-dresden.de) holding the - current documentation (protected branch) + - Served at [https://doc.zih.tu-dresden.de/preview](https://doc.zih.tu-dresden.de/preview) from + TUD-ZIH VPN +- Main: Branch which is deployed at [https://doc.zih.tu-dresden.de](https://doc.zih.tu-dresden.de) + holding the current documentation (protected branch) If you are totally sure about your commit (e.g., fix a typo), it is only the following steps: @@ -387,260 +386,3 @@ BigDataFrameworksApacheSparkApacheFlinkApacheHadoop.md is not included in nav pika.md is not included in nav specific_software.md is not included in nav ``` - -## Content Rules - -**Remark:** Avoid using tabs both in markdown files and in `mkdocs.yaml`. Type spaces instead. - -### New Page and Pages Structure - -The pages structure is defined in the configuration file [mkdocs.yaml](doc.zih.tu-dresden.de/mkdocs.yml). - -```Shell Session -docs/ - - Home: index.md - - Application for HPC Login: application.md - - Request for Resources: req_resources.md - - Access to the Cluster: access.md - - Available Software and Usage: - - Overview: software/overview.md - ... -``` - -To add a new page to the documentation follow these two steps: - -1. Create a new markdown file under `docs/subdir/file_name.md` and put the documentation inside. The - sub directory and file name should follow the pattern `fancy_title_and_more.md`. -1. Add `subdir/file_name.md` to the configuration file `mkdocs.yml` by updating the navigation - section. - -Make sure that the new page **is not floating**, i.e., it can be reached directly from the documentation -structure. - -### Markdown - -1. Please keep things simple, i.e., avoid using fancy markdown dialects. - * [Cheat Sheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) - * [Style Guide](https://github.com/google/styleguide/blob/gh-pages/docguide/style.md) - -1. Do not add large binary files or high resolution images to the repository. See this valuable - document for [image optimization](https://web.dev/fast/#optimize-your-images). - -1. [Admonitions](https://squidfunk.github.io/mkdocs-material/reference/admonitions/) may be -actively used, especially for longer code examples, warnings, tips, important information that -should be highlighted, etc. Code examples, longer than half screen height should collapsed -(and indented): - -??? example - ```Bash - [...] - # very long example here - [...] - ``` - -### Writing Style - -**TODO** Guide [Issue #14](#14) - -* Capitalize headings, e.g. *Exclusive Reservation of Hardware* - -### Spelling and Technical Wording - -To provide a consistent and high quality documentation, and help users to find the right pages, -there is a list of conventions w.r.t. spelling and technical wording. - -* Language settings: en_us -* `I/O` not `IO` -* `Slurm` not `SLURM` -* `Filesystem` not `file system` -* `ZIH system` and `ZIH systems` not `Taurus`, `HRSKII`, `our HPC systems` etc. -* `Workspace` not `work space` -* avoid term `HPC-DA` - -### Code Blocks and Command Prompts - -Showing commands and sample output is an important part of all technical documentation. To make -things as clear for readers as possible and provide a consistent documentation, some rules have to -be followed. - -1. Use ticks to mark code blocks and commands, not italic font. -1. Specify language for code blocks ([see below](#code-blocks-and-syntax-highlighting)). -1. All code blocks and commands should be runnable from a login node or a node within a specific - partition (e.g., `ml`). -1. It should be clear from the prompt, where the command is run (e.g. local machine, login node or - specific partition). - -#### Prompts - -We follow this rules regarding prompts: - -| Host/Partition | Prompt | -|------------------------|------------------| -| Login nodes | `marie@login$` | -| Arbitrary compute node | `marie@compute$` | -| `haswell` partition | `marie@haswell$` | -| `ml` partition | `marie@ml$` | -| `alpha` partition | `marie@alpha$` | -| `alpha` partition | `marie@alpha$` | -| `romeo` partition | `marie@romeo$` | -| `julia` partition | `marie@julia$` | -| Localhost | `marie@local$` | - -*Remarks:* - -* **Always use a prompt**, even there is no output provided for the shown command. -* All code blocks should use long parameter names (e.g. Slurm parameters), if available. -* All code blocks which specify some general command templates, e.g. containing `<` and `>` - (see [Placeholders](#mark-placeholders)), should use `bash` for the code block. Additionally, - an example invocation, perhaps with output, should be given with the normal `console` code block. - See also [Code Block description below](#code-blocks-and-syntax-highlighting). -* Using some magic, the prompt as well as the output is identified and will not be copied! -* Stick to the [generic user name](#data-privacy-and-generic-user-name) `marie`. - -#### Code Blocks and Syntax Highlighting - -This project makes use of the extension -[pymdownx.highlight](https://squidfunk.github.io/mkdocs-material/reference/code-blocks/) for syntax -highlighting. There is a complete list of supported -[language short codes](https://pygments.org/docs/lexers/). - -For consistency, use the following short codes within this project: - -With the exception of command templates, use `console` for shell session and console: - -```` markdown -```console -marie@login$ ls -foo -bar -``` -```` - -Make sure that shell session and console code blocks are executable on the login nodes of HPC system. - -Command templates use [Placeholders](#mark-placeholders) to mark replaceable code parts. Command -templates should give a general idea of invocation and thus, do not contain any output. Use a -`bash` code block followed by an invocation example (with `console`): - -```` markdown -```bash -marie@local$ ssh -NL <local port>:<compute node>:<remote port> <zih login>@tauruslogin.hrsk.tu-dresden.de -``` - -```console -marie@local$ ssh -NL 5901:172.24.146.46:5901 marie@tauruslogin.hrsk.tu-dresden.de -``` -```` - -Also use `bash` for shell scripts such as jobfiles: - -```` markdown -```bash -#!/bin/bash -#SBATCH --nodes=1 -#SBATCH --time=01:00:00 -#SBATCH --output=slurm-%j.out - -module load foss - -srun a.out -``` -```` - -!!! important - - Use long parameter names where possible to ease understanding. - -`python` for Python source code: - -```` markdown -```python -from time import gmtime, strftime -print(strftime("%Y-%m-%d %H:%M:%S", gmtime())) -``` -```` - -`pycon` for Python console: - -```` markdown -```pycon ->>> from time import gmtime, strftime ->>> print(strftime("%Y-%m-%d %H:%M:%S", gmtime())) -2021-08-03 07:20:33 -``` -```` - -Line numbers can be added via - -```` markdown -```bash linenums="1" -#!/bin/bash - -#SBATCH -N 1 -#SBATCH -n 23 -#SBATCH -t 02:10:00 - -srun a.out -``` -```` - -_Result_: - - - -Specific Lines can be highlighted by using - -```` markdown -```bash hl_lines="2 3" -#!/bin/bash - -#SBATCH -N 1 -#SBATCH -n 23 -#SBATCH -t 02:10:00 - -srun a.out -``` -```` - -_Result_: - - - -### Data Privacy and Generic User Name - -Where possible, replace login, project name and other private data with clearly arbitrary placeholders. -E.g., use the generic login `marie` and the corresponding project name `p_marie`. - -```console -marie@login$ ls -l -drwxr-xr-x 3 marie p_marie 4096 Jan 24 2020 code -drwxr-xr-x 3 marie p_marie 4096 Feb 12 2020 data --rw-rw---- 1 marie p_marie 4096 Jan 24 2020 readme.md -``` - -### Mark Omissions - -If showing only a snippet of a long output, omissions are marked with `[...]`. - -### Mark Placeholders - -Stick to the Unix rules on optional and required arguments, and selection of item sets: - -* `<required argument or value>` -* `[optional argument or value]` -* `{choice1|choice2|choice3}` - -## Graphics and Attachments - -All graphics and attachments are saved within `misc` directory of the respective sub directory in -`docs`. - -The syntax to insert a graphic or attachment into a page is - -```Bash - -{: align="center"} -``` - -The attribute `align` is optional. By default, graphics are left aligned. **Note:** It is crucial to -have `{: align="center"}` on a new line. diff --git a/doc.zih.tu-dresden.de/docs/access/desktop_cloud_visualization.md b/doc.zih.tu-dresden.de/docs/access/desktop_cloud_visualization.md index 6b40f3bad658df5a171d8b46e5e34f8ae7a1ee95..7395aad287f5c197ae8ba639491c493e87f2ffe9 100644 --- a/doc.zih.tu-dresden.de/docs/access/desktop_cloud_visualization.md +++ b/doc.zih.tu-dresden.de/docs/access/desktop_cloud_visualization.md @@ -4,14 +4,14 @@ NICE DCV enables remote accessing OpenGL 3D applications running on ZIH systems server's GPUs. If you don't need OpenGL acceleration, you might also want to try our [WebVNC](graphical_applications_with_webvnc.md) solution. -Look [here](https://docs.aws.amazon.com/dcv/latest/userguide/client-web.html) if you want to know -if your browser is supported by DCV. +See [the official DCV documentation](https://docs.aws.amazon.com/dcv/latest/userguide/client-web.html) +if you want to know whether your browser is supported by DCV. ## Access with JupyterHub **Check out our new documentation about** [Virtual Desktops](../software/virtual_desktops.md). -To start a JupyterHub session on the dcv partition (taurusi210\[4-8\]) with one GPU, six CPU cores +To start a JupyterHub session on the partition `dcv` (`taurusi210[4-8]`) with one GPU, six CPU cores and 2583 MB memory per core, click on: [https://taurus.hrsk.tu-dresden.de/jupyter/hub/spawn#/~(partition~'dcv~cpuspertask~'6~gres~'gpu*3a1~mempercpu~'2583~environment~'production)](https://taurus.hrsk.tu-dresden.de/jupyter/hub/spawn#/~(partition~'dcv~cpuspertask~'6~gres~'gpu*3a1~mempercpu~'2583~environment~'production)) Optionally, you can modify many different Slurm parameters. For this diff --git a/doc.zih.tu-dresden.de/docs/access/graphical_applications_with_webvnc.md b/doc.zih.tu-dresden.de/docs/access/graphical_applications_with_webvnc.md index 6837ace6473f9532e608778ec96049394b4c4494..c652738dc859beecf3dc9669fdde684dc49d04f3 100644 --- a/doc.zih.tu-dresden.de/docs/access/graphical_applications_with_webvnc.md +++ b/doc.zih.tu-dresden.de/docs/access/graphical_applications_with_webvnc.md @@ -38,7 +38,7 @@ marie@login$ srun --pty --partition=interactive --mem-per-cpu=2500 --cpus-per-ta [...] ``` -Of course, you can adjust the batch job parameters to your liking. Note that the default timelimit +Of course, you can adjust the batch job parameters to your liking. Note that the default time limit in partition `interactive` is only 30 minutes, so you should specify a longer one with `--time` (or `-t`). The script will automatically generate a self-signed SSL certificate and place it in your home diff --git a/doc.zih.tu-dresden.de/docs/access/jupyterhub.md b/doc.zih.tu-dresden.de/docs/access/jupyterhub.md index d3cdc8f582c663a2b5d27dcd4f59a6c2e7dc659b..f9a916195ecbf814cf426beb4d26885500b3b3de 100644 --- a/doc.zih.tu-dresden.de/docs/access/jupyterhub.md +++ b/doc.zih.tu-dresden.de/docs/access/jupyterhub.md @@ -1,7 +1,7 @@ # JupyterHub With our JupyterHub service we offer you a quick and easy way to work with Jupyter notebooks on ZIH -systems. This page covers starting and stopping JuperterHub sessions, error handling and customizing +systems. This page covers starting and stopping JupyterHub sessions, error handling and customizing the environment. We also provide a comprehensive documentation on how to use @@ -21,7 +21,8 @@ cannot give extensive support in every case. !!! note This service is only available for users with an active HPC project. - See [here](../access/overview.md) how to apply for an HPC project. + See [Application for Login and Resources](../application/overview.md), if you need to apply for + an HPC project. JupyterHub is available at [https://taurus.hrsk.tu-dresden.de/jupyter](https://taurus.hrsk.tu-dresden.de/jupyter). @@ -41,7 +42,7 @@ settings. You can: - modify batch system parameters to your needs ([more about batch system Slurm](../jobs_and_resources/slurm.md)) - assign your session to a project or reservation -- load modules from the [module system](../software/runtime_environment.md) +- load modules from the [module system](../software/modules.md) - choose a different standard environment (in preparation for future software updates or testing additional features) @@ -100,7 +101,7 @@ running the code. We currently offer one for Python, C++, MATLAB and R. ## Stop a Session -It is good practise to stop your session once your work is done. This releases resources for other +It is good practice to stop your session once your work is done. This releases resources for other users and your quota is less charged. If you just log out or close the window, your server continues running and **will not stop** until the Slurm job runtime hits the limit (usually 8 hours). @@ -137,8 +138,8 @@ This message appears instantly if your batch system parameters are not valid. Please check those settings against the available hardware. Useful pages for valid batch system parameters: -- [Slurm batch system (Taurus)](../jobs_and_resources/system_taurus.md#batch-system) - [General information how to use Slurm](../jobs_and_resources/slurm.md) +- [Partitions and limits](../jobs_and_resources/partitions_and_limits.md) ### Error Message in JupyterLab @@ -147,8 +148,8 @@ Useful pages for valid batch system parameters: If the connection to your notebook server unexpectedly breaks, you will get this error message. Sometimes your notebook server might hit a batch system or hardware limit and gets killed. Then -usually the logfile of the corresponding batch job might contain useful information. These logfiles -are located in your `home` directory and have the name `jupyter-session-<jobid>.log`. +usually the log file of the corresponding batch job might contain useful information. These log +files are located in your `home` directory and have the name `jupyter-session-<jobid>.log`. ## Advanced Tips @@ -189,7 +190,7 @@ Here is a short list of some included software: \* generic = all partitions except ml -\*\* R is loaded from the [module system](../software/runtime_environment.md) +\*\* R is loaded from the [module system](../software/modules.md) ### Creating and Using a Custom Environment @@ -309,4 +310,4 @@ You can switch kernels of existing notebooks in the kernel menu: You have now the option to preload modules from the [module system](../software/modules.md). Select multiple modules that will be preloaded before your notebook server starts. The list of available modules depends on the module environment you want to start the session in (`scs5` or -`ml`). The right module environment will be chosen by your selected partition. +`ml`). The right module environment will be chosen by your selected partition. diff --git a/doc.zih.tu-dresden.de/docs/access/jupyterhub_for_teaching.md b/doc.zih.tu-dresden.de/docs/access/jupyterhub_for_teaching.md index 970a11898a6f2e93110d8b4f211ae9df9d883eed..797d9fc8e455b14e40a5ec7f3737874b2ac500ae 100644 --- a/doc.zih.tu-dresden.de/docs/access/jupyterhub_for_teaching.md +++ b/doc.zih.tu-dresden.de/docs/access/jupyterhub_for_teaching.md @@ -1,7 +1,7 @@ # JupyterHub for Teaching -On this page we want to introduce to you some useful features if you -want to use JupyterHub for teaching. +On this page, we want to introduce to you some useful features if you want to use JupyterHub for +teaching. !!! note @@ -9,24 +9,21 @@ want to use JupyterHub for teaching. Please be aware of the following notes: -- ZIH systems operate at a lower availability level than your usual Enterprise Cloud VM. There - can always be downtimes, e.g. of the filesystems or the batch system. +- ZIH systems operate at a lower availability level than your usual Enterprise Cloud VM. There can + always be downtimes, e.g. of the filesystems or the batch system. - Scheduled downtimes are announced by email. Please plan your courses accordingly. - Access to HPC resources is handled through projects. See your course as a project. Projects need to be registered beforehand (more info on the page [Access](../application/overview.md)). -- Don't forget to **TODO ANCHOR**(add your users) - (ProjectManagement#manage_project_members_40dis_45_47enable_41) (eg. students or tutors) to -your project. -- It might be a good idea to **TODO ANCHOR**(request a - reservation)(Slurm#Reservations) of part of the compute resources for your project/course to - avoid unnecessary waiting times in the batch system queue. +- Don't forget to [add your users](../application/project_management.md#manage-project-members-dis-enable) + (e.g. students or tutors) to your project. +- It might be a good idea to [request a reservation](../jobs_and_resources/overview.md#exclusive-reservation-of-hardware) + of part of the compute resources for your project/course to avoid unnecessary waiting times in + the batch system queue. ## Clone a Repository With a Link -This feature bases on -[nbgitpuller](https://github.com/jupyterhub/nbgitpuller). -Documentation can be found at -[this page](https://jupyterhub.github.io/nbgitpuller/). +This feature bases on [nbgitpuller](https://github.com/jupyterhub/nbgitpuller). Further information +can be found in the [external documentation about nbgitpuller](https://jupyterhub.github.io/nbgitpuller/). This extension for Jupyter notebooks can clone every public git repository into the users work directory. It's offering a quick way to distribute notebooks and other material to your students. @@ -51,14 +48,14 @@ The following parameters are available: |---|---| |`repo` | path to git repository| |`branch` | branch in the repository to pull from default: `master`| -|`urlpath` | URL to redirect the user to a certain file [more info](https://jupyterhub.github.io/nbgitpuller/topic/url-options.html#urlpath)| +|`urlpath` | URL to redirect the user to a certain file, [more info about parameter urlpath](https://jupyterhub.github.io/nbgitpuller/topic/url-options.html#urlpath)| |`depth` | clone only a certain amount of latest commits not recommended| This [link generator](https://jupyterhub.github.io/nbgitpuller/link?hub=https://taurus.hrsk.tu-dresden.de/jupyter/) might help creating those links -## Spawner Options Passthrough with URL Parameters +## Spawn Options Pass-through with URL Parameters The spawn form now offers a quick start mode by passing URL parameters. diff --git a/doc.zih.tu-dresden.de/docs/access/security_restrictions.md b/doc.zih.tu-dresden.de/docs/access/security_restrictions.md index 25f6270410c4e35cee150019298fac6dd33cd01e..b43d631c07fc47bf55da932dbb0d11aca4cf2ecf 100644 --- a/doc.zih.tu-dresden.de/docs/access/security_restrictions.md +++ b/doc.zih.tu-dresden.de/docs/access/security_restrictions.md @@ -1,27 +1,27 @@ -# Security Restrictions on Taurus +# Security Restrictions -As a result of the security incident the German HPC sites in Gau Alliance are now adjusting their -measurements to prevent infection and spreading of the malware. +As a result of a security incident the German HPC sites in Gauß Alliance have adjusted their +measurements to prevent infection and spreading of malware. -The most important items for HPC systems at ZIH are: +The most important items for ZIH systems are: -- All users (who haven't done so recently) have to +* All users (who haven't done so recently) have to [change their ZIH password](https://selfservice.zih.tu-dresden.de/l/index.php/pswd/change_zih_password). - **Login to Taurus is denied with an old password.** -- All old (private and public) keys have been moved away. -- All public ssh keys for Taurus have to - - be re-generated using only the ED25519 algorithm (`ssh-keygen -t ed25519`) - - **passphrase for the private key must not be empty** -- Ideally, there should be no private key on Taurus except for local use. -- Keys to other systems must be passphrase-protected! -- **ssh to Taurus** is only possible from inside TU Dresden Campus - (login\[1,2\].zih.tu-dresden.de will be blacklisted). Users from outside can use VPN (see - [here](https://tu-dresden.de/zih/dienste/service-katalog/arbeitsumgebung/zugang_datennetz/vpn)). -- **ssh from Taurus** is only possible inside TU Dresden Campus. - (Direct ssh access to other computing centers was the spreading vector of the recent incident.) + * **Login to ZIH systems is denied with an old password.** +* All old (private and public) keys have been moved away. +* All public ssh keys for ZIH systems have to + * be re-generated using only the ED25519 algorithm (`ssh-keygen -t ed25519`) + * **passphrase for the private key must not be empty** +* Ideally, there should be no private key on ZIH system except for local use. +* Keys to other systems must be passphrase-protected! +* **ssh to ZIH systems** is only possible from inside TU Dresden campus + (`login[1,2].zih.tu-dresden.de` will be blacklisted). Users from outside can use + [VPN](https://tu-dresden.de/zih/dienste/service-katalog/arbeitsumgebung/zugang_datennetz/vpn). +* **ssh from ZIH system** is only possible inside TU Dresden campus. + (Direct SSH access to other computing centers was the spreading vector of the recent incident.) -Data transfer is possible via the taurusexport nodes. We are working on a bandwidth-friendly -solution. +Data transfer is possible via the [export nodes](../data_transfer/export_nodes.md). We are working +on a bandwidth-friendly solution. We understand that all this will change convenient workflows. If the measurements would render your -work on Taurus completely impossible, please contact the HPC support. +work on ZIH systems completely impossible, please [contact the HPC support](../support/support.md). diff --git a/doc.zih.tu-dresden.de/docs/access/ssh_login.md b/doc.zih.tu-dresden.de/docs/access/ssh_login.md index 5e67c5279f701405224078d7234517c642d3e726..a0fef440151984abbe662fe8f096de166eae6dad 100644 --- a/doc.zih.tu-dresden.de/docs/access/ssh_login.md +++ b/doc.zih.tu-dresden.de/docs/access/ssh_login.md @@ -9,11 +9,12 @@ connection to enter the campus network. While active, it allows the user to conn HPC login nodes. For more information on our VPN and how to set it up, please visit the corresponding -[ZIH service catalogue page](https://tu-dresden.de/zih/dienste/service-katalog/arbeitsumgebung/zugang_datennetz/vpn). +[ZIH service catalog page](https://tu-dresden.de/zih/dienste/service-katalog/arbeitsumgebung/zugang_datennetz/vpn). ## Connecting from Linux -Please use an up-to-date SSH client. The login nodes accept the following encryption algorithms: +SSH establishes secure connections using authentication and encryption. Thus, please use an +up-to-date SSH client. The login nodes accept the following encryption algorithms: * `aes128-ctr` * `aes192-ctr` @@ -23,65 +24,135 @@ Please use an up-to-date SSH client. The login nodes accept the following encryp * `chacha20-poly1305@openssh.com` * `chacha20-poly1305@openssh.com` -### SSH Session +### Before Your First Connection -If your workstation is within the campus network, you can connect to the HPC login nodes directly. +We suggest to create an SSH key pair before you work with the ZIH systems. This ensures high +connection security. ```console -marie@local$ ssh <zih-login>@taurus.hrsk.tu-dresden.de +marie@local$ mkdir -p ~/.ssh +marie@local$ ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 +Generating public/private ed25519 key pair. +Enter passphrase (empty for no passphrase): +Enter same passphrase again: +[...] ``` -If you connect for the fist time, the client will ask you to verify the host by its fingerprint: +Type in a passphrase for the protection of your key. The passphrase should be **non-empty**. +Copy the public key to the ZIH system (Replace placeholder `marie` with your ZIH login): ```console -marie@local$: ssh <zih-login>@taurus.hrsk.tu-dresden.de +marie@local$ ssh-copy-id -i ~/.ssh/id_ed25519.pub marie@taurus.hrsk.tu-dresden.de The authenticity of host 'taurus.hrsk.tu-dresden.de (141.30.73.104)' can't be established. RSA key fingerprint is SHA256:HjpVeymTpk0rqoc8Yvyc8d9KXQ/p2K0R8TJ27aFnIL8. Are you sure you want to continue connecting (yes/no)? ``` Compare the shown fingerprint with the [documented fingerprints](key_fingerprints.md). Make sure -they match. Than you can accept by typing `y` or `yes`. +they match. Then you can accept by typing `yes`. -### X11-Forwarding +!!! info + If `ssh-copy-id` is not available, you need to do additional steps: -If you plan to use an application with graphical user interface (GUI), you need to enable -X11-forwarding for the connection. Add the option `-X` or `-XC` to your SSH command. The `-C` enables -compression which usually improves usability in this case). + ```console + marie@local$ scp ~/.ssh/id_ed25519.pub marie@taurus.hrsk.tu-dresden.de: + The authenticity of host 'taurus.hrsk.tu-dresden.de (141.30.73.104)' can't be established. + RSA key fingerprint is SHA256:HjpVeymTpk0rqoc8Yvyc8d9KXQ/p2K0R8TJ27aFnIL8. + Are you sure you want to continue connecting (yes/no)? + ``` + + After that, you need to manually copy the key to the right place: + + ```console + marie@local$ ssh marie@taurus.hrsk.tu-dresden.de + [...] + marie@login$ mkdir -p ~/.ssh + marie@login$ touch ~/.ssh/authorized_keys + marie@login$ cat id_ed25519.pub >> ~/.ssh/authorized_keys + ``` + +#### Configuring Default Parameters for SSH + +After you have copied your key to the ZIH system, you should be able to connect using: ```console -marie@local$ ssh -XC <zih-login>@taurus.hrsk.tu-dresden.de +marie@local$ ssh marie@taurus.hrsk.tu-dresden.de +[...] +marie@login$ exit ``` -!!! info +However, you can make this more comfortable if you prepare an SSH configuration on your local +workstation. Navigate to the subdirectory `.ssh` in your home directory and open the file `config` +(`~/.ssh/config`) in your favorite editor. If it does not exist, create it. Put the following lines +in it (you can omit lines starting with `#`): + +```bash +Host taurus + #For login (shell access) + HostName taurus.hrsk.tu-dresden.de + #Put your ZIH-Login after keyword "User": + User marie + #Path to private key: + IdentityFile ~/.ssh/id_ed25519 + #Don't try other keys if you have more: + IdentitiesOnly yes + #Enable X11 forwarding for graphical applications and compression. You don't need parameter -X and -C when invoking ssh then. + ForwardX11 yes + Compression yes +Host taurusexport + #For copying data without shell access + HostName taurusexport.hrsk.tu-dresden.de + #Put your ZIH-Login after keyword "User": + User marie + #Path to private key: + IdentityFile ~/.ssh/id_ed25519 + #Don't try other keys if you have more: + IdentitiesOnly yes +``` - Also consider to use a [DCV session](desktop_cloud_visualization.md) for remote desktop - visualization at ZIH systems. +Afterwards, you can connect to the ZIH system using: -### Password-Less SSH +```console +marie@local$ ssh taurus +``` -Of course, password-less SSH connecting is supported at ZIH. All public SSH keys for ZIH systems -have to be generated following these rules: +If you want to copy data from/to ZIH systems, please refer to [Export Nodes: Transfer Data to/from +ZIH's Filesystems](../data_transfer/export_nodes.md) for more information on export nodes. - * The **ED25519** algorithm has to be used, e.g., `ssh-keygen -t ed25519` - * A **non-empty** passphrase for the private key must be set. +### X11-Forwarding + +If you plan to use an application with graphical user interface (GUI), you need to enable +X11-forwarding for the connection. If you use the SSH configuration described above, everything is +already prepared and you can simply use: + +```console +marie@local$ ssh taurus +``` -The generated public key is usually saved at `~/.ssh/id_ed25519` at your local system. To allow for -password-less SSH connection to ZIH systems, it has to be added to the file `.ssh/authorized_keys` within -your home directory `/home/<zih-login>/` at ZIH systems. +If you have omitted the last two lines in the default configuration above, you need to add the +option `-X` or `-XC` to your SSH command. The `-C` enables compression which usually improves +usability in this case: ```console -marie@local$ ssh -i id-ed25519 <zih-login>@taurus.hrsk.tu-dresden.de -Enter passphrase for key 'id-ed25519': +marie@local$ ssh -XC taurus ``` +!!! info + + Also consider to use a [DCV session](desktop_cloud_visualization.md) for remote desktop + visualization at ZIH systems. + ## Connecting from Windows We recommend one of the following applications: * [MobaXTerm](https://mobaxterm.mobatek.net): [ZIH documentation](misc/basic_usage_of_MobaXterm.pdf) * [PuTTY](https://www.putty.org): [ZIH documentation](misc/basic_usage_of_PuTTY.pdf) - * OpenSSH Server: [docs](https://docs.microsoft.com/de-de/windows-server/administration/openssh/openssh_install_firstuse) + * For Windows 10 (1809 and higher): + * [Windows Terminal](https://www.microsoft.com/store/productId/9N0DX20HK701) + * Together with the built-in [OpenSSH Client](https://docs.microsoft.com/de-de/windows-server/administration/openssh/openssh_overview) + +## SSH Key Fingerprints The page [key fingerprints](key_fingerprints.md) holds the up-to-date fingerprints for the login nodes. Make sure they match. diff --git a/doc.zih.tu-dresden.de/docs/accessibility.md b/doc.zih.tu-dresden.de/docs/accessibility.md index 418d8a11c98be59a121a47f0d497dfce1a79aa05..ba40340fe0d9995c27b4013d06a01400dc279e87 100644 --- a/doc.zih.tu-dresden.de/docs/accessibility.md +++ b/doc.zih.tu-dresden.de/docs/accessibility.md @@ -39,4 +39,4 @@ Postanschrift: Archivstraße 1, 01097 Dresden E-Mail: <info.behindertenbeauftragter@sk.sachsen.de> Telefon: +49 351 564-12161 Fax: +49 351 564-12169 -Webseite: [https://www.inklusion.sachsen.de](https://www.inklusion.sachsen.de) +Webseite: [https://www.inklusion.sachsen.de/](https://www.inklusion.sachsen.de/) diff --git a/Compendium_attachments/ProjectManagement/add_member.png b/doc.zih.tu-dresden.de/docs/application/misc/add_member.png similarity index 100% rename from Compendium_attachments/ProjectManagement/add_member.png rename to doc.zih.tu-dresden.de/docs/application/misc/add_member.png diff --git a/Compendium_attachments/ProjectManagement/external_login.png b/doc.zih.tu-dresden.de/docs/application/misc/external_login.png similarity index 100% rename from Compendium_attachments/ProjectManagement/external_login.png rename to doc.zih.tu-dresden.de/docs/application/misc/external_login.png diff --git a/Compendium_attachments/ProjectManagement/members.png b/doc.zih.tu-dresden.de/docs/application/misc/members.png similarity index 100% rename from Compendium_attachments/ProjectManagement/members.png rename to doc.zih.tu-dresden.de/docs/application/misc/members.png diff --git a/Compendium_attachments/ProjectManagement/overview.png b/doc.zih.tu-dresden.de/docs/application/misc/overview.png similarity index 100% rename from Compendium_attachments/ProjectManagement/overview.png rename to doc.zih.tu-dresden.de/docs/application/misc/overview.png diff --git a/Compendium_attachments/ProjectManagement/password.png b/doc.zih.tu-dresden.de/docs/application/misc/password.png similarity index 100% rename from Compendium_attachments/ProjectManagement/password.png rename to doc.zih.tu-dresden.de/docs/application/misc/password.png diff --git a/Compendium_attachments/ProjectManagement/project_details.png b/doc.zih.tu-dresden.de/docs/application/misc/project_details.png similarity index 100% rename from Compendium_attachments/ProjectManagement/project_details.png rename to doc.zih.tu-dresden.de/docs/application/misc/project_details.png diff --git a/Compendium_attachments/ProjectRequestForm/request_step1_b.png b/doc.zih.tu-dresden.de/docs/application/misc/request_step1_b.png similarity index 100% rename from Compendium_attachments/ProjectRequestForm/request_step1_b.png rename to doc.zih.tu-dresden.de/docs/application/misc/request_step1_b.png diff --git a/Compendium_attachments/ProjectRequestForm/request_step2_details.png b/doc.zih.tu-dresden.de/docs/application/misc/request_step2_details.png similarity index 100% rename from Compendium_attachments/ProjectRequestForm/request_step2_details.png rename to doc.zih.tu-dresden.de/docs/application/misc/request_step2_details.png diff --git a/Compendium_attachments/ProjectRequestForm/request_step3_machines.png b/doc.zih.tu-dresden.de/docs/application/misc/request_step3_machines.png similarity index 100% rename from Compendium_attachments/ProjectRequestForm/request_step3_machines.png rename to doc.zih.tu-dresden.de/docs/application/misc/request_step3_machines.png diff --git a/Compendium_attachments/ProjectRequestForm/request_step4_software.png b/doc.zih.tu-dresden.de/docs/application/misc/request_step4_software.png similarity index 100% rename from Compendium_attachments/ProjectRequestForm/request_step4_software.png rename to doc.zih.tu-dresden.de/docs/application/misc/request_step4_software.png diff --git a/Compendium_attachments/ProjectRequestForm/request_step5_description.png b/doc.zih.tu-dresden.de/docs/application/misc/request_step5_description.png similarity index 100% rename from Compendium_attachments/ProjectRequestForm/request_step5_description.png rename to doc.zih.tu-dresden.de/docs/application/misc/request_step5_description.png diff --git a/Compendium_attachments/ProjectRequestForm/request_step6.png b/doc.zih.tu-dresden.de/docs/application/misc/request_step6.png similarity index 100% rename from Compendium_attachments/ProjectRequestForm/request_step6.png rename to doc.zih.tu-dresden.de/docs/application/misc/request_step6.png diff --git a/Compendium_attachments/ProjectManagement/stats.png b/doc.zih.tu-dresden.de/docs/application/misc/stats.png similarity index 100% rename from Compendium_attachments/ProjectManagement/stats.png rename to doc.zih.tu-dresden.de/docs/application/misc/stats.png diff --git a/doc.zih.tu-dresden.de/docs/application/overview.md b/doc.zih.tu-dresden.de/docs/application/overview.md index 6ab0da135480e6a9621b492a2d9b4fe956f7e2cb..59e6e6e78833b63dd358ecaeda361135aba7ef30 100644 --- a/doc.zih.tu-dresden.de/docs/application/overview.md +++ b/doc.zih.tu-dresden.de/docs/application/overview.md @@ -5,7 +5,7 @@ The HPC project manager should hold a professorship (university) or head a resea also apply for a "Schnupperaccount" (trial account) for one year to find out if the machine is useful for your application. -An other able use case is to request resources for a courses. +An other able use case is to request resources for a courses. To learn more about applying for a project or a course, check the following page: [https://tu-dresden.de/zih/hochleistungsrechnen/zugang][1] diff --git a/doc.zih.tu-dresden.de/docs/application/project_management.md b/doc.zih.tu-dresden.de/docs/application/project_management.md index a69ef756d4b74fc35e7c5be014fc2b060ea0af5e..79e457cb2590d4109a160a8296b676c3384490d5 100644 --- a/doc.zih.tu-dresden.de/docs/application/project_management.md +++ b/doc.zih.tu-dresden.de/docs/application/project_management.md @@ -1,113 +1,104 @@ -# Project management +# Project Management -The HPC project leader has overall responsibility for the project and -for all activities within his project on ZIH's HPC systems. In -particular he shall: +The HPC project leader has overall responsibility for the project and for all activities within the +corresponding project on ZIH systems. In particular the project leader shall: -- add and remove users from the project, -- update contact details of th eproject members, -- monitor the resources his project, -- inspect and store data of retiring users. +* add and remove users from the project, +* update contact details of the project members, +* monitor the resources of the project, +* inspect and store data of retiring users. -For this he can appoint a *project administrator* with an HPC account to -manage technical details. +The project leader can appoint a *project administrator* with an HPC account to manage these +technical details. -The front-end to the HPC project database enables the project leader and -the project administrator to +The front-end to the HPC project database enables the project leader and the project administrator +to -- add and remove users from the project, -- define a technical administrator, -- view statistics (resource consumption), -- file a new HPC proposal, -- file results of the HPC project. +* add and remove users from the project, +* define a technical administrator, +* view statistics (resource consumption), +* file a new HPC proposal, +* file results of the HPC project. ## Access -<span class="twiki-macro IMAGE" type="frame" align="right" -caption="password" width="100">%ATTACHURLPATH%/external_login.png</span> - +{loading=lazy width=300 style="float:right"} [Entry point to the project management system](https://hpcprojekte.zih.tu-dresden.de/managers) - The project leaders of an ongoing project and their accredited admins are allowed to login to the system. In general each of these persons should possess a ZIH login at the Technical University of Dresden, with which it is possible to log on the homepage. In some cases, it may happen that a project leader of a foreign organization do not have a ZIH login. For this purpose, it is possible to set a local password: -"[Passwort vergessen](https://hpcprojekte.zih.tu-dresden.de/managers/members/missingPassword)". +"[Missing Password](https://hpcprojekte.zih.tu-dresden.de/managers/members/missingPassword)". -<span class="twiki-macro IMAGE" type="frame" align="right" caption="password reset" -width="100">%ATTACHURLPATH%/password.png</span> + +{: style="clear:right;"} -On the 'Passwort vergessen' page, it is possible to reset the -passwords of a 'non-ZIH-login'. For this you write your login, which -usually corresponds to your email address, in the field and click on -'zurcksetzen'. Within 10 minutes the system sends a signed e-mail from -<hpcprojekte@zih.tu-dresden.de> to the registered e-mail address. this -e-mail contains a link to reset the password. +{loading=lazy width=300 style="float:right"} +On the 'Missing Password' page, it is possible to reset the passwords of a 'non-ZIH-login'. For this +you write your login, which usually corresponds to your email address, in the field and click on +'reset. Within 10 minutes the system sends a signed e-mail from <hpcprojekte@zih.tu-dresden.de> to +the registered e-mail address. this e-mail contains a link to reset the password. + + +{: style="clear:right;"} ## Projects -<span class="twiki-macro IMAGE" type="frame" align="right" -caption="projects overview" -width="100">%ATTACHURLPATH%/overview.png</span> - -\<div style="text-align: justify;"> After login you reach an overview -that displays all available projects. In each of these projects are -listed, you are either project leader or an assigned project -administrator. From this list, you have the option to view the details -of a project or make a following project request. The latter is only -possible if a project has been approved and is active or was. In the -upper right area you will find a red button to log out from the system. -\</div> \<br style="clear: both;" /> \<br /> <span -class="twiki-macro IMAGE" type="frame" align="right" -caption="project details" -width="100">%ATTACHURLPATH%/project_details.png</span> \<div -style="text-align: justify;"> The project details provide information -about the requested and allocated resources. The other tabs show the -employee and the statistics about the project. \</div> \<br -style="clear: both;" /> - -### manage project members (dis-/enable) - -<span class="twiki-macro IMAGE" type="frame" align="right" -caption="project members" width="100">%ATTACHURLPATH%/members.png</span> -\<div style="text-align: justify;"> The project members can be managed -under the tab 'employee' in the project details. This page gives an -overview of all ZIH logins that are a member of a project and its -status. If a project member marked in green, it can work on all -authorized HPC machines when the project has been approved. If an -employee is marked in red, this can have several causes: - -- he was manually disabled by project managers, project administrator - or an employee of the ZIH -- he was disabled by the system because his ZIH login expired -- his confirmation of the current hpc-terms is missing - -You can specify a user as an administrator. This user can then access -the project managment system. Next, you can disable individual project -members. This disabling is only a "request of disabling" and has a time -delay of 5 minutes. An user can add or reactivate himself, with his -zih-login, to a project via the link on the end of the page. To prevent -misuse this link is valid for 2 weeks and will then be renewed -automatically. \</div> \<br style="clear: both;" /> - -<span class="twiki-macro IMAGE" type="frame" align="right" -caption="add member" width="100">%ATTACHURLPATH%/add_member.png</span> - -\<div style="text-align: justify;"> The link leads to a page where you -can sign in to a Project by accepting the term of use. You need also an -valid ZIH-Login. After this step it can take 1-1,5 h to transfer the -login to all cluster nodes. \</div> \<br style="clear: both;" /> - -### statistic - -<span class="twiki-macro IMAGE" type="frame" align="right" -caption="project statistic" width="100">%ATTACHURLPATH%/stats.png</span> - -\<div style="text-align: justify;"> The statistic is located under the -tab 'Statistik' in the project details. The data will updated once a day -an shows used CPU-time and used disk space of an project. Following -projects shows also the data of the predecessor. \</div> - -\<br style="clear: both;" /> +{loading=lazy width=300 style="float:right"} +After login you reach an overview that displays all available projects. In each of these projects +are listed, you are either project leader or an assigned project administrator. From this list, you +have the option to view the details of a project or make a following project request. The latter is +only possible if a project has been approved and is active or was. In the upper right area you will +find a red button to log out from the system. + + +{: style="clear:right;"} + +{loading=lazy width=300 style="float:right"} +The project details provide information about the requested and allocated resources. The other tabs +show the employee and the statistics about the project. + + +{: style="clear:right;"} + +### Manage Project Members (dis-/enable) + +{loading=lazy width=300 style="float:right"} +The project members can be managed under the tab 'employee' in the project details. This page gives +an overview of all ZIH logins that are a member of a project and its status. If a project member +marked in green, it can work on all authorized HPC machines when the project has been approved. If +an employee is marked in red, this can have several causes: + +* the employee was manually disabled by project managers, project administrator + or ZIH staff +* the employee was disabled by the system because its ZIH login expired +* confirmation of the current HPC-terms is missing + +You can specify a user as an administrator. This user can then access the project management system. +Next, you can disable individual project members. This disabling is only a "request of disabling" +and has a time delay of 5 minutes. An user can add or reactivate itself, with its ZIH-login, to a +project via the link on the end of the page. To prevent misuse this link is valid for 2 weeks and +will then be renewed automatically. + + +{: style="clear:right;"} + +{loading=lazy width=300 style="float:right"} +The link leads to a page where you can sign in to a project by accepting the term of use. You need +also an valid ZIH-Login. After this step it can take 1-1,5 h to transfer the login to all cluster +nodes. + + +{: style="clear:right;"} + +### Statistic + +{loading=lazy width=300 style="float:right"} +The statistic is located under the tab 'Statistic' in the project details. The data will updated +once a day an shows used CPU-time and used disk space of an project. Following projects shows also +the data of the predecessor. + + +{: style="clear:right;"} diff --git a/doc.zih.tu-dresden.de/docs/application/project_request_form.md b/doc.zih.tu-dresden.de/docs/application/project_request_form.md index 7a50b2274b2167e5d2efd89c7a4b1725074e8990..e829f316cb26f11b9b9048a889c8b5e918b2e870 100644 --- a/doc.zih.tu-dresden.de/docs/application/project_request_form.md +++ b/doc.zih.tu-dresden.de/docs/application/project_request_form.md @@ -1,78 +1,83 @@ # Project Request Form -## first step (requester) - -<span class="twiki-macro IMAGE" type="frame" align="right" -caption="picture 2: personal information" width="170" zoom="on -">%ATTACHURL%/request_step1_b.png</span> <span class="twiki-macro IMAGE" -type="frame" align="right" caption="picture 1: login screen" width="170" -zoom="on -">%ATTACHURL%/request_step1_b.png</span> +## First Step: Requester +{loading=lazy width=300 style="float:right"} The first step is asking for the personal information of the requester. -**That's you**, not the leader of this project! \<br />If you have an -ZIH-Login, you can use it \<sup>\[Pic 1\]\</sup>. If not, you have to -fill in the whole information \<sup>\[Pic.:2\]\</sup>. <span -class="twiki-macro IMAGE">clear</span> - -## second step (project details) - -<span class="twiki-macro IMAGE" type="frame" align="right" -caption="picture 3: project details" width="170" zoom="on -">%ATTACHURL%/request_step2_details.png</span> This Step is asking for -general project Details.\<br />Any project have: - -- a title, at least 20 characters long -- a valid duration - - Projects starts at the first of a month and ends on the last day - of a month. So you are not able to send on the second of a month - a project request which start in this month. - - The approval is for a maximum of one year. Be careful: a - duration from "May, 2013" till "May 2014" has 13 month. -- a selected science, according to the DFG: - <http://www.dfg.de/dfg_profil/gremien/fachkollegien/faecher/index.jsp> -- a sponsorship -- a kind of request -- a project leader/manager - - The leader of this project should hold a professorship - (university) or is the head of the research group. - - If you are this Person, leave this fields free. - -<span class="twiki-macro IMAGE">clear</span> - -## third step (hardware) - -<span class="twiki-macro IMAGE" type="frame" align="right" -caption="picture 4: hardware" width="170" zoom="on -">%ATTACHURL%/request_step3_machines.png</span> This step inquire the -required hardware. You can find the specifications [here]**todo fix link** -\<br />For your guidance: - -- gpu => taurus -- many main memory => venus -- other machines => you know it and don't need this guidance - -<span class="twiki-macro IMAGE">clear</span> - -## fourth step (software) - -<span class="twiki-macro IMAGE" type="frame" align="right" -caption="picture 5: software" width="170" zoom="on -">%ATTACHURL%/request_step4_software.png</span> Any information you will -give us in this step, helps us to make a rough estimate, if you are able -to realize your project. For Example: you need matlab. Matlab is only -available on Taurus. <span class="twiki-macro IMAGE">clear</span> - -## fifth step (project description) - -<span class="twiki-macro IMAGE" type="frame" align="right" -caption="picture 6: project description" width="170" zoom="on -">%ATTACHURL%/request_step5_description.png</span> <span -class="twiki-macro IMAGE">clear</span> - -## sixth step (summary) - -<span class="twiki-macro IMAGE" type="frame" align="right" -caption="picture 8: summary" width="170" zoom="on -">%ATTACHURL%/request_step6.png</span> <span -class="twiki-macro IMAGE">clear</span> +**That's you**, not the leader of this project! +If you have an ZIH-Login, you can use it. +If not, you have to fill in the whole information. + + +{: style="clear:right;"} + +## Second Step: Project Details + +![picture 3: Project Details >][1]{loading=lazy width=300 style="float:right"} +This Step is asking for general project Details. + +Any project have: + +* a title, at least 20 characters long +* a valid duration + * Projects starts at the first of a month and ends on the last day of a month. So you are not + able to send on the second of a month a project request which start in this month. + * The approval is for a maximum of one year. Be careful: a duration from "May, 2013" till + "May 2014" has 13 month. +* a selected science, according to the DFG: + http://www.dfg.de/dfg_profil/gremien/fachkollegien/faecher/index.jsp +* a sponsorship a kind of request a project leader/manager The leader of this project should hold a + professorship (university) or is the head of the research group. + * If you are this person, leave this fields free. + + +{: style="clear:right;"} + +## Third step: Hardware + +{loading=lazy width=300 style="float:right"} +This step inquire the required hardware. The +[hardware specifications](../jobs_and_resources/hardware_overview.md) might help you to estimate, +e. g. the compute time. + +Please fill in the total computing time you expect in the project runtime. The compute time is +given in cores per hour (CPU/h), this refers to the 'virtual' cores for nodes with hyperthreading. +If they require GPUs, then this is given as GPU units per hour (GPU/h). Please add 6 CPU hours per +GPU hour in your application. + +The project home is a shared storage in your project. Here you exchange data or install software +for your project group in userspace. The directory is not intended for active calculations, for this +the scratch is available. + + +{: style="clear:right;"} + +## Fourth Step: Software + +{loading=lazy width=300 style="float:right"} +Any information you will give us in this step, helps us to make a rough estimate, if you are able +to realize your project. For example, some software requires its own licenses. + + +{: style="clear:right;"} + +## Fifth Step: Project Description + +![picture 6: Project Description >][2]{loading=lazy width=300 style="float:right"} Please enter a +short project description here. This is especially important for trial accounts and courses. For +normal HPC projects a detailed project description is additionally required, which you can upload +here. + + +{: style="clear:right;"} + +## Sixth Step: Summary + +{loading=lazy width=300 style="float:right"} +Check your entries and confirm the terms of use. + + +{: style="clear:right;"} + +[1]: misc/request_step2_details.png "Project Details" +[2]: misc/request_step5_description.png "Project Description" diff --git a/doc.zih.tu-dresden.de/docs/archive/beegfs_on_demand.md b/doc.zih.tu-dresden.de/docs/archive/beegfs_on_demand.md index ce009ace4bdcfc58fc20009eafbc6faf6c4fd553..e221188dcd1c33ef66815d38bffd4a8c5866f48e 100644 --- a/doc.zih.tu-dresden.de/docs/archive/beegfs_on_demand.md +++ b/doc.zih.tu-dresden.de/docs/archive/beegfs_on_demand.md @@ -3,7 +3,7 @@ !!! warning This documentation page is outdated. - The up-to date documentation on BeeGFS can be found [here](../data_lifecycle/beegfs.md). + Please see the [new BeeGFS page](../data_lifecycle/beegfs.md). **Prerequisites:** To work with TensorFlow you obviously need a [login](../application/overview.md) to the ZIH systems and basic knowledge about Linux, mounting, and batch system Slurm. @@ -61,8 +61,8 @@ Check the status of the job with `squeue -u \<username>`. ## Mount BeeGFS Filesystem -You can mount BeeGFS filesystem on the ML partition (PowerPC architecture) or on the Haswell -[partition](../jobs_and_resources/system_taurus.md) (x86_64 architecture) +You can mount BeeGFS filesystem on the partition ml (PowerPC architecture) or on the +partition haswell (x86_64 architecture), more information about [partitions](../jobs_and_resources/partitions_and_limits.md). ### Mount BeeGFS Filesystem on the Partition `ml` diff --git a/doc.zih.tu-dresden.de/docs/archive/cxfs_end_of_support.md b/doc.zih.tu-dresden.de/docs/archive/cxfs_end_of_support.md index 84e018b655f958ecb2d0a8d35982aad47a66adb2..2854bb2aeccb7d016e91dda4d9de6d717521bf46 100644 --- a/doc.zih.tu-dresden.de/docs/archive/cxfs_end_of_support.md +++ b/doc.zih.tu-dresden.de/docs/archive/cxfs_end_of_support.md @@ -1,44 +1,45 @@ -# Changes in the CXFS File System +# Changes in the CXFS Filesystem -With the ending support from SGI, the CXFS file system will be seperated -from its tape library by the end of March, 2013. +!!! warning -This file system is currently mounted at + This page is outdated! -- SGI Altix: `/fastfs/` -- Atlas: `/hpc_fastfs/` +With the ending support from SGI, the CXFS filesystem will be separated from its tape library by +the end of March, 2013. -We kindly ask our users to remove their large data from the file system. +This filesystem is currently mounted at + +* SGI Altix: `/fastfs/` +* Atlas: `/hpc_fastfs/` + +We kindly ask our users to remove their large data from the filesystem. Files worth keeping can be moved -- to the new [Intermediate Archive](../data_lifecycle/intermediate_archive.md) (max storage +* to the new [Intermediate Archive](../data_lifecycle/intermediate_archive.md) (max storage duration: 3 years) - see [MigrationHints](#migration-from-cxfs-to-the-intermediate-archive) below, -- or to the [Log-term Archive](../data_lifecycle/preservation_research_data.md) (tagged with +* or to the [Log-term Archive](../data_lifecycle/preservation_research_data.md) (tagged with metadata). -To run the file system without support comes with the risk of losing -data. So, please store away your results into the Intermediate Archive. -`/fastfs` might on only be used for really temporary data, since we are -not sure if we can fully guarantee the availability and the integrity of -this file system, from then on. +To run the filesystem without support comes with the risk of losing data. So, please store away +your results into the Intermediate Archive. `/fastfs` might on only be used for really temporary +data, since we are not sure if we can fully guarantee the availability and the integrity of this +filesystem, from then on. -With the new HRSK-II system comes a large scratch file system with appr. -800 TB disk space. It will be made available for all running HPC systems -in due time. +With the new HRSK-II system comes a large scratch filesystem with approximately 800 TB disk space. +It will be made available for all running HPC systems in due time. ## Migration from CXFS to the Intermediate Archive Data worth keeping shall be moved by the users to the directory `archive_migration`, which can be found in your project's and your -personal `/fastfs` directories. (`/fastfs/my_login/archive_migration`, -`/fastfs/my_project/archive_migration` ) +personal `/fastfs` directories: -\<u>Attention:\</u> Exclusively use the command `mv`. Do **not** use -`cp` or `rsync`, for they will store a second version of your files in -the system. +* `/fastfs/my_login/archive_migration` +* `/fastfs/my_project/archive_migration` -Please finish this by the end of January. Starting on Feb/18/2013, we -will step by step transfer these directories to the new hardware. +**Attention:** Exclusively use the command `mv`. Do **not** use `cp` or `rsync`, for they will store +a second version of your files in the system. -- Set DENYTOPICVIEW = WikiGuest +Please finish this by the end of January. Starting on Feb/18/2013, we will step by step transfer +these directories to the new hardware. diff --git a/doc.zih.tu-dresden.de/docs/archive/install_jupyter.md b/doc.zih.tu-dresden.de/docs/archive/install_jupyter.md new file mode 100644 index 0000000000000000000000000000000000000000..3d59d1cc7cf9e93e9a7f3ca78d22100978a72b8f --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/archive/install_jupyter.md @@ -0,0 +1,204 @@ +# Jupyter Installation + +!!! warning + + This page is outdated! + +Jupyter notebooks allow to analyze data interactively using your web browser. One advantage of +Jupyter is, that code, documentation and visualization can be included in a single notebook, so that +it forms a unit. Jupyter notebooks can be used for many tasks, such as data cleaning and +transformation, numerical simulation, statistical modeling, data visualization and also machine +learning. + +There are two general options on how to work with Jupyter notebooks on ZIH systems: remote Jupyter +server and JupyterHub. + +These sections show how to set up and run a remote Jupyter server with GPUs within a Slurm job. +Furthermore, the following sections explain which modules and packages you need for that. + +!!! note + On ZIH systems, there is a [JupyterHub](../access/jupyterhub.md), where you do not need the + manual server setup described below and can simply run your Jupyter notebook on HPC nodes. Keep + in mind, that, with JupyterHub, you can't work with some special instruments. However, general + data analytics tools are available. + +The remote Jupyter server is able to offer more freedom with settings and approaches. + +## Preparation phase (optional) + +On ZIH system, start an interactive session for setting up the environment: + +```console +marie@login$ srun --pty -n 1 --cpus-per-task=2 --time=2:00:00 --mem-per-cpu=2500 --x11=first bash -l -i +``` + +Create a new directory in your home, e.g. Jupyter + +```console +marie@compute$ mkdir Jupyter +marie@compute$ cd Jupyter +``` + +There are two ways how to run Anaconda. The easiest way is to load the Anaconda module. The second +one is to download Anaconda in your home directory. + +1. Load Anaconda module (recommended): + +```console +marie@compute$ module load modenv/scs5 +marie@compute$ module load Anaconda3 +``` + +1. Download latest Anaconda release (see example below) and change the rights to make it an +executable script and run the installation script: + +```console +marie@compute$ wget https://repo.continuum.io/archive/Anaconda3-2019.03-Linux-x86_64.sh +marie@compute$ chmod u+x Anaconda3-2019.03-Linux-x86_64.sh +marie@compute$ ./Anaconda3-2019.03-Linux-x86_64.sh +``` + +(during installation you have to confirm the license agreement) + +Next step will install the anaconda environment into the home +directory (`/home/userxx/anaconda3`). Create a new anaconda environment with the name `jnb`. + +```console +marie@compute$ conda create --name jnb +``` + +## Set environmental variables + +In the shell, activate previously created python environment (you can +deactivate it also manually) and install Jupyter packages for this python environment: + +```console +marie@compute$ source activate jnb +marie@compute$ conda install jupyter +``` + +If you need to adjust the configuration, you should create the template. Generate configuration +files for Jupyter notebook server: + +```console +marie@compute$ jupyter notebook --generate-config +``` + +Find a path of the configuration file, usually in the home under `.jupyter` directory, e.g. +`/home//.jupyter/jupyter_notebook_config.py` + +Set a password (choose easy one for testing), which is needed later on to log into the server +in browser session: + +```console +marie@compute$ jupyter notebook password +Enter password: +Verify password: +``` + +You get a message like that: + +```bash +[NotebookPasswordApp] Wrote *hashed password* to +/home/marie/.jupyter/jupyter_notebook_config.json +``` + +I order to create a certificate for secure connections, you can create a self-signed +certificate: + +```console +marie@compute$ openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mykey.key -out mycert.pem +``` + +Fill in the form with decent values. + +Possible entries for your Jupyter configuration (`.jupyter/jupyter_notebook*config.py*`). + +```bash +c.NotebookApp.certfile = u'<path-to-cert>/mycert.pem' +c.NotebookApp.keyfile = u'<path-to-cert>/mykey.key' + +# set ip to '*' otherwise server is bound to localhost only +c.NotebookApp.ip = '*' +c.NotebookApp.open_browser = False + +# copy hashed password from the jupyter_notebook_config.json +c.NotebookApp.password = u'<your hashed password here>' +c.NotebookApp.port = 9999 +c.NotebookApp.allow_remote_access = True +``` + +!!! note + `<path-to-cert>` - path to key and certificate files, for example: + (`/home/marie/mycert.pem`) + +## Slurm job file to run the Jupyter server on ZIH system with GPU (1x K80) (also works on K20) + +```bash +#!/bin/bash -l +#SBATCH --gres=gpu:1 # request GPU +#SBATCH --partition=gpu2 # use partition GPU 2 +#SBATCH --output=notebook_output.txt +#SBATCH --nodes=1 +#SBATCH --ntasks=1 +#SBATCH --time=02:30:00 +#SBATCH --mem=4000M +#SBATCH -J "jupyter-notebook" # job-name +#SBATCH -A p_marie + +unset XDG_RUNTIME_DIR # might be required when interactive instead of sbatch to avoid 'Permission denied error' +srun jupyter notebook +``` + +Start the script above (e.g. with the name `jnotebook`) with sbatch command: + +```bash +sbatch jnotebook.slurm +``` + +If you have a question about sbatch script see the article about [Slurm](../jobs_and_resources/slurm.md). + +Check by the command: `tail notebook_output.txt` the status and the **token** of the server. It +should look like this: + +`https://(taurusi2092.taurus.hrsk.tu-dresden.de or 127.0.0.1):9999/` + +You can see the **server node's hostname** by the command: `squeue --me`. + +### Remote connect to the server + +There are two options on how to connect to the server: + +1. You can create an ssh tunnel if you have problems with the +solution above. Open the other terminal and configure ssh +tunnel: (look up connection values in the output file of Slurm job, e.g.) (recommended): + +```bash +node=taurusi2092 #see the name of the node with squeue -u <your_login> +localport=8887 #local port on your computer +remoteport=9999 #pay attention on the value. It should be the same value as value in the notebook_output.txt +ssh -fNL ${localport}:${node}:${remoteport} <zih_user>@taurus.hrsk.tu-dresden.de #configure the ssh tunnel for connection to your remote server +pgrep -f "ssh -fNL ${localport}" #verify that tunnel is alive +``` + +2. On your client (local machine) you now can connect to the server. You need to know the **node's + hostname**, the **port** of the server and the **token** to login (see paragraph above). + +You can connect directly if you know the IP address (just ping the node's hostname while logged on +ZIH system). + +```bash +#command on remote terminal +marie@taurusi2092$ host taurusi2092 +# copy IP address from output +# paste IP to your browser or call on local terminal e.g.: +marie@local$ firefox https://<IP>:<PORT> # https important to use SSL cert +``` + +To login into the Jupyter notebook site, you have to enter the **token**. +(`https://localhost:8887`). Now you can create and execute notebooks on ZIH system with GPU support. + +!!! important + If you would like to use [JupyterHub](../access/jupyterhub.md) after using a remote manually + configured Jupyter server (example above) you need to change the name of the configuration file + (`/home//.jupyter/jupyter_notebook_config.py`) to any other. diff --git a/doc.zih.tu-dresden.de/docs/archive/no_ib_jobs.md b/doc.zih.tu-dresden.de/docs/archive/no_ib_jobs.md index 9ccce6361bcaa0bc024644f348708354d269a04f..49007a12354190a0fdde97a14a1a6bda922ea38d 100644 --- a/doc.zih.tu-dresden.de/docs/archive/no_ib_jobs.md +++ b/doc.zih.tu-dresden.de/docs/archive/no_ib_jobs.md @@ -25,8 +25,8 @@ Infiniband access if (and only if) they have set the `--tmp`-option as well: >units can be specified using the suffix \[K\|M\|G\|T\]. This option >applies to job allocations. -Keep in mind: Since the scratch file system are not available and the -project file system is read-only mounted at the compute nodes you have +Keep in mind: Since the scratch filesystem are not available and the +project filesystem is read-only mounted at the compute nodes you have to work in /tmp. A simple job script should do this: @@ -34,7 +34,7 @@ A simple job script should do this: - create a temporary directory on the compute node in `/tmp` and go there - start the application (under /sw/ or /projects/)using input data - from somewhere in the project file system + from somewhere in the project filesystem - archive and transfer the results to some global location ```Bash diff --git a/doc.zih.tu-dresden.de/docs/archive/system_altix.md b/doc.zih.tu-dresden.de/docs/archive/system_altix.md index 951b06137a599fc95239e5d50144fd2fa205e096..aa61353f4bec0c143b7c86892d8f3cb0a3c41d00 100644 --- a/doc.zih.tu-dresden.de/docs/archive/system_altix.md +++ b/doc.zih.tu-dresden.de/docs/archive/system_altix.md @@ -22,9 +22,9 @@ The jobs for these partitions (except Neptun) are scheduled by the [Platform LSF batch system running on `mars.hrsk.tu-dresden.de`. The actual placement of a submitted job may depend on factors like memory size, number of processors, time limit. -### File Systems +### Filesystems -All partitions share the same CXFS file systems `/work` and `/fastfs`. +All partitions share the same CXFS filesystems `/work` and `/fastfs`. ### ccNUMA Architecture @@ -123,8 +123,8 @@ nodes with dedicated resources for the user's job. Normally a job can be submitt #### LSF -The batch system on Atlas is LSF. For general information on LSF, please follow -[this link](platform_lsf.md). +The batch system on Atlas is LSF, see also the +[general information on LSF](platform_lsf.md). #### Submission of Parallel Jobs diff --git a/doc.zih.tu-dresden.de/docs/archive/system_atlas.md b/doc.zih.tu-dresden.de/docs/archive/system_atlas.md index 0e744c4ab702afac9d3ac413ccfb5abd58fef817..2bebd5511e69f98370aea0c721cee272f940fbc6 100644 --- a/doc.zih.tu-dresden.de/docs/archive/system_atlas.md +++ b/doc.zih.tu-dresden.de/docs/archive/system_atlas.md @@ -22,7 +22,7 @@ kernel. Currently, the following hardware is installed: Mars and Deimos users: Please read the [migration hints](migrate_to_atlas.md). -All nodes share the `/home` and `/fastfs` file system with our other HPC systems. Each +All nodes share the `/home` and `/fastfs` filesystem with our other HPC systems. Each node has 180 GB local disk space for scratch mounted on `/tmp`. The jobs for the compute nodes are scheduled by the [Platform LSF](platform_lsf.md) batch system from the login nodes `atlas.hrsk.tu-dresden.de` . @@ -86,8 +86,8 @@ user's job. Normally a job can be submitted with these data: #### LSF -The batch system on Atlas is LSF. For general information on LSF, please follow -[this link](platform_lsf.md). +The batch system on Atlas is LSF, see also the +[general information on LSF](platform_lsf.md). #### Submission of Parallel Jobs diff --git a/doc.zih.tu-dresden.de/docs/archive/system_venus.md b/doc.zih.tu-dresden.de/docs/archive/system_venus.md index 2c0a1fe2b83b1c4e7d09f5e2f6495db8658cb7f9..56acf9b47081726c9662150f638ff430e099020c 100644 --- a/doc.zih.tu-dresden.de/docs/archive/system_venus.md +++ b/doc.zih.tu-dresden.de/docs/archive/system_venus.md @@ -19,9 +19,9 @@ the Linux operating system SLES 11 SP 3 with a kernel version 3.x. From our experience, most parallel applications benefit from using the additional hardware hyperthreads. -### File Systems +### Filesystems -Venus uses the same `home` file system as all our other HPC installations. +Venus uses the same `home` filesystem as all our other HPC installations. For computations, please use `/scratch`. ## Usage @@ -77,8 +77,8 @@ nodes with dedicated resources for the user's job. Normally a job can be submitt - files for redirection of output and error messages, - executable and command line parameters. -The batch system on Venus is Slurm. For general information on Slurm, please follow -[this link](../jobs_and_resources/slurm.md). +The batch system on Venus is Slurm. Please see +[general information on Slurm](../jobs_and_resources/slurm.md). #### Submission of Parallel Jobs @@ -92,10 +92,10 @@ On Venus, you can only submit jobs with a core number which is a multiple of 8 ( srun -n 16 a.out ``` -**Please note:** There are different MPI libraries on Taurus and Venus, +**Please note:** There are different MPI libraries on Venus than on other ZIH systems, so you have to compile the binaries specifically for their target. -#### File Systems +#### Filesystems - The large main memory on the system allows users to create RAM disks within their own jobs. diff --git a/doc.zih.tu-dresden.de/docs/archive/unicore_rest_api.md b/doc.zih.tu-dresden.de/docs/archive/unicore_rest_api.md index 3cc59e7beb48a69a2b939542b14fef28cf4047fc..839028f327e069e912f59ffb688ccd1f54b58a40 100644 --- a/doc.zih.tu-dresden.de/docs/archive/unicore_rest_api.md +++ b/doc.zih.tu-dresden.de/docs/archive/unicore_rest_api.md @@ -1,18 +1,15 @@ # UNICORE access via REST API -**%RED%The UNICORE support has been abandoned and so this way of access -is no longer available.%ENDCOLOR%** +!!! warning -Most of the UNICORE features are also available using its REST API. - -This API is documented here: - -<https://sourceforge.net/p/unicore/wiki/REST_API/> + This page is outdated! The UNICORE support has been abandoned and so this way of access is no + longer available. -Some useful examples of job submission via REST are available at: - -<https://sourceforge.net/p/unicore/wiki/REST_API_Examples/> - -The base address for the Taurus system at the ZIH is: +Most of the UNICORE features are also available using its REST API. -unicore.zih.tu-dresden.de:8080/TAURUS/rest/core +* This API is documented here: + * [https://sourceforge.net/p/unicore/wiki/REST_API/](https://sourceforge.net/p/unicore/wiki/REST_API/) +* Some useful examples of job submission via REST are available at: + * [https://sourceforge.net/p/unicore/wiki/REST_API_Examples/](https://sourceforge.net/p/unicore/wiki/REST_API_Examples/) +* The base address for the system at the ZIH is: + * `unicore.zih.tu-dresden.de:8080/TAURUS/rest/core` diff --git a/doc.zih.tu-dresden.de/docs/contrib/content_rules.md b/doc.zih.tu-dresden.de/docs/contrib/content_rules.md new file mode 100644 index 0000000000000000000000000000000000000000..2be83c1f78668abb764586741a7de764b5baa112 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/contrib/content_rules.md @@ -0,0 +1,267 @@ +# Content Rules + +**Remark:** Avoid using tabs both in markdown files and in `mkdocs.yaml`. Type spaces instead. + +## New Page and Pages Structure + +The pages structure is defined in the configuration file `mkdocs.yaml`: + +```Bash +docs/ + - Home: index.md + - Application for HPC Login: application.md + - Request for Resources: req_resources.md + - Access to the Cluster: access.md + - Available Software and Usage: + - Overview: software/overview.md + [...] +``` + +To add a new page to the documentation follow these two steps: + +1. Create a new markdown file under `docs/subdir/file_name.md` and put the documentation inside. +The sub-directory and file name should follow the pattern `fancy_title_and_more.md`. +1. Add `subdir/file_name.md` to the configuration file `mkdocs.yml` by updating the navigation + section. + +Make sure that the new page **is not floating**, i.e., it can be reached directly from +the documentation structure. + +## Markdown + +1. Please keep things simple, i.e., avoid using fancy markdown dialects. + * [Cheat Sheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) + * [Style Guide](https://github.com/google/styleguide/blob/gh-pages/docguide/style.md) + +1. Do not add large binary files or high resolution images to the repository. See this valuable + document for [image optimization](https://web.dev/fast/#optimize-your-images). + +1. [Admonitions](https://squidfunk.github.io/mkdocs-material/reference/admonitions/) may be +actively used, especially for longer code examples, warnings, tips, important information that +should be highlighted, etc. Code examples, longer than half screen height should collapsed +(and indented): + +??? example + ```Bash + [...] + # very long example here + [...] + ``` + +## Writing Style + +* Capitalize headings, e.g. *Exclusive Reservation of Hardware* +* Give keywords in link texts, e.g. [Code Blocks](#code-blocks-and-syntax-highlighting) is more + descriptive than [this subsection](#code-blocks-and-syntax-highlighting) +* Use active over passive voice + * Write with confidence. This confidence should be reflected in the documentation, so that + the readers trust and follow it. + * Example: `We recommend something` instead of `Something is recommended.` + +## Spelling and Technical Wording + +To provide a consistent and high quality documentation, and help users to find the right pages, +there is a list of conventions w.r.t. spelling and technical wording. + +* Language settings: en_us +* `I/O` not `IO` +* `Slurm` not `SLURM` +* `Filesystem` not `file system` +* `ZIH system` and `ZIH systems` not `Taurus`, `HRSKII`, `our HPC systems`, etc. +* `Workspace` not `work space` +* avoid term `HPC-DA` +* Partition names after the keyword *partition*: *partition `ml`* not *ML partition*, *ml + partition*, *`ml` partition*, *"ml" partition*, etc. + +### Long Options + +* Use long over short options, e.g. `srun --nodes=2 --ntasks-per-node=4 ...` is preferred over + `srun -N 2 -n 4 ...` +* Use `module` over the short front-end `ml` in documentation and examples + +## Code Blocks and Command Prompts + +Showing commands and sample output is an important part of all technical documentation. To make +things as clear for readers as possible and provide a consistent documentation, some rules have to +be followed. + +1. Use ticks to mark code blocks and commands, not italic font. +1. Specify language for code blocks ([see below](#code-blocks-and-syntax-highlighting)). +1. All code blocks and commands should be runnable from a login node or a node within a specific + partition (e.g., `ml`). +1. It should be clear from the prompt, where the command is run (e.g. local machine, login node or + specific partition). + +### Prompts + +We follow this rules regarding prompts: + +| Host/Partition | Prompt | +|------------------------|------------------| +| Login nodes | `marie@login$` | +| Arbitrary compute node | `marie@compute$` | +| `haswell` partition | `marie@haswell$` | +| `ml` partition | `marie@ml$` | +| `alpha` partition | `marie@alpha$` | +| `romeo` partition | `marie@romeo$` | +| `julia` partition | `marie@julia$` | +| Localhost | `marie@local$` | + +*Remarks:* + +* **Always use a prompt**, even there is no output provided for the shown command. +* All code blocks should use long parameter names (e.g. Slurm parameters), if available. +* All code blocks which specify some general command templates, e.g. containing `<` and `>` + (see [Placeholders](#mark-placeholders)), should use `bash` for the code block. Additionally, + an example invocation, perhaps with output, should be given with the normal `console` code block. + See also [Code Block description below](#code-blocks-and-syntax-highlighting). +* Using some magic, the prompt as well as the output is identified and will not be copied! +* Stick to the [generic user name](#data-privacy-and-generic-user-name) `marie`. + +### Code Blocks and Syntax Highlighting + +This project makes use of the extension +[pymdownx.highlight](https://squidfunk.github.io/mkdocs-material/reference/code-blocks/) for syntax +highlighting. There is a complete list of supported +[language short codes](https://pygments.org/docs/lexers/). + +For consistency, use the following short codes within this project: + +With the exception of command templates, use `console` for shell session and console: + +````markdown +```console +marie@login$ ls +foo +bar +``` +```` + +Make sure that shell session and console code blocks are executable on the login nodes of HPC system. + +Command templates use [Placeholders](#mark-placeholders) to mark replaceable code parts. Command +templates should give a general idea of invocation and thus, do not contain any output. Use a +`bash` code block followed by an invocation example (with `console`): + +````markdown +```bash +marie@local$ ssh -NL <local port>:<compute node>:<remote port> <zih login>@tauruslogin.hrsk.tu-dresden.de +``` + +```console +marie@local$ ssh -NL 5901:172.24.146.46:5901 marie@tauruslogin.hrsk.tu-dresden.de +``` +```` + +Also use `bash` for shell scripts such as job files: + +````markdown +```bash +#!/bin/bash +#SBATCH --nodes=1 +#SBATCH --time=01:00:00 +#SBATCH --output=slurm-%j.out + +module load foss + +srun a.out +``` +```` + +!!! important + + Use long parameter names where possible to ease understanding. + +`python` for Python source code: + +````markdown +```python +from time import gmtime, strftime +print(strftime("%Y-%m-%d %H:%M:%S", gmtime())) +``` +```` + +`pycon` for Python console: + +````markdown +```pycon +>>> from time import gmtime, strftime +>>> print(strftime("%Y-%m-%d %H:%M:%S", gmtime())) +2021-08-03 07:20:33 +``` +```` + +Line numbers can be added via + +````markdown +```bash linenums="1" +#!/bin/bash + +#SBATCH -N 1 +#SBATCH -n 23 +#SBATCH -t 02:10:00 + +srun a.out +``` +```` + +_Result_: + + + +Specific Lines can be highlighted by using + +```` markdown +```bash hl_lines="2 3" +#!/bin/bash + +#SBATCH -N 1 +#SBATCH -n 23 +#SBATCH -t 02:10:00 + +srun a.out +``` +```` + +_Result_: + + + +### Data Privacy and Generic User Name + +Where possible, replace login, project name and other private data with clearly arbitrary placeholders. +E.g., use the generic login `marie` and the corresponding project name `p_marie`. + +```console +marie@login$ ls -l +drwxr-xr-x 3 marie p_marie 4096 Jan 24 2020 code +drwxr-xr-x 3 marie p_marie 4096 Feb 12 2020 data +-rw-rw---- 1 marie p_marie 4096 Jan 24 2020 readme.md +``` + +## Mark Omissions + +If showing only a snippet of a long output, omissions are marked with `[...]`. + +## Unix Rules + +Stick to the Unix rules on optional and required arguments, and selection of item sets: + +* `<required argument or value>` +* `[optional argument or value]` +* `{choice1|choice2|choice3}` + +## Graphics and Attachments + +All graphics and attachments are saved within `misc` directory of the respective sub directory in +`docs`. + +The syntax to insert a graphic or attachment into a page is + +```Bash + +{: align="center"} +``` + +The attribute `align` is optional. By default, graphics are left aligned. **Note:** It is crucial to +have `{: align="center"}` on a new line. diff --git a/doc.zih.tu-dresden.de/docs/contrib/contribute_browser.md b/doc.zih.tu-dresden.de/docs/contrib/contribute_browser.md new file mode 100644 index 0000000000000000000000000000000000000000..45e8018d263300c03101f1374b6350ce58a131dd --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/contrib/contribute_browser.md @@ -0,0 +1,105 @@ +# Contribution Guide for Browser-based Editing + +In the following, it is outlined how to contribute to the +[HPC documentation](https://doc.zih.tu-dresden.de/) of +[TU Dresden/ZIH](https://tu-dresden.de/zih/) by means of GitLab's web interface using a standard web +browser only. + +## Preparation + +First of all, you need an account on [gitlab.hrz.tu-chemnitz.de](https://gitlab.hrz.tu-chemnitz.de). +Secondly, you need access to the project +[ZIH/hpcsupport/hpc-compendium](https://gitlab.hrz.tu-chemnitz.de/zih/hpcsupport/hpc-compendium). + +The project is publicly visible, i.e., it is open to the world and any signed-in user has the +[Guest role](https://gitlab.hrz.tu-chemnitz.de/help/user/permissions.md) on this repository. Guests +have only very +[limited permissions](https://gitlab.hrz.tu-chemnitz.de/help/user/permissions.md#project-members-permissions). +In particular, as guest, you can contribute to the documentation by +[creating issues](howto_contribute.md#contribute-via-issue), but you cannot edit files and create +new branches. + +To be granted the role **Developer**, please request access by clicking the corresponding button. + + + +Once you are granted the developer role, choose "ZIH/hpcsupport/hpc-compendium" in your project list. + +!!! hint "Git basics" + + If you are not familiar with the basics of git-based document revision control yet, please have + a look at [Gitlab tutorials](https://gitlab.hrz.tu-chemnitz.de/help/gitlab-basics/index.md). + +## Create a Branch + +Your contribution starts by creating your own branch of the repository that will hold your edits and +additions. Create your branch by clicking on "+" near "preview->hpc-compendium/" as depicted in +the figure and click "New branch". + + + +By default, the new branch should be created from the `preview` branch, as pre-selected. + +Define a branch name that briefly describes what you plan to change, e.g., `edits-in-document-xyz`. +Then, click on "Create branch" as depicted in this figure: + + + +As a result, you should now see your branch's name on top of your list of repository files as +depicted here: + + + +## Editing Existing Articles + +Navigate the depicted document hierarchy under `doc.zih.tu-dresden.de/docs` until you find the +article to be edited. A click on the article's name opens a textual representation of the article. +In the top right corner of it, you find the button "Edit" to be clicked in order to make changes. +Once you completed your changes, click on "Commit changes". Please add meaningful comment about the +changes you made under "Commit message". Feel free to do as many changes and commits as you wish in +your branch of the repository. + +## Adding New Article + +Navigate the depicted document hierarchy under `doc.zih.tu-dresden.de/docs` to find a topic that +fits best to your article. To start a completely new article, click on "+ New file" as depicted +here: + + + +Set a file name that corresponds well to your article like `application_xyz.md`. +(The file name should follow the pattern `fancy_title_and_more.md`.) +Once you completed your initial edits, click on "commit". + + + +Finally, the new article needs to be added to the navigation section of the configuration file +`doc.zih.tu-dresden.de/mkdocs.yaml`. + +## Submitting Articles for Publication + +Once you are satisfied with your edits, you are ready for publication. +Therefore, your edits need to undergo an internal review process and pass the CI/CD pipeline tests. +This process is triggered by creating a "merge request", which serves the purpose of merging your edits +into the `preview` branch of the repository. + +* Click on "Merge requests" (in the menu to the left) as depicted below. +* Then, click on the button "New merge request". +* Select your source branch (for example `edits-in-document-xyz`) and click on "Compare branches and + continue". (The target branch is always `preview`. This is pre-selected - do not change!) +* The next screen will give you an overview of your changes. Please provide a meaningful + description of the contributions. Once you checked them, click on "Create merge request". + + + +## Revision of Articles + +As stated earlier, all changes undergo a review process. +This covers automated checks contained in the CI/CD pipeline and the review by a maintainer. +You can follow this process under +[Merge requests](https://gitlab.hrz.tu-chemnitz.de/zih/hpcsupport/hpc-compendium/-/merge_requests) +(where you initiated your merge request). +If you are asked to make corrections or changes, follow the directions as indicated. +Once your merge request has been accepted, the merge request will be closed and the branch will be deleted. +At this point, there is nothing else to do for you. +Except probably for waiting a little while until your changes become visible on the official web site. diff --git a/doc.zih.tu-dresden.de/docs/contrib/contribute_container.md b/doc.zih.tu-dresden.de/docs/contrib/contribute_container.md new file mode 100644 index 0000000000000000000000000000000000000000..dd44fafa136d63ae80267226f70dc00563507ba3 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/contrib/contribute_container.md @@ -0,0 +1,150 @@ +# Contributing Using a Local Clone and a Docker Container + +## Git Procedure + +Please follow this standard Git procedure for working with a local clone: + +1. Change to a local (unencrypted) filesystem. (We have seen problems running the container on an +ecryptfs filesystem. So you might want to use e.g. `/tmp` as the start directory.) +1. Create a new directory, e.g. with `mkdir hpc-wiki` +1. Change into the new directory, e.g. `cd hpc-wiki` +1. Clone the Git repository: +`git clone git@gitlab.hrz.tu-chemnitz.de:zih/hpcsupport/hpc-compendium.git .` (don't forget the +dot) +1. Create a new feature branch for you to work in. Ideally, name it like the file you want to +modify or the issue you want to work on, e.g.: `git checkout -b issue-174`. (If you are uncertain +about the name of a file, please look into `mkdocs.yaml`.) +1. Improve the documentation with your preferred editor, i.e. add new files and correct mistakes. +automatically by our CI pipeline. +1. Use `git add <FILE>` to select your improvements for the next commit. +1. Commit the changes with `git commit -m "<DESCRIPTION>"`. The description should be a meaningful +description of your changes. If you work on an issue, please also add "Closes 174" (for issue 174). +1. Push the local changes to the GitLab server, e.g. with `git push origin issue-174`. +1. As an output you get a link to create a merge request against the preview branch. +1. When the merge request is created, a continuous integration (CI) pipeline automatically checks +your contributions. + +You can find the details and commands to preview your changes and apply checks in the next section. + +## Preparation + +Assuming you already have a working Docker installation and have cloned the repository as mentioned +above, a few more steps are necessary. + +* a working Docker installation +* all necessary access/execution rights +* a local clone of the repository in the directory `./hpc-wiki` + +Build the docker image. This might take a bit longer, but you have to +run it only once in a while. + +```bash +cd hpc-wiki +docker build -t hpc-compendium . +``` + +## Working with the Docker Container + +Here is a suggestion of a workflow which might be suitable for you. + +### Start the Local Web Server + +The command(s) to start the dockerized web server is this: + +```bash +docker run --name=hpc-compendium -p 8000:8000 --rm -w /docs --mount src="$(pwd)"/doc.zih.tu-dresden.de,target=/docs,type=bind hpc-compendium bash -c "mkdocs build && mkdocs serve -a 0.0.0.0:8000" +``` + +You can view the documentation via `http://localhost:8000` in your browser, now. + +!!! note + + You can keep the local web server running in this shell to always have the opportunity to see + the result of your changes in the browser. Simply open another terminal window for other + commands. + +You can now update the contents in you preferred editor. The running container automatically takes +care of file changes and rebuilds the documentation whenever you save a file. + +With the details described below, it will then be easy to follow the guidelines for local +correctness checks before submitting your changes and requesting the merge. + +### Run the Proposed Checks Inside Container + +In our continuous integration (CI) pipeline, a merge request triggers the automated check of + +* correct links, +* correct spelling, +* correct text format. + +If one of them fails, the merge request will not be accepted. To prevent this, you can run these +checks locally and adapt your files accordingly. + +To avoid a lot of retyping, use the following in your shell: + +```bash +alias wiki="docker run --name=hpc-compendium --rm -it -w /docs --mount src=$PWD/doc.zih.tu-dresden.de,target=/docs,type=bind hpc-compendium bash -c" +``` + +You are now ready to use the different checks, however we suggest to try the pre-commit hook. + +#### Pre-commit Git Hook + +We recommend to automatically run checks whenever you try to commit a change. In this case, failing +checks prevent commits (unless you use option `--no-verify`). This can be accomplished by adding a +pre-commit hook to your local clone of the repository. The following code snippet shows how to do +that: + +```bash +cp doc.zih.tu-dresden.de/util/pre-commit .git/hooks/ +``` + +!!! note + The pre-commit hook only works, if you can use docker without using `sudo`. If this is not + already the case, use the command `adduser $USER docker` to enable docker commands without + `sudo` for the current user. Restart the docker daemons afterwards. + +Read on if you want to run a specific check. + +#### Linter + +If you want to check whether the markdown files are formatted properly, use the following command: + +```bash +wiki 'markdownlint docs' +``` + +#### Spell Checker + +For spell-checking a single file, , e.g. +`doc.zih.tu-dresden.de/docs/software/big_data_frameworks_spark.md`, use: + +```bash +wiki 'util/check-spelling.sh docs/software/big_data_frameworks_spark.md' +``` + +For spell-checking all files, use: + +```bash +wiki 'find docs -type f -name "*.md" | xargs -L1 util/check-spelling.sh' +``` + +This outputs all words of all files that are unknown to the spell checker. +To let the spell checker "know" a word, append it to +`doc.zih.tu-dresden.de/wordlist.aspell`. + +#### Link Checker + +To check a single file, e.g. +`doc.zih.tu-dresden.de/docs/software/big_data_frameworks_spark.md`, use: + +```bash +wiki 'markdown-link-check docs/software/big_data_frameworks_spark.md' +``` + +To check whether there are links that point to a wrong target, use +(this may take a while and gives a lot of output because it runs over all files): + +```bash +wiki 'find docs -type f -name "*.md" | xargs -L1 markdown-link-check' +``` diff --git a/doc.zih.tu-dresden.de/docs/contrib/howto_contribute.md b/doc.zih.tu-dresden.de/docs/contrib/howto_contribute.md new file mode 100644 index 0000000000000000000000000000000000000000..e0d91cccc3f534e0d7057b72f1d6479f8932b6aa --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/contrib/howto_contribute.md @@ -0,0 +1,51 @@ +# How-To Contribute + +!!! cite "Chinese proverb" + + Ink is better than the best memory. + +In principle, there are three possible ways how to contribute to this documentation. + +## Contribute via Issue + +Users can contribute to the documentation via the +[GitLab issue tracking system](https://gitlab.hrz.tu-chemnitz.de/zih/hpcsupport/hpc-compendium/-/issues). +For that, open an issue to report typos and missing documentation or request for more precise +wording etc. ZIH staff will get in touch with you to resolve the issue and improve the +documentation. + +??? tip "Create an issue in GitLab" + +  + {: align=center} + +!!! warning "HPC support" + + Non-documentation issues and requests need to be send as ticket to + [hpcsupport@zih.tu-dresden.de](mailto:hpcsupport@zih.tu-dresden.de). + +## Contribute via Web IDE + +GitLab offers a rich and versatile web interface to work with repositories. To fix typos and edit +source files, follow these steps: + +1. Navigate to the repository at +[https://gitlab.hrz.tu-chemnitz.de/zih/hpcsupport/hpc-compendium](https://gitlab.hrz.tu-chemnitz.de/zih/hpcsupport/hpc-compendium) +and log in. +1. Select the right branch. +1. Select the file of interest in `doc.zih.tu-dresden.de/docs/...` and click the `Edit` button. +1. A text and commit editor are invoked: Do your changes, add a meaningful commit message and commit + the changes. + +The more sophisticated integrated Web IDE is reached from the top level menu of the repository or +by selecting any source file. + +Other git services might have an equivalent web interface to interact with the repository. Please +refer to the corresponding documentation for further information. + +## Contribute Using Git Locally + +For experienced Git users, we provide a Docker container that includes all checks of the CI engine +used in the back-end. Using them should ensure that merge requests will not be blocked +due to automatic checking. +For details, refer to the page [Work Locally Using Containers](contribute_container.md). diff --git a/doc.zih.tu-dresden.de/docs/contrib/misc/cb_branch_indicator.png b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_branch_indicator.png new file mode 100644 index 0000000000000000000000000000000000000000..1c024c55142a12d390d4eaf8306632ed80e0eb9a Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_branch_indicator.png differ diff --git a/doc.zih.tu-dresden.de/docs/contrib/misc/cb_commit_file.png b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_commit_file.png new file mode 100644 index 0000000000000000000000000000000000000000..3df543cb2940c808a24bc7be023691aba40ff9c7 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_commit_file.png differ diff --git a/doc.zih.tu-dresden.de/docs/contrib/misc/cb_create_new_branch.png b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_create_new_branch.png new file mode 100644 index 0000000000000000000000000000000000000000..8e9bca4e7fcc8014f725c1c1d024037e23a64204 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_create_new_branch.png differ diff --git a/doc.zih.tu-dresden.de/docs/contrib/misc/cb_create_new_file.png b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_create_new_file.png new file mode 100644 index 0000000000000000000000000000000000000000..30fed32f3c5a12b91dc0c7cd2250978653ea84f6 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_create_new_file.png differ diff --git a/doc.zih.tu-dresden.de/docs/contrib/misc/cb_new_merge_request.png b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_new_merge_request.png new file mode 100644 index 0000000000000000000000000000000000000000..e74b1ec4d43c6017fa7d1e6326996c30795c71a6 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_new_merge_request.png differ diff --git a/doc.zih.tu-dresden.de/docs/contrib/misc/cb_set_branch_name.png b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_set_branch_name.png new file mode 100644 index 0000000000000000000000000000000000000000..4da02249faeea31495c792bc045d593d9b989a04 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_set_branch_name.png differ diff --git a/doc.zih.tu-dresden.de/docs/contrib/misc/create_gitlab_issue.gif b/doc.zih.tu-dresden.de/docs/contrib/misc/create_gitlab_issue.gif new file mode 100644 index 0000000000000000000000000000000000000000..cb4910897903283e43b21feadcdb2acf2f42c15e Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/contrib/misc/create_gitlab_issue.gif differ diff --git a/doc.zih.tu-dresden.de/misc/highlight_lines.png b/doc.zih.tu-dresden.de/docs/contrib/misc/highlight_lines.png similarity index 100% rename from doc.zih.tu-dresden.de/misc/highlight_lines.png rename to doc.zih.tu-dresden.de/docs/contrib/misc/highlight_lines.png diff --git a/doc.zih.tu-dresden.de/misc/lines.png b/doc.zih.tu-dresden.de/docs/contrib/misc/lines.png similarity index 100% rename from doc.zih.tu-dresden.de/misc/lines.png rename to doc.zih.tu-dresden.de/docs/contrib/misc/lines.png diff --git a/doc.zih.tu-dresden.de/docs/contrib/misc/request_access.png b/doc.zih.tu-dresden.de/docs/contrib/misc/request_access.png new file mode 100644 index 0000000000000000000000000000000000000000..c051e93b6a149ed69e95e5d9b653a80110836266 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/contrib/misc/request_access.png differ diff --git a/doc.zih.tu-dresden.de/docs/data_lifecycle/intermediate_archive.md b/doc.zih.tu-dresden.de/docs/data_lifecycle/intermediate_archive.md index e63f3f2876f98aeaa8c6a08e41fd21cc8eab7869..bcfc86b6b35f01bc0a5a1eebffdf65ee6319d171 100644 --- a/doc.zih.tu-dresden.de/docs/data_lifecycle/intermediate_archive.md +++ b/doc.zih.tu-dresden.de/docs/data_lifecycle/intermediate_archive.md @@ -1,18 +1,18 @@ # Intermediate Archive With the "Intermediate Archive", ZIH is closing the gap between a normal disk-based filesystem and -[Longterm Archive](preservation_research_data.md). The Intermediate Archive is a hierarchical +[Long-term Archive](preservation_research_data.md). The Intermediate Archive is a hierarchical filesystem with disks for buffering and tapes for storing research data. Its intended use is the storage of research data for a maximal duration of 3 years. For storing the data after exceeding this time, the user has to supply essential metadata and migrate the files to -the [Longterm Archive](preservation_research_data.md). Until then, she/he has to keep track of her/his +the [Long-term Archive](preservation_research_data.md). Until then, she/he has to keep track of her/his files. Some more information: - Maximum file size in the archive is 500 GB (split up your files, see - [Datamover](../data_transfer/data_mover.md)) + [Datamover](../data_transfer/datamover.md)) - Data will be stored in two copies on tape. - The bandwidth to this data is very limited. Hence, this filesystem must not be used directly as input or output for HPC jobs. @@ -20,7 +20,7 @@ Some more information: ## Access the Intermediate Archive For storing and restoring your data in/from the "Intermediate Archive" you can use the tool -[Datamover](../data_transfer/data_mover.md). To use the DataMover you have to login to ZIH systems. +[Datamover](../data_transfer/datamover.md). To use the DataMover you have to login to ZIH systems. ### Store Data diff --git a/doc.zih.tu-dresden.de/docs/data_lifecycle/overview.md b/doc.zih.tu-dresden.de/docs/data_lifecycle/overview.md index e20e2ace134dad1c4fbbb94b2fc3d0a0f1401df1..a0152832601f63ef68ecb2191414265896237434 100644 --- a/doc.zih.tu-dresden.de/docs/data_lifecycle/overview.md +++ b/doc.zih.tu-dresden.de/docs/data_lifecycle/overview.md @@ -10,7 +10,7 @@ uniformity of the project can be achieved by taking into account and setting up The used set of software within an HPC project can be management with environments on different levels either defined by [modules](../software/modules.md), [containers](../software/containers.md) -or by [Python virtual environments](../software/python.md). +or by [Python virtual environments](../software/python_virtual_environments.md). In the following, a brief overview on relevant topics w.r.t. data life cycle management is provided. ## Data Storage and Management @@ -18,11 +18,11 @@ In the following, a brief overview on relevant topics w.r.t. data life cycle man The main concept of working with data on ZIH systems bases on [Workspaces](workspaces.md). Use it properly: - * use a `/home` directory for the limited amount of personal data, simple examples and the results - of calculations. The home directory is not a working directory! However, `/home` filesystem is - [backed up](#backup) using snapshots; - * use `workspaces` as a place for working data (i.e. datasets); Recommendations of choosing the - correct storage system for workspace presented below. +* use a `/home` directory for the limited amount of personal data, simple examples and the results + of calculations. The home directory is not a working directory! However, `/home` filesystem is + [backed up](#backup) using snapshots; +* use `workspaces` as a place for working data (i.e. data sets); Recommendations of choosing the + correct storage system for workspace presented below. ### Taxonomy of Filesystems @@ -30,37 +30,29 @@ It is important to design your data workflow according to characteristics, like (bandwidth/IOPS) of the application, size of the data, (number of files,) and duration of the storage to efficiently use the provided storage and filesystems. The page [filesystems](file_systems.md) holds a comprehensive documentation on the different -filesystems. <!--In general, the mechanisms of so-called--> <!--[Workspaces](workspaces.md) are -compulsory for all HPC users to store data for a defined duration ---> <!--depending on the -requirements and the storage system this time span might range from days to a few--> <!--years.--> -<!--- [HPC filesystems](file_systems.md)--> <!--- [Intermediate -Archive](intermediate_archive.md)--> <!--- [Special data containers] **todo** Special data -containers (was no valid link in old compendium)--> <!--- [Move data between filesystems] -(../data_transfer/data_mover.md)--> <!--- [Move data to/from ZIH's filesystems] -(../data_transfer/export_nodes.md)--> <!--- [Longterm Preservation for -ResearchData](preservation_research_data.md)--> +filesystems. !!! hint "Recommendations to choose of storage system" * For data that seldom changes but consumes a lot of space, the [warm_archive](file_systems.md#warm_archive) can be used. (Note that this is mounted **read-only** on the compute nodes). - * For a series of calculations that works on the same data please use a `scratch` based [workspace](workspaces.md). + * For a series of calculations that works on the same data please use a `scratch` based + [workspace](workspaces.md). * **SSD**, in its turn, is the fastest available filesystem made only for large parallel applications running with millions of small I/O (input, output operations). * If the batch job needs a directory for temporary data then **SSD** is a good choice as well. The data can be deleted afterwards. Keep in mind that every workspace has a storage duration. Thus, be careful with the expire date -otherwise it could vanish. The core data of your project should be [backed up](#backup) and -[archived]**todo link** (for the most [important]**todo link** data). +otherwise it could vanish. The core data of your project should be [backed up](#backup) and the most +important data should be [archived](preservation_research_data.md). ### Backup The backup is a crucial part of any project. Organize it at the beginning of the project. The backup mechanism on ZIH systems covers **only** the `/home` and `/projects` filesystems. Backed up -files can be restored directly by the users. Details can be found -[here](file_systems.md#backup-and-snapshots-of-the-file-system). +files can be restored directly by users, see [Snapshots](permanent.md#snapshots). !!! warning @@ -68,18 +60,18 @@ files can be restored directly by the users. Details can be found ### Folder Structure and Organizing Data -Organizing of living data using the filesystem helps for consistency and structuredness of the +Organizing of living data using the filesystem helps for consistency of the project. We recommend following the rules for your work regarding: - * Organizing the data: Never change the original data; Automatize the organizing the data; Clearly - separate intermediate and final output in the filenames; Carry identifier and original name - along in your analysis pipeline; Make outputs clearly identifiable; Document your analysis - steps. - * Naming Data: Keep short, but meaningful names; Keep standard file endings; File names - don’t replace documentation and metadata; Use standards of your discipline; Make rules for your - project, document and keep them (See the [README recommendations]**todo link** below) +* Organizing the data: Never change the original data; Automatize the organizing the data; Clearly + separate intermediate and final output in the filenames; Carry identifier and original name + along in your analysis pipeline; Make outputs clearly identifiable; Document your analysis + steps. +* Naming Data: Keep short, but meaningful names; Keep standard file endings; File names + don’t replace documentation and metadata; Use standards of your discipline; Make rules for your + project, document and keep them (See the [README recommendations](#readme-recommendation) below) -This is the example of an organisation (hierarchical) for the folder structure. Use it as a visual +This is the example of an organization (hierarchical) for the folder structure. Use it as a visual illustration of the above:  @@ -126,50 +118,10 @@ Don't forget about data hygiene: Classify your current data into critical (need its life cycle (from creation, storage and use to sharing, archiving and destruction); Erase the data you don’t need throughout its life cycle. -<!--## Software Packages--> - -<!--As was written before the module concept is the basic concept for using software on ZIH systems.--> -<!--Uniformity of the project has to be achieved by using the same set of software on different levels.--> -<!--It could be done by using environments. There are two types of environments should be distinguished:--> -<!--runtime environment (the project level, use scripts to load [modules]**todo link**), Python virtual--> -<!--environment. The concept of the environment will give an opportunity to use the same version of the--> -<!--software on every level of the project for every project member.--> - -<!--### Private Individual and Project Modules Files--> - -<!--[Private individual and project module files]**todo link** will be discussed in [chapter 7]**todo--> -<!--link**. Project modules list is a powerful instrument for effective teamwork.--> - -<!--### Python Virtual Environment--> - -<!--If you are working with the Python then it is crucial to use the virtual environment on ZIH Systems. The--> -<!--main purpose of Python virtual environments (don't mess with the software environment for modules)--> -<!--is to create an isolated environment for Python projects (self-contained directory tree that--> -<!--contains a Python installation for a particular version of Python, plus a number of additional--> -<!--packages).--> - -<!--**Vitualenv (venv)** is a standard Python tool to create isolated Python environments. We--> -<!--recommend using venv to work with Tensorflow and Pytorch on ZIH systems. It has been integrated into the--> -<!--standard library under the [venv module]**todo link**. **Conda** is the second way to use a virtual--> -<!--environment on the ZIH systems. Conda is an open-source package management system and environment--> -<!--management system from the Anaconda.--> - -<!--[Detailed information]**todo link** about using the virtual environment.--> - -<!--## Application Software Availability--> - -<!--Software created for the purpose of the project should be available for all members of the group.--> -<!--The instruction of how to use the software: installation of packages, compilation etc should be--> -<!--documented and gives the opportunity to comfort efficient and safe work.--> - -## Access rights +## Access Rights The concept of **permissions** and **ownership** is crucial in Linux. See the -[HPC-introduction]**todo link** slides for the understanding of the main concept. Standard Linux -changing permission command (i.e `chmod`) valid for ZIH systems as well. The **group** access level -contains members of your project group. Be careful with 'write' permission and never allow to change -the original data. - -Useful links: [Data Management]**todo link**, [Filesystems]**todo link**, [Get Started with -HPC Data Analytics]**todo link**, [Project Management]**todo link**, [Preservation research -data[**todo link** +[slides of HPC introduction](../misc/HPC-Introduction.pdf) for understanding of the main concept. +Standard Linux changing permission command (i.e `chmod`) valid for ZIH systems as well. The +**group** access level contains members of your project group. Be careful with 'write' permission +and never allow to change the original data. diff --git a/doc.zih.tu-dresden.de/docs/data_lifecycle/preservation_research_data.md b/doc.zih.tu-dresden.de/docs/data_lifecycle/preservation_research_data.md index 29399cf9f323337bacb34e76a5da8412d599119d..79ae1cf00b45f8bf46bc054e1502fc9404417b75 100644 --- a/doc.zih.tu-dresden.de/docs/data_lifecycle/preservation_research_data.md +++ b/doc.zih.tu-dresden.de/docs/data_lifecycle/preservation_research_data.md @@ -1,4 +1,4 @@ -# Longterm Preservation for Research Data +# Long-term Preservation for Research Data ## Why should research data be preserved? @@ -47,42 +47,38 @@ stored in XML-format but free text is also possible. There are some meta-data st Below are some examples: - possible meta-data for a book would be: - - Title - - Author - - Publisher - - Publication - - year - - ISBN + - Title + - Author + - Publisher + - Publication + - year + - ISBN - possible meta-data for an electronically saved image would be: - - resolution of the image - - information about the colour depth of the picture - - file format (jpg or tiff or ...) - - file size how was this image created (digital camera, scanner, ...) - - description of what the image shows - - creation date of the picture - - name of the person who made the picture + - resolution of the image + - information about the color depth of the picture + - file format (jpg or tiff or ...) + - file size how was this image created (digital camera, scanner, ...) + - description of what the image shows + - creation date of the picture + - name of the person who made the picture - meta-data for the result of a calculation/simulation could be: - - file format - - file size - - input data - - which software in which version was used to calculate the result/to do the simulation - - configuration of the software - - date of the calculation/simulation (start/end or start/duration) - - computer on which the calculation/simulation was done - - name of the person who submitted the calculation/simulation - - description of what was calculated/simulated + - file format + - file size + - input data + - which software in which version was used to calculate the result/to do the simulation + - configuration of the software + - date of the calculation/simulation (start/end or start/duration) + - computer on which the calculation/simulation was done + - name of the person who submitted the calculation/simulation + - description of what was calculated/simulated ## Where can I get more information about management of research data? -Got to - -- <http://www.forschungsdaten.org/> (german version) or <http://www.forschungsdaten.org/en/> -- (english version) - -to find more information about managing research data. +Go to [http://www.forschungsdaten.org/en/](http://www.forschungsdaten.org/en/) to find more +information about managing research data. ## I want to store my research data at ZIH. How can I do that? -Longterm preservation of research data is under construction at ZIH and in a testing phase. +Long-term preservation of research data is under construction at ZIH and in a testing phase. Nevertheless you can already use the archiving service. If you would like to become a test -user, please write an E-Mail to Dr. Klaus Köhler (klaus.koehler \[at\] tu-dresden.de). +user, please write an E-Mail to [Dr. Klaus Köhler](mailto:klaus.koehler@tu-dresden.de). diff --git a/doc.zih.tu-dresden.de/docs/data_lifecycle/quotas.md b/doc.zih.tu-dresden.de/docs/data_lifecycle/quotas.md deleted file mode 100644 index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000 diff --git a/doc.zih.tu-dresden.de/docs/data_lifecycle/workspaces.md b/doc.zih.tu-dresden.de/docs/data_lifecycle/workspaces.md index f5e217de6b34e861004b54de3fb4d6cb5004a2ce..cad27c4df4070206644612d85d7fadc7658e15f4 100644 --- a/doc.zih.tu-dresden.de/docs/data_lifecycle/workspaces.md +++ b/doc.zih.tu-dresden.de/docs/data_lifecycle/workspaces.md @@ -2,15 +2,15 @@ Storage systems differ in terms of capacity, streaming bandwidth, IOPS rate, etc. Price and efficiency don't allow to have it all in one. That is why fast parallel filesystems at ZIH have -restrictions with regards to **age of files** and [quota](quotas.md). The mechanism of workspaces +restrictions with regards to **age of files** and [quota](permanent.md#quotas). The mechanism of workspaces enables users to better manage their HPC data. -<!--Workspaces are primarily login-related.--> -The concept of "workspaces" is common and used at a large number of HPC centers. + +The concept of workspaces is common and used at a large number of HPC centers. !!! note - A workspace is a directory, with an associated expiration date, created on behalf of a user in a - certain storage system. + A **workspace** is a directory, with an associated expiration date, created on behalf of a user + in a certain filesystem. Once the workspace has reached its expiration date, it gets moved to a hidden directory and enters a grace period. Once the grace period ends, the workspace is deleted permanently. The maximum lifetime @@ -30,8 +30,8 @@ times. To list all available filesystems for using workspaces use: -```bash -zih$ ws_find -l +```console +marie@login$ ws_find -l Available filesystems: scratch warm_archive @@ -43,8 +43,8 @@ beegfs_global0 To list all workspaces you currently own, use: -```bash -zih$ ws_list +```console +marie@login$ ws_list id: test-workspace workspace directory : /scratch/ws/0/marie-test-workspace remaining time : 89 days 23 hours @@ -59,8 +59,8 @@ id: test-workspace To create a workspace in one of the listed filesystems use `ws_allocate`. It is necessary to specify a unique name and the duration of the workspace. -```bash -ws_allocate: [options] workspace_name duration +```console +marie@login$ ws_allocate: [options] workspace_name duration Options: -h [ --help] produce help message @@ -74,13 +74,12 @@ Options: -u [ --username ] arg username -g [ --group ] group workspace -c [ --comment ] arg comment - ``` !!! example - ```bash - zih$ ws_allocate -F scratch -r 7 -m marie.testuser@tu-dresden.de test-workspace 90 + ```console + marie@login$ ws_allocate -F scratch -r 7 -m marie.testuser@tu-dresden.de test-workspace 90 Info: creating workspace. /scratch/ws/marie-test-workspace remaining extensions : 10 @@ -95,22 +94,23 @@ days with an email reminder for 7 days before the expiration. Setting the reminder to `7` means you will get a reminder email on every day starting `7` prior to expiration date. -### Extention of a Workspace +### Extension of a Workspace The lifetime of a workspace is finite. Different filesystems (storage systems) have different maximum durations. A workspace can be extended multiple times, depending on the filesystem. -| Storage system (use with parameter -F ) | Duration, days | Extensions | Remarks | -|:------------------------------------------:|:----------:|:-------:|:---------------------------------------------------------------------------------------:| -| `ssd` | 30 | 10 | High-IOPS filesystem (`/lustre/ssd`) on SSDs. | -| `beegfs` | 30 | 2 | High-IOPS filesystem (`/lustre/ssd`) onNVMes. | -| `scratch` | 100 | 2 | Scratch filesystem (/scratch) with high streaming bandwidth, based on spinning disks | -| `warm_archive` | 365 | 2 | Capacity filesystem based on spinning disks | +| Filesystem (use with parameter `-F`) | Duration, days | Extensions | Remarks | +|:------------------------------------:|:----------:|:-------:|:-----------------------------------:| +| `ssd` | 30 | 2 | High-IOPS filesystem (`/lustre/ssd`, symbolic link: `/ssd`) on SSDs. | +| `beegfs_global0` (deprecated) | 30 | 2 | High-IOPS filesystem (`/beegfs/global0`) on NVMes. | +| `beegfs` | 30 | 2 | High-IOPS filesystem (`/beegfs`) on NVMes. | +| `scratch` | 100 | 10 | Scratch filesystem (`/lustre/ssd`, symbolic link: `/scratch`) with high streaming bandwidth, based on spinning disks | +| `warm_archive` | 365 | 2 | Capacity filesystem based on spinning disks | -To extend your workspace use the following command: +To extent your workspace use the following command: -``` -zih$ ws_extend -F scratch test-workspace 100 #extend the workspace for 100 days +```console +marie@login$ ws_extend -F scratch test-workspace 100 Info: extending workspace. /scratch/ws/marie-test-workspace remaining extensions : 1 @@ -122,39 +122,46 @@ remaining time in days: 100 With the `ws_extend` command, a new duration for the workspace is set. The new duration is not added! -This means when you extend a workspace that expires in 90 days with the `ws_extend -F scratch -my-workspace 40`, it will now expire in 40 days **not** 130 days. +This means when you extend a workspace that expires in 90 days with the command + +```console +marie@login$ ws_extend -F scratch my-workspace 40 +``` + +it will now expire in 40 days **not** 130 days. ### Deletion of a Workspace To delete a workspace use the `ws_release` command. It is mandatory to specify the name of the workspace and the filesystem in which it is located: -`ws_release -F <filesystem> <workspace name>` +```console +marie@login$ ws_release -F <filesystem> <workspace name> +``` ### Restoring Expired Workspaces At expiration time your workspace will be moved to a special, hidden directory. For a month (in warm_archive: 2 months), you can still restore your data into an existing workspace. -!!!Warning +!!! warning When you release a workspace **by hand**, it will not receive a grace period and be **permanently deleted** the **next day**. The advantage of this design is that you can create and release workspaces inside jobs and not swamp the filesystem with data no one needs anymore in the hidden directories (when workspaces are in the grace period). -Use: +Use -``` -ws_restore -l -F scratch +```console +marie@login$ ws_restore -l -F scratch ``` to get a list of your expired workspaces, and then restore them like that into an existing, active workspace 'new_ws': -``` -ws_restore -F scratch marie-test-workspace-1234567 new_ws +```console +marie@login$ ws_restore -F scratch marie-test-workspace-1234567 new_ws ``` The expired workspace has to be specified by its full name as listed by `ws_restore -l`, including @@ -174,8 +181,8 @@ workspaces within in the directory `DIR`. Calling this command will do the follo - The directory `DIR` will be created if necessary. - Links to all personal workspaces will be managed: - - Create links to all available workspaces if not already present. - - Remove links to released workspaces. + - Create links to all available workspaces if not already present. + - Remove links to released workspaces. **Remark**: An automatic update of the workspace links can be invoked by putting the command `ws_register DIR` in your personal `shell` configuration file (e.g., `.bashrc`). @@ -198,8 +205,9 @@ A batch job needs a directory for temporary data. This can be deleted afterwards #SBATCH --ntasks=1 #SBATCH --cpus-per-task=24 - module load modenv/classic - module load gaussian + module purge + module load modenv/hiera + module load Gaussian COMPUTE_DIR=gaussian_$SLURM_JOB_ID export GAUSS_SCRDIR=$(ws_allocate -F ssd $COMPUTE_DIR 7) @@ -218,8 +226,8 @@ Likewise, other jobs can use temporary workspaces. For a series of jobs or calculations that work on the same data, you should allocate a workspace once, e.g., in `scratch` for 100 days: -``` -zih$ ws_allocate -F scratch my_scratchdata 100 +```console +marie@login$ ws_allocate -F scratch my_scratchdata 100 Info: creating workspace. /scratch/ws/marie-my_scratchdata remaining extensions : 2 @@ -234,8 +242,8 @@ chmod g+wrx /scratch/ws/marie-my_scratchdata And verify it with: -``` -zih $ ls -la /scratch/ws/marie-my_scratchdata +```console +marie@login$ ls -la /scratch/ws/marie-my_scratchdata total 8 drwxrwx--- 2 marie hpcsupport 4096 Jul 10 09:03 . drwxr-xr-x 5 operator adm 4096 Jul 10 09:01 .. @@ -247,8 +255,8 @@ For data that seldom changes but consumes a lot of space, the warm archive can b this is mounted read-only on the compute nodes, so you cannot use it as a work directory for your jobs! -``` -zih$ ws_allocate -F warm_archive my_inputdata 365 +```console +marie@login$ ws_allocate -F warm_archive my_inputdata 365 /warm_archive/ws/marie-my_inputdata remaining extensions : 2 remaining time in days: 365 @@ -259,10 +267,10 @@ remaining time in days: 365 The warm archive is not built for billions of files. There is a quota for 100.000 files per group. Please archive data. -To see your active quota use: +To see your active quota use -``` -qinfo quota /warm_archive/ws/ +```console +marie@login$ qinfo quota /warm_archive/ws/ ``` Note that the workspaces reside under the mountpoint `/warm_archive/ws/` and not `/warm_archive` diff --git a/doc.zih.tu-dresden.de/docs/data_transfer/data_mover.md b/doc.zih.tu-dresden.de/docs/data_transfer/data_mover.md deleted file mode 100644 index 856af9f3080969f29ac71c7bc8bf6b8c79c45a60..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/data_transfer/data_mover.md +++ /dev/null @@ -1,85 +0,0 @@ -# Transferring files between HPC systems - -We provide a special data transfer machine providing the global file -systems of each ZIH HPC system. This machine is not accessible through -SSH as it is dedicated to data transfers. To move or copy files from one -file system to another file system you have to use the following -commands: - -- **dtcp**, **dtls, dtmv**, **dtrm, dtrsync**, **dttar** - -These commands submit a job to the data transfer machines performing the -selected command. Except the following options their syntax is the same -than the shell command without **dt** prefix (cp, ls, mv, rm, rsync, -tar). - -Additional options: - -| | | -|-------------------|-------------------------------------------------------------------------------| -| --account=ACCOUNT | Assign data transfer job to specified account. | -| --blocking | Do not return until the data transfer job is complete. (default for **dtls**) | -| --time=TIME | Job time limit (default 18h). | - -- **dtinfo**, **dtqueue**, **dtq**, **dtcancel** - -**dtinfo** shows information about the nodes of the data transfer -machine (like sinfo). **dtqueue** and **dtq** shows all the data -transfer jobs that belong to you (like squeue -u $USER). **dtcancel** -signals data transfer jobs (like scancel). - -To identify the mount points of the different HPC file systems on the -data transfer machine, please use **dtinfo**. It shows an output like -this (attention, the mount points can change without an update on this -web page) : - -| HPC system | Local directory | Directory on data transfer machine | -|:-------------------|:-----------------|:-----------------------------------| -| Taurus, Venus | /scratch/ws | /scratch/ws | -| | /ssd/ws | /ssd/ws | -| | /warm_archive/ws | /warm_archive/ws | -| | /home | /home | -| | /projects | /projects | -| **Archive** | | /archiv | -| **Group Storages** | | /grp/\<group storage> | - -## How to copy your data from an old scratch (Atlas, Triton, Venus) to our new scratch (Taurus) - -You can use our tool called Datamover to copy your data from A to B. - - dtcp -r /scratch/<project or user>/<directory> /projects/<project or user>/<directory> # or - dtrsync -a /scratch/<project or user>/<directory> /lustre/ssd/<project or user>/<directory> - -Options for dtrsync: - - -a, --archive archive mode; equals -rlptgoD (no -H,-A,-X) - - -r, --recursive recurse into directories - -l, --links copy symlinks as symlinks - -p, --perms preserve permissions - -t, --times preserve modification times - -g, --group preserve group - -o, --owner preserve owner (super-user only) - -D same as --devices --specials - -Example: - - dtcp -r /scratch/rotscher/results /luste/ssd/rotscher/ # or - new: dtrsync -a /scratch/rotscher/results /home/rotscher/results - -## Examples on how to use data transfer commands: - -Copying data from Taurus' /scratch to Taurus' /projects - - % dtcp -r /scratch/jurenz/results/ /home/jurenz/ - -Moving data from Venus' /sratch to Taurus' /luste/ssd - - % dtmv /scratch/jurenz/results/ /lustre/ssd/jurenz/results - -TGZ data from Taurus' /scratch to the Archive - - % dttar -czf /archiv/jurenz/taurus_results_20140523.tgz /scratch/jurenz/results - -**%RED%Note:<span class="twiki-macro ENDCOLOR"></span>**Please do not -generate files in the archive much larger that 500 GB. diff --git a/doc.zih.tu-dresden.de/docs/data_transfer/datamover.md b/doc.zih.tu-dresden.de/docs/data_transfer/datamover.md new file mode 100644 index 0000000000000000000000000000000000000000..41333949cb352630294ccb3a2ffac7ea65d980e6 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/data_transfer/datamover.md @@ -0,0 +1,69 @@ +# Transferring Files Between ZIH Systems + +With the **datamover**, we provide a special data transfer machine for transferring data with best +transfer speed between the filesystems of ZIH systems. The datamover machine is not accessible +through SSH as it is dedicated to data transfers. To move or copy files from one filesystem to +another filesystem, you have to use the following commands: + +- `dtcp`, `dtls`, `dtmv`, `dtrm`, `dtrsync`, `dttar`, and `dtwget` + +These commands submit a [batch job](../jobs_and_resources/slurm.md) to the data transfer machines +performing the selected command. Except the following options their syntax is the very same as the +well-known shell commands without the prefix *dt*. + +| Additional Option | Description | +|---------------------|-------------------------------------------------------------------------------| +| `--account=ACCOUNT` | Assign data transfer job to specified account. | +| `--blocking ` | Do not return until the data transfer job is complete. (default for `dtls`) | +| `--time=TIME ` | Job time limit (default: 18 h). | + +## Managing Transfer Jobs + +There are the commands `dtinfo`, `dtqueue`, `dtq`, and `dtcancel` to manage your transfer commands +and jobs. + +* `dtinfo` shows information about the nodes of the data transfer machine (like `sinfo`). +* `dtqueue` and `dtq` show all your data transfer jobs (like `squeue -u $USER`). +* `dtcancel` signals data transfer jobs (like `scancel`). + +To identify the mount points of the different filesystems on the data transfer machine, use +`dtinfo`. It shows an output like this: + +| ZIH system | Local directory | Directory on data transfer machine | +|:-------------------|:---------------------|:-----------------------------------| +| Taurus | `/scratch/ws` | `/scratch/ws` | +| | `/ssd/ws` | `/ssd/ws` | +| | `/beegfs/global0/ws` | `/beegfs/global0/ws` | +| | `/warm_archive/ws` | `/warm_archive/ws` | +| | `/home` | `/home` | +| | `/projects` | `/projects` | +| **Archive** | | `/archive` | +| **Group storage** | | `/grp/<group storage>` | + +## Usage of Datamover + +!!! example "Copying data from `/beegfs/global0` to `/projects` filesystem." + + ``` console + marie@login$ dtcp -r /beegfs/global0/ws/marie-workdata/results /projects/p_marie/. + ``` + +!!! example "Moving data from `/beegfs/global0` to `/warm_archive` filesystem." + + ``` console + marie@login$ dtmv /beegfs/global0/ws/marie-workdata/results /warm_archive/ws/marie-archive/. + ``` + +!!! example "Archive data from `/beegfs/global0` to `/archiv` filesystem." + + ``` console + marie@login$ dttar -czf /archiv/p_marie/results.tgz /beegfs/global0/ws/marie-workdata/results + ``` + +!!! warning + Do not generate files in the `/archiv` filesystem much larger that 500 GB! + +!!! note + The [warm archive](../data_lifecycle/warm_archive.md) and the `projects` filesystem are not + writable from within batch jobs. + However, you can store the data in the `warm_archive` using the datamover. diff --git a/doc.zih.tu-dresden.de/docs/data_transfer/export_nodes.md b/doc.zih.tu-dresden.de/docs/data_transfer/export_nodes.md index ccd1e87e7c47ad3beae89caef3620d9f32d108ba..d492594b85f4e9d033d273749983e75458068dc1 100644 --- a/doc.zih.tu-dresden.de/docs/data_transfer/export_nodes.md +++ b/doc.zih.tu-dresden.de/docs/data_transfer/export_nodes.md @@ -1,146 +1,165 @@ -# Move data to/from ZIH's File Systems +# Export Nodes: Transfer Data to/from ZIH's Filesystems -## Export Nodes - -To copy large data to/from the HPC machines, the Taurus export nodes should be used. While it is +To copy large data to/from ZIH systems, the so-called **export nodes** should be used. While it is possible to transfer small files directly via the login nodes, they are not intended to be used that -way and there exists a CPU time limit on the login nodes, killing each process that takes up too -much CPU time, which also affects file-copy processes if the copied files are very large. The export -nodes have a better uplink (10GBit/s) and are generally the preferred way to transfer your data. -Note that you cannot log in via ssh to the export nodes, but only use scp, rsync or sftp on them. +way. Furthermore, longer transfers will hit the CPU time limit on the login nodes, i.e. the process +get killed. The **export nodes** have a better uplink (10 GBit/s) allowing for higher bandwidth. Note +that you cannot log in via SSH to the export nodes, but only use `scp`, `rsync` or `sftp` on them. + +The export nodes are reachable under the hostname `taurusexport.hrsk.tu-dresden.de` (or +`taurusexport3.hrsk.tu-dresden.de` and `taurusexport4.hrsk.tu-dresden.de`). + +Please keep in mind that there are different +[filesystems](../data_lifecycle/file_systems.md#recommendations-for-filesystem-usage). Choose the +one that matches your needs. -They are reachable under the hostname: **taurusexport.hrsk.tu-dresden.de** (or -taurusexport3.hrsk.tu-dresden.de, taurusexport4.hrsk.tu-dresden.de). +## Access From Linux -## Access from Linux Machine +There are at least three tools to exchange data between your local workstation and ZIH systems. They +are explained in the following section in more detail. -There are three possibilities to exchange data between your local machine (lm) and the hpc machines -(hm), which are explained in the following abstract in more detail. +!!! important + The following explanations require that you have already set up your [SSH configuration + ](../access/ssh_login.md#configuring-default-parameters-for-ssh). ### SCP -Type following commands in the terminal when you are in the directory of -the local machine. +The tool [`scp`](https://www.man7.org/linux/man-pages/man1/scp.1.html) +(OpenSSH secure file copy) copies files between hosts on a network. To copy all files +in a directory, the option `-r` has to be specified. -#### Copy data from lm to hm +??? example "Example: Copy a file from your workstation to ZIH systems" -```Bash -# Copy file -scp <file> <zih-user>@<machine>:<target-location> -# Copy directory -scp -r <directory> <zih-user>@<machine>:<target-location> -``` + ```bash + marie@local$ scp <file> taurusexport:<target-location> -#### Copy data from hm to lm + # Add -r to copy whole directory + marie@local$ scp -r <directory> taurusexport:<target-location> + ``` -```Bash -# Copy file -scp <zih-user>@<machine>:<file> <target-location> -# Copy directory -scp -r <zih-user>@<machine>:<directory> <target-location> -``` + For example, if you want to copy your data file `mydata.csv` to the directory `input` in your + home directory, you would use the following: -Example: + ```console + marie@local$ scp mydata.csv taurusexport:input/ + ``` -```Bash -scp helloworld.txt mustermann@taurusexport.hrsk.tu-dresden.de:~/. -``` +??? example "Example: Copy a file from ZIH systems to your workstation" + + ```bash + marie@local$ scp taurusexport:<file> <target-location> + + # Add -r to copy whole directory + marie@local$ scp -r taurusexport:<directory> <target-location> + ``` -Additional information: <http://www.computerhope.com/unix/scp.htm> + For example, if you have a directory named `output` in your home directory on ZIH systems and + you want to copy it to the directory `/tmp` on your workstation, you would use the following: + + ```console + marie@local$ scp -r taurusexport:output /tmp + ``` ### SFTP -Is a virtual command line, which you could access with the following -line: +The tool [`sftp`](https://man7.org/linux/man-pages/man1/sftp.1.html) (OpenSSH secure file transfer) +is a file transfer program, which performs all operations over an encrypted SSH transport. It may +use compression to increase performance. + +`sftp` is basically a virtual command line, which you could access and exit as follows. -```Bash +```console # Enter virtual command line -sftp <zih-user>@<machine> +marie@local$ sftp taurusexport # Exit virtual command line -sftp> exit +sftp> exit # or sftp> <Ctrl+D> ``` -After that you have access to the filesystem on the hpc machine and you -can use the same commands as on your local machine, e.g. ls, cd, pwd and -many more. If you would access to your local machine from this virtual -command line, then you have to put the letter l (local machine) before -the command, e.g. lls, lcd or lpwd. - -#### Copy data from lm to hm - -```Bash -# Copy file -sftp> put <file> -# Copy directory -sftp> put -r <directory> -``` +After that you have access to the filesystem on ZIH systems, you can use the same commands as on +your local workstation, e.g., `ls`, `cd`, `pwd` etc. If you would access to your local workstation +from this virtual command line, then you have to prefix the command with the letter `l` +(`l`ocal),e.g., `lls`, `lcd` or `lpwd`. -#### Copy data from hm to lm +??? example "Example: Copy a file from your workstation to ZIH systems" -```Bash -# Copy file -sftp> get <file> -# Copy directory -sftp> get -r <directory> -``` + ```console + marie@local$ sftp taurusexport + # Copy file + sftp> put <file> + # Copy directory + sftp> put -r <directory> + ``` -Example: +??? example "Example: Copy a file from ZIH systems to your local workstation" -```Bash -sftp> get helloworld.txt -``` + ```console + marie@local$ sftp taurusexport + # Copy file + sftp> get <file> + # Copy directory + sftp> get -r <directory> + ``` -Additional information: http://www.computerhope.com/unix/sftp.htm +### Rsync -### RSYNC +[`Rsync`](https://man7.org/linux/man-pages/man1/rsync.1.html), is a fast and extraordinarily +versatile file copying tool. It can copy locally, to/from another host over any remote shell, or +to/from a remote `rsync` daemon. It is famous for its delta-transfer algorithm, which reduces the +amount of data sent over the network by sending only the differences between the source files and +the existing files in the destination. Type following commands in the terminal when you are in the directory of the local machine. -#### Copy data from lm to hm +??? example "Example: Copy a file from your workstation to ZIH systems" -```Bash -# Copy file -rsync <file> <zih-user>@<machine>:<target-location> -# Copy directory -rsync -r <directory> <zih-user>@<machine>:<target-location> -``` + ```console + # Copy file + marie@local$ rsync <file> taurusexport:<target-location> + # Copy directory + marie@local$ rsync -r <directory> taurusexport:<target-location> + ``` -#### Copy data from hm to lm +??? example "Example: Copy a file from ZIH systems to your local workstation" -```Bash -# Copy file -rsync <zih-user>@<machine>:<file> <target-location> -# Copy directory -rsync -r <zih-user>@<machine>:<directory> <target-location> -``` + ```console + # Copy file + marie@local$ rsync taurusexport:<file> <target-location> + # Copy directory + marie@local$ rsync -r taurusexport:<directory> <target-location> + ``` -Example: +## Access From Windows -```Bash -rsync helloworld.txt mustermann@taurusexport.hrsk.tu-dresden.de:~/. -``` +### Command Line -Additional information: http://www.computerhope.com/unix/rsync.htm +Windows 10 (1809 and higher) comes with a +[built-in OpenSSH support](https://docs.microsoft.com/en-us/windows-server/administration/openssh/openssh_overview) +including the above described [SCP](#SCP) and [SFTP](#SFTP). -## Access from Windows machine +### GUI - Using WinSCP First you have to install [WinSCP](http://winscp.net/eng/download.php). Then you have to execute the WinSCP application and configure some option as described below. -<span class="twiki-macro IMAGE" size="600">WinSCP_001_new.PNG</span> + +{: align="center"} -<span class="twiki-macro IMAGE" size="600">WinSCP_002_new.PNG</span> + +{: align="center"} -<span class="twiki-macro IMAGE" size="600">WinSCP_003_new.PNG</span> + +{: align="center"} -<span class="twiki-macro IMAGE" size="600">WinSCP_004_new.PNG</span> + +{: align="center"} -After your connection succeeded, you can copy files from your local -machine to the hpc machine and the other way around. +After your connection succeeded, you can copy files from your local workstation to ZIH systems and +the other way around. -<span class="twiki-macro IMAGE" size="600">WinSCP_005_new.PNG</span> + +{: align="center"} diff --git a/Compendium_attachments/ExportNodes/WinSCP_001_new.PNG b/doc.zih.tu-dresden.de/docs/data_transfer/misc/WinSCP_001_new.PNG similarity index 100% rename from Compendium_attachments/ExportNodes/WinSCP_001_new.PNG rename to doc.zih.tu-dresden.de/docs/data_transfer/misc/WinSCP_001_new.PNG diff --git a/Compendium_attachments/ExportNodes/WinSCP_002_new.PNG b/doc.zih.tu-dresden.de/docs/data_transfer/misc/WinSCP_002_new.PNG similarity index 100% rename from Compendium_attachments/ExportNodes/WinSCP_002_new.PNG rename to doc.zih.tu-dresden.de/docs/data_transfer/misc/WinSCP_002_new.PNG diff --git a/Compendium_attachments/ExportNodes/WinSCP_003_new.PNG b/doc.zih.tu-dresden.de/docs/data_transfer/misc/WinSCP_003_new.PNG similarity index 100% rename from Compendium_attachments/ExportNodes/WinSCP_003_new.PNG rename to doc.zih.tu-dresden.de/docs/data_transfer/misc/WinSCP_003_new.PNG diff --git a/Compendium_attachments/ExportNodes/WinSCP_004_new.PNG b/doc.zih.tu-dresden.de/docs/data_transfer/misc/WinSCP_004_new.PNG similarity index 100% rename from Compendium_attachments/ExportNodes/WinSCP_004_new.PNG rename to doc.zih.tu-dresden.de/docs/data_transfer/misc/WinSCP_004_new.PNG diff --git a/Compendium_attachments/ExportNodes/WinSCP_005_new.PNG b/doc.zih.tu-dresden.de/docs/data_transfer/misc/WinSCP_005_new.PNG similarity index 100% rename from Compendium_attachments/ExportNodes/WinSCP_005_new.PNG rename to doc.zih.tu-dresden.de/docs/data_transfer/misc/WinSCP_005_new.PNG diff --git a/doc.zih.tu-dresden.de/docs/data_transfer/overview.md b/doc.zih.tu-dresden.de/docs/data_transfer/overview.md index 3f92972f39b320aef5b824e5a7146a2d25e5a503..c2f4fe1e669b17b4f0cdf21c39e4072d4b80fa5d 100644 --- a/doc.zih.tu-dresden.de/docs/data_transfer/overview.md +++ b/doc.zih.tu-dresden.de/docs/data_transfer/overview.md @@ -1,37 +1,22 @@ # Transfer of Data -## Moving data to/from the HPC Machines - -To copy data to/from the HPC machines, the Taurus export nodes should be used as a preferred way. -There are three possibilities to exchanging data between your local machine (lm) and the HPC -machines (hm): SCP, RSYNC, SFTP. Type following commands in the terminal of the local machine. The -SCP command was used for the following example. Copy data from lm to hm - -```Bash -# Copy file from your local machine. For example: scp helloworld.txt mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/ -scp <file> <zih-user>@taurusexport.hrsk.tu-dresden.de:<target-location> - -scp -r <directory> <zih-user>@taurusexport.hrsk.tu-dresden.de:<target-location> #Copy directory from your local machine. -``` - -Copy data from hm to lm - -```Bash -# Copy file. For example: scp mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/helloworld.txt /home/mustermann/Downloads -scp <zih-user>@taurusexport.hrsk.tu-dresden.de:<file> <target-location> - -scp -r <zih-user>@taurusexport.hrsk.tu-dresden.de:<directory> <target-location> #Copy directory -``` - -## Moving data inside the HPC machines: Datamover - -The best way to transfer data inside the Taurus is the datamover. It is the special data transfer -machine provides the best data speed. To load, move, copy etc. files from one file system to another -file system, you have to use commands with dt prefix, such as: dtcp, dtwget, dtmv, dtrm, dtrsync, -dttar, dtls. These commands submit a job to the data transfer machines that execute the selected -command. Except for the 'dt' prefix, their syntax is the same as the shell command without the 'dt'. - -Keep in mind: The warm_archive is not writable for jobs. However, you can store the data in the warm -archive with the datamover. - -Useful links: [Data Mover]**todo link**, [Export Nodes]**todo link** +## Moving Data to/from ZIH Systems + +There are at least three tools for exchanging data between your local workstation and ZIH systems: +`scp`, `rsync`, and `sftp`. Please refer to the offline or online man pages of +[scp](https://www.man7.org/linux/man-pages/man1/scp.1.html), +[rsync](https://man7.org/linux/man-pages/man1/rsync.1.html), and +[sftp](https://man7.org/linux/man-pages/man1/sftp.1.html) for detailed information. + +No matter what tool you prefer, it is crucial that the **export nodes** are used as preferred way to +copy data to/from ZIH systems. Please follow the link to the documentation on +[export nodes](export_nodes.md) for further reference and examples. + +## Moving Data Inside ZIH Systems: Datamover + +The recommended way for data transfer inside ZIH Systems is the **datamover**. It is a special +data transfer machine that provides the best transfer speed. To load, move, copy etc. files from one +filesystem to another filesystem, you have to use commands prefixed with `dt`: `dtcp`, `dtwget`, +`dtmv`, `dtrm`, `dtrsync`, `dttar`, `dtls`. These commands submit a job to the data transfer +machines that execute the selected command. Please refer to the detailed documentation regarding the +[datamover](datamover.md). diff --git a/doc.zih.tu-dresden.de/docs/index.md b/doc.zih.tu-dresden.de/docs/index.md index cc174e052a72bf6258ce4844749690ae28d7a46c..60f6f081cf4a1c2ea76663bccd65e9ff866597fb 100644 --- a/doc.zih.tu-dresden.de/docs/index.md +++ b/doc.zih.tu-dresden.de/docs/index.md @@ -1,48 +1,29 @@ -# ZIH HPC Compendium +# ZIH HPC Documentation -Dear HPC users, +This is the documentation of the HPC systems and services provided at +[TU Dresden/ZIH](https://tu-dresden.de/zih/). This documentation is work in progress, since we try +to incorporate more information with increasing experience and with every question you ask us. The +HPC team invites you to take part in the improvement of these pages by correcting or adding useful +information. -due to restrictions coming from data security and software incompatibilities the old -"HPC Compendium" is now reachable only from inside TU Dresden campus (or via VPN). +## Contribution -Internal users should be redirected automatically. +Issues concerning this documentation can reported via the GitLab +[issue tracking system](https://gitlab.hrz.tu-chemnitz.de/zih/hpcsupport/hpc-compendium/-/issues). +Please check for any already existing issue before submitting your issue in order to avoid duplicate +issues. -We apologize for this severe action, but we are in the middle of the preparation for a wiki -relaunch, so we do not want to redirect resources to fix technical/security issues for a system -that will last only a few weeks. +Contributions from user-side are highly welcome. Please find out more in our [guidelines how to contribute](contrib/howto_contribute.md). -Thank you for your understanding, +**Reminder:** Non-documentation issues and requests need to be send as ticket to +[hpcsupport@zih.tu-dresden.de](mailto:hpcsupport@zih.tu-dresden.de). -your HPC Support Team ZIH +--- -## What is new? +--- -The desire for a new technical documentation is driven by two major aspects: +## News -1. Clear and user-oriented structure of the content -1. Usage of modern tools for technical documentation +**2021-10-05** Offline-maintenance (black building test) -The HPC Compendium provided knowledge and help for many years. It grew with every new hardware -installation and ZIH stuff tried its best to keep it up to date. But, to be honest, it has become -quite messy, and housekeeping it was a nightmare. - -The new structure is designed with the schedule for an HPC project in mind. This will ease the start -for new HPC users, as well speedup searching information w.r.t. a specific topic for advanced users. - -We decided against a classical wiki software. Instead, we write the documentation in markdown and -make use of the static site generator [mkdocs](https://www.mkdocs.org/) to create static html files -from this markdown files. All configuration, layout and content files are managed within a git -repository. The generated static html files, i.e, the documentation you are now reading, is deployed -to a web server. - -The workflow is flexible, allows a high level of automation, and is quite easy to maintain. - -From a technical point, our new documentation system is highly inspired by -[OLFC User Documentation](https://docs.olcf.ornl.gov/) as well as -[NERSC Technical Documentation](https://nersc.gitlab.io/). - -## Contribute - -Contributions are highly welcome. Please refere to -[README.md](https://gitlab.hrz.tu-chemnitz.de/zih/hpc-compendium/hpc-compendium/-/blob/main/doc.zih.tu-dresden.de/README.md) -file of this project. +**2021-09-29** Introduction to HPC at ZIH ([HPC introduction slides](misc/HPC-Introduction.pdf)) diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md index 5324f550e30e66b6ec6830cf7fddbb921b0dbdbf..c2e1bac98c1aeaad9910a98d5b6282df4f3160d7 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md @@ -1,13 +1,14 @@ -# Alpha Centauri - Multi-GPU sub-cluster +# Alpha Centauri - Multi-GPU Sub-Cluster -The sub-cluster "AlphaCentauri" had been installed for AI-related computations (ScaDS.AI). +The sub-cluster "Alpha Centauri" had been installed for AI-related computations (ScaDS.AI). It has 34 nodes, each with: -- 8 x NVIDIA A100-SXM4 (40 GB RAM) -- 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz with multithreading enabled -- 1 TB RAM 3.5 TB `/tmp` local NVMe device -- Hostnames: `taurusi[8001-8034]` -- Slurm partition `alpha` for batch jobs and `alpha-interactive` for interactive jobs +* 8 x NVIDIA A100-SXM4 (40 GB RAM) +* 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz with multi-threading enabled +* 1 TB RAM +* 3.5 TB `/tmp` local NVMe device +* Hostnames: `taurusi[8001-8034]` +* Slurm partition `alpha` for batch jobs and `alpha-interactive` for interactive jobs !!! note @@ -19,12 +20,12 @@ It has 34 nodes, each with: ### Modules The easiest way is using the [module system](../software/modules.md). -The software for the `alpha` partition is available in `modenv/hiera` module environment. +The software for the partition alpha is available in `modenv/hiera` module environment. To check the available modules for `modenv/hiera`, use the command -```bash -module spider <module_name> +```console +marie@alpha$ module spider <module_name> ``` For example, to check whether PyTorch is available in version 1.7.1: @@ -64,7 +65,8 @@ True ### Python Virtual Environments -Virtual environments allow users to install additional python packages and create an isolated +[Virtual environments](../software/python_virtual_environments.md) allow users to install +additional python packages and create an isolated runtime environment. We recommend using `virtualenv` for this purpose. ```console @@ -95,20 +97,20 @@ Successfully installed torchvision-0.10.0 ### JupyterHub -[JupyterHub](../access/jupyterhub.md) can be used to run Jupyter notebooks on AlphaCentauri +[JupyterHub](../access/jupyterhub.md) can be used to run Jupyter notebooks on Alpha Centauri sub-cluster. As a starting configuration, a "GPU (NVIDIA Ampere A100)" preset can be used in the advanced form. In order to use latest software, it is recommended to choose `fosscuda-2020b` as a standard environment. Already installed modules from `modenv/hiera` -can be pre-loaded in "Preload modules (modules load):" field. +can be preloaded in "Preload modules (modules load):" field. ### Containers Singularity containers enable users to have full control of their software environment. -Detailed information about containers can be found [here](../software/containers.md). +For more information, see the [Singularity container details](../software/containers.md). Nvidia [NGC](https://developer.nvidia.com/blog/how-to-run-ngc-deep-learning-containers-with-singularity/) containers can be used as an effective solution for machine learning related tasks. (Downloading -containers requires registration). Nvidia-prepared containers with software solutions for specific +containers requires registration). Nvidia-prepared containers with software solutions for specific scientific problems can simplify the deployment of deep learning workloads on HPC. NGC containers have shown consistent performance compared to directly run code. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/batch_systems.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/batch_systems.md deleted file mode 100644 index 06e9be7e7a8ab5efa0ae1272ba6159ac50310e0b..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/batch_systems.md +++ /dev/null @@ -1,56 +0,0 @@ -# Batch Systems - -Applications on an HPC system can not be run on the login node. They have to be submitted to compute -nodes with dedicated resources for user jobs. Normally a job can be submitted with these data: - -- number of CPU cores, -- requested CPU cores have to belong on one node (OpenMP programs) or - can distributed (MPI), -- memory per process, -- maximum wall clock time (after reaching this limit the process is - killed automatically), -- files for redirection of output and error messages, -- executable and command line parameters. - -Depending on the batch system the syntax differs slightly: - -- [Slurm](../jobs_and_resources/slurm.md) (taurus, venus) - -If you are confused by the different batch systems, you may want to enjoy this [batch system -commands translation table](http://slurm.schedmd.com/rosetta.pdf). - -**Comment:** Please keep in mind that for a large runtime a computation may not reach its end. Try -to create shorter runs (4...8 hours) and use checkpointing. Here is an extreme example from -literature for the waste of large computing resources due to missing checkpoints: - -*Earth was a supercomputer constructed to find the question to the answer to the Life, the Universe, -and Everything by a race of hyper-intelligent pan-dimensional beings. Unfortunately 10 million years -later, and five minutes before the program had run to completion, the Earth was destroyed by -Vogons.* (Adams, D. The Hitchhikers Guide Through the Galaxy) - -## Exclusive Reservation of Hardware - -If you need for some special reasons, e.g., for benchmarking, a project or paper deadline, parts of -our machines exclusively, we offer the opportunity to request and reserve these parts for your -project. - -Please send your request **7 working days** before the reservation should start (as that's our -maximum time limit for jobs and it is therefore not guaranteed that resources are available on -shorter notice) with the following information to the [HPC -support](mailto:hpcsupport@zih.tu-dresden.de?subject=Request%20for%20a%20exclusive%20reservation%20of%20hardware&body=Dear%20HPC%20support%2C%0A%0AI%20have%20the%20following%20request%20for%20a%20exclusive%20reservation%20of%20hardware%3A%0A%0AProject%3A%0AReservation%20owner%3A%0ASystem%3A%0AHardware%20requirements%3A%0ATime%20window%3A%20%3C%5Byear%5D%3Amonth%3Aday%3Ahour%3Aminute%20-%20%5Byear%5D%3Amonth%3Aday%3Ahour%3Aminute%3E%0AReason%3A): - -- `Project:` *\<Which project will be credited for the reservation?>* -- `Reservation owner:` *\<Who should be able to run jobs on the - reservation? I.e., name of an individual user or a group of users - within the specified project.>* -- `System:` *\<Which machine should be used?>* -- `Hardware requirements:` *\<How many nodes and cores do you need? Do - you have special requirements, e.g., minimum on main memory, - equipped with a graphic card, special placement within the network - topology?>* -- `Time window:` *\<Begin and end of the reservation in the form - year:month:dayThour:minute:second e.g.: 2020-05-21T09:00:00>* -- `Reason:` *\<Reason for the reservation.>* - -**Please note** that your project CPU hour budget will be credited for the reserved hardware even if -you don't use it. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/binding_and_distribution_of_tasks.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/binding_and_distribution_of_tasks.md index 4e8bde8c6e43ab765135f3199525a09820abf8d1..ad411b78cff4c8c4fcd06c4028cb14b6c6438c4f 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/binding_and_distribution_of_tasks.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/binding_and_distribution_of_tasks.md @@ -1,67 +1,103 @@ # Binding and Distribution of Tasks +Slurm provides several binding strategies to place and bind the tasks and/or threads of your job +to cores, sockets and nodes. + +!!! note + + Keep in mind that the distribution method might have a direct impact on the execution time of + your application. The manipulation of the distribution can either speed up or slow down your + application. + ## General -To specify a pattern the commands `--cpu_bind=<cores|sockets>` and -`--distribution=<block | cyclic>` are needed. cpu_bind defines the resolution in which the tasks -will be allocated. While --distribution determinates the order in which the tasks will be allocated -to the cpus. Keep in mind that the allocation pattern also depends on your specification. +To specify a pattern the commands `--cpu_bind=<cores|sockets>` and `--distribution=<block|cyclic>` +are needed. The option `cpu_bind` defines the resolution in which the tasks will be allocated. While +`--distribution` determinate the order in which the tasks will be allocated to the CPUs. Keep in +mind that the allocation pattern also depends on your specification. -```Bash -#!/bin/bash -#SBATCH --nodes=2 # request 2 nodes -#SBATCH --cpus-per-task=4 # use 4 cores per task -#SBATCH --tasks-per-node=4 # allocate 4 tasks per node - 2 per socket +!!! example "Explicitly specify binding and distribution" -srun --ntasks 8 --cpus-per-task 4 --cpu_bind=cores --distribution=block:block ./application -``` + ```bash + #!/bin/bash + #SBATCH --nodes=2 # request 2 nodes + #SBATCH --cpus-per-task=4 # use 4 cores per task + #SBATCH --tasks-per-node=4 # allocate 4 tasks per node - 2 per socket + + srun --ntasks 8 --cpus-per-task 4 --cpu_bind=cores --distribution=block:block ./application + ``` In the following sections there are some selected examples of the combinations between `--cpu_bind` and `--distribution` for different job types. +## OpenMP Strategies + +The illustration below shows the default binding of a pure OpenMP job on a single node with 16 CPUs +on which 16 threads are allocated. + + +{: align=center} + +!!! example "Default binding and default distribution" + + ```bash + #!/bin/bash + #SBATCH --nodes=1 + #SBATCH --tasks-per-node=1 + #SBATCH --cpus-per-task=16 + + export OMP_NUM_THREADS=16 + + srun --ntasks 1 --cpus-per-task $OMP_NUM_THREADS ./application + ``` + ## MPI Strategies -### Default Binding and Dsitribution Pattern +### Default Binding and Distribution Pattern -The default binding uses --cpu_bind=cores in combination with --distribution=block:cyclic. The -default (as well as block:cyclic) allocation method will fill up one node after another, while +The default binding uses `--cpu_bind=cores` in combination with `--distribution=block:cyclic`. The +default (as well as `block:cyclic`) allocation method will fill up one node after another, while filling socket one and two in alternation. Resulting in only even ranks on the first socket of each node and odd on each second socket of each node. -\<img alt="" -src="data:;base64,iVBORw0KGgoAAAANSUhEUgAAAw4AAADeCAIAAAAb9sCoAAAABmJLR0QA/wD/AP+gvaeTAAAfBklEQVR4nO3dfXBU1f348bshJEA2ISGbB0gIZAMJxqciIhCktGKxaqs14UEGC9gBJVUjxIo4EwFlpiqMOgydWipazTBNVATbGevQMQQYUMdSEEUNYGIID8kmMewmm2TzeH9/3On+9pvN2T27N9nsJu/XX+Tu/dx77uee8+GTu8tiUFVVAQAAQH/ChnoAAAAAwYtWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQChcT7DBYBiocQAIOaqqDvUQfEC9AkYyPfWKp0oAAABCup4qaULrN0sA+oXuExrqFTDS6K9XPFUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUauX72s58ZDIZPP/3UuSU5OfnDDz+UP8KXX35pNBrl9y8uLs7JyYmKikpOTvZhoABGvMDXq40bN2ZnZ48bNy4tLW3Tpk2dnZ0+DBfDC63SiBYfH//0008H7HQmk2nDhg3btm0L2BkBDBsBrld2u33Pnj2XLl0qLS0tLS3dunVrwE6NYEOrNKKtXbu2srLygw8+cH+ptrZ26dKliYmJqampjz/+eFtbm7b90qVLd911V2xs7A033HDixAnn/s3Nzfn5+ZMnT05ISHjwwQcbGxvdj3nPPfcsW7Zs8uTJg3Q5AIaxANerN954Y8GCBfHx8Tk5OQ8//LBrOEYaWqURzWg0btu27dlnn+3q6urzUl5e3ujRoysrK0+ePHnq1KnCwkJt+9KlS1NTU+vq6v71r3/95S9/ce6/cuVKi8Vy+vTpmpqa8ePHr1mzJmBXAWAkGMJ6dfz48VmzZg3o1SCkqDroPwKG0MKFC7dv397V1TVjxozdu3erqpqUlHTw4EFVVSsqKhRFqa+v1/YsKysbM2ZMT09PRUWFwWBoamrSthcXF0dFRamqWlVVZTAYnPvbbDaDwWC1Wvs9b0lJSVJS0mBfHQZVKK79UBwznIaqXqmqumXLlvT09MbGxkG9QAwe/Ws/PNCtGYJMeHj4Sy+9tG7dulWrVjk3Xr58OSoqKiEhQfvRbDY7HI7GxsbLly/Hx8fHxcVp26dPn679obq62mAwzJ4923mE8ePHX7lyZfz48YG6DgDDX+Dr1QsvvLBv377y8vL4+PjBuioEPVolKPfff/8rr7zy0ksvObekpqa2trY2NDRo1ae6ujoyMtJkMqWkpFit1o6OjsjISEVR6urqtP3T0tIMBsOZM2fojQAMqkDWq82bNx84cODo0aOpqamDdkEIAXxWCYqiKDt37ty1a1dLS4v2Y2Zm5ty5cwsLC+12u8ViKSoqWr16dVhY2IwZM2bOnPnaa68pitLR0bFr1y5t/4yMjMWLF69du7a2tlZRlIaGhv3797ufpaenx+FwaJ8zcDgcHR0dAbo8AMNIYOpVQUHBgQMHDh06ZDKZHA4HXxYwktEqQVEUZc6cOffee6/zn40YDIb9+/e3tbWlp6fPnDnzpptuevXVV7WX3n///bKysltuueWOO+644447nEcoKSmZNGlSTk5OdHT03Llzjx8/7n6WN954Y+zYsatWrbJYLGPHjuWBNgA/BKBeWa3W3bt3X7hwwWw2jx07duzYsdnZ2YG5OgQhg/MTT/4EGwyKoug5AoBQFIprPxTHDEA//Wufp0oAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABC4UM9AAAInKqqqqEeAoAQY1BV1f9gg0FRFD1HABCKQnHta2MGMDLpqVcD8FSJAgQg+JnN5qEeAoCQNABPlQCMTKH1VAkA/KOrVQIAABje+BdwAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrq+gpLvVRoJ/Ps6CebGSBBaXzXCnBwJqFcQ0VOveKoEAAAgNAD/sUlo/WYJefp/02JuDFeh+1s4c3K4ol5BRP/c4KkSAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACA0PBvlb799ttf//rXJpNp3LhxM2bMeOaZZ/w4yIwZMz788EPJnX/yk5+Ulpb2+1JxcXFOTk5UVFRycrIfw8DACqq5sXHjxuzs7HHjxqWlpW3atKmzs9OPwSDUBdWcpF4FlaCaGyOtXg3zVqm3t/eXv/zlpEmTvv7668bGxtLSUrPZPITjMZlMGzZs2LZt2xCOAZpgmxt2u33Pnj2XLl0qLS0tLS3dunXrEA4GQyLY5iT1KngE29wYcfVK1UH/EQbbpUuXFEX59ttv3V+6evXqkiVLEhISUlJSHnvssdbWVm37tWvX8vPz09LSoqOjZ86cWVFRoapqVlbWwYMHtVcXLly4atWqzs5Om822fv361NRUk8m0fPnyhoYGVVUff/zx0aNHm0ymKVOmrFq1qt9RlZSUJCUlDdY1Dxw995e54d/c0GzZsmXBggUDf80DJ/jvr7vgH3NwzknqVTAIzrmhGQn1apg/VZo0aVJmZub69evffffdmpoa15fy8vJGjx5dWVl58uTJU6dOFRYWattXrFhx8eLFzz77zGq1vvPOO9HR0c6Qixcvzp8///bbb3/nnXdGjx69cuVKi8Vy+vTpmpqa8ePHr1mzRlGU3bt3Z2dn7969u7q6+p133gngtcI3wTw3jh8/PmvWrIG/ZgS3YJ6TGFrBPDdGRL0a2k4tACwWy+bNm2+55Zbw8PBp06aVlJSoqlpRUaEoSn19vbZPWVnZmDFjenp6KisrFUW5cuVKn4NkZWU999xzqampe/bs0bZUVVUZDAbnEWw2m8FgsFqtqqrefPPN2llE+C0tSATh3FBVdcuWLenp6Y2NjQN4pQMuJO5vHyEx5iCck9SrIBGEc0MdMfVq+LdKTi0tLa+88kpYWNhXX331ySefREVFOV/64YcfFEWxWCxlZWXjxo1zj83KykpKSpozZ47D4dC2HD58OCwsbIqL2NjYb775RqX06I4NvOCZG88//7zZbK6urh7Q6xt4oXV/NaE15uCZk9SrYBM8c2Pk1Kth/gacK6PRWFhYOGbMmK+++io1NbW1tbWhoUF7qbq6OjIyUntTtq2trba21j18165dCQkJ9913X1tbm6IoaWlpBoPhzJkz1f9z7dq17OxsRVHCwkZQVoeHIJkbmzdv3rdv39GjR6dMmTIIV4lQEiRzEkEoSObGiKpXw3yR1NXVPf3006dPn25tbW1qanrxxRe7urpmz56dmZk5d+7cwsJCu91usViKiopWr14dFhaWkZGxePHiRx55pLa2VlXVs2fPOqdaZGTkgQMHYmJi7r777paWFm3PtWvXajs0NDTs379f2zM5OfncuXP9jqenp8fhcHR1dSmK4nA4Ojo6ApIG9CPY5kZBQcGBAwcOHTpkMpkcDsew/8e3cBdsc5J6FTyCbW6MuHo1tA+1BpvNZlu3bt306dPHjh0bGxs7f/78jz76SHvp8uXLubm5JpNp4sSJ+fn5drtd297U1LRu3bqUlJTo6Ohbbrnl3Llzqsu/Guju7v7tb3972223NTU1Wa3WgoKCqVOnGo1Gs9n85JNPakc4cuTI9OnTY2Nj8/Ly+ozn9ddfd02+64PTIKTn/jI3fJob165d67MwMzIyApcL3wX//XUX/GMOqjmpUq+CSVDNjRFYrwzOo/jBYDBop/f7CAhmeu4vc2N4C8X7G4pjhjzqFUT0399h/gYcAACAHrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQuH6D2EwGPQfBMMScwPBhjkJEeYGRHiqBAAAIGRQVXWoxwAAABCkeKoEAAAgRKsEAAAgRKsEAAAgRKsEAAAgRKsEAAAgRKsEAAAgRKsEAAAgpOvbuvlu05HAv2/eYm6MBKH1rWzMyZGAegURPfWKp0oAAABCA/B/wIXWb5aQp/83LebGcBW6v4UzJ4cr6hVE9M8NnioBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAIDdtW6cSJE/fee++ECROioqJuvPHGoqKi1tbWAJy3u7u7oKBgwoQJMTExK1eubG5u7nc3o9FocBEZGdnR0RGA4Y1YQzUfLBbLsmXLTCZTbGzsXXfdde7cuX53Ky4uzsnJiYqKSk5Odt2+Zs0a13lSWloagDEj8KhXcEW9CjbDs1X65z//uWjRoptvvvmzzz6rr6/ft29ffX39mTNnZGJVVe3q6vL71M8///yhQ4dOnjz5/fffX7x4cf369f3uZrFYWv4nNzf3gQceiIyM9Puk8GwI50N+fr7Vaj1//vyVK1cmTpy4dOnSfnczmUwbNmzYtm2b+0uFhYXOqbJkyRK/R4KgRb2CK+pVMFJ10H+EwdDT05OamlpYWNhne29vr6qqV69eXbJkSUJCQkpKymOPPdba2qq9mpWVVVRUdPvtt2dmZpaXl9tstvXr16empppMpuXLlzc0NGi7vfrqq1OmTBk/fvzEiRO3b9/ufvbExMS33npL+3N5eXl4ePi1a9c8jLahoSEyMvLw4cM6r3ow6Lm/wTM3hnY+ZGRk7N27V/tzeXl5WFhYd3e3aKglJSVJSUmuW1avXv3MM8/4e+mDKHjur7zgHDP1aqBQr6hXIgPQ7Qzt6QeD1n2fPn2631fnzZu3YsWK5ubm2traefPmPfroo9r2rKysG264obGxUfvxV7/61QMPPNDQ0NDW1vbII4/ce++9qqqeO3fOaDReuHBBVVWr1frf//63z8Fra2tdT609zT5x4oSH0e7cuXP69Ok6LncQDY/SM4TzQVXVTZs2LVq0yGKx2Gy2hx56KDc318NQ+y09EydOTE1NnTVr1ssvv9zZ2el7AgZF8NxfecE5ZurVQKFeUa9EaJX68cknnyiKUl9f7/5SRUWF60tlZWVjxozp6elRVTUrK+tPf/qTtr2qqspgMDh3s9lsBoPBarVWVlaOHTv2vffea25u7vfU58+fVxSlqqrKuSUsLOzjjz/2MNrMzMydO3f6fpWBMDxKzxDOB23nhQsXatm47rrrampqPAzVvfQcOnTo008/vXDhwv79+1NSUtx/1xwqwXN/5QXnmKlXA4V6pW2nXrnTf3+H4WeVEhISFEW5cuWK+0uXL1+OiorSdlAUxWw2OxyOxsZG7cdJkyZpf6iurjYYDLNnz546derUqVNvuumm8ePHX7lyxWw2FxcX//nPf05OTv7pT3969OjRPsePjo5WFMVms2k/trS09Pb2xsTEvP32285PurnuX15eXl1dvWbNmoG6drgbwvmgquqdd95pNpubmprsdvuyZctuv/321tZW0Xxwt3jx4nnz5k2bNi0vL+/ll1/et2+fnlQgCFGv4Ip6FaSGtlMbDNp7vU899VSf7b29vX268vLy8sjISGdXfvDgQW37999/P2rUKKvVKjpFW1vbH//4x7i4OO39Y1eJiYl/+9vftD8fOXLE83v/y5cvf/DBB327vADSc3+DZ24M4XxoaGhQ3N7g+Pzzz0XHcf8tzdV77703YcIET5caQMFzf+UF55ipVwOFeqVtp165G4BuZ2hPP0j+8Y9/jBkz5rnnnqusrHQ4HGfPns3Pzz9x4kRvb+/cuXMfeuihlpaWurq6+fPnP/LII1qI61RTVfXuu+9esmTJ1atXVVWtr69///33VVX97rvvysrKHA6HqqpvvPFGYmKie+kpKirKysqqqqqyWCwLFixYsWKFaJD19fURERHB+QFJzfAoPeqQzocpU6asW7fOZrO1t7e/8MILRqOxqanJfYTd3d3t7e3FxcVJSUnt7e3aMXt6evbu3VtdXW21Wo8cOZKRkeH8aMKQC6r7Kylox0y9GhDUK+cRqFd90CoJHT9+/O67746NjR03btyNN9744osvav9Y4PLly7m5uSaTaeLEifn5+Xa7Xdu/z1SzWq0FBQVTp041Go1ms/nJJ59UVfXUqVO33XZbTExMXFzcnDlzjh075n7ezs7OJ554IjY21mg0rlixwmaziUa4Y8eOoP2ApGbYlB516ObDmTNnFi9eHBcXFxMTM2/ePNHfNK+//rrrs96oqChVVXt6eu688874+PiIiAiz2fzss8+2tbUNeGb8E2z3V0Ywj5l6pR/1yhlOvepD//01OI/iB+2dSz1HQDDTc3+ZG8NbKN7fUBwz5FGvIKL//g7Dj3UDAAAMFFolAAAAIVolAAAAIVolAAAAIVolAAAAIVolAAAAIVolAAAAIVolAAAAIVolAAAAoXD9h/D6vw1jxGJuINgwJyHC3IAIT5UAAACEdP0fcAAAAMMbT5UAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEdH1bN99tOhL4981bzI2RILS+lY05ORJQryCip17xVAkAAEBoAP4POD1dPLHBH6tHKF4vsfKxoSgU80ysfKweoXi9xMrH6sFTJQAAACFaJQAAACFaJQAAAKFBaZW6u7sLCgomTJgQExOzcuXK5uZm+diNGzdmZ2ePGzcuLS1t06ZNnZ2dfpx95syZBoOhrq7Op8B///vfc+bMGTNmTEJCwqZNm+QDLRbLsmXLTCZTbGzsXXfdde7cOc/7FxcX5+TkREVFJScn9xm517yJYmXyJop1nt2/vPnE8xg8KyoqSk9Pj4yMjI+Pv++++77//nv52DVr1hhclJaWyscajUbX2MjIyI6ODsnYy5cv5+XlxcfHT5gw4fe//73XQFF+ZPIm2kcmb6JYPXkLZh7y6bUOiGJl6oBoncqsfVGszNr3vI/nte8h1muuRLEyuRLNWz1/v8gQ3V+ZOiCKlakDolzJrH1RrMzaF8XKrH1RrEyuRLEyuRJdl56/X7xQdRAdoaioKDMzs7Ky0mKxzJ8/f8WKFfKxa9euPXbsWGNj44kTJyZPnrx582b5WM327dsXLVqkKEptba18bFlZmdFo/Otf/1pXV1dTU3Ps2DH52AceeOAXv/jFjz/+aLfbV69efeONN3qO/eijj959990dO3YkJSW57iPKm0ysKG8ysRr3vOmZIaJYz2PwHPv5559XVlY2NzdXVVXdf//9OTk58rGrV68uLCxs+Z+uri75WLvd7gzMzc1dvny5fOxtt9324IMP2my2q1evzp0798knn/QcK8qPaLtMrChvMrGivOmvHoEnc72iOiATK6oDrrGidSqz9kWxMmvfc131vPZFsTK5EsXK5Eo0b2Vy5SuZ+yuqAzKxojogkyuZtS+KlVn7oliZtS+KlcmVKFYmV6LrksmVfwalVUpMTHzrrbe0P5eXl4eHh1+7dk0y1tWWLVsWLFggf15VVb/55puMjIwvvvhC8bFVysnJeeaZZzyPRxSbkZGxd+9e7c/l5eVhYWHd3d1eY0tKSvrcTlHeZGJdueZNMrbfvA1U6XHnefxez9vZ2Zmfn3/PPffIx65evdrv++vU0NAQGRl5+PBhydgrV64oilJRUaH9ePDgQaPR2NHR4TVWlB/37T7NjT55k4kV5U1/6Qk8mesV1QGZWFEdEOXKdZ3Kr333WNF2yVif1r5rrHyu3GN9ylWfeetrrmT4tI761AGvsR7qgPz9lVn7olhVYu27x/q69vs9r9dc9Yn1NVf9/l0gnyt5A/8GXF1dXX19/cyZM7UfZ82a1d3d/e233/pxqOPHj8+aNUt+/56ent/97nevvfZadHS0TydyOByff/55T0/PddddFxcXt2jRoq+++ko+PC8vr6SkpL6+vrm5+c033/zNb34zatQonwaghGbeAq+4uDg5OTk6Ovrrr7/++9//7mvs5MmTb7311h07dnR1dflx9rfffjstLe3nP/+55P7OJepkt9t9et9woAxt3kJFgOuAc536sfZFa1xm7bvu4+vad8b6kSvX80rmyn3eDmCd9FsA6oCvNdxDrE9r3z1Wfu33O2bJXDlj5XOlp6b5Q0+f1e8Rzp8/ryhKVVXV/2/HwsI+/vhjmVhXW7ZsSU9Pb2xslDyvqqo7d+5cunSpqqrfffed4stTpdraWkVR0tPTz549a7fbN2zYkJKSYrfbJc9rs9kWLlyovXrdddfV1NTInLdP5+shb15jXfXJm0ysKG96ZojnWL+fKrW1tV29evXYsWMzZ85cu3atfOyhQ4c+/fTTCxcu7N+/PyUlpbCw0Ncxq6qamZm5c+dOn8Z86623Oh8mz5s3T1GUzz77zGvsgD9V6jdvMrGivOmvHoHn9Xo91AGZXInqQL+5cl2nPq19VVwbva599318WvuusT7lyv28krlyn7e+5kqSTzW2Tx2QiRXVAfn7K/mkxD1Wcu27x/q09kVz0muu3GMlc+Xh74LBeKo08K2StoROnz6t/ah95u7EiRMysU7PP/+82Wyurq6WP++FCxcmTZpUV1en+t4qtbS0KIqyY8cO7cf29vZRo0YdPXpUJra3t3f27NkPP/xwU1OT3W7funVrWlqaTJvVb5nuN2/yy9g9b15jPeRtYEuPzPjlz3vs2DGDwdDa2upH7L59+xITE3097+HDhyMiIhoaGnwa88WLF3Nzc5OSktLT07du3aooyvnz573GDtIbcOr/zZuvsa550196As/r9XqoA15jPdQB99g+69SntS+qjTJrv88+Pq39PrE+5apPrE+50jjnrU+5kie/FtzrgEysqA7I31+Zte/5703Pa99zrOe1L4qVyZV7rHyu3K9LExpvwCUnJycmJn755Zfaj6dOnQoPD8/OzpY/wubNm/ft23f06NEpU6bIRx0/fryxsfH66683mUxaK3r99de/+eabMrFGo3HatGnOL/T06Zs9f/zxx//85z8FBQVxcXFRUVFPPfVUTU3N2bNn5Y+gCcW8Da1Ro0b58UanoigRERHd3d2+Ru3Zsyc3N9dkMvkUlZaW9sEHH9TV1VVVVaWmpqakpEybNs3XUw+sAOcthASmDrivU/m1L1rjMmvffR/5te8eK58r91j/aqY2b/XXSZ0GtQ74V8PlY0Vr32ush7XvIdZrrvqN9aNm+l3TfKCnzxIdoaioKCsrq6qqymKxLFiwwKd/AffEE09Mnz69qqqqvb29vb3d/TOwotjW1tZL/3PkyBFFUU6dOiX/Jtqrr75qNpvPnTvX3t7+hz/8YfLkyfJPLKZMmbJu3Tqbzdbe3v7CCy8YjcampiYPsd3d3e3t7cXFxUlJSe3t7Q6HQ9suyptMrChvXmM95E3PDBHFisbvNbazs/PFF1+sqKiwWq1ffPHFrbfempeXJxnb09Ozd+/e6upqq9V65MiRjIyMRx99VH7MqqrW19dHRET0+4Fuz7EnT5784YcfGhsbDxw4kJCQ8Pbbb3uOFeVHtN1rrIe8eY31kDf91SPwZPIsqgMysaI64BorWqcya18UK7P2+91Hcu2Lji+TK1Gs11x5mLcyuRqMuaEK6oBMrKgOyORKZu33Gyu59vuNlVz7Hv6+9porUazXXHm4Lplc+WdQWqXOzs4nnngiNjbWaDSuWLHCZrNJxl67dk35vzIyMuTP6+TrG3Cqqvb29m7ZsiUpKSkmJuaOO+74+uuv5WPPnDmzePHiuLi4mJiYefPmef0XUq+//rrrNUZFRWnbRXnzGushbzLnFeVNz/QSxXodgyi2q6vrvvvuS0pKioiImDp16saNG+XnVU9Pz5133hkfHx8REWE2m5999tm2tjb5MauqumPHjunTp/f7kufYXbt2JSYmjh49Ojs7u7i42GusKD+i7V5jPeTNa6yHvOmZG0NFJs+iOiATK6oDzlgP69Tr2hfFyqx9mboqWvseYr3mykOs11x5mLcyddJXMvdXFdQBmVhRHZDJlde1L4qVWfuiWJm173leec6Vh1ivufJwXTJ10j8G51H8ELr/bR6xxBI7VLFDJRRzRSyxxA5trIb/2AQAAECIVgkAAECIVgkAAECIVgkAAEBoAD7WjeFNz8foMLyF4se6MbxRryDCx7oBAAAGha6nSgAAAMMbT5UAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACE/h82xQH7rLtt0wAAAABJRU5ErkJggg==" -/> + +{: align="center"} -```Bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --tasks-per-node=16 -#SBATCH --cpus-per-task=1 +!!! example "Default binding and default distribution" -srun --ntasks 32 ./application -``` + ```bash + #!/bin/bash + #SBATCH --nodes=2 + #SBATCH --tasks-per-node=16 + #SBATCH --cpus-per-task=1 + + srun --ntasks 32 ./application + ``` ### Core Bound -Note: With this command the tasks will be bound to a core for the entire runtime of your -application. +!!! note + + With this command the tasks will be bound to a core for the entire runtime of your + application. #### Distribution: block:block This method allocates the tasks linearly to the cores. -\<img alt="" -src="data:;base64,iVBORw0KGgoAAAANSUhEUgAAAw4AAADeCAIAAAAb9sCoAAAABmJLR0QA/wD/AP+gvaeTAAAe5UlEQVR4nO3dfVRUdf7A8TuIoDIgyPCgIMigYPS0ZqZirrvZ6la7tYEPeWzV9mjJVqS0mZ1DanXOVnqq43HPtq7WFsezUJm2e07bcU+IerQ6ratZVqhBiA8wQDgDAwyP9/fH/TW/+TF8Z74zw8MdeL/+gjv3c7/fO/fz/fDhzjAYVFVVAAAA0JeQoZ4AAACAftEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACIUGEmwwGPprHgCCjqqqQz0FH1CvgJEskHrFXSUAAAChgO4qaYLrN0sAgQveOzTUK2CkCbxecVcJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFZp5PrZz35mMBg++eQT55bExMQPPvhA/ghffPGF0WiU37+oqCg7OzsiIiIxMdGHiQIY8Qa/Xm3cuDErK2vcuHEpKSmbNm3q6OjwYboYXmiVRrTY2Ninnnpq0IYzmUwbNmzYtm3boI0IYNgY5Hplt9t379596dKlkpKSkpKSrVu3DtrQ0BtapRFt7dq1FRUV77//vvtDNTU1S5cujY+PT05Ofuyxx1pbW7Xtly5dWrx4cXR09A033HDixAnn/k1NTXl5eZMnT46Li3vggQcaGhrcj3n33XcvW7Zs8uTJA3Q6AIaxQa5Xe/bsmT9/fmxsbHZ29kMPPeQajpGGVmlEMxqN27Zte+aZZzo7O3s9lJubO3r06IqKipMnT546daqgoEDbvnTp0uTk5Nra2n/9619/+ctfnPuvXLnSYrGcPn26urp6/Pjxa9asGbSzADASDGG9On78+MyZM/v1bBBU1AAEfgQMoQULFrzwwgudnZ3Tp0/ftWuXqqoJCQkHDx5UVbW8vFxRlLq6Om3P0tLSMWPGdHd3l5eXGwyGxsZGbXtRUVFERISqqpWVlQaDwbm/zWYzGAxWq7XPcYuLixMSEgb67DCggnHtB+Oc4TRU9UpV1S1btqSlpTU0NAzoCWLgBL72Qwe7NYPOhIaGvvTSS+vWrVu1apVz4+XLlyMiIuLi4rRvzWazw+FoaGi4fPlybGxsTEyMtn3atGnaF1VVVQaDYdasWc4jjB8//sqVK+PHjx+s8wAw/A1+vXr++ef37dtXVlYWGxs7UGcF3aNVgnLfffe98sorL730knNLcnJyS0tLfX29Vn2qqqrCw8NNJlNSUpLVam1vbw8PD1cUpba2Vts/JSXFYDCcOXOG3gjAgBrMerV58+YDBw4cPXo0OTl5wE4IQYD3KkFRFGXHjh07d+5sbm7Wvs3IyJgzZ05BQYHdbrdYLIWFhatXrw4JCZk+ffqMGTNee+01RVHa29t37typ7Z+enr5o0aK1a9fW1NQoilJfX79//373Ubq7ux0Oh/Y+A4fD0d7ePkinB2AYGZx6lZ+ff+DAgUOHDplMJofDwYcFjGS0SlAURZk9e/Y999zj/LMRg8Gwf//+1tbWtLS0GTNm3HTTTa+++qr20HvvvVdaWnrLLbfccccdd9xxh/MIxcXFkyZNys7OjoyMnDNnzvHjx91H2bNnz9ixY1etWmWxWMaOHcsNbQB+GIR6ZbVad+3adeHCBbPZPHbs2LFjx2ZlZQ3O2UGHDM53PPkTbDAoihLIEQAEo2Bc+8E4ZwCBC3ztc1cJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAiFYJAABAKHSoJwAAg6eysnKopwAgyBhUVfU/2GBQFCWQIwAIRsG49rU5AxiZAqlX/XBXiQIEQP/MZvNQTwFAUOqHu0oARqbguqsEAP4JqFUCAAAY3vgLOAAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAAKGAPoKSz1UaCfz7OAlyYyQIro8aISdHAuoVRAKpV9xVAgAAEOqHf2wSXL9ZQl7gv2mRG8NV8P4WTk4OV9QriASeG9xVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEBr+rdI333zz61//2mQyjRs3bvr06U8//bQfB5k+ffoHH3wgufNPfvKTkpKSPh8qKirKzs6OiIhITEz0YxroX7rKjY0bN2ZlZY0bNy4lJWXTpk0dHR1+TAbBTlc5Sb3SFV3lxkirV8O8Verp6fnlL385adKkr776qqGhoaSkxGw2D+F8TCbThg0btm3bNoRzgEZvuWG323fv3n3p0qWSkpKSkpKtW7cO4WQwJPSWk9Qr/dBbboy4eqUGIPAjDLRLly4pivLNN9+4P3T16tUlS5bExcUlJSU9+uijLS0t2vZr167l5eWlpKRERkbOmDGjvLxcVdXMzMyDBw9qjy5YsGDVqlUdHR02m239+vXJyckmk2n58uX19fWqqj722GOjR482mUypqamrVq3qc1bFxcUJCQkDdc79J5DrS274lxuaLVu2zJ8/v//Puf/o//q60/+c9ZmT1Cs90GduaEZCvRrmd5UmTZqUkZGxfv36d955p7q62vWh3Nzc0aNHV1RUnDx58tSpUwUFBdr2FStWXLx48dNPP7VarW+//XZkZKQz5OLFi/Pmzbv99tvffvvt0aNHr1y50mKxnD59urq6evz48WvWrFEUZdeuXVlZWbt27aqqqnr77bcH8VzhGz3nxvHjx2fOnNn/5wx903NOYmjpOTdGRL0a2k5tEFgsls2bN99yyy2hoaFTp04tLi5WVbW8vFxRlLq6Om2f0tLSMWPGdHd3V1RUKIpy5cqVXgfJzMx89tlnk5OTd+/erW2prKw0GAzOI9hsNoPBYLVaVVW9+eabtVFE+C1NJ3SYG6qqbtmyJS0traGhoR/PtN8FxfXtJSjmrMOcpF7phA5zQx0x9Wr4t0pOzc3Nr7zySkhIyJdffvnxxx9HREQ4H/r+++8VRbFYLKWlpePGjXOPzczMTEhImD17tsPh0LYcPnw4JCQk1UV0dPTXX3+tUnoCjh18+smN5557zmw2V1VV9ev59b/gur6a4JqzfnKSeqU3+smNkVOvhvkLcK6MRmNBQcGYMWO+/PLL5OTklpaW+vp67aGqqqrw8HDtRdnW1taamhr38J07d8bFxd17772tra2KoqSkpBgMhjNnzlT96Nq1a1lZWYqihISMoGd1eNBJbmzevHnfvn1Hjx5NTU0dgLNEMNFJTkKHdJIbI6peDfNFUltb+9RTT50+fbqlpaWxsfHFF1/s7OycNWtWRkbGnDlzCgoK7Ha7xWIpLCxcvXp1SEhIenr6okWLHn744ZqaGlVVz54960y18PDwAwcOREVF3XXXXc3Nzdqea9eu1Xaor6/fv3+/tmdiYuK5c+f6nE93d7fD4ejs7FQUxeFwtLe3D8rTgD7oLTfy8/MPHDhw6NAhk8nkcDiG/R/fwp3ecpJ6pR96y40RV6+G9qbWQLPZbOvWrZs2bdrYsWOjo6PnzZv34Ycfag9dvnw5JyfHZDJNnDgxLy/Pbrdr2xsbG9etW5eUlBQZGXnLLbecO3dOdfmrga6urt/+9re33XZbY2Oj1WrNz8+fMmWK0Wg0m81PPPGEdoQjR45MmzYtOjo6Nze313xef/111yff9capDgVyfckNn3Lj2rVrvRZmenr64D0XvtP/9XWn/znrKidV6pWe6Co3RmC9MjiP4geDwaAN7/cRoGeBXF9yY3gLxusbjHOGPOoVRAK/vsP8BTgAAIBA0CoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAIhQZ+CIPBEPhBMCyRG9AbchIi5AZEuKsEAAAgZFBVdajnAAAAoFPcVQIAABCiVQIAABCiVQIAABCiVQIAABCiVQIAABCiVQIAABCiVQIAABAK6NO6+WzTkcC/T94iN0aC4PpUNnJyJKBeQSSQesVdJQAAAKF++B9wwfWbJeQF/psWuTFcBe9v4eTkcEW9gkjgucFdJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAAKFh2yqdOHHinnvumTBhQkRExI033lhYWNjS0jII43Z1deXn50+YMCEqKmrlypVNTU197mY0Gg0uwsPD29vbB2F6I9ZQ5YPFYlm2bJnJZIqOjl68ePG5c+f63K2oqCg7OzsiIiIxMdF1+5o1a1zzpKSkZBDmjMFHvYIr6pXeDM9W6Z///OfChQtvvvnmTz/9tK6ubt++fXV1dWfOnJGJVVW1s7PT76Gfe+65Q4cOnTx58rvvvrt48eL69ev73M1isTT/KCcn5/777w8PD/d7UHg2hPmQl5dntVrPnz9/5cqViRMnLl26tM/dTCbThg0btm3b5v5QQUGBM1WWLFni90ygW9QruKJe6ZEagMCPMBC6u7uTk5MLCgp6be/p6VFV9erVq0uWLImLi0tKSnr00UdbWlq0RzMzMwsLC2+//faMjIyysjKbzbZ+/frk5GSTybR8+fL6+nptt1dffTU1NXX8+PETJ0584YUX3EePj49/8803ta/LyspCQ0OvXbvmYbb19fXh4eGHDx8O8KwHQiDXVz+5MbT5kJ6evnfvXu3rsrKykJCQrq4u0VSLi4sTEhJct6xevfrpp5/299QHkH6urzx9zpl61V+oV9QrkX7odoZ2+IGgdd+nT5/u89G5c+euWLGiqamppqZm7ty5jzzyiLY9MzPzhhtuaGho0L791a9+df/999fX17e2tj788MP33HOPqqrnzp0zGo0XLlxQVdVqtf73v//tdfCamhrXobW72SdOnPAw2x07dkybNi2A0x1Aw6P0DGE+qKq6adOmhQsXWiwWm8324IMP5uTkeJhqn6Vn4sSJycnJM2fOfPnllzs6Onx/AgaEfq6vPH3OmXrVX6hX1CsRWqU+fPzxx4qi1NXVuT9UXl7u+lBpaemYMWO6u7tVVc3MzPzTn/6kba+srDQYDM7dbDabwWCwWq0VFRVjx4599913m5qa+hz6/PnziqJUVlY6t4SEhHz00UceZpuRkbFjxw7fz3IwDI/SM4T5oO28YMEC7dm47rrrqqurPUzVvfQcOnTok08+uXDhwv79+5OSktx/1xwq+rm+8vQ5Z+pVf6FeadupV+4Cv77D8L1KcXFxiqJcuXLF/aHLly9HRERoOyiKYjabHQ5HQ0OD9u2kSZO0L6qqqgwGw6xZs6ZMmTJlypSbbrpp/PjxV65cMZvNRUVFf/7znxMTE3/6058ePXq01/EjIyMVRbHZbNq3zc3NPT09UVFRb731lvOdbq77l5WVVVVVrVmzpr/OHe6GMB9UVb3zzjvNZnNjY6Pdbl+2bNntt9/e0tIiygd3ixYtmjt37tSpU3Nzc19++eV9+/YF8lRAh6hXcEW90qmh7dQGgvZa75NPPtlre09PT6+uvKysLDw83NmVHzx4UNv+3XffjRo1ymq1ioZobW394x//GBMTo71+7Co+Pv5vf/ub9vWRI0c8v/a/fPnyBx54wLfTG0SBXF/95MYQ5kN9fb3i9gLHZ599JjqO+29prt59990JEyZ4OtVBpJ/rK0+fc6Ze9RfqlbadeuWuH7qdoR1+gPzjH/8YM2bMs88+W1FR4XA4zp49m5eXd+LEiZ6enjlz5jz44IPNzc21tbXz5s17+OGHtRDXVFNV9a677lqyZMnVq1dVVa2rq3vvvfdUVf32229LS0sdDoeqqnv27ImPj3cvPYWFhZmZmZWVlRaLZf78+StWrBBNsq6uLiwsTJ9vkNQMj9KjDmk+pKamrlu3zmaztbW1Pf/880ajsbGx0X2GXV1dbW1tRUVFCQkJbW1t2jG7u7v37t1bVVVltVqPHDmSnp7ufGvCkNPV9ZWk2zlTr/oF9cp5BOpVL7RKQsePH7/rrruio6PHjRt34403vvjii9ofC1y+fDknJ8dkMk2cODEvL89ut2v790o1q9Wan58/ZcoUo9FoNpufeOIJVVVPnTp12223RUVFxcTEzJ49+9ixY+7jdnR0PP7449HR0UajccWKFTabTTTD7du36/YNkpphU3rUocuHM2fOLFq0KCYmJioqau7cuaKfNK+//rrrvd6IiAhVVbu7u++8887Y2NiwsDCz2fzMM8+0trb2+zPjH71dXxl6njP1KnDUK2c49aqXwK+vwXkUP2ivXAZyBOhZINeX3BjegvH6BuOcIY96BZHAr+8wfFs3AABAf6FVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEAoN/BBe/9swRixyA3pDTkKE3IAId5UAAACEAvofcAAAAMMbd5UAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEAvq0bj7bdCTw75O3yI2RILg+lY2cHAmoVxAJpF5xVwkAAECoH/4HXCBdPLH6jw1EMJ4vsfKxwSgYn2di5WMDEYznS6x8bCC4qwQAACBEqwQAACBEqwQAACA0IK1SV1dXfn7+hAkToqKiVq5c2dTUJB9bVFSUnZ0dERGRmJjo67gbN27MysoaN25cSkrKpk2bOjo65GMLCwvT0tLCw8NjY2Pvvffe7777ztfRu7q6ZsyYYTAYamtr5aPWrFljcFFSUuLToP/+979nz549ZsyYuLi4TZs2yQcajUbXccPDw9vb230a2j8Wi2XZsmUmkyk6Onrx4sXnzp2Tj718+XJubm5sbOyECRN+//vfe52wKJdk8lMUK5Ofon1k8lMUK5OfnufmOT9FsQHmp255eK68rilRrMyaEsXKrAtRrMy6EOWezFoQxcqsBVGszFoQ7RN4rfbM89w8ryNRrMw68jCu15wUxcrkpChWJidFsTI5KbqOMjkpig2kf/BCDYDoCIWFhRkZGRUVFRaLZd68eStWrJCP/fDDD995553t27cnJCT4Ou7atWuPHTvW0NBw4sSJyZMnb968WT72s88+q6ioaGpqqqysvO+++7Kzs+VjNS+88MLChQsVRampqZGPXb16dUFBQfOPOjs75WNLS0uNRuNf//rX2tra6urqY8eOycfa7XbnoDk5OcuXL5ePlSGKvf/++3/xi1/88MMPdrt99erVN954o3zsbbfd9sADD9hstqtXr86ZM+eJJ57wHCvKJVF+ysSKtsvEivJTJlaUnzKxGvf8lIkV5Wfg1WPwyZyvaE3JxIrWlEysaF3IxIrWhWusKPdk1oIoVmYtiGJl1oJoH5m14CuZcTWe15EoVmYdiWJlclIUK5OToliZnBTFyuSk6DrK5KQoViYn/TMgrVJ8fPybb76pfV1WVhYaGnrt2jXJWE1xcbEfrZKrLVu2zJ8/34/Yjo6OvLy8u+++26fYr7/+Oj09/fPPP1d8b5WefvppD/PxEJudne13rFN9fX14ePjhw4f9iPVj3PT09L1792pfl5WVhYSEdHV1ycReuXJFUZTy8nLt24MHDxqNxvb2dq+x7rkkyk+ZWNF2+ViNa376FNsrPyVj+8xPmVhRfgZeegafzPmK1pRP16jXmpKJFa0Lr7Ee1oXoGrnmnvxacI8VnYt8rPt2n2K9rgV5kuNKriP3WF/XkWusfE72OWeN15x0j5XPyV6xvuZkr+voU072+fNaPifl9f8LcLW1tXV1dTNmzNC+nTlzZldX1zfffNPvA3l2/PjxmTNn+hRSVFSUmJgYGRn51Vdf/f3vf5cP7O7u/t3vfvfaa69FRkb6OM3/HXfy5Mm33nrr9u3bOzs7JaMcDsdnn33W3d193XXXxcTELFy48Msvv/Rj9LfeeislJeXnP/+5H7F+yM3NLS4urqura2pqeuONN37zm9+MGjVKJtCZ7k52u92Pe+/kp6/8y89gNIRrajDXhTP3/FgLfuSt11iZY/bax++14CvXcX1dR+5zll9Hzlg/crLP51MyJ11jfc1JZ6x8TrpfR/mcHLQc+F+B9Fl9HuH8+fOKolRWVv5fOxYS8tFHH8nEOgV4V2nLli1paWkNDQ0+xba2tl69evXYsWMzZsxYu3atfOyOHTuWLl2qquq3336r+HhX6dChQ5988smFCxf279+flJRUUFAgGVtTU6MoSlpa2tmzZ+12+4YNG5KSkux2u/z5ajIyMnbs2NHnQ4FkiCjWZrMtWLBAe/S6666rrq6Wj7311ludN3Xnzp2rKMqnn37qNbZXLnnIT6+xHrbLx6pu+SkZ22d+ysSK8lMmVpSfgVePwef1fD2sKZ+ub681JRMrWhcysaJ10ec1cs09n9aCKqirkr/Bi2qy17XQZ6zkWpAnM678OnKP9Wkducb6lJPu4zp5zUn3WPmcdI+VzEn36yifkx5+Xg/EXaX+b5W0S3v69GntW+09WSdOnJCJdQqkVXruuefMZnNVVZUfsZpjx44ZDIaWlhaZ2AsXLkyaNKm2tlb1q1VytW/fvvj4eMnY5uZmRVG2b9+ufdvW1jZq1KijR4/6NO7hw4fDwsLq6+v7fLTfS09PT8+sWbMeeuihxsZGu92+devWlJQU+fbu4sWLOTk5CQkJaWlpW7duVRTl/PnzXmP7/HHYZ37K/FgSbZePdc9P+ViNa356jfWQn76O65qfgZeewef1fD2sKfnnyn1NeY31sC5kxhWtC/fYXrnn01oQ1VWZtSCKlVkLnuu557Ugz+u4Pq0jz3P2vI56xfqUk6JxZXKyV6xPOek+rnxOapzX0aec7BXr3BIcL8AlJibGx8d/8cUX2renTp0KDQ3Nysrq94H6tHnz5n379h09ejQ1NTWQ44waNUryBvjx48cbGhquv/56k8mktc/XX3/9G2+84cegYWFhXV1dkjsbjcapU6c6P4TUv08j3b17d05Ojslk8iPWDz/88MN//vOf/Pz8mJiYiIiIJ598srq6+uzZs5LhKSkp77//fm1tbWVlZXJyclJS0tSpU32dA/k5OPkZjIZqTQ3OunDPPfm1EEjeimJljimzj/xakOc+rvw68jpnD+vIPVY+Jz2M6zUn3WPlc7LPcf2o1dp19K8+D0QO9BZInyU6QmFhYWZmZmVlpcVimT9/vk9/AdfV1dXW1lZUVJSQkNDW1uZwOORjH3/88WnTplVWVra1tbW1tbm/51cU29HR8eKLL5aXl1ut1s8///zWW2/Nzc2VjG1pabn0oyNHjiiKcurUKck7Jd3d3Xv37q2qqrJarUeOHElPT3/kkUfkz/fVV181m83nzp1ra2v7wx/+MHnyZMk7YZq6urqwsLA+39DtNdYrUWxqauq6detsNltbW9vzzz9vNBobGxslY0+ePPn99983NDQcOHAgLi7urbfe8jyuKJdE+SkTK9ouEyvKT6+xHvLTa6yH/PQa6yE/A68eg0/mGonWlEysKlhTMrGidSETK1oXrrGi3JNZC6JYmbUgipVZC33uI7kWfOV1XMl11Ges5DoSPScyOenhZ5/XnBTFyuSkKNZrTnq4jl5z0kOsTE76Z0BapY6Ojscffzw6OtpoNK5YscJms8nHvv7664qLiIgIydhr164p/196erpkbGdn57333puQkBAWFjZlypSNGzf6NGcnX1+A6+7uvvPOO2NjY8PCwsxm8zPPPNPa2io/bk9Pz5YtWxISEqKiou64446vvvrKpzlv37592rRpHk6nv0qPqzNnzixatCgmJiYqKmru3Lk+/eXdzp074+PjR48enZWVVVRU5HVcUS6J8lMmVrTda6yH/PQa6yE/Zebs5OGFgz5jPeRnILkxVGSeK9Gaknye+1xTMrGidSETK1oXzlgPued1LXiI9boWRLEya0G0j+Ra8JXM+TqJ1pEoVmYdeRjXa056nrPnnPQQ6zUnPcR6zUkP19FrTnqIlanP/jE4j+KH4P23ecQSS+xQxQ6VYHyuiCWW2KGN1fCPTQAAAIRolQAAAIRolQAAAIRolQAAAIT64W3dGN4CeRsdhrdgfFs3hjfqFUR4WzcAAMCACOiuEgAAwPDGXSUAAAAhWiUAAAAhWiUAAAAhWiUAAAAhWiUAAAAhWiUAAAAhWiUAAAAhWiUAAAAhWiUAAAAhWiUAAAAhWiUAAAAhWiUAAAAhWiUAAAAhWiUAAAAhWiUAAAAhWiUAAAAhWiUAAAAhWiUAAAAhWiUAAACh/wGLggH7ga71+AAAAABJRU5ErkJggg==" -/> + +{: align="center"} -```Bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --tasks-per-node=16 -#SBATCH --cpus-per-task=1 +!!! example "Binding to cores and block:block distribution" -srun --ntasks 32 --cpu_bind=cores --distribution=block:block ./application -``` + ```bash + #!/bin/bash + #SBATCH --nodes=2 + #SBATCH --tasks-per-node=16 + #SBATCH --cpus-per-task=1 + + srun --ntasks 32 --cpu_bind=cores --distribution=block:block ./application + ``` #### Distribution: cyclic:cyclic @@ -71,18 +107,19 @@ then the first socket of the second node until one task is placed on every first socket of every node. After that it will place a task on every second socket of every node and so on. -\<img alt="" -src="<data:;base64,iVBORw0KGgoAAAANSUhEUgAAAw4AAADeCAIAAAAb9sCoAAAABmJLR0QA/wD/AP+gvaeTAAAfCElEQVR4nO3de1BU5/348bOIoLIgyHJREGRRMORWY4yKWtuYapO0SQNe4piq6aiRJqKSxugMURNnmkQnyTh2am1MmjBOIYnRtDNpxk4QdTTJpFZjNAYvEMQLLBDchQWW6/n+cab748fuczi7y8JZeL/+kt3zec5zPs+Hxw9nl8Ugy7IEAAAAd4IGegIAAAD6RasEAAAgRKsEAAAgRKsEAAAgRKsEAAAgRKsEAAAgRKsEAAAgRKsEAAAgRKsEAAAgRKsEAAAgRKsEAAAgRKsEAAAgRKsEAAAgRKsEAAAgRKsEAAAgRKsEAAAgRKsEAAAgRKsEAAAgRKsEAAAgFOxLsMFg6Kt5AAg4siwP9BQ8wH4FDGW+7FfcVQIAABDy6a6SIrB+sgTgu8C9Q8N+BQw1vu9X3FUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUaun72s58ZDIYvvvjC+Uh8fPwnn3yifYRvvvnGaDRqP76goCAzMzMsLCw+Pt6DiQIY8vp/v9q4cWNGRsaoUaOSkpI2bdrU1tbmwXQxuNAqDWnR0dEvvPBCv53OZDJt2LBh+/bt/XZGAINGP+9Xdrt93759169fLyoqKioq2rZtW7+dGnpDqzSkrVq1qqys7OOPP3Z9qqqqatGiRbGxsYmJic8991xzc7Py+PXr1xcsWBAZGXnXXXedOnXKeXxDQ0NOTs748eNjYmKefPLJuro61zEfeeSRxYsXjx8/3k+XA2AQ6+f96u23354zZ050dHRmZubTTz/dPRxDDa3SkGY0Grdv375ly5b29vYeT2VnZw8fPrysrOz06dNnzpzJy8tTHl+0aFFiYmJ1dfW//vWvv/zlL87jly1bZrFYzp49W1lZOXr06JUrV/bbVQAYCgZwvzp58uTUqVP79GoQUGQf+D4CBtDcuXN37NjR3t4+efLkPXv2yLIcFxd3+PBhWZZLS0slSaqpqVGOLC4uHjFiRGdnZ2lpqcFgqK+vVx4vKCgICwuTZbm8vNxgMDiPt9lsBoPBarW6PW9hYWFcXJy/rw5+FYjf+4E4ZzgN1H4ly/LWrVtTUlLq6ur8eoHwH9+/94P7uzWDzgQHB7/22murV69evny588EbN26EhYXFxMQoX5rNZofDUVdXd+PGjejo6KioKOXxSZMmKf+oqKgwGAzTpk1zjjB69OibN2+OHj26v64DwODX//vVK6+8cuDAgZKSkujoaH9dFXSPVgnS448//sYbb7z22mvORxITE5uammpra5Xdp6KiIjQ01GQyJSQkWK3W1tbW0NBQSZKqq6uV45OSkgwGw7lz5+iNAPhVf+5XmzdvPnTo0PHjxxMTE/12QQgAvFcJkiRJu3bt2r17d2Njo/JlWlrajBkz8vLy7Ha7xWLJz89fsWJFUFDQ5MmTp0yZ8tZbb0mS1Nraunv3buX41NTU+fPnr1q1qqqqSpKk2tragwcPup6ls7PT4XAo7zNwOBytra39dHkABpH+2a9yc3MPHTp05MgRk8nkcDj4sIChjFYJkiRJ06dPf/TRR52/NmIwGA4ePNjc3JySkjJlypR77rnnzTffVJ766KOPiouL77vvvgcffPDBBx90jlBYWDhu3LjMzMzw8PAZM2acPHnS9Sxvv/32yJEjly9fbrFYRo4cyQ1tAF7oh/3KarXu2bPnypUrZrN55MiRI0eOzMjI6J+rgw4ZnO948ibYYJAkyZcRAASiQPzeD8Q5A/Cd79/73FUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQCh7oCQBA/ykvLx/oKQAIMAZZlr0PNhgkSfJlBACBKBC/95U5AxiafNmv+uCuEhsQAP0zm80DPQUAAakP7ioBGJoC664SAHjHp1YJAABgcOM34AAAAIRolQAAAIRolQAAAIRolQAAAIRolQAAAIRolQAAAIR8+ghKPldpKPDu4ySojaEgsD5qhJocCtivIOLLfsVdJQAAAKE++MMmgfWTJbTz/SctamOwCtyfwqnJwYr9CiK+1wZ3lQAAAIRolQAAAIRolQAAAIRolQAAAIRolQAAAIRolQAAAIRolQAAAIRolQAAAIRolQAAAIRolQAAAIRolQAAAIRolQAAAIRolQAAAIQGf6t08eLFX//61yaTadSoUZMnT37xxRe9GGTy5MmffPKJxoN/8pOfFBUVuX2qoKAgMzMzLCwsPj7ei2mgb+mqNjZu3JiRkTFq1KikpKRNmza1tbV5MRkEOl3VJPuVruiqNobafjXIW6Wurq5f/vKX48aNO3/+fF1dXVFRkdlsHsD5mEymDRs2bN++fQDnAIXeasNut+/bt+/69etFRUVFRUXbtm0bwMlgQOitJtmv9ENvtTHk9ivZB76P4G/Xr1+XJOnixYuuT926dWvhwoUxMTEJCQnPPvtsU1OT8vjt27dzcnKSkpLCw8OnTJlSWloqy3J6evrhw4eVZ+fOnbt8+fK2tjabzbZ27drExESTybRkyZLa2lpZlp977rnhw4ebTKbk5OTly5e7nVVhYWFcXJy/rrnv+LK+1IZ3taHYunXrnDlz+v6a+47+19eV/uesz5pkv9IDfdaGYijsV4P8rtK4cePS0tLWrl37wQcfVFZWdn8qOzt7+PDhZWVlp0+fPnPmTF5envL40qVLr1279uWXX1qt1vfffz88PNwZcu3atVmzZs2ePfv9998fPnz4smXLLBbL2bNnKysrR48evXLlSkmS9uzZk5GRsWfPnoqKivfff78frxWe0XNtnDx5curUqX1/zdA3PdckBpaea2NI7FcD26n1A4vFsnnz5vvuuy84OHjixImFhYWyLJeWlkqSVFNToxxTXFw8YsSIzs7OsrIySZJu3rzZY5D09PSXXnopMTFx3759yiPl5eUGg8E5gs1mMxgMVqtVluV7771XOYsIP6XphA5rQ5blrVu3pqSk1NXV9eGV9rmAWN8eAmLOOqxJ9iud0GFtyENmvxr8rZJTY2PjG2+8ERQU9O23337++edhYWHOp3744QdJkiwWS3Fx8ahRo1xj09PT4+Lipk+f7nA4lEeOHj0aFBSU3E1kZOR3330ns/X4HNv/9FMbL7/8stlsrqio6NPr63uBtb6KwJqzfmqS/Upv9FMbQ2e/GuQvwHVnNBrz8vJGjBjx7bffJiYmNjU11dbWKk9VVFSEhoYqL8o2NzdXVVW5hu/evTsmJuaxxx5rbm6WJCkpKclgMJw7d67if27fvp2RkSFJUlDQEMrq4KCT2ti8efOBAweOHz+enJzsh6tEINFJTUKHdFIbQ2q/GuTfJNXV1S+88MLZs2ebmprq6+tfffXV9vb2adOmpaWlzZgxIy8vz263WyyW/Pz8FStWBAUFpaamzp8/f82aNVVVVbIsX7hwwVlqoaGhhw4dioiIePjhhxsbG5UjV61apRxQW1t78OBB5cj4+PhLly65nU9nZ6fD4Whvb5ckyeFwtLa29ksa4IbeaiM3N/fQoUNHjhwxmUwOh2PQ//ItXOmtJtmv9ENvtTHk9quBvanlbzabbfXq1ZMmTRo5cmRkZOSsWbM+/fRT5akbN25kZWWZTKaxY8fm5OTY7Xbl8fr6+tWrVyckJISHh993332XLl2Su/3WQEdHx29/+9sHHnigvr7earXm5uZOmDDBaDSazeb169crIxw7dmzSpEmRkZHZ2dk95rN3797uye9+41SHfFlfasOj2rh9+3aPb8zU1NT+y4Xn9L++rvQ/Z13VpMx+pSe6qo0huF8ZnKN4wWAwKKf3egTomS/rS20MboG4voE4Z2jHfgUR39d3kL8ABwAA4AtaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAAKFg34cwGAy+D4JBidqA3lCTEKE2IMJdJQAAACGDLMsDPQcAAACd4q4SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAkE+f1s1nmw4F3n3yFrUxFATWp7JRk0MB+xVEfNmvuKsEAAAg1Ad/Ay6wfrKEdr7/pEVtDFaB+1M4NTlYsV9BxPfa4K4SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACA0KBtlU6dOvXoo4+OGTMmLCzs7rvvzs/Pb2pq6ofzdnR05ObmjhkzJiIiYtmyZQ0NDW4PMxqNhm5CQ0NbW1v7YXpD1kDVg8ViWbx4sclkioyMXLBgwaVLl9weVlBQkJmZGRYWFh8f3/3xlStXdq+ToqKifpgz+h/7Fbpjv9Kbwdkq/fOf/5w3b96999775Zdf1tTUHDhwoKam5ty5c1piZVlub2/3+tQvv/zykSNHTp8+ffXq1WvXrq1du9btYRaLpfF/srKynnjiidDQUK9PCnUDWA85OTlWq/Xy5cs3b94cO3bsokWL3B5mMpk2bNiwfft216fy8vKcpbJw4UKvZwLdYr9Cd+xXeiT7wPcR/KGzszMxMTEvL6/H411dXbIs37p1a+HChTExMQkJCc8++2xTU5PybHp6en5+/uzZs9PS0kpKSmw229q1axMTE00m05IlS2pra5XD3nzzzeTk5NGjR48dO3bHjh2uZ4+NjX333XeVf5eUlAQHB9++fVtltrW1taGhoUePHvXxqv3Bl/XVT20MbD2kpqbu379f+XdJSUlQUFBHR4doqoWFhXFxcd0fWbFixYsvvujtpfuRftZXO33Omf2qr7BfsV+J9EG3M7Cn9wel+z579qzbZ2fOnLl06dKGhoaqqqqZM2c+88wzyuPp6el33XVXXV2d8uWvfvWrJ554ora2trm5ec2aNY8++qgsy5cuXTIajVeuXJFl2Wq1/ve//+0xeFVVVfdTK3ezT506pTLbXbt2TZo0yYfL9aPBsfUMYD3Isrxp06Z58+ZZLBabzfbUU09lZWWpTNXt1jN27NjExMSpU6e+/vrrbW1tnifAL/Szvtrpc87sV32F/Yr9SoRWyY3PP/9ckqSamhrXp0pLS7s/VVxcPGLEiM7OTlmW09PT//SnPymPl5eXGwwG52E2m81gMFit1rKyspEjR3744YcNDQ1uT3358mVJksrLy52PBAUFffbZZyqzTUtL27Vrl+dX2R8Gx9YzgPWgHDx37lwlG3fccUdlZaXKVF23niNHjnzxxRdXrlw5ePBgQkKC68+aA0U/66udPufMftVX2K+Ux9mvXPm+voPwvUoxMTGSJN28edP1qRs3boSFhSkHSJJkNpsdDkddXZ3y5bhx45R/VFRUGAyGadOmTZgwYcKECffcc8/o0aNv3rxpNpsLCgr+/Oc/x8fH//SnPz1+/HiP8cPDwyVJstlsypeNjY1dXV0RERHvvfee851u3Y8vKSmpqKhYuXJlX107XA1gPciy/NBDD5nN5vr6ervdvnjx4tmzZzc1NYnqwdX8+fNnzpw5ceLE7Ozs119//cCBA76kAjrEfoXu2K90amA7NX9QXut9/vnnezze1dXVoysvKSkJDQ11duWHDx9WHr969eqwYcOsVqvoFM3NzX/84x+joqKU14+7i42N/dvf/qb8+9ixY+qv/S9ZsuTJJ5/07PL6kS/rq5/aGMB6qK2tlVxe4Pjqq69E47j+lNbdhx9+OGbMGLVL7Uf6WV/t9Dln9qu+wn6lPM5+5aoPup2BPb2f/OMf/xgxYsRLL71UVlbmcDguXLiQk5Nz6tSprq6uGTNmPPXUU42NjdXV1bNmzVqzZo0S0r3UZFl++OGHFy5ceOvWLVmWa2pqPvroI1mWv//+++LiYofDIcvy22+/HRsb67r15Ofnp6enl5eXWyyWOXPmLF26VDTJmpqakJAQfb5BUjE4th55QOshOTl59erVNputpaXllVdeMRqN9fX1rjPs6OhoaWkpKCiIi4traWlRxuzs7Ny/f39FRYXVaj127FhqaqrzrQkDTlfrq5Fu58x+1SfYr5wjsF/1QKskdPLkyYcffjgyMnLUqFF33333q6++qvyywI0bN7Kyskwm09ixY3Nycux2u3J8j1KzWq25ubkTJkwwGo1ms3n9+vWyLJ85c+aBBx6IiIiIioqaPn36iRMnXM/b1ta2bt26yMhIo9G4dOlSm80mmuHOnTt1+wZJxaDZeuSBq4dz587Nnz8/KioqIiJi5syZov9p9u7d2/1eb1hYmCzLnZ2dDz30UHR0dEhIiNls3rJlS3Nzc59nxjt6W18t9Dxn9ivfsV85w9mvevB9fQ3OUbygvHLpywjQM1/Wl9oY3AJxfQNxztCO/Qoivq/vIHxbNwAAQF+hVQIAABCiVQIAABCiVQIAABCiVQIAABCiVQIAABCiVQIAABCiVQIAABCiVQIAABAK9n2IXv/aMIYsagN6Q01ChNqACHeVAAAAhHz6G3AAAACDG3eVAAAAhGiVAAAAhGiVAAAAhGiVAAAAhGiVAAAAhGiVAAAAhGiVAAAAhHz6tG4+23Qo8O6Tt6iNoSCwPpWNmhwK2K8g4st+xV0lAAAAoT74G3C+dPHE6j/WF4F4vcRqjw1EgZhnYrXH+iIQr5dY7bG+4K4SAACAEK0SAACAEK0SAACAkF9apY6Ojtzc3DFjxkRERCxbtqyhocGLEaZMmWIwGKqrq7VHWSyWxYsXm0ymyMjIBQsWXLp0Sf34goKCzMzMsLCw+Pj47o9v3LgxIyNj1KhRSUlJmzZtamtr0x4rSdK///3v6dOnjxgxIiYmZtOmTa6xovG15E19bup5E8V6mjdfaMmtil5z251ojbTkWWV9pd7yLIrVkmdRfrTkTeWYXvOWn5+fkpISGhoaHR392GOPXb16VXuuAp36WqtbuXKloZuioiLtsTdu3MjOzo6Ojh4zZszvf//71tZW7+YpWjstsUajsfv8Q0NDXachqisteRPFasmbKNbTvPlCS25FtOS2O1E+teRZdIyWPItiteRZtEZa8iaK1ZI30fi+fC/3QvaBaIT8/Py0tLSysjKLxTJr1qylS5dqj1Xs2LFj3rx5kiRVVVVpj33iiSd+8Ytf/Pjjj3a7fcWKFXfffbd67KeffvrBBx/s3LkzLi6u+zGrVq06ceJEXV3dqVOnxo8fv3nzZu2xxcXFRqPxr3/9a3V1dWVl5YkTJ1xjReOL8qYlVpQ3LbGivPlSIaJY9fmrx4pyK4oVrZGWPItiFep5FsVqybMoP1pqUnSMlpr86quvysrKGhoaysvLH3/88czMTO25ChSiOauvtXrsihUr8vLyGv+nvb1de+wDDzzw5JNP2my2W7duzZgxY/369eqxonmK1k5LrN1ud04+KytryZIlrrGiuhKNqSVWlDctsaK8+WO/EuVWS6wot6JYUT615Fl0jJY8i2K15Fm0RlpqUhSrpSZF42vJlXf80irFxsa+++67yr9LSkqCg4Nv376tMVaW5e+++y41NfXrr7+WPGyVUlNT9+/f7zxvUFBQR0dHr7GFhYUqW+TWrVvnzJmjPTYzM/PFF1/UPufu44vypiVWFuRNS6wob/7YelTm32usKLfqsa5rpD3PbmtDY55dYz3Nsyg/6jXpeoxHNdnW1paTk/PII48oX3pak3qmPmf1fUAUu2LFCi9qUpblmzdvSpJUWlqqfHn48GGj0dja2tprrMo8e6ydR7G1tbWhoaFHjx5VmbPsriZdx9QSK8pbr7EqefPrftUjtx7F9siteqxojbTk2fUY7XnuEetFnt3uV73WpEqslpp0uy7aa1K7vn8Brrq6uqamZsqUKcqXU6dO7ejouHjxosbwzs7O3/3ud2+99VZ4eLinp87Ozi4sLKypqWloaHjnnXd+85vfDBs2zNNBejh58uTUqVM1HuxwOL766qvOzs477rgjKipq3rx53377rcbxvchb97l5mrfusf7Im6dz6JUXuXUrgOpTlB8teXMeoz1vBQUF8fHx4eHh58+f//vf/y75nKshoqCgYPz48ffff//OnTvb29s1Rjm3bye73e7R6zs95tBj7Tz13nvvJSUl/fznP1c/zKPvWfVYj/LmjO3bvGnRb7n1k36rT9f11Z43t3Wlnjff18UzvvRZbke4fPmyJEnl5eX/rx0LCvrss8+0xMqyvGvXrkWLFsmy/P3330se3lWy2Wxz585Vnr3jjjsqKyu1xKr8pLV169aUlJS6ujqNsVVVVZIkpaSkXLhwwW63b9iwISEhwW63i+bcfXyVvPUaK4vzpiVWlDdfKqTX2B5z6DVWJbfqsT3WyKM8u9aG9jy7xnqUZ1F+eq3JHsdor8nm5uZbt26dOHFiypQpq1at8jRX+qc+Z+/uKh05cuSLL764cuXKwYMHExIS8vLytMfef//9zhc4Zs6cKUnSl19+2Wus23m6rp32WEVaWtquXbvU5+y2JjX+BN8jVpQ3LbGivPlpv3KbW42xih65VY/t27tK2vPsGutRnl1rQ2NNuo1VqNekyrr4465S37dKytZ89uxZ5UvlfaCnTp3SEnvlypVx48ZVV1fLnrdKXV1d06ZNe/rpp+vr6+12+7Zt25KSkrz4r9Tp5ZdfNpvNFRUV2mMbGxslSdq5c6fyZUtLy7Bhw44fP+42tsf4KnnrNVYlb73GquTNT1uP6xy0xKrkVj3WbTurMc89Yj3Kc49Yj/Isyo+WmuxxjEc1qThx4oTBYGhqavIoV/qnPmfvWqXuDhw4EBsbqz322rVrWVlZcXFxKSkp27ZtkyTp8uXLvcaqz9O5dh7FHj16NCQkpLa2VuW8oprU8t+S+vd797xpiRXlzX/7laJ7brXHuuZWPbZvW6Xu1PPsGqs9z+rrq16TolgtNek6vuhafN+v+v4FuPj4+NjY2G+++Ub58syZM8HBwRkZGVpiT548WVdXd+edd5pMJqWNvfPOO9955x0tsT/++ON//vOf3NzcqKiosLCw559/vrKy8sKFC95dxebNmw8cOHD8+PHk5GTtUUajceLEic4PBlX5hFDX8bXnzTVWe95cY/s2b1r4O7fq9F+fovxoyZvrMd7lbdiwYcOGDfMlV0NQSEhIR0eH9uOTkpI+/vjj6urq8vLyxMTEhISEiRMn+j4NZe08Ctm3b19WVpbJZBId4N33rMZYlby5jfVT3rTwR277jZ/qU0ttiPKmEutR3rxYF4/50meJRsjPz09PTy8vL7dYLHPmzNH+G3BNTU3X/+fYsWOSJJ05c0bLnSFFcnLy6tWrbTZbS0vLK6+8YjQa6+vrVWI7OjpaWloKCgri4uJaWlocDofy+Lp16yZNmlReXt7S0tLS0uJ8r6WW2DfffNNsNl+6dKmlpeUPf/jD+PHjXbtp0fiivPUaq5I3LecV5c2XChHFiuagJVaUW1GsaI205NltrMY8i86rJc+i/GipSdExvdZkW1vbq6++WlpaarVav/766/vvvz87O1t7rgKFaM6i9eo1trOzc//+/RUVFVar9dixY6mpqc8884z2854+ffqHH36oq6s7dOhQTEzMe++9px7rdp4qa6elJmVZrqmpCQkJ6fGmYy11JRqz11iVvGk5ryhvfb5fqeS211iF29yKYkX51JJnt8dozLNofC15drtGGmtS5f8C9ZpUGV9Lrrzjl1apra1t3bp1kZGRRqNx6dKlNptNe6yTF+9VOnfu3Pz586OioiIiImbOnNnrbxzs3btX6iYsLEyW5du3b0v/v9TUVI2xsix3dXVt3bo1Li4uIiLiwQcfPH/+fI9YlfFFedMSK8qbllhR3nwpL7exWuavcl5RbkWxojXqNc8qsU4qL8CJYnvNsyg/WmpS5Zhea7K9vf2xxx6Li4sLCQmZMGHCxo0bnTnRkqtAIZpzr2stiu3s7HzooYeio6NDQkLMZvOWLVuam5u1n3f37t2xsbHDhw/PyMgoKCjodc5u56mydhrreefOnZMmTRKdV6WuRGP2GquSNy3nFeXNl5p0G6uS215jFW5zK4oV5bPXPIuO0ZJnlfF7zbNojbTUpPr/Beo1qTK+llx5x+AcxQuB+2fziCWW2IGKHSiBmCtiiSV2YGMV/GETAAAAIVolAAAAIVolAAAAIVolAAAAoT54WzcGN1/eRofBLRDf1o3Bjf0KIrytGwAAwC98uqsEAAAwuHFXCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQOj/AItyAftZS8fsAAAAAElFTkSuQmCC>" -/> + +{: align="center"} + +!!! example "Binding to cores and cyclic:cyclic distribution" -```Bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --tasks-per-node=16 -#SBATCH --cpus-per-task=1 + ```bash + #!/bin/bash + #SBATCH --nodes=2 + #SBATCH --tasks-per-node=16 + #SBATCH --cpus-per-task=1 -srun --ntasks 32 --cpu_bind=cores --distribution=cyclic:cyclic -``` + srun --ntasks 32 --cpu_bind=cores --distribution=cyclic:cyclic + ``` #### Distribution: cyclic:block @@ -90,104 +127,108 @@ The cyclic:block distribution will allocate the tasks of your job in alternation on node level, starting with first node filling the sockets linearly. -\<img alt="" -src="<data:;base64,iVBORw0KGgoAAAANSUhEUgAAAw4AAADeCAIAAAAb9sCoAAAABmJLR0QA/wD/AP+gvaeTAAAe3klEQVR4nO3de3BU9f3/8bMhJEA2ISGbCyQkZAMJxlsREQhSWrFQtdWacJHBAnZASdUIsSLORECZqQqjDkOnlIpWM0wTFcF2xjp0DAEG1LEURFEDmBjCJdkkht1kk2yu5/fHme5vv9l8dj+7m8vZ5Pn4i5w973M+57Wf/fDO2WUxqKqqAAAAoC8hQz0AAAAA/aJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEAoNpNhgMPTXOAAEHVVVh3oIPmC9AkayQNYr7ioBAAAIBXRXSRNcv1kCCFzw3qFhvQJGmsDXK+4qAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqAQAACNEqjVw/+9nPDAbDp59+6tySmJj44Ycfyh/hyy+/NBqN8vsXFRVlZ2dHREQkJib6MFAAI97gr1cbN27MysoaN25cSkrKpk2bOjo6fBguhhdapREtNjb2mWeeGbTTmUymDRs2bNu2bdDOCGDYGOT1ym6379279/LlyyUlJSUlJVu3bh20U0NvaJVGtLVr11ZUVHzwwQfuD9XU1CxdujQ+Pj45OfmJJ55obW3Vtl++fHnx4sXR0dE33XTTyZMnnfs3NTXl5eVNnjw5Li7uoYceamhocD/mvffeu2zZssmTJw/Q5QAYxgZ5vXrjjTfmz58fGxubnZ39yCOPuJZjpKFVGtGMRuO2bduee+65zs7OXg/l5uaOHj26oqLi1KlTp0+fLigo0LYvXbo0OTm5trb2X//611/+8hfn/itXrrRYLGfOnKmurh4/fvyaNWsG7SoAjARDuF6dOHFi5syZ/Xo1CCpqAAI/AobQggULtm/f3tnZOX369N27d6uqmpCQcOjQIVVVy8vLFUWpq6vT9iwtLR0zZkx3d3d5ebnBYGhsbNS2FxUVRUREqKpaWVlpMBic+9tsNoPBYLVa+zxvcXFxQkLCQF8dBlQwvvaDccxwGqr1SlXVLVu2pKWlNTQ0DOgFYuAE/toPHezWDDoTGhr68ssvr1u3btWqVc6NV65ciYiIiIuL0340m80Oh6OhoeHKlSuxsbExMTHa9mnTpml/qKqqMhgMs2bNch5h/PjxV69eHT9+/GBdB4Dhb/DXqxdffHH//v1lZWWxsbEDdVXQPVolKA888MCrr7768ssvO7ckJye3tLTU19drq09VVVV4eLjJZEpKSrJare3t7eHh4Yqi1NbWavunpKQYDIazZ8/SGwEYUIO5Xm3evPngwYPHjh1LTk4esAtCEOCzSlAURdm5c+euXbuam5u1HzMyMubMmVNQUGC32y0WS2Fh4erVq0NCQqZPnz5jxozXX39dUZT29vZdu3Zp+6enpy9atGjt2rU1NTWKotTX1x84cMD9LN3d3Q6HQ/ucgcPhaG9vH6TLAzCMDM56lZ+ff/DgwcOHD5tMJofDwZcFjGS0SlAURZk9e/Z9993n/GcjBoPhwIEDra2taWlpM2bMuOWWW1577TXtoffff7+0tPS2226766677rrrLucRiouLJ02alJ2dHRkZOWfOnBMnTrif5Y033hg7duyqVassFsvYsWO5oQ3AD4OwXlmt1t27d1+8eNFsNo8dO3bs2LFZWVmDc3XQIYPzE0/+FBsMiqIEcgQAwSgYX/vBOGYAgQv8tc9dJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAACFaJQAAAKHQoR4AAAyeysrKoR4CgCBjUFXV/2KDQVGUQI4AIBgF42tfGzOAkSmQ9aof7iqxAAHQP7PZPNRDABCU+uGuEoCRKbjuKgGAfwJqlQAAAIY3/gUcAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAUEBfQcn3Ko0E/n2dBHNjJAiurxphTo4ErFcQCWS94q4SAACAUD/8xybB9Zsl5AX+mxZzY7gK3t/CmZPDFesVRAKfG9xVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEBr+rdK3337761//2mQyjRs3bvr06c8++6wfB5k+ffqHH34oufNPfvKTkpKSPh8qKirKzs6OiIhITEz0YxjoX7qaGxs3bszKyho3blxKSsqmTZs6Ojr8GAyCna7mJOuVruhqboy09WqYt0o9PT2//OUvJ02a9PXXXzc0NJSUlJjN5iEcj8lk2rBhw7Zt24ZwDNDobW7Y7fa9e/devny5pKSkpKRk69atQzgYDAm9zUnWK/3Q29wYceuVGoDAjzDQLl++rCjKt99+6/7QtWvXlixZEhcXl5SU9Pjjj7e0tGjbr1+/npeXl5KSEhkZOWPGjPLyclVVMzMzDx06pD26YMGCVatWdXR02Gy29evXJycnm0ym5cuX19fXq6r6xBNPjB492mQypaamrlq1qs9RFRcXJyQkDNQ1959Anl/mhn9zQ7Nly5b58+f3/zX3H/0/v+70P2Z9zknWKz3Q59zQjIT1apjfVZo0aVJGRsb69evffffd6upq14dyc3NHjx5dUVFx6tSp06dPFxQUaNtXrFhx6dKlzz77zGq1vvPOO5GRkc6SS5cuzZs3784773znnXdGjx69cuVKi8Vy5syZ6urq8ePHr1mzRlGU3bt3Z2Vl7d69u6qq6p133hnEa4Vv9Dw3Tpw4MXPmzP6/Zuibnuckhpae58aIWK+GtlMbBBaLZfPmzbfddltoaOjUqVOLi4tVVS0vL1cUpa6uTtuntLR0zJgx3d3dFRUViqJcvXq110EyMzOff/755OTkvXv3alsqKysNBoPzCDabzWAwWK1WVVVvvfVW7Swi/JamEzqcG6qqbtmyJS0traGhoR+vtN8FxfPbS1CMWYdzkvVKJ3Q4N9QRs14N/1bJqbm5+dVXXw0JCfnqq68++eSTiIgI50M//PCDoigWi6W0tHTcuHHutZmZmQkJCbNnz3Y4HNqWI0eOhISEpLqIjo7+5ptvVJaegGsHn37mxgsvvGA2m6uqqvr1+vpfcD2/muAas37mJOuV3uhnboyc9WqYvwHnymg0FhQUjBkz5quvvkpOTm5paamvr9ceqqqqCg8P196UbW1trampcS/ftWtXXFzc/fff39raqihKSkqKwWA4e/Zs1f9cv349KytLUZSQkBGU6vCgk7mxefPm/fv3Hzt2LDU1dQCuEsFEJ3MSOqSTuTGi1qth/iKpra195plnzpw509LS0tjY+NJLL3V2ds6aNSsjI2POnDkFBQV2u91isRQWFq5evTokJCQ9PX3RokWPPvpoTU2Nqqrnzp1zTrXw8PCDBw9GRUXdc889zc3N2p5r167Vdqivrz9w4IC2Z2Ji4vnz5/scT3d3t8Ph6OzsVBTF4XC0t7cPSgzog97mRn5+/sGDBw8fPmwymRwOx7D/x7dwp7c5yXqlH3qbGyNuvRram1oDzWazrVu3btq0aWPHjo2Ojp43b95HH32kPXTlypWcnByTyTRx4sS8vDy73a5tb2xsXLduXVJSUmRk5G233Xb+/HnV5V8NdHV1/fa3v73jjjsaGxutVmt+fv6UKVOMRqPZbH7qqae0Ixw9enTatGnR0dG5ubm9xrNnzx7X8F1vnOpQIM8vc8OnuXH9+vVeL8z09PTBy8J3+n9+3el/zLqakyrrlZ7oam6MwPXK4DyKHwwGg3Z6v48APQvk+WVuDG/B+PwG45ghj/UKIoE/v8P8DTgAAIBA0CoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAIhQZ+CIPBEPhBMCwxN6A3zEmIMDcgwl0lAAAAIYOqqkM9BgAAAJ3irhIAAIAQrRIAAIAQrRIAAIAQrRIAAIAQrRIAAIAQrRIAAIAQrRIAAIBQQN/WzXebjgT+ffMWc2MkCK5vZWNOjgSsVxAJZL3irhIAAIBQP/wfcMH1myXkBf6bFnNjuAre38KZk8MV6xVEAp8b3FUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQGrat0smTJ++7774JEyZERETcfPPNhYWFLS0tg3Derq6u/Pz8CRMmREVFrVy5sqmpqc/djEajwUV4eHh7e/sgDG/EGqr5YLFYli1bZjKZoqOjFy9efP78+T53Kyoqys7OjoiISExMdN2+Zs0a13lSUlIyCGPG4GO9givWK70Znq3SP//5z4ULF956662fffZZXV3d/v376+rqzp49K1OrqmpnZ6ffp37hhRcOHz586tSp77///tKlS+vXr+9zN4vF0vw/OTk5Dz74YHh4uN8nhWdDOB/y8vKsVuuFCxeuXr06ceLEpUuX9rmbyWTasGHDtm3b3B8qKChwTpUlS5b4PRLoFusVXLFe6ZEagMCPMBC6u7uTk5MLCgp6be/p6VFV9dq1a0uWLImLi0tKSnr88cdbWlq0RzMzMwsLC++8886MjIyysjKbzbZ+/frk5GSTybR8+fL6+nptt9deey01NXX8+PETJ07cvn27+9nj4+Pfeust7c9lZWWhoaHXr1/3MNr6+vrw8PAjR44EeNUDIZDnVz9zY2jnQ3p6+r59+7Q/l5WVhYSEdHV1iYZaXFyckJDgumX16tXPPvusv5c+gPTz/MrT55hZr/oL6xXrlUg/dDtDe/qBoHXfZ86c6fPRuXPnrlixoqmpqaamZu7cuY899pi2PTMz86abbmpoaNB+/NWvfvXggw/W19e3trY++uij9913n6qq58+fNxqNFy9eVFXVarX+97//7XXwmpoa11Nrd7NPnjzpYbQ7d+6cNm1aAJc7gIbH0jOE80FV1U2bNi1cuNBisdhstocffjgnJ8fDUPtceiZOnJicnDxz5sxXXnmlo6PD9wAGhH6eX3n6HDPrVX9hvWK9EqFV6sMnn3yiKEpdXZ37Q+Xl5a4PlZaWjhkzpru7W1XVzMzMP/3pT9r2yspKg8Hg3M1msxkMBqvVWlFRMXbs2Pfee6+pqanPU1+4cEFRlMrKSueWkJCQjz/+2MNoMzIydu7c6ftVDobhsfQM4XzQdl6wYIGWxg033FBdXe1hqO5Lz+HDhz/99NOLFy8eOHAgKSnJ/XfNoaKf51eePsfMetVfWK+07axX7gJ/fofhZ5Xi4uIURbl69ar7Q1euXImIiNB2UBTFbDY7HI6Ghgbtx0mTJml/qKqqMhgMs2bNmjJlypQpU2655Zbx48dfvXrVbDYXFRX9+c9/TkxM/OlPf3rs2LFex4+MjFQUxWazaT82Nzf39PRERUW9/fbbzk+6ue5fVlZWVVW1Zs2a/rp2uBvC+aCq6t133202mxsbG+12+7Jly+68886WlhbRfHC3aNGiuXPnTp06NTc395VXXtm/f38gUUCHWK/givVKp4a2UxsI2nu9Tz/9dK/tPT09vbrysrKy8PBwZ1d+6NAhbfv3338/atQoq9UqOkVra+sf//jHmJgY7f1jV/Hx8X/729+0Px89etTze//Lly9/6KGHfLu8QRTI86ufuTGE86G+vl5xe4Pj888/Fx3H/bc0V++9996ECRM8Xeog0s/zK0+fY2a96i+sV9p21it3/dDtDO3pB8g//vGPMWPGPP/88xUVFQ6H49y5c3l5eSdPnuzp6ZkzZ87DDz/c3NxcW1s7b968Rx99VCtxnWqqqt5zzz1Lliy5du2aqqp1dXXvv/++qqrfffddaWmpw+FQVfWNN96Ij493X3oKCwszMzMrKystFsv8+fNXrFghGmRdXV1YWJg+PyCpGR5Ljzqk8yE1NXXdunU2m62tre3FF180Go2NjY3uI+zq6mpraysqKkpISGhra9OO2d3dvW/fvqqqKqvVevTo0fT0dOdHE4acrp5fSbodM+tVv2C9ch6B9aoXWiWhEydO3HPPPdHR0ePGjbv55ptfeukl7R8LXLlyJScnx2QyTZw4MS8vz263a/v3mmpWqzU/P3/KlClGo9FsNj/11FOqqp4+ffqOO+6IioqKiYmZPXv28ePH3c/b0dHx5JNPRkdHG43GFStW2Gw20Qh37Nih2w9IaobN0qMO3Xw4e/bsokWLYmJioqKi5s6dK/qbZs+ePa73eiMiIlRV7e7uvvvuu2NjY8PCwsxm83PPPdfa2trvyfhHb8+vDD2PmfUqcKxXznLWq14Cf34NzqP4QXvnMpAjQM8CeX6ZG8NbMD6/wThmyGO9gkjgz+8w/Fg3AABAf6FVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEKJVAgAAEAoN/BBe/7dhjFjMDegNcxIizA2IcFcJAABAKKD/Aw4AAGB4464SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAUEDf1s13m44E/n3zFnNjJAiub2VjTo4ErFcQCWS94q4SAACAUD/8H3CBdPHU6r82EMF4vdTK1wajYMyZWvnaQATj9VIrXxsI7ioBAAAI0SoBAAAI0SoBAAAIDUir1NXVlZ+fP2HChKioqJUrVzY1NcnXbty4MSsra9y4cSkpKZs2bero6PDj7DNmzDAYDLW1tT4V/vvf/549e/aYMWPi4uI2bdokX2ixWJYtW2YymaKjoxcvXnz+/HnP+xcVFWVnZ0dERCQmJvYaudfcRLUyuYlqnWf3LzevPJzXa+aiWpnMRZnI5CyqlcnZ8z6ec/ZQ6zUrUa1MVoWFhWlpaeHh4bGxsffff//3338vn1Ww8/y68EyUm4w1a9YYXJSUlMjXGo1G19rw8PD29nbJ2itXruTm5sbGxk6YMOH3v/+910JRPjK5ifaRyU1UG0huMkTnlclcVCuTuej1K5OzqFYmZ1GtTM6iWpmsRLUyWYmuK5DXshdqAERHKCwszMjIqKiosFgs8+bNW7FihXzt2rVrjx8/3tDQcPLkycmTJ2/evFm+VrN9+/aFCxcqilJTUyNfW1paajQa//rXv9bW1lZXVx8/fly+9sEHH/zFL37x448/2u321atX33zzzZ5rP/roo3fffXfHjh0JCQmu+4hyk6kV5SZTq3HPLZAZInNeUeYytaLMXWtFmcjkLKqVydnzHPacs6hWJitRrUxWn3/+eUVFRVNTU2Vl5QMPPJCdnS2fVbAQjdnz68JzrSg3mdrVq1cXFBQ0/09nZ6d8rd1udxbm5OQsX75cvvaOO+546KGHbDbbtWvX5syZ89RTT3muFeUj2i5TK8pNplaU20CvV6LMZWpFmcu8fmVyFtXK5CyqlclZVCuTlahWJivRdclk5Z8BaZXi4+Pfeust7c9lZWWhoaHXr1+XrHW1ZcuW+fPny59XVdVvvvkmPT39iy++UHxslbKzs5999lnP4xHVpqen79u3T/tzWVlZSEhIV1eX19ri4uJeT6coN5laV665Sdb2mVt/LT2i84oyl6kVZS4as2sm8jm714q2S9b6lLNrrXxW7rU+ZdXR0ZGXl3fvvfdqP/qalZ55HrPn15TX6+2Vm0zt6tWr/V5znOrr68PDw48cOSJZe/XqVUVRysvLtR8PHTpkNBrb29u91orycd/u03rVKzeZWlFuA71eOfXK3Guth8zl1xyZnEW1qkTO7rW+5tzneb1m1avW16z6fN3JZyWv/9+Aq62traurmzFjhvbjzJkzu7q6vv32Wz8OdeLEiZkzZ8rv393d/bvf/e7111+PjIz06UQOh+Pzzz/v7u6+4YYbYmJiFi5c+NVXX8mX5+bmFhcX19XVNTU1vfnmm7/5zW9GjRrl0wCU4MwtEIOcuTMTP3IW5SmTs+s+vubsrPUjK9fzSmZVVFSUmJgYGRn59ddf//3vf1f6dU4OY+65+VQ7efLk22+/fceOHZ2dnX6c/e23305JSfn5z38uub/zrw0nu93u0/uG/WVocwvEIGTu6xruodannN1r5XPuc8ySWTlr5bMKZP74I5A+q88jXLhwQVGUysrK/9+OhYR8/PHHMrWutmzZkpaW1tDQIHleVVV37ty5dOlSVVW/++47xZe7SjU1NYqipKWlnTt3zm63b9iwISkpyW63S57XZrMtWLBAe/SGG26orq6WOW+vztdDbl5rXfXKTaZWlFsgM8TreT1kLjNmUeZ9jtk1E59yVsXz0GvO7vv4lLNrrU9ZuZ9XMqvW1tZr164dP358xowZa9eu9SMrnfM8Zr/vKrnnJll7+PDhTz/99OLFiwcOHEhKSiooKPB1zKqqZmRk7Ny506cx33777c43OObOnasoymeffea1tt/vKvWZm0ytKLcBXa9c9cpcplaUufyaI3mnxL1WMmf3Wp9yFq2TXrNyr5XMysPrbiDuKvV/q6Qt62fOnNF+1D4HevLkSZlapxdeeMFsNldVVcmf9+LFi5MmTaqtrVV9b5Wam5sVRdmxY4f2Y1tb26hRo44dOyZT29PTM2vWrEceeaSxsdFut2/dujUlJUWmzeqzdegzN/mXsXtuXms95DagS4+HzL3WesjcvbZXJj7lLJqHMjn32sennHvV+pRVr1qfstIcP37cYDC0tLT4lJX+eR5zgG/AqS65+VG7f//++Ph4X8975MiRsLCw+vp6n8Z86dKlnJychISEtLS0rVu3Kopy4cIFr7UD9Aac+n9z87XWNbcBXa+c3DOXqRVlLr/myOTs+e9Nzzl7rvWcs6hWJiv3Wvms3K9LExxvwCUmJsbHx3/55Zfaj6dPnw4NDc3KypI/wubNm/fv33/s2LHU1FT5qhMnTjQ0NNx4440mk0lrRW+88cY333xTptZoNE6dOtX5hZ4+fbPnjz/++J///Cc/Pz8mJiYiIuLpp5+urq4+d+6c/BE0wZhbIAYnc/dM5HMW5SmTs/s+8jm718pn5V7r3/wcNWrUqFGjAp+TI42Wmx+FYWFhXV1dvlbt3bs3JyfHZDL5VJWSkvLBBx/U1tZWVlYmJycnJSVNnTrV11P3r0HOLRADmrl/a7h8rShnr7UecvZQ6zWrPmv9mJ9+zx8fBNJniY5QWFiYmZlZWVlpsVjmz5/v07+Ae/LJJ6dNm1ZZWdnW1tbW1ub+eUNRbUtLy+X/OXr0qKIop0+fln8T7bXXXjObzefPn29ra/vDH/4wefJk+d8OU1NT161bZ7PZ2traXnzxRaPR2NjY6KG2q6urra2tqKgoISGhra3N4XBo20W5ydSKcvNa6yG3QGaIzJhFmcvUijJ3rRVlIpOzqFYm5z73kcxZdHyZrES1XrPq6Oh46aWXysvLrVbrF198cfvtt+fm5spnFSxEYxbNMa+1HnLzWtvd3b1v376qqiqr1Xr06NH09PTHHntMfsyqqtbV1YWFhfX5gW7PtadOnfrhhx8aGhoOHjwYFxf39ttve64V5SPa7rXWQ25eaz3kNtDrlSrIXKZWlLnM61cm5z5rJXPus1YyZw9/X3vNSlTrNSsP1yWTlX8GpFXq6Oh48skno6OjjUbjihUrbDabZO3169eV/ys9PV3+vE6+vgGnqmpPT8+WLVsSEhKioqLuuuuur7/+Wr727NmzixYtiomJiYqKmjt3rtd/jbJnzx7Xa4yIiNC2i3LzWushN5nzinILZHrJnFeUuUytKHNnrYdMvOYsqpXJWWYOi3L2UOs1Kw+1XrPq7Oy8//77ExISwsLCpkyZsnHjRmcmMnMyWIjG7PV1Iar1kJvX2u7u7rvvvjs2NjYsLMxsNj/33HOtra3yY1ZVdceOHdOmTevzIc+1u3btio+PHz16dFZWVlFRkddaUT6i7V5rPeTmtdZDboHMSZnrVQWZy9SKMnfWenj9es1ZVCuTs6hWJmfPa53nrDzUes3Kw3XJzEn/GJxH8UPw/rd51FJL7VDVDpVgzIpaaqkd2loN/7EJAACAEK0SAACAEK0SAACAEK0SAACAUD98rBvDWyAfo8PwFowf68bwxnoFET7WDQAAMCACuqsEAAAwvHFXCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQOj/AVpnAfsg0n+oAAAAAElFTkSuQmCC>" -/> + +{: align="center"} +!!! example "Binding to cores and cyclic:block distribution" + + ```bash #!/bin/bash #SBATCH --nodes=2 #SBATCH --tasks-per-node=16 #SBATCH --cpus-per-task=1 srun --ntasks 32 --cpu_bind=cores --distribution=cyclic:block ./application + ``` ### Socket Bound -Note: The general distribution onto the nodes and sockets stays the -same. The mayor difference between socket and cpu bound lies within the -ability of the tasks to "jump" from one core to another inside a socket -while executing the application. These jumps can slow down the execution -time of your application. +The general distribution onto the nodes and sockets stays the same. The mayor difference between +socket- and CPU-bound lies within the ability of the OS to move tasks from one core to another +inside a socket while executing the application. These jumps can slow down the execution time of +your application. #### Default Distribution -The default distribution uses --cpu_bind=sockets with ---distribution=block:cyclic. The default allocation method (as well as -block:cyclic) will fill up one node after another, while filling socket -one and two in alternation. Resulting in only even ranks on the first -socket of each node and odd on each second socket of each node. +The default distribution uses `--cpu_bind=sockets` with `--distribution=block:cyclic`. The default +allocation method (as well as `block:cyclic`) will fill up one node after another, while filling +socket one and two in alternation. Resulting in only even ranks on the first socket of each node and +odd on each second socket of each node. -\<img alt="" -src="data:;base64,iVBORw0KGgoAAAANSUhEUgAAAvoAAADyCAIAAACzsfbGAAAABmJLR0QA/wD/AP+gvaeTAAAgAElEQVR4nO3daXQUVdrA8Wq27AukIQECCQkQCAoCIov44iAHFJARCJtggsqWIyLiCOggghsoisOAo4zoSE6cZGQTj8twDmGZAdzZiSAkhCVASITurJ2EpN4PNdMnk+7qrqR64+b/+5RU31t1763nPjypNB2DLMsSAACAuJp5ewAAAADuRbkDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAE10JPZ4PB4KpxALjtyLLs4SuSc4CmTE/O4ekOAAAQnK6nOwrP/4QHwLu8+5SFnAM0NfpzDk93AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3mq7777/fYDAcOnTIeiQqKurzzz/XfoajR48GBwdrb5+WljZkyJCgoKCoqKgGDBSAEDyfc5599tnExMTAwMDOnTsvXry4qqqqAcOFWCh3mrSIiIjnn3/eY5czGo0LFy5csWKFx64IwKd4OOeUlpZu3Ljx0qVLmZmZmZmZL7/8sscuDV9DudOkzZo1KycnZ9u2bbYvXb16ddKkSe3atYuOjp4/f355ebly/NKlS6NGjQoPD7/jjjsOHjxobV9cXJyamtqpU6e2bdtOnTq1qKjI9pyjR4+ePHlyp06d3DQdAD7Owznnww8/vO+++yIiIoYMGfL444/X7Y6mhnKnSQsODl6xYsULL7xQXV1d76WJEye2bNkyJyfnp59+Onz48KJFi5TjkyZNio6Ovnbt2tdff/3BBx9Y20+fPr2goODIkSMXL14MCwubOXOmx2YB4HbhxZxz4MCB/v37u3Q2uK3IOug/A7xo2LBhr776anV1dY8ePdavXy/LcmRk5I4dO2RZPn36tCRJ169fV1pmZWX5+/vX1NScPn3aYDDcuHFDOZ6WlhYUFCTLcm5ursFgsLY3m80Gg8FkMtm9bkZGRmRkpLtnB7fy1t4n59zWvJVzZFlevnx5ly5dioqK3DpBuI/+vd/C0+UVfEyLFi1Wr149e/bs5ORk68HLly8HBQW1bdtW+TYuLs5isRQVFV2+fDkiIqJ169bK8W7duilf5OXlGQyGAQMGWM8QFhaWn58fFhbmqXkAuD14Pue88sor6enpe/fujYiIcNes4PModyD9/ve/f+edd1avXm09Eh0dXVZWVlhYqGSfvLw8Pz8/o9HYsWNHk8lUWVnp5+cnSdK1a9eU9p07dzYYDMeOHaO+AeCUJ3PO0qVLt2/fvn///ujoaLdNCLcB3rsDSZKkNWvWrFu3rqSkRPm2e/fugwYNWrRoUWlpaUFBwbJly1JSUpo1a9ajR4++ffu+++67kiRVVlauW7dOaR8fHz9y5MhZs2ZdvXpVkqTCwsKtW7faXqWmpsZisSi/s7dYLJWVlR6aHgAf45mcs2DBgu3bt+/atctoNFosFv4jelNGuQNJkqSBAweOGTPG+l8hDAbD1q1by8vLu3Tp0rdv3969e69du1Z5acuWLVlZWf369Rs+fPjw4cOtZ8jIyOjQocOQIUNCQkIGDRp04MAB26t8+OGHAQEBycnJBQUFAQEBPFgGmiwP5ByTybR+/fqzZ8/GxcUFBAQEBAQkJiZ6ZnbwQQbrO4Aa09lgkCRJzxkA3I68tffJOUDTpH/v83QHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIroX+UxgMBv0nAQCNyDkAGoqnOwAAQHAGWZa9PQYAAAA34ukOAAAQHOUOAAAQHOUOAAAQHOUOAAAQHOUOAAAQHOUOAAAQnK6PGeTDvpqCxn1UAbHRFHj+YyyIq6aAnAM1enIOT3cAAIDgXPBHJPigQlHp/2mJ2BCVd3+SJq5ERc6BGv2xwdMdAAAgOModAAAgOModAAAgOModAAAgOModAAAgOModAAAgOModAAAgOModAAAgOModAAAgOModAAAgOModAAAgOModAAAgOModAAAgOPHLnezs7IcffthoNAYGBvbo0WPJkiWNOEmPHj0+//xzjY3vuuuuzMxMuy+lpaUNGTIkKCgoKiqqEcOAa/lUbDz77LOJiYmBgYGdO3devHhxVVVVIwYDX+BTcUXO8Sk+FRtNLecIXu7U1tY++OCDHTp0OHHiRFFRUWZmZlxcnBfHYzQaFy5cuGLFCi+OAQpfi43S0tKNGzdeunQpMzMzMzPz5Zdf9uJg0Gi+FlfkHN/ha7HR5HKOrIP+M7jbpUuXJEnKzs62fenKlStJSUlt27bt2LHjU089VVZWphy/efNmampq586dQ0JC+vbte/r0aVmWExISduzYobw6bNiw5OTkqqoqs9k8b9686Ohoo9E4ZcqUwsJCWZbnz5/fsmVLo9EYExOTnJxsd1QZGRmRkZHumrPr6Lm/xEbjYkOxfPny++67z/Vzdh1v3V/iipzjjr6e4ZuxoWgKOUfwpzsdOnTo3r37vHnz/vGPf1y8eLHuSxMnTmzZsmVOTs5PP/10+PDhRYsWKcenTZt24cKFb7/91mQybd68OSQkxNrlwoUL995779ChQzdv3tyyZcvp06cXFBQcOXLk4sWLYWFhM2fOlCRp/fr1iYmJ69evz8vL27x5swfniobx5dg4cOBA//79XT9nuJ8vxxW8y5djo0nkHO9WWx5QUFCwdOnSfv36tWjRomvXrhkZGbIsnz59WpKk69evK22ysrL8/f1rampycnIkScrPz693koSEhJdeeik6Onrjxo3KkdzcXIPBYD2D2Ww2GAwmk0mW5T59+ihXUcNPWj7CB2NDluXly5d36dKlqKjIhTN1OW/dX+KKnOOOvh7jg7EhN5mcI365Y1VSUvLOO+80a9bs+PHju3fvDgoKsr50/vx5SZIKCgqysrICAwNt+yYkJERGRg4cONBisShH9uzZ06xZs5g6wsPDT506JZN6dPf1PN+JjZUrV8bFxeXl5bl0fq5HuaOF78QVOcfX+E5sNJ2cI/gvs+oKDg5etGiRv7//8ePHo6Ojy8rKCgsLlZfy8vL8/PyUX3CWl5dfvXrVtvu6devatm07bty48vJySZI6d+5sMBiOHTuW9183b95MTEyUJKlZsya0qmLwkdhYunRpenr6/v37Y2Ji3DBLeJqPxBV8kI/ERpPKOYJvkmvXrj3//PNHjhwpKyu7cePGqlWrqqurBwwY0L1790GDBi1atKi0tLSgoGDZsmUpKSnNmjWLj48fOXLknDlzrl69KsvyyZMnraHm5+e3ffv20NDQhx56qKSkRGk5a9YspUFhYeHWrVuVllFRUWfOnLE7npqaGovFUl1dLUmSxWKprKz0yDLADl+LjQULFmzfvn3Xrl1Go9FisQj/n0JF5WtxRc7xHb4WG00u53j34ZK7mc3m2bNnd+vWLSAgIDw8/N577/3qq6+Uly5fvjxhwgSj0di+ffvU1NTS0lLl+I0bN2bPnt2xY8eQkJB+/fqdOXNGrvNO+Fu3bj322GP33HPPjRs3TCbTggULYmNjg4OD4+LinnnmGeUM+/bt69atW3h4+MSJE+uN5/3336+7+HUfYPogPfeX2GhQbNy8ebPexoyPj/fcWjSct+4vcUXOcUdfz/Cp2GiCOcdgPUsjGAwG5fKNPgN8mZ77S2yIzVv3l7gSGzkHavTfX8F/mQUAAEC5AwAABEe5AwAABEe5AwAABEe5AwAABEe5AwAABEe5AwAABEe5AwAABEe5AwAABEe5AwAABEe5AwAABEe5AwAABEe5AwAABEe5AwAABNdC/ymUP8sO2CI24A7EFdQQG1DD0x0AACA4gyzL3h4DAACAG/F0BwAACI5yBwAACI5yBwAACI5yBwAACI5yBwAACI5yBwAACI5yBwAACE7Xpyrz+ZVNQeM+mYnYaAo8/6ldxFVTQM6BGj05h6c7AABAcC74m1l8LrOo9P+0RGyIyrs/SRNXoiLnQI3+2ODpDgAAEBzlDgAAEBzlDgAAEBzlDgAAEBzlDgAAEBzlDgAAEBzlDgAAEBzlDgAAEJyw5c7BgwfHjBnTpk2boKCgO++8c9myZWVlZR647q1btxYsWNCmTZvQ0NDp06cXFxfbbRYcHGyow8/Pr7Ky0gPDa7K8FQ8FBQWTJ082Go3h4eGjRo06c+aM3WZpaWlDhgwJCgqKioqqe3zmzJl14yQzM9MDY0bjkHNQFznH14hZ7nzxxRcPPPBAnz59vv322+vXr6enp1+/fv3YsWNa+sqyXF1d3ehLr1y5cteuXT/99NO5c+cuXLgwb948u80KCgpK/mvChAnjx4/38/Nr9EXhmBfjITU11WQy/frrr/n5+e3bt580aZLdZkajceHChStWrLB9adGiRdZQSUpKavRI4FbkHNRFzvFFsg76z+AONTU10dHRixYtqne8trZWluUrV64kJSW1bdu2Y8eOTz31VFlZmfJqQkLCsmXLhg4d2r17971795rN5nnz5kVHRxuNxilTphQWFirN1q5dGxMTExYW1r59+1dffdX26u3atfv444+Vr/fu3duiRYubN286GG1hYaGfn9+ePXt0ztod9Nxf34kN78ZDfHz8pk2blK/37t3brFmzW7duqQ01IyMjMjKy7pGUlJQlS5Y0dupu5K376ztxVRc5x1XIOeQcNS6oWLx7eXdQKugjR47YfXXw4MHTpk0rLi6+evXq4MGD586dqxxPSEi44447ioqKlG/Hjh07fvz4wsLC8vLyOXPmjBkzRpblM2fOBAcHnz17VpZlk8n0888/1zv51atX615aeap88OBBB6Nds2ZNt27ddEzXjcRIPV6MB1mWFy9e/MADDxQUFJjN5hkzZkyYMMHBUO2mnvbt20dHR/fv3//NN9+sqqpq+AK4BeVOXeQcVyHnkHPUUO7YsXv3bkmSrl+/bvvS6dOn676UlZXl7+9fU1Mjy3JCQsKGDRuU47m5uQaDwdrMbDYbDAaTyZSTkxMQEPDZZ58VFxfbvfSvv/4qSVJubq71SLNmzb755hsHo+3evfuaNWsaPktPECP1eDEelMbDhg1TVqNnz54XL150MFTb1LNr165Dhw6dPXt269atHTt2tP150Vsod+oi57gKOUc5Ts6xpf/+CvjenbZt20qSlJ+fb/vS5cuXg4KClAaSJMXFxVkslqKiIuXbDh06KF/k5eUZDIYBAwbExsbGxsb27t07LCwsPz8/Li4uLS3tL3/5S1RU1P/93//t37+/3vlDQkIkSTKbzcq3JSUltbW1oaGhn3zyifWdX3Xb7927Ny8vb+bMma6aO2x5MR5kWR4xYkRcXNyNGzdKS0snT548dOjQsrIytXiwNXLkyMGDB3ft2nXixIlvvvlmenq6nqWAm5BzUBc5x0d5t9pyB+X3ps8991y947W1tfUq67179/r5+Vkr6x07dijHz50717x5c5PJpHaJ8vLyN954o3Xr1srvYutq167d3/72N+Xrffv2Of49+pQpU6ZOndqw6XmQnvvrO7HhxXgoLCyUbH7R8N1336mdx/Ynrbo+++yzNm3aOJqqB3nr/vpOXNVFznEVco5ynJxjywUVi3cv7yY7d+709/d/6aWXcnJyLBbLyZMnU1NTDx48WFtbO2jQoBkzZpSUlFy7du3ee++dM2eO0qVuqMmy/NBDDyUlJV25ckWW5evXr2/ZskWW5V9++SUrK8tisciy/OGHH7Zr18429SxbtiwhISE3N7egoOC+++6bNm2a2iCvX7/eqlUr33zDoEKM1CN7NR5iYmJmz55tNpsrKipeeeWV4ODgGzdu2I7w1q1bFRUVaWlpkZGRFRUVyjlramo2bdqUl5dnMpn27dsXHx9v/TW/11Hu1EPOcQlyjvUM5Jx6KHdUHThw4KGHHgoPDw8MDLzzzjtXrVqlvAH+8uXLEyZMMBqN7du3T01NLS0tVdrXCzWTybRgwYLY2Njg4OC4uLhnnnlGluXDhw/fc889oaGhrVu3Hjhw4L/+9S/b61ZVVT399NPh4eHBwcHTpk0zm81qI3zrrbd89g2DCmFSj+y9eDh27NjIkSNbt24dGho6ePBgtX9p3n///brPXIOCgmRZrqmpGTFiRERERKtWreLi4l544YXy8nKXr0zjUO7YIufoR86xdifn1KP//hqsZ2kE5beAes4AX6bn/hIbYvPW/SWuxEbOgRr991fAtyoDAADURbkDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAE10L/KZz+hVU0WcQG3IG4ghpiA2p4ugMAAASn629mAQAA+D6e7gAAAMFR7gAAAMFR7gAAAMFR7gAAAMFR7gAAAMFR7gAAAMFR7gAAAMHp+lRlPr+yKWjcJzMRG02B5z+1i7hqCsg5UKMn5/B0BwAACM4FfzOLz2UWlf6flogNUXn3J2niSlTkHKjRHxs83QEAAIKj3AEAAIKj3AEAAIKj3AEAAIKj3AEAAIKj3AEAAIKj3JEkSerRo4fBYDAYDJGRkSkpKaWlpY04idFoPHfunMvHBu8iNuAOxBXUEBtuQrnzH1u2bJFl+dChQz/++OPq1au9PRz4EGID7kBcQQ2x4Q6UO/8jPj5+7Nixx48fV7598cUXO3fuHBoaOmjQoMOHDysHjUbj22+/PXDgwK5duz799NO2J9m3b19MTMz333/vuXHD/YgNuANxBTXEhmtR7vwPs9mclZXVq1cv5ds777zz559/vnHjxqRJk6ZOnWr9vM6jR48eOnToxIkTu3fvzsrKqnuGr7/+Ojk5eefOnQMHDvT06OFOxAbcgbiCGmLDxWQd9J/BRyQkJISHh0dGRrZo0WL06NHl5eW2bcLDwy9fvizLckRExPfff68cnDt37po1a5SvIyIiXn/99ZiYmFOnTnls5G6l5/4SG8SGSNd1OeLKLnKOTGyo0H9/ebrzH2vWrDl8+HBaWtrBgwdzcnKUg5988kn//v07deoUGxtbUlJSVFSkHG/Tpo3yhb+/f933kf3pT3+aPHlyYmKihwcPtyI24A7EFdQQG+5AufMf4eHhHTp0mDZt2pNPPrl06VJJks6ePfvcc8+lp6dfunQpLy8vNDRUdvbH57Zs2bJt27YNGzZ4ZMjwEGID7kBcQQ2x4Q4u+Ivognn++ee7dOly7NixW7duBQUFxcfHS5KUnp5+8+ZNp307duyYlZU1bNiwwMDAJ554wv2DhUcRG3AH4gpqiA0XotypLyoqKiUlZfXq1RkZGRMnTuzTp0/btm0HDx7cqVMnLd1jY2OzsrLuv/9+f3//Rx991N2jhScRG3AH4gpqiA0XMjh9IOaos8EgSZKeM8CX6bm/xIbYvHV/iSuxkXOgRv/95b07AABAcJQ7AABAcJQ7AABAcJQ7AABAcJQ7AABAcD5R7pjN5qSkpKCgoJiYmE8//dS2gcVimTt3bpcuXYKCgu6+++49e/Y4ONvp06f9/PxmzJih1mDjxo3x8fF+fn49e/Y8ffq0bYNffvll+PDhoaGh0dHRr7zyit2TrF27tk+fPi1atFA+A0r7XNT6apyj2nU1zv2243i+irlz53bo0MHPz69Lly5qfz34wQcfNPxX165dbRtYLBbD/7J7qpMnTw4bNiwkJKR79+5ff/11g8asZS5qbbTMUa2v07nfphysp9M9rtbX6d53sE8bnce07H2nbRzsfcd9Ha+Vg75a8qRa3GrJk3qo5Vgte1wtNpzufQdr5XTvO+jrdO876Ksl76nFpNO1cnBdLXlSbV5a8mTj+ES5s3DhQovFkp+f//HHH8+ZMyc7O7teg8rKysDAwG3btl28eHHatGnjxo0rLCxUO9v8+fPvuecetVe3bNny2muv/fWvf7127VpaWlp4eLhtm+Tk5N69excVFWVlZb333ns7d+60bdOpU6fXX3993LhxDZ2LWl+Nc1S7rpa5344cz1eRnJz8ww8/FBUVbd68edWqVbt27bLbLC0traKioqKiwu5N8ff3r/iv/Pz8li1bjh8/vl6b6urqRx55ZMSIEb/99tv69eunTJly6dIl7WPWMhe1Nlrm6OD8jud+m1Kbr5Y97mCdHe99B/u00XlMy9532sbB3nfQ1+laOeirJU+qxa2WPKmH3furZY+r9dWy9x2sldO973idHe99x7HheO+r9dWyVmp9NeZJtXlpyZONpOcPbuk/gyzLFoslICDgxx9/VL6dMGHCiy++qHz95JNPPvnkk7ZdWrduvW/fPrttPv300xkzZixZsmT69OnWg3Xb9OrVKzMz0/acddsEBgZa/+ja+PHj33jjDbXxpKSkLFmypHFzqddX+xzV+tqdux567q9LYsPKdr52Y+PKlStRUVHffvutbZtRo0ZlZGTYntnueTZs2DBw4EDbNidOnGjZsmV1dbVy/He/+93q1avVzqN2f7XMxUFsOJijWl+1uevh2vur57q289Wyx9X6at/7Cus+1ZnH1I5r7Os076n11b5Wtn0btFZ149bBWrk25zjYR2p7XK1vg/a+wvb+asxjdvvKGva+bd8G5T216zpdq3p9G7pW9ealsF0r/TnH+5+qnJubW1FR0bt3b+Xb3r17HzlyRPl61KhRtu3PnTtXWlras2dP2zbFxcUrVqzYv3//unXr6naxtikrKzt16lR2dnb79u2bN2/+2GOPvfbaa82bN693nocffvjvf/977969z58//+OPPy5btszBePTMRY2DOapRm7uo6q3Jc889l5aWVlxcvG7dukGDBtlts2TJksWLFycmJq5cuXLgwIF22yg2b948c+ZMtWtZ1dbWnjx50nEbLTT21TJHNXbnLiSNe1xNg/Z+3X2qM4+pHdfS12neU+vb0LWqd12Na2Ubtw7WymM07nE1Tve+2v2tR2Nf7Xvftq/2vKc2Zi1r5WC+DtbK7rzcSE+tpP8Msiz/8MMPfn5+1m/Xrl37wAMPqDUuKysbMGDAihUr7L66YMGCt956S5ZltSccZ86ckSRpxIgRRUVFZ8+ejY+P//Of/2zbLC8vT/nTJJIkvfTSSw4GX68CbdBc1H7ycDxHtb5O594Ieu6vS2LDyvGTMFmWzWbzhQsXPvroo/Dw8FOnTtk2+PLLLw8fPpydnf3iiy+GhIRcuHBB7VTZ2dmtWrX67bffbF+qqqqKjY19+eWXy8vLv/zyy+bNm48fP76hY3Y6F7U2Tueo1lf73LVz7f3Vc91689W4x+32lRuy9+vtU5fkMS1737aN9r1fr2+D1sr2uhrXyjZuHayVa3OO2l5zsMfV+jZo76vdRy17325fjXvftq/2va82Zi1rVa+v9rVyMC93PN3x/nt3goODKysrq6qqlG+Li4uDg4PttqysrHzkkUd69eq1fPly21ePHTu2e/fuhQsXOrhWQECAJEl/+MMfIiIiunbtOnv2bNt3UVVWVg4fPnzu3LkWiyUnJ+eLL754//33XT4XNY7nqEbL3MUWGhrauXPnJ554YsSIEenp6bYNxowZ07dv3549e77++us9e/b85ptv1E61efPmsWPHtmnTxvalli1b7tixY/fu3ZGRkatWrRo3blx0dLQrp+GQ0zmq0T53AWjZ42q0733bfao/j2nZ+7ZttO99277a18q2r/a1so1b/XlSJwd7XI32vd+4HO64r5a9b7evxr3vYMxO18q2r/a1anROaxzv/zIrLi7O39//+PHjd999tyRJJ06c6NWrl22z6urqpKSk8PDwTZs2KX87o55///vfubm57du3lySpvLy8trY2Ozv78OHDddtER0eHh4dbu9s9T25ubm5u7tNPP+3n5xcXF5eUlJSVlZWamurCuahxOkc1WubedLRq1cppg5qaGrsv1dbWpqenv/fee2p977rrrgMHDihf9+vXb8KECY0epx5O5+igo9rcxaBlj6vRuPft7lOdeUzL3rfbRuPet9tX41rZ7du4PKnErc48qZPTPa5Gy95vdA7X3tfu3tfSV23vO+jrdK3U+jYiTzY6p2nn/ac7fn5+U6ZMWblypdls3rt37z//+c/p06crL82aNWvWrFmSJNXU1EyfPr26uvqjjz6qrq62WCy1tbX12jz++ONnz549evTo0aNHH3/88dGjR1t/UrG2MRgMycnJb7/9tslkunDhwqZNm8aOHVuvTWxsbFhY2AcffFBdXX3p0qVt27b16dOnXhtJkm7dumWxWGpqampqapQvNM5Fra+WOar1dTD3253d+Up11sRkMm3YsCEvL++3335LT0//6quvHn744XptSkpKMjMzr169WlhY+O677/78888jR46s10axe/fuysrK0aNH1x1D3TbffffdtWvX8vPzlyxZUlZWNnXqVNs2amN2Ohe1NlrmqNbXwdxvd3bnq2WPq/XVsvfV9qmePKZl76u10ZL31PpqWSu1vlrWSi1uHayVW2ND4XSPq/V1uvcd3Eene1+tr5a9r9ZXS95zMGana+Wgr9O1cjAvB/dOLz2/CdN/BoXJZJowYUJAQECnTp3S09Otx0eOHPnRRx/Jsnz+/Pl6w7a+29zapq56v8Ou26a8vDwlJSUkJKRDhw5//OMfa2pqbNvs2bNnwIABQUFBkZGR8+bNq6iosHfzt+cAAAHsSURBVG2zZMmSuuN59913Nc5Fra/GOapdV23ueui5v66KDbX5WtekuLh41KhRrVu3DgoK6t+//86dO619rW3MZvPQoUNDQ0ODg4MHDRq0e/du2zaKRx99dP78+fXGULfNCy+8EBYWFhAQMGbMmPPnz9ttozZmp3NRa6Nljmp9HcxdD1fdXz3XVVtPLXtcra/Tve9gnzY6j2nZ+w7aWKnlPQd9na6Vg75O18pB3KqtlZ640hIbsoY9rtbX6d53sFZO975aXy17X62vlrznOK4cr5WDvk7XysG81NZKT2z85wy6Ouu+vAPV1dW9evWqqqq6jdr4Wl+dXJV6XM7X7jux4fvXvR3vUVPrK3sp59yOa9XU+squyDkG61kaQfldnZ4zwJfpub/Ehti8dX+JK7GRc6BG//31/nt3AAAA3IpyBwAACI5yBwAACI5yBwAACI5yBwAACM4Fn6rc0M+ORNNBbMAdiCuoITaghqc7AABAcLo+dwcAAMD38XQHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAI7v8BE+cBiPwLm7cAAAAASUVORK5CYII=" -/> + +{: align="center"} -```Bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --tasks-per-node=16 -#SBATCH --cpus-per-task=1 +!!! example "Binding to sockets and block:cyclic distribution" -srun --ntasks 32 -cpu_bind=sockets ./application -``` + ```bash + #!/bin/bash + #SBATCH --nodes=2 + #SBATCH --tasks-per-node=16 + #SBATCH --cpus-per-task=1 + + srun --ntasks 32 -cpu_bind=sockets ./application + ``` #### Distribution: block:block This method allocates the tasks linearly to the cores. -\<img alt="" -src="data:;base64,iVBORw0KGgoAAAANSUhEUgAAAvoAAADyCAIAAACzsfbGAAAABmJLR0QA/wD/AP+gvaeTAAAdq0lEQVR4nO3deVRU5/3H8TsUHWAQRhlkcQQEFUWjUWNc8zPHpJq4tSruFjSN24lbSBRN3BMbjYk21RNrtWnkkELdTRNTzxExrZqmGhUTFaMQRFFZijOswzLc3x+35VCGQWWYxYf36y/m3ufeeS73y9fP3BnvqGRZlgAAAMTl5uwJAAAA2BdxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAATnbsvGKpWqueYB4Ikjy7KDn5GeA7RktvQcru4AAADB2XR1R+H4V3gAnMu5V1noOUBLY3vP4eoOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMERdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcMSdluv5559XqVRnz56tXRIYGHjkyJFH38OlS5e8vb0ffXxCQsLgwYM1Gk1gYOBjTBSAEBzfc15//fWoqCgvL6+QkJDly5dXVlY+xnQhFuJOi+bn57ds2TKHPZ1Op1u6dOm6desc9owAXIqDe05JScmuXbtu376dnJycnJy8du1ahz01XA1xp0V79dVXMzIyDh48aLnq3r17kyZNat++vV6vX7hwYVlZmbL89u3bI0eO1Gq1PXv2PHPmTO34oqKiBQsWdOzY0d/ff+rUqQUFBZb7HDVq1OTJkzt27GinwwHg4hzcc3bv3v3cc8/5+fkNHjx49uzZdTdHS0PcadG8vb3XrVu3cuXKqqqqeqsmTpzYqlWrjIyM8+fPX7hwIS4uTlk+adIkvV5///79Y8eO/f73v68dP2PGjNzc3IsXL2ZnZ/v6+s6aNcthRwHgSeHEnnP69Ol+/fo169HgiSLbwPY9wImGDRv2zjvvVFVVdevWbfv27bIsBwQEHD58WJbl9PR0SZLy8vKUkSkpKR4eHmazOT09XaVSFRYWKssTEhI0Go0sy5mZmSqVqna80WhUqVQGg6HB501KSgoICLD30cGunPW3T895ojmr58iyvGbNmk6dOhUUFNj1AGE/tv/tuzs6XsHFuLu7b9q0ac6cOTExMbUL79y5o9Fo/P39lYfh4eEmk6mgoODOnTt+fn5t27ZVlnfp0kX5ISsrS6VS9e/fv3YPvr6+OTk5vr6+jjoOAE8Gx/ecDRs2JCYmpqam+vn52euo4PKIO5B+8YtffPjhh5s2bapdotfrS0tL8/Pzle6TlZWlVqt1Ol2HDh0MBkNFRYVarZYk6f79+8r4kJAQlUqVlpZGvgHwUI7sOStWrDh06NDXX3+t1+vtdkB4AvDZHUiSJG3ZsuWjjz4qLi5WHnbt2nXgwIFxcXElJSW5ubmrVq2KjY11c3Pr1q1bnz59tm3bJklSRUXFRx99pIyPiIgYMWLEq6++eu/ePUmS8vPzDxw4YPksZrPZZDIp79mbTKaKigoHHR4AF+OYnrN48eJDhw4dP35cp9OZTCb+I3pLRtyBJEnSgAEDRo8eXftfIVQq1YEDB8rKyjp16tSnT59evXpt3bpVWbV///6UlJS+ffsOHz58+PDhtXtISkoKDg4ePHhwmzZtBg4cePr0actn2b17t6enZ0xMTG5urqenJxeWgRbLAT3HYDBs3779xo0b4eHhnp6enp6eUVFRjjk6uCBV7SeAmrKxSiVJki17APAkctbfPj0HaJls/9vn6g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMERdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABCcu+27UKlUtu8EAB4RPQfA4+LqDgAAEJxKlmVnzwEAAMCOuLoDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA4m24zyM2+WoKm3aqA2mgJHH8bC+qqJaDnwBpbeg5XdwAAgOCa4UskuFGhqGx/tURtiMq5r6SpK1HRc2CN7bXB1R0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMERdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAghM/7ly9enXs2LE6nc7Ly6tbt27x8fFN2Em3bt2OHDnyiIOffvrp5OTkBlclJCQMHjxYo9EEBgY2YRpoXi5VG6+//npUVJSXl1dISMjy5csrKyubMBm4ApeqK3qOS3Gp2mhpPUfwuFNTU/PSSy8FBwd///33BQUFycnJ4eHhTpyPTqdbunTpunXrnDgHKFytNkpKSnbt2nX79u3k5OTk5OS1a9c6cTJoMlerK3qO63C12mhxPUe2ge17sLfbt29LknT16lXLVXfv3o2Ojvb39+/QocNrr71WWlqqLH/w4MGCBQtCQkLatGnTp0+f9PR0WZYjIyMPHz6srB02bFhMTExlZaXRaJw/f75er9fpdFOmTMnPz5dleeHCha1atdLpdKGhoTExMQ3OKikpKSAgwF7H3HxsOb/URtNqQ7FmzZrnnnuu+Y+5+Tjr/FJX9Bx7bOsYrlkbipbQcwS/uhMcHNy1a9f58+f/5S9/yc7Orrtq4sSJrVq1ysjIOH/+/IULF+Li4pTl06ZNu3Xr1jfffGMwGPbu3dumTZvaTW7dujVkyJChQ4fu3bu3VatWM2bMyM3NvXjxYnZ2tq+v76xZsyRJ2r59e1RU1Pbt27Oysvbu3evAY8XjceXaOH36dL9+/Zr/mGF/rlxXcC5Xro0W0XOcm7YcIDc3d8WKFX379nV3d+/cuXNSUpIsy+np6ZIk5eXlKWNSUlI8PDzMZnNGRoYkSTk5OfV2EhkZuXr1ar1ev2vXLmVJZmamSqWq3YPRaFSpVAaDQZbl3r17K89iDa+0XIQL1oYsy2vWrOnUqVNBQUEzHmmzc9b5pa7oOfbY1mFcsDbkFtNzxI87tYqLiz/88EM3N7fLly+fOHFCo9HUrvrpp58kScrNzU1JSfHy8rLcNjIyMiAgYMCAASaTSVly8uRJNze30Dq0Wu2VK1dkWo/N2zqe69TG+vXrw8PDs7KymvX4mh9x51G4Tl3Rc1yN69RGy+k5gr+ZVZe3t3dcXJyHh8fly5f1en1paWl+fr6yKisrS61WK29wlpWV3bt3z3Lzjz76yN/ff9y4cWVlZZIkhYSEqFSqtLS0rP968OBBVFSUJElubi3otyoGF6mNFStWJCYmfv3116GhoXY4Sjiai9QVXJCL1EaL6jmC/5Hcv39/2bJlFy9eLC0tLSwsfO+996qqqvr379+1a9eBAwfGxcWVlJTk5uauWrUqNjbWzc0tIiJixIgRc+fOvXfvnizLP/zwQ22pqdXqQ4cO+fj4vPzyy8XFxcrIV199VRmQn59/4MABZWRgYOD169cbnI/ZbDaZTFVVVZIkmUymiooKh/wa0ABXq43FixcfOnTo+PHjOp3OZDIJ/59CReVqdUXPcR2uVhstruc49+KSvRmNxjlz5nTp0sXT01Or1Q4ZMuTLL79UVt25c2fChAk6nS4oKGjBggUlJSXK8sLCwjlz5nTo0KFNmzZ9+/a9fv26XOeT8NXV1b/61a+effbZwsJCg8GwePHisLAwb2/v8PDwJUuWKHs4depUly5dtFrtxIkT681n586ddX/5dS9guiBbzi+18Vi18eDBg3p/mBEREY77XTw+Z51f6oqeY49tHcOlaqMF9hxV7V6aQKVSKU/f5D3AldlyfqkNsTnr/FJXYqPnwBrbz6/gb2YBAAAQdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOOIOAAAQHHEHAAAIzt32XShfyw5YojZgD9QVrKE2YA1XdwAAgOBUsiw7ew4AAAB2xNUdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgbLqrMvevbAmadmcmaqMlcPxdu6irloCeA2ts6Tlc3QEAAIJrhu/M4r7MorL91RK1ISrnvpKmrkRFz4E1ttcGV3cAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMEJG3fOnDkzevTodu3aaTSap556atWqVaWlpQ543urq6sWLF7dr187Hx2fGjBlFRUUNDvP29lbVoVarKyoqHDC9FstZ9ZCbmzt58mSdTqfVakeOHHn9+vUGhyUkJAwePFij0QQGBtZdPmvWrLp1kpyc7IA5o2noOaiLnuNqxIw7n3/++QsvvNC7d+9vvvkmLy8vMTExLy8vLS3tUbaVZbmqqqrJT71+/frjx4+fP3/+5s2bt27dmj9/foPDcnNzi/9rwoQJ48ePV6vVTX5SNM6J9bBgwQKDwfDjjz/m5OQEBQVNmjSpwWE6nW7p0qXr1q2zXBUXF1dbKtHR0U2eCeyKnoO66DmuSLaB7XuwB7PZrNfr4+Li6i2vqamRZfnu3bvR0dH+/v4dOnR47bXXSktLlbWRkZGrVq0aOnRo165dU1NTjUbj/Pnz9Xq9TqebMmVKfn6+Mmzr1q2hoaG+vr5BQUHvvPOO5bO3b9/+k08+UX5OTU11d3d/8OBBI7PNz89Xq9UnT5608ajtwZbz6zq14dx6iIiI2LNnj/Jzamqqm5tbdXW1takmJSUFBATUXRIbGxsfH9/UQ7cjZ51f16mruug5zYWeQ8+xphkSi3Of3h6UBH3x4sUG1w4aNGjatGlFRUX37t0bNGjQvHnzlOWRkZE9e/YsKChQHo4ZM2b8+PH5+fllZWVz584dPXq0LMvXr1/39va+ceOGLMsGg+G7776rt/N79+7VfWrlqvKZM2came2WLVu6dOliw+HakRitx4n1IMvy8uXLX3jhhdzcXKPROHPmzAkTJjQy1QZbT1BQkF6v79ev3+bNmysrKx//F2AXxJ266DnNhZ5Dz7GGuNOAEydOSJKUl5dnuSo9Pb3uqpSUFA8PD7PZLMtyZGTkjh07lOWZmZkqlap2mNFoVKlUBoMhIyPD09Nz3759RUVFDT71jz/+KElSZmZm7RI3N7evvvqqkdl27dp1y5Ytj3+UjiBG63FiPSiDhw0bpvw2unfvnp2d3chULVvP8ePHz549e+PGjQMHDnTo0MHy9aKzEHfqouc0F3qOspyeY8n28yvgZ3f8/f0lScrJybFcdefOHY1GowyQJCk8PNxkMhUUFCgPg4ODlR+ysrJUKlX//v3DwsLCwsJ69erl6+ubk5MTHh6ekJDw8ccfBwYG/t///d/XX39db/9t2rSRJMloNCoPi4uLa2pqfHx8Pv3009pPftUdn5qampWVNWvWrOY6dlhyYj3Isvziiy+Gh4cXFhaWlJRMnjx56NChpaWl1urB0ogRIwYNGtS5c+eJEydu3rw5MTHRll8F7ISeg7roOS7KuWnLHpT3Td944416y2tqauol69TUVLVaXZusDx8+rCy/efPmz372M4PBYO0pysrKfvOb37Rt21Z5L7au9u3b/+lPf1J+PnXqVOPvo0+ZMmXq1KmPd3gOZMv5dZ3acGI95OfnSxZvNPzzn/+0th/LV1p17du3r127do0dqgM56/y6Tl3VRc9pLvQcZTk9x1IzJBbnPr2dHD161MPDY/Xq1RkZGSaT6YcffliwYMGZM2dqamoGDhw4c+bM4uLi+/fvDxkyZO7cucomdUtNluWXX345Ojr67t27sizn5eXt379fluVr166lpKSYTCZZlnfv3t2+fXvL1rNq1arIyMjMzMzc3Nznnntu2rRp1iaZl5fXunVr1/zAoEKM1iM7tR5CQ0PnzJljNBrLy8s3bNjg7e1dWFhoOcPq6ury8vKEhISAgIDy8nJln2azec+ePVlZWQaD4dSpUxEREbVv8zsdcaceek6zoOfU7oGeUw9xx6rTp0+//PLLWq3Wy8vrqaeeeu+995QPwN+5c2fChAk6nS4oKGjBggUlJSXK+HqlZjAYFi9eHBYW5u3tHR4evmTJElmWL1y48Oyzz/r4+LRt23bAgAF///vfLZ+3srJy0aJFWq3W29t72rRpRqPR2gzff/99l/3AoEKY1iM7rx7S0tJGjBjRtm1bHx+fQYMGWfuXZufOnXWvuWo0GlmWzWbziy++6Ofn17p16/Dw8JUrV5aVlTX7b6ZpiDuW6Dm2o+fUbk7Pqcf286uq3UsTKO8C2rIHuDJbzi+1ITZnnV/qSmz0HFhj+/kV8KPKAAAAdRF3AACA4Ig7AABAcMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBudu+i4d+wypaLGoD9kBdwRpqA9ZwdQcAAAjOpu/MAgAAcH1c3QEAAIIj7gAAAMERdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACM6muypz/8qWoGl3ZqI2WgLH37WLumoJ6Dmwxpaew9UdAAAguGb4zizuyywq218tURuicu4raepKVPQcWGN7bXB1BwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdSZKkbt26qVQqlUoVEBAQGxtbUlLShJ3odLqbN282+9zgXNQG7IG6gjXUhp0Qd/5j//79siyfPXv23LlzmzZtcvZ04EKoDdgDdQVrqA17IO78j4iIiDFjxly+fFl5+NZbb4WEhPj4+AwcOPDChQvKQp1O98EHHwwYMKBz586LFi2y3MmpU6dCQ0O//fZbx80b9kdtwB6oK1hDbTQv4s7/MBqNKSkpPXr0UB4+9dRT3333XWFh4aRJk6ZOnVp7v85Lly6dPXv2+++/P3HiREpKSt09HDt2LCYm5ujRowMGDHD07GFP1AbsgbqCNdRGM5NtYPseXERkZKRWqw0ICHB3dx81alRZWZnlGK1We+fOHVmW/fz8vv32W2XhvHnztmzZovzs5+e3cePG0NDQK1euOGzmdmXL+aU2qA2RnrfZUVcNoufI1IYVtp9fru78x5YtWy5cuJCQkHDmzJmMjAxl4aefftqvX7+OHTuGhYUVFxcXFBQoy9u1a6f84OHhUfdzZL/97W8nT54cFRXl4MnDrqgN2AN1BWuoDXsg7vyHVqsNDg6eNm3ar3/96xUrVkiSdOPGjTfeeCMxMfH27dtZWVk+Pj7yw758bv/+/QcPHtyxY4dDpgwHoTZgD9QVrKE27KEZvhFdMMuWLevUqVNaWlp1dbVGo4mIiJAkKTEx8cGDBw/dtkOHDikpKcOGDfPy8nrllVfsP1k4FLUBe6CuYA210YyIO/UFBgbGxsZu2rQpKSlp4sSJvXv39vf3HzRoUMeOHR9l87CwsJSUlOeff97Dw2P69On2ni0cidqAPVBXsIbaaEaqh14Qa2xjlUqSJFv2AFdmy/mlNsTmrPNLXYmNngNrbD+/fHYHAAAIjrgDAAAER9wBAACCI+4AAADBEXcaYDQao6OjNRpNaGjoZ599ZjnAZDKp/hff4gbgobZu3dq7d293d3flZip17dq1KyIiQq1Wd+/ePT09vd5ak8k0b968Tp06aTSaZ5555uTJk7Wr5s2bFxwcrFarO3XqRCN6QjVyfhXp6elqtXrmzJkNbm6tBhqptxaIuNOApUuXmkymnJycTz75ZO7cuVevXq03wMPDo/y/cnJyWrVqNX78eKdMFQ5w7dq14cOH+/j46PX6DRs2NDjGWlt56aWXajNx586dHTJfuK6OHTtu3Lhx3Lhx9Zbv37//3Xff/cMf/nD//v2EhAStVltvQEVFhZeX18GDB7Ozs6dNmzZu3Lj8/HxlVUxMzL/+9a+CgoK9e/e+9957x48fd8SRoFk1cn4VCxcufPbZZ61tbq0GrNVbC2XLN1DYvgcXZDKZPD09z507pzycMGHCW2+91cj4HTt2DBgwwCFTczRbzq9ItfHMM88sWbKkoqIiPT29ffv2R44csRyzb9++v/71r+PHj4+Pj6+7fOTIkQkJCUoyrqiocNSU7c5Z51eMuoqNja1XJz169EhOTn70PbRt2/bUqVP1Ft69ezcwMPCbb75phik6CT1HUe/8fvbZZzNnzoyPj58xY0bjGzZYA5b19iSy/fxydae+zMzM8vLyXr16KQ979ep15cqVRsbv3bs3JibGIVODc1y9enX69OmtW7eOjIwcMmSI5dU+SZImTZo0ZswYHx8fy1WtWrXy8PDw8PBo3bq1/SeLJ09paemVK1euXr0aFBSk1+tXrlxpNpsbGX/z5s2SkpLu3bvXLnnjjTf8/f3DwsLWrl07cOBA+08ZdlTv/BYVFa1bt+79999vfCtq4KGIO/WVlJSo1eraf5l8fHzqfulaPdeuXUtLS5s6daqjZgcnGDt27J///GeTyXTt2rVz586NHDnysTaPj48PCQl56aWXvv32WzvNEE+0nJwcSZLOnj37ww8/nDp1av/+/R9//LG1wWVlZdOnT3/77bfbt29fu3Dt2rXffffdzp07V65c2WAcx5PC8vyuXr16zpw5QUFBjW9IDTwUcac+b2/vioqKyspK5WFRUZG3t7ckSTt27FA+gTFmzJjawXv37h0zZkztF9JCSJs3b/7iiy88PT2joqJmz57dt2/fR9920aJFR44cOX78eL9+/X7+859nZ2fbb554Qnl6ekqS9Oabb/r5+XXu3HnOnDnHjh2TGuo5FRUVv/zlL3v06LFmzZq6e/Dx8QkJCXnllVdefPHFxMRExx8CmoXl+U1LSztx4sTSpUvrjbSsDWrgoYg79YWHh3t4eFy+fFl5+P333/fo0UOSpIULFyrv/33xxRfKqpqamsTERN7JEltFRcXw4cPnzZtnMpkyMjI+//zznTt3Slbir6XRo0f36dOne/fuGzdu7N69+1dffeWoieOJodfrtVqtco986b83y5csek5VVVV0dLRWq92zZ0/tGEu8Z/qEavD8/uMf/8jMzAwKCtLpdL/73e8OHDigvNyy/PeoLmqgQcSd+tRq9ZQpU9avX280GlNTU//2t7/NmDGjwZEnTpyoqKgYNWqUg2cIR8rMzMzMzFy0aJFarQ4PD4+Ojk5JSZEe1m4a1Lp168Y/kwHhVVdXm0wms9lsNpuVHyRJUqlUMTExH3zwgcFguHXr1p49eywztNlsnjFjRlVV1R//+MeqqiqTyVRTUyNJksFg2LFjR1ZW1r///e/ExMQvv/xy7NixTjgw2Mba+Z09e/aNGzcuXbp06dKl2bNnjxo1SrnyV1cjNdBgvbVctnzO2fY9uCaDwTBhwgRPT8+OHTsmJiZaGzZ9+vTaf/OEZMv5FaY2ysrKfH19t23bVllZmZ2d/fTTT2/YsMFyWFVVVXl5+cyZM998883y8vLq6mpZlouKipKSku7evZuXl7d161ZPT88bN244/Ajswlnn90mvq/j4+Lrtd9u2bcrysrKy2NjYNm3aBAcHv/3222azud6GP/30U73WnZSUJMtyUVHRyJEj27Ztq9Fo+vXrd/ToUUcfUrNqsT3H2vmty9r/zGqkBqzV25PI9vPLN6LDKr6dWJGamhofH3/16lVvb+/x48dv27bNw8Oj3pgVK1Zs3ry59uG2bduWLl1aVFQ0evToy5cv19TU9OzZ8913333hhRccO3d74RvRYQ/0HFhj+/kl7sAqWg+sIe7AHug5sMb288tndwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABOdu+y4auZ05WjhqA/ZAXcEaagPWcHUHAAAIzqbbDAIAALg+ru4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMERdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcP8PtzynrHMtHtYAAAAASUVORK5CYII=" -/> + +{: align="center"} -```Bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --tasks-per-node=16 -#SBATCH --cpus-per-task=1 +!!! example "Binding to sockets and block:block distribution" + + ```bash + #!/bin/bash + #SBATCH --nodes=2 + #SBATCH --tasks-per-node=16 + #SBATCH --cpus-per-task=1 -srun --ntasks 32 --cpu_bind=sockets --distribution=block:block ./application -``` + srun --ntasks 32 --cpu_bind=sockets --distribution=block:block ./application + ``` #### Distribution: block:cyclic -The block:cyclic distribution will allocate the tasks of your job in +The `block:cyclic` distribution will allocate the tasks of your job in alternation between the first node and the second node while filling the sockets linearly. -\<img alt="" -src="data:;base64,iVBORw0KGgoAAAANSUhEUgAAAvoAAADyCAIAAACzsfbGAAAABmJLR0QA/wD/AP+gvaeTAAAgAElEQVR4nO3daXQUVdrA8Wq27AukIQECCQkQCAoCIov44iAHFJARCJtggsqWIyLiCOggghsoisOAo4zoSE6cZGQTj8twDmGZAdzZiSAkhCVASITurJ2EpN4PNdMnk+7qrqR64+b/+5RU31t1763nPjypNB2DLMsSAACAuJp5ewAAAADuRbkDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAE10JPZ4PB4KpxALjtyLLs4SuSc4CmTE/O4ekOAAAQnK6nOwrP/4QHwLu8+5SFnAM0NfpzDk93AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3AACA4Ch3mq7777/fYDAcOnTIeiQqKurzzz/XfoajR48GBwdrb5+WljZkyJCgoKCoqKgGDBSAEDyfc5599tnExMTAwMDOnTsvXry4qqqqAcOFWCh3mrSIiIjnn3/eY5czGo0LFy5csWKFx64IwKd4OOeUlpZu3Ljx0qVLmZmZmZmZL7/8sscuDV9DudOkzZo1KycnZ9u2bbYvXb16ddKkSe3atYuOjp4/f355ebly/NKlS6NGjQoPD7/jjjsOHjxobV9cXJyamtqpU6e2bdtOnTq1qKjI9pyjR4+ePHlyp06d3DQdAD7Owznnww8/vO+++yIiIoYMGfL444/X7Y6mhnKnSQsODl6xYsULL7xQXV1d76WJEye2bNkyJyfnp59+Onz48KJFi5TjkyZNio6Ovnbt2tdff/3BBx9Y20+fPr2goODIkSMXL14MCwubOXOmx2YB4HbhxZxz4MCB/v37u3Q2uK3IOug/A7xo2LBhr776anV1dY8ePdavXy/LcmRk5I4dO2RZPn36tCRJ169fV1pmZWX5+/vX1NScPn3aYDDcuHFDOZ6WlhYUFCTLcm5ursFgsLY3m80Gg8FkMtm9bkZGRmRkpLtnB7fy1t4n59zWvJVzZFlevnx5ly5dioqK3DpBuI/+vd/C0+UVfEyLFi1Wr149e/bs5ORk68HLly8HBQW1bdtW+TYuLs5isRQVFV2+fDkiIqJ169bK8W7duilf5OXlGQyGAQMGWM8QFhaWn58fFhbmqXkAuD14Pue88sor6enpe/fujYiIcNes4PModyD9/ve/f+edd1avXm09Eh0dXVZWVlhYqGSfvLw8Pz8/o9HYsWNHk8lUWVnp5+cnSdK1a9eU9p07dzYYDMeOHaO+AeCUJ3PO0qVLt2/fvn///ujoaLdNCLcB3rsDSZKkNWvWrFu3rqSkRPm2e/fugwYNWrRoUWlpaUFBwbJly1JSUpo1a9ajR4++ffu+++67kiRVVlauW7dOaR8fHz9y5MhZs2ZdvXpVkqTCwsKtW7faXqWmpsZisSi/s7dYLJWVlR6aHgAf45mcs2DBgu3bt+/atctoNFosFv4jelNGuQNJkqSBAweOGTPG+l8hDAbD1q1by8vLu3Tp0rdv3969e69du1Z5acuWLVlZWf369Rs+fPjw4cOtZ8jIyOjQocOQIUNCQkIGDRp04MAB26t8+OGHAQEBycnJBQUFAQEBPFgGmiwP5ByTybR+/fqzZ8/GxcUFBAQEBAQkJiZ6ZnbwQQbrO4Aa09lgkCRJzxkA3I68tffJOUDTpH/v83QHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIroX+UxgMBv0nAQCNyDkAGoqnOwAAQHAGWZa9PQYAAAA34ukOAAAQHOUOAAAQHOUOAAAQHOUOAAAQHOUOAAAQHOUOAAAQnK6PGeTDvpqCxn1UAbHRFHj+YyyIq6aAnAM1enIOT3cAAIDgXPBHJPigQlHp/2mJ2BCVd3+SJq5ERc6BGv2xwdMdAAAgOModAAAgOModAAAgOModAAAgOModAAAgOModAAAgOModAAAgOModAAAgOModAAAgOModAAAgOModAAAgOModAAAgOModAAAgOPHLnezs7IcffthoNAYGBvbo0WPJkiWNOEmPHj0+//xzjY3vuuuuzMxMuy+lpaUNGTIkKCgoKiqqEcOAa/lUbDz77LOJiYmBgYGdO3devHhxVVVVIwYDX+BTcUXO8Sk+FRtNLecIXu7U1tY++OCDHTp0OHHiRFFRUWZmZlxcnBfHYzQaFy5cuGLFCi+OAQpfi43S0tKNGzdeunQpMzMzMzPz5Zdf9uJg0Gi+FlfkHN/ha7HR5HKOrIP+M7jbpUuXJEnKzs62fenKlStJSUlt27bt2LHjU089VVZWphy/efNmampq586dQ0JC+vbte/r0aVmWExISduzYobw6bNiw5OTkqqoqs9k8b9686Ohoo9E4ZcqUwsJCWZbnz5/fsmVLo9EYExOTnJxsd1QZGRmRkZHumrPr6Lm/xEbjYkOxfPny++67z/Vzdh1v3V/iipzjjr6e4ZuxoWgKOUfwpzsdOnTo3r37vHnz/vGPf1y8eLHuSxMnTmzZsmVOTs5PP/10+PDhRYsWKcenTZt24cKFb7/91mQybd68OSQkxNrlwoUL995779ChQzdv3tyyZcvp06cXFBQcOXLk4sWLYWFhM2fOlCRp/fr1iYmJ69evz8vL27x5swfniobx5dg4cOBA//79XT9nuJ8vxxW8y5djo0nkHO9WWx5QUFCwdOnSfv36tWjRomvXrhkZGbIsnz59WpKk69evK22ysrL8/f1rampycnIkScrPz693koSEhJdeeik6Onrjxo3KkdzcXIPBYD2D2Ww2GAwmk0mW5T59+ihXUcNPWj7CB2NDluXly5d36dKlqKjIhTN1OW/dX+KKnOOOvh7jg7EhN5mcI365Y1VSUvLOO+80a9bs+PHju3fvDgoKsr50/vx5SZIKCgqysrICAwNt+yYkJERGRg4cONBisShH9uzZ06xZs5g6wsPDT506JZN6dPf1PN+JjZUrV8bFxeXl5bl0fq5HuaOF78QVOcfX+E5sNJ2cI/gvs+oKDg5etGiRv7//8ePHo6Ojy8rKCgsLlZfy8vL8/PyUX3CWl5dfvXrVtvu6devatm07bty48vJySZI6d+5sMBiOHTuW9183b95MTEyUJKlZsya0qmLwkdhYunRpenr6/v37Y2Ji3DBLeJqPxBV8kI/ERpPKOYJvkmvXrj3//PNHjhwpKyu7cePGqlWrqqurBwwY0L1790GDBi1atKi0tLSgoGDZsmUpKSnNmjWLj48fOXLknDlzrl69KsvyyZMnraHm5+e3ffv20NDQhx56qKSkRGk5a9YspUFhYeHWrVuVllFRUWfOnLE7npqaGovFUl1dLUmSxWKprKz0yDLADl+LjQULFmzfvn3Xrl1Go9FisQj/n0JF5WtxRc7xHb4WG00u53j34ZK7mc3m2bNnd+vWLSAgIDw8/N577/3qq6+Uly5fvjxhwgSj0di+ffvU1NTS0lLl+I0bN2bPnt2xY8eQkJB+/fqdOXNGrvNO+Fu3bj322GP33HPPjRs3TCbTggULYmNjg4OD4+LinnnmGeUM+/bt69atW3h4+MSJE+uN5/3336+7+HUfYPogPfeX2GhQbNy8ebPexoyPj/fcWjSct+4vcUXOcUdfz/Cp2GiCOcdgPUsjGAwG5fKNPgN8mZ77S2yIzVv3l7gSGzkHavTfX8F/mQUAAEC5AwAABEe5AwAABEe5AwAABEe5AwAABEe5AwAABEe5AwAABEe5AwAABEe5AwAABEe5AwAABEe5AwAABEe5AwAABEe5AwAABEe5AwAABNdC/ymUP8sO2CI24A7EFdQQG1DD0x0AACA4gyzL3h4DAACAG/F0BwAACI5yBwAACI5yBwAACI5yBwAACI5yBwAACI5yBwAACI5yBwAACE7Xpyrz+ZVNQeM+mYnYaAo8/6ldxFVTQM6BGj05h6c7AABAcC74m1l8LrOo9P+0RGyIyrs/SRNXoiLnQI3+2ODpDgAAEBzlDgAAEBzlDgAAEBzlDgAAEBzlDgAAEBzlDgAAEBzlDgAAEBzlDgAAEJyw5c7BgwfHjBnTpk2boKCgO++8c9myZWVlZR647q1btxYsWNCmTZvQ0NDp06cXFxfbbRYcHGyow8/Pr7Ky0gPDa7K8FQ8FBQWTJ082Go3h4eGjRo06c+aM3WZpaWlDhgwJCgqKioqqe3zmzJl14yQzM9MDY0bjkHNQFznH14hZ7nzxxRcPPPBAnz59vv322+vXr6enp1+/fv3YsWNa+sqyXF1d3ehLr1y5cteuXT/99NO5c+cuXLgwb948u80KCgpK/mvChAnjx4/38/Nr9EXhmBfjITU11WQy/frrr/n5+e3bt580aZLdZkajceHChStWrLB9adGiRdZQSUpKavRI4FbkHNRFzvFFsg76z+AONTU10dHRixYtqne8trZWluUrV64kJSW1bdu2Y8eOTz31VFlZmfJqQkLCsmXLhg4d2r17971795rN5nnz5kVHRxuNxilTphQWFirN1q5dGxMTExYW1r59+1dffdX26u3atfv444+Vr/fu3duiRYubN286GG1hYaGfn9+ePXt0ztod9Nxf34kN78ZDfHz8pk2blK/37t3brFmzW7duqQ01IyMjMjKy7pGUlJQlS5Y0dupu5K376ztxVRc5x1XIOeQcNS6oWLx7eXdQKugjR47YfXXw4MHTpk0rLi6+evXq4MGD586dqxxPSEi44447ioqKlG/Hjh07fvz4wsLC8vLyOXPmjBkzRpblM2fOBAcHnz17VpZlk8n0888/1zv51atX615aeap88OBBB6Nds2ZNt27ddEzXjcRIPV6MB1mWFy9e/MADDxQUFJjN5hkzZkyYMMHBUO2mnvbt20dHR/fv3//NN9+sqqpq+AK4BeVOXeQcVyHnkHPUUO7YsXv3bkmSrl+/bvvS6dOn676UlZXl7+9fU1Mjy3JCQsKGDRuU47m5uQaDwdrMbDYbDAaTyZSTkxMQEPDZZ58VFxfbvfSvv/4qSVJubq71SLNmzb755hsHo+3evfuaNWsaPktPECP1eDEelMbDhg1TVqNnz54XL150MFTb1LNr165Dhw6dPXt269atHTt2tP150Vsod+oi57gKOUc5Ts6xpf/+CvjenbZt20qSlJ+fb/vS5cuXg4KClAaSJMXFxVkslqKiIuXbDh06KF/k5eUZDIYBAwbExsbGxsb27t07LCwsPz8/Li4uLS3tL3/5S1RU1P/93//t37+/3vlDQkIkSTKbzcq3JSUltbW1oaGhn3zyifWdX3Xb7927Ny8vb+bMma6aO2x5MR5kWR4xYkRcXNyNGzdKS0snT548dOjQsrIytXiwNXLkyMGDB3ft2nXixIlvvvlmenq6nqWAm5BzUBc5x0d5t9pyB+X3ps8991y947W1tfUq67179/r5+Vkr6x07dijHz50717x5c5PJpHaJ8vLyN954o3Xr1srvYutq167d3/72N+Xrffv2Of49+pQpU6ZOndqw6XmQnvvrO7HhxXgoLCyUbH7R8N1336mdx/Ynrbo+++yzNm3aOJqqB3nr/vpOXNVFznEVco5ynJxjywUVi3cv7yY7d+709/d/6aWXcnJyLBbLyZMnU1NTDx48WFtbO2jQoBkzZpSUlFy7du3ee++dM2eO0qVuqMmy/NBDDyUlJV25ckWW5evXr2/ZskWW5V9++SUrK8tisciy/OGHH7Zr18429SxbtiwhISE3N7egoOC+++6bNm2a2iCvX7/eqlUr33zDoEKM1CN7NR5iYmJmz55tNpsrKipeeeWV4ODgGzdu2I7w1q1bFRUVaWlpkZGRFRUVyjlramo2bdqUl5dnMpn27dsXHx9v/TW/11Hu1EPOcQlyjvUM5Jx6KHdUHThw4KGHHgoPDw8MDLzzzjtXrVqlvAH+8uXLEyZMMBqN7du3T01NLS0tVdrXCzWTybRgwYLY2Njg4OC4uLhnnnlGluXDhw/fc889oaGhrVu3Hjhw4L/+9S/b61ZVVT399NPh4eHBwcHTpk0zm81qI3zrrbd89g2DCmFSj+y9eDh27NjIkSNbt24dGho6ePBgtX9p3n///brPXIOCgmRZrqmpGTFiRERERKtWreLi4l544YXy8nKXr0zjUO7YIufoR86xdifn1KP//hqsZ2kE5beAes4AX6bn/hIbYvPW/SWuxEbOgRr991fAtyoDAADURbkDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAER7kDAAAE10L/KZz+hVU0WcQG3IG4ghpiA2p4ugMAAASn629mAQAA+D6e7gAAAMFR7gAAAMFR7gAAAMFR7gAAAMFR7gAAAMFR7gAAAMFR7gAAAMHp+lRlPr+yKWjcJzMRG02B5z+1i7hqCsg5UKMn5/B0BwAACM4FfzOLz2UWlf6flogNUXn3J2niSlTkHKjRHxs83QEAAIKj3AEAAIKj3AEAAIKj3AEAAIKj3AEAAIKj3AEAAIKj3JEkSerRo4fBYDAYDJGRkSkpKaWlpY04idFoPHfunMvHBu8iNuAOxBXUEBtuQrnzH1u2bJFl+dChQz/++OPq1au9PRz4EGID7kBcQQ2x4Q6UO/8jPj5+7Nixx48fV7598cUXO3fuHBoaOmjQoMOHDysHjUbj22+/PXDgwK5duz799NO2J9m3b19MTMz333/vuXHD/YgNuANxBTXEhmtR7vwPs9mclZXVq1cv5ds777zz559/vnHjxqRJk6ZOnWr9vM6jR48eOnToxIkTu3fvzsrKqnuGr7/+Ojk5eefOnQMHDvT06OFOxAbcgbiCGmLDxWQd9J/BRyQkJISHh0dGRrZo0WL06NHl5eW2bcLDwy9fvizLckRExPfff68cnDt37po1a5SvIyIiXn/99ZiYmFOnTnls5G6l5/4SG8SGSNd1OeLKLnKOTGyo0H9/ebrzH2vWrDl8+HBaWtrBgwdzcnKUg5988kn//v07deoUGxtbUlJSVFSkHG/Tpo3yhb+/f933kf3pT3+aPHlyYmKihwcPtyI24A7EFdQQG+5AufMf4eHhHTp0mDZt2pNPPrl06VJJks6ePfvcc8+lp6dfunQpLy8vNDRUdvbH57Zs2bJt27YNGzZ4ZMjwEGID7kBcQQ2x4Q4u+Ivognn++ee7dOly7NixW7duBQUFxcfHS5KUnp5+8+ZNp307duyYlZU1bNiwwMDAJ554wv2DhUcRG3AH4gpqiA0XotypLyoqKiUlZfXq1RkZGRMnTuzTp0/btm0HDx7cqVMnLd1jY2OzsrLuv/9+f3//Rx991N2jhScRG3AH4gpqiA0XMjh9IOaos8EgSZKeM8CX6bm/xIbYvHV/iSuxkXOgRv/95b07AABAcJQ7AABAcJQ7AABAcJQ7AABAcJQ7AABAcD5R7pjN5qSkpKCgoJiYmE8//dS2gcVimTt3bpcuXYKCgu6+++49e/Y4ONvp06f9/PxmzJih1mDjxo3x8fF+fn49e/Y8ffq0bYNffvll+PDhoaGh0dHRr7zyit2TrF27tk+fPi1atFA+A0r7XNT6apyj2nU1zv2243i+irlz53bo0MHPz69Lly5qfz34wQcfNPxX165dbRtYLBbD/7J7qpMnTw4bNiwkJKR79+5ff/11g8asZS5qbbTMUa2v07nfphysp9M9rtbX6d53sE8bnce07H2nbRzsfcd9Ha+Vg75a8qRa3GrJk3qo5Vgte1wtNpzufQdr5XTvO+jrdO876Ksl76nFpNO1cnBdLXlSbV5a8mTj+ES5s3DhQovFkp+f//HHH8+ZMyc7O7teg8rKysDAwG3btl28eHHatGnjxo0rLCxUO9v8+fPvuecetVe3bNny2muv/fWvf7127VpaWlp4eLhtm+Tk5N69excVFWVlZb333ns7d+60bdOpU6fXX3993LhxDZ2LWl+Nc1S7rpa5344cz1eRnJz8ww8/FBUVbd68edWqVbt27bLbLC0traKioqKiwu5N8ff3r/iv/Pz8li1bjh8/vl6b6urqRx55ZMSIEb/99tv69eunTJly6dIl7WPWMhe1Nlrm6OD8jud+m1Kbr5Y97mCdHe99B/u00XlMy9532sbB3nfQ1+laOeirJU+qxa2WPKmH3furZY+r9dWy9x2sldO973idHe99x7HheO+r9dWyVmp9NeZJtXlpyZONpOcPbuk/gyzLFoslICDgxx9/VL6dMGHCiy++qHz95JNPPvnkk7ZdWrduvW/fPrttPv300xkzZixZsmT69OnWg3Xb9OrVKzMz0/acddsEBgZa/+ja+PHj33jjDbXxpKSkLFmypHFzqddX+xzV+tqdux567q9LYsPKdr52Y+PKlStRUVHffvutbZtRo0ZlZGTYntnueTZs2DBw4EDbNidOnGjZsmV1dbVy/He/+93q1avVzqN2f7XMxUFsOJijWl+1uevh2vur57q289Wyx9X6at/7Cus+1ZnH1I5r7Os076n11b5Wtn0btFZ149bBWrk25zjYR2p7XK1vg/a+wvb+asxjdvvKGva+bd8G5T216zpdq3p9G7pW9ealsF0r/TnH+5+qnJubW1FR0bt3b+Xb3r17HzlyRPl61KhRtu3PnTtXWlras2dP2zbFxcUrVqzYv3//unXr6naxtikrKzt16lR2dnb79u2bN2/+2GOPvfbaa82bN693nocffvjvf/977969z58//+OPPy5btszBePTMRY2DOapRm7uo6q3Jc889l5aWVlxcvG7dukGDBtlts2TJksWLFycmJq5cuXLgwIF22yg2b948c+ZMtWtZ1dbWnjx50nEbLTT21TJHNXbnLiSNe1xNg/Z+3X2qM4+pHdfS12neU+vb0LWqd12Na2Ubtw7WymM07nE1Tve+2v2tR2Nf7Xvftq/2vKc2Zi1r5WC+DtbK7rzcSE+tpP8Msiz/8MMPfn5+1m/Xrl37wAMPqDUuKysbMGDAihUr7L66YMGCt956S5ZltSccZ86ckSRpxIgRRUVFZ8+ejY+P//Of/2zbLC8vT/nTJJIkvfTSSw4GX68CbdBc1H7ycDxHtb5O594Ieu6vS2LDyvGTMFmWzWbzhQsXPvroo/Dw8FOnTtk2+PLLLw8fPpydnf3iiy+GhIRcuHBB7VTZ2dmtWrX67bffbF+qqqqKjY19+eWXy8vLv/zyy+bNm48fP76hY3Y6F7U2Tueo1lf73LVz7f3Vc91689W4x+32lRuy9+vtU5fkMS1737aN9r1fr2+D1sr2uhrXyjZuHayVa3OO2l5zsMfV+jZo76vdRy17325fjXvftq/2va82Zi1rVa+v9rVyMC93PN3x/nt3goODKysrq6qqlG+Li4uDg4PttqysrHzkkUd69eq1fPly21ePHTu2e/fuhQsXOrhWQECAJEl/+MMfIiIiunbtOnv2bNt3UVVWVg4fPnzu3LkWiyUnJ+eLL754//33XT4XNY7nqEbL3MUWGhrauXPnJ554YsSIEenp6bYNxowZ07dv3549e77++us9e/b85ptv1E61efPmsWPHtmnTxvalli1b7tixY/fu3ZGRkatWrRo3blx0dLQrp+GQ0zmq0T53AWjZ42q0733bfao/j2nZ+7ZttO99277a18q2r/a1so1b/XlSJwd7XI32vd+4HO64r5a9b7evxr3vYMxO18q2r/a1anROaxzv/zIrLi7O39//+PHjd999tyRJJ06c6NWrl22z6urqpKSk8PDwTZs2KX87o55///vfubm57du3lySpvLy8trY2Ozv78OHDddtER0eHh4dbu9s9T25ubm5u7tNPP+3n5xcXF5eUlJSVlZWamurCuahxOkc1WubedLRq1cppg5qaGrsv1dbWpqenv/fee2p977rrrgMHDihf9+vXb8KECY0epx5O5+igo9rcxaBlj6vRuPft7lOdeUzL3rfbRuPet9tX41rZ7du4PKnErc48qZPTPa5Gy95vdA7X3tfu3tfSV23vO+jrdK3U+jYiTzY6p2nn/ac7fn5+U6ZMWblypdls3rt37z//+c/p06crL82aNWvWrFmSJNXU1EyfPr26uvqjjz6qrq62WCy1tbX12jz++ONnz549evTo0aNHH3/88dGjR1t/UrG2MRgMycnJb7/9tslkunDhwqZNm8aOHVuvTWxsbFhY2AcffFBdXX3p0qVt27b16dOnXhtJkm7dumWxWGpqampqapQvNM5Fra+WOar1dTD3253d+Up11sRkMm3YsCEvL++3335LT0//6quvHn744XptSkpKMjMzr169WlhY+O677/78888jR46s10axe/fuysrK0aNH1x1D3TbffffdtWvX8vPzlyxZUlZWNnXqVNs2amN2Ohe1NlrmqNbXwdxvd3bnq2WPq/XVsvfV9qmePKZl76u10ZL31PpqWSu1vlrWSi1uHayVW2ND4XSPq/V1uvcd3Eene1+tr5a9r9ZXS95zMGana+Wgr9O1cjAvB/dOLz2/CdN/BoXJZJowYUJAQECnTp3S09Otx0eOHPnRRx/Jsnz+/Pl6w7a+29zapq56v8Ou26a8vDwlJSUkJKRDhw5//OMfa2pqbNvs2bNnwIABQUFBkZGR8+bNq6iosHfzt+cAAAHsSURBVG2zZMmSuuN59913Nc5Fra/GOapdV23ueui5v66KDbX5WtekuLh41KhRrVu3DgoK6t+//86dO619rW3MZvPQoUNDQ0ODg4MHDRq0e/du2zaKRx99dP78+fXGULfNCy+8EBYWFhAQMGbMmPPnz9ttozZmp3NRa6Nljmp9HcxdD1fdXz3XVVtPLXtcra/Tve9gnzY6j2nZ+w7aWKnlPQd9na6Vg75O18pB3KqtlZ640hIbsoY9rtbX6d53sFZO975aXy17X62vlrznOK4cr5WDvk7XysG81NZKT2z85wy6Ouu+vAPV1dW9evWqqqq6jdr4Wl+dXJV6XM7X7jux4fvXvR3vUVPrK3sp59yOa9XU+squyDkG61kaQfldnZ4zwJfpub/Ehti8dX+JK7GRc6BG//31/nt3AAAA3IpyBwAACI5yBwAACI5yBwAACI5yBwAACM4Fn6rc0M+ORNNBbMAdiCuoITaghqc7AABAcLo+dwcAAMD38XQHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAIjnIHAAAI7v8BE+cBiPwLm7cAAAAASUVORK5CYII=" -/> + +{: align="center"} +!!! example "Binding to sockets and block:cyclic distribution" + + ```bash #!/bin/bash #SBATCH --nodes=2 #SBATCh --tasks-per-node=16 #SBATCH --cpus-per-task=1 srun --ntasks 32 --cpu_bind=sockets --distribution=block:cyclic ./application + ``` ## Hybrid Strategies ### Default Binding and Distribution Pattern -The default binding pattern of hybrid jobs will split the cores -allocated to a rank between the sockets of a node. The example shows -that Rank 0 has 4 cores at its disposal. Two of them on first socket -inside the first node and two on the second socket inside the first -node. +The default binding pattern of hybrid jobs will split the cores allocated to a rank between the +sockets of a node. The example shows that Rank 0 has 4 cores at its disposal. Two of them on first +socket inside the first node and two on the second socket inside the first node. -\<img alt="" -src="data:;base64,iVBORw0KGgoAAAANSUhEUgAAAvoAAADyCAIAAACzsfbGAAAABmJLR0QA/wD/AP+gvaeTAAAgAElEQVR4nO3de1iUdf7/8XsQA+SoDgdhZHA4CaUlpijmYdXooJvrsbxqzXa1dCsPbFlt5qF2O2xbXV52bdtlV25c7iVrhrVXWVaEupJ2gjxUYAIDgjgcZJCDIIf7+8f9a36zjCAwM/eMn3k+/oJ77rnf9z3z5u1r7hnn1siyLAEAAIjLy9U7AAAA4FzEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABCctz131mg0jtoPANccWZZVrsjMATyZPTOHszsAAEBwdp3dUaj/Cg+Aa7n2LAszB/A09s8czu4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMERdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9zxXDNmzNBoNF9++aVlSURExPvvv9/3LXz//fcBAQF9Xz8zMzMtLc3f3z8iIqIfOwpACOrPnPXr1ycnJw8ZMiQ6OnrDhg2XL1/ux+5CLMQdjzZ8+PDHH39ctXJarXbdunVbtmxRrSIAt6LyzGlqanrzzTfPnj2blZWVlZW1efNm1UrD3RB3PNqKFSuKi4vfe+8925uqqqoWL14cFham0+keeeSRlpYWZfnZs2dvu+22kJCQG264IS8vz7L+xYsXV69ePXLkyNDQ0Hvuuae2ttZ2m3feeeeSJUtGjhzppMMB4OZUnjk7duyYOnXq8OHD09LSHnjgAeu7w9MQdzxaQEDAli1bnnrqqfb29m43LVy4cPDgwcXFxd9++21+fn5GRoayfPHixTqd7vz58/v37//HP/5hWf/ee+81mUwFBQXl5eXBwcHLly9X7SgAXCtcOHOOHDkyfvx4hx4NrimyHezfAlxo+vTpzz33XHt7++jRo7dv3y7Lcnh4+L59+2RZLiwslCSpurpaWTMnJ8fX17ezs7OwsFCj0Vy4cEFZnpmZ6e/vL8tySUmJRqOxrN/Q0KDRaMxm8xXr7t69Ozw83NlHB6dy1d8+M+ea5qqZI8vypk2bRo0aVVtb69QDhPPY/7fvrXa8gpvx9vZ+8cUXV65cuWzZMsvCiooKf3//0NBQ5VeDwdDa2lpbW1tRUTF8+PChQ4cqy+Pj45UfjEajRqOZMGGCZQvBwcGVlZXBwcFqHQeAa4P6M+fZZ5/dtWtXbm7u8OHDnXVUcHvEHUjz5s175ZVXXnzxRcsSnU7X3NxcU1OjTB+j0ejj46PVaqOiosxmc1tbm4+PjyRJ58+fV9aPjo7WaDTHjx8n3wC4KjVnzpNPPpmdnX3o0CGdTue0A8I1gM/uQJIk6eWXX962bVtjY6Pya0JCwqRJkzIyMpqamkwm08aNG++//34vL6/Ro0ePGzfutddekySpra1t27ZtyvqxsbHp6ekrVqyoqqqSJKmmpmbv3r22VTo7O1tbW5X37FtbW9va2lQ6PABuRp2Zs2bNmuzs7AMHDmi12tbWVv4juicj7kCSJCk1NXXOnDmW/wqh0Wj27t3b0tIyatSocePGjR079tVXX1Vuevfdd3NyclJSUmbOnDlz5kzLFnbv3h0ZGZmWlhYYGDhp0qQjR47YVtmxY4efn9+yZctMJpOfnx8nlgGPpcLMMZvN27dv//nnnw0Gg5+fn5+fX3JysjpHBzeksXwCaCB31mgkSbJnCwCuRa7622fmAJ7J/r99zu4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMERdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBedu/CY1GY/9GAKCPmDkA+ouzOwAAQHAaWZZdvQ8AAABOxNkdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADB2fU1g3zZlycY2FcV0BueQP2vsaCvPAEzBz2xZ+ZwdgcAAAjOAReR4IsKRWX/qyV6Q1SufSVNX4mKmYOe2N8bnN0BAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMERdwAAgOCIOwAAQHDEHQAAIDjx486PP/7461//WqvVDhkyZPTo0U888cQANjJ69Oj333+/jyvfdNNNWVlZV7wpMzMzLS3N398/IiJiALsBx3Kr3li/fn1ycvKQIUOio6M3bNhw+fLlAewM3IFb9RUzx624VW942swRPO50dXXdfvvtkZGRJ0+erK2tzcrKMhgMLtwfrVa7bt26LVu2uHAfoHC33mhqanrzzTfPnj2blZWVlZW1efNmF+4MBszd+oqZ4z7crTc8bubIdrB/C8529uxZSZJ+/PFH25vOnTu3aNGi0NDQqKiohx9+uLm5WVleX1+/evXq6OjowMDAcePGFRYWyrKcmJi4b98+5dbp06cvW7bs8uXLDQ0Nq1at0ul0Wq327rvvrqmpkWX5kUceGTx4sFar1ev1y5Ytu+Je7d69Ozw83FnH7Dj2PL/0xsB6Q7Fp06apU6c6/pgdx1XPL33FzHHGfdXhnr2h8ISZI/jZncjIyISEhFWrVv373/8uLy+3vmnhwoWDBw8uLi7+9ttv8/PzMzIylOVLly4tKys7evSo2Wx+5513AgMDLXcpKyubMmXKLbfc8s477wwePPjee+81mUwFBQXl5eXBwcHLly+XJGn79u3Jycnbt283Go3vvPOOiseK/nHn3jhy5Mj48eMdf8xwPnfuK7iWO/eGR8wc16YtFZhMpieffDIlJcXb2zsuLm737t2yLBcWFkqSVF1drayTk5Pj6+vb2dlZXFwsSVJlZWW3jSQmJj7zzDM6ne7NN99UlpSUlGg0GssWGhoaNBqN2WyWZfnGG29UqvSEV1puwg17Q5blTZs2jRo1qra21oFH6nCuen7pK2aOM+6rGjfsDdljZo74cceisbHxlVde8fLyOnHixOeff+7v72+5qbS0VJIkk8mUk5MzZMgQ2/smJiaGh4enpqa2trYqS7744gsvLy+9lZCQkB9++EFm9Nh9X/W5T29s3brVYDAYjUaHHp/jEXf6wn36ipnjbtynNzxn5gj+Zpa1gICAjIwMX1/fEydO6HS65ubmmpoa5Saj0ejj46O8wdnS0lJVVWV7923btoWGht51110tLS2SJEVHR2s0muPHjxt/UV9fn5ycLEmSl5cHPapicJPeePLJJ3ft2nXo0CG9Xu+Eo4Ta3KSv4IbcpDc8auYI/kdy/vz5xx9/vKCgoLm5+cKFCy+88EJ7e/uECRMSEhImTZqUkZHR1NRkMpk2btx4//33e3l5xcbGpqenP/jgg1VVVbIsnzp1ytJqPj4+2dnZQUFBd9xxR2Njo7LmihUrlBVqamr27t2rrBkREVFUVHTF/ens7GxtbW1vb5ckqbW1ta2tTZWHAVfgbr2xZs2a7OzsAwcOaLXa1tZW4f9TqKjcra+YOe7D3XrD42aOa08uOVtDQ8PKlSvj4+P9/PxCQkKmTJny0UcfKTdVVFQsWLBAq9WOGDFi9erVTU1NyvILFy6sXLkyKioqMDAwJSWlqKhItvokfEdHx29/+9uJEydeuHDBbDavWbMmJiYmICDAYDCsXbtW2cLBgwfj4+NDQkIWLlzYbX/eeOMN6wff+gSmG7Ln+aU3+tUb9fX13f4wY2Nj1Xss+s9Vzy99xcxxxn3V4Va94YEzR2PZygBoNBql/IC3AHdmz/NLb4jNVc8vfSU2Zg56Yv/zK/ibWQAAAMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAILztn8TymXZAVv0BpyBvkJP6A30hLM7AABAcBpZll29DwAAAE7E2R0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMERdwAAgODs+lZlvr/SEwzsm5noDU+g/rd20VeegJmDntgzczi7AwAABOeAa2bxvcyisv/VEr0hKte+kqavRMXMQU/s7w3O7gAAAMERdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAghM27uTl5c2ZM2fYsGH+/v5jxozZuHFjc3OzCnU7OjrWrFkzbNiwoKCge++99+LFi1dcLSAgQGPFx8enra1Nhd3zWK7qB5PJtGTJEq1WGxIScttttxUVFV1xtczMzLS0NH9//4iICOvly5cvt+6TrKwsFfYZA8PMgTVmjrsRM+785z//mTVr1o033nj06NHq6updu3ZVV1cfP368L/eVZbm9vX3Apbdu3XrgwIFvv/32zJkzZWVlq1atuuJqJpOp8RcLFiyYP3++j4/PgIuidy7sh9WrV5vN5tOnT1dWVo4YMWLx4sVXXE2r1a5bt27Lli22N2VkZFhaZdGiRQPeEzgVMwfWmDnuSLaD/Vtwhs7OTp1Ol5GR0W15V1eXLMvnzp1btGhRaGhoVFTUww8/3NzcrNyamJi4cePGW265JSEhITc3t6GhYdWqVTqdTqvV3n333TU1Ncpqr776ql6vDw4OHjFixHPPPWdbPSws7O2331Z+zs3N9fb2rq+v72Vva2pqfHx8vvjiCzuP2hnseX7dpzdc2w+xsbFvvfWW8nNubq6Xl1dHR0dPu7p79+7w8HDrJffff/8TTzwx0EN3Ilc9v+7TV9aYOY7CzGHm9MQBicW15Z1BSdAFBQVXvHXy5MlLly69ePFiVVXV5MmTH3roIWV5YmLiDTfcUFtbq/w6d+7c+fPn19TUtLS0PPjgg3PmzJFluaioKCAg4Oeff5Zl2Ww2f/fdd902XlVVZV1aOaucl5fXy96+/PLL8fHxdhyuE4kxelzYD7Isb9iwYdasWSaTqaGh4b777luwYEEvu3rF0TNixAidTjd+/PiXXnrp8uXL/X8AnIK4Y42Z4yjMHGZOT4g7V/D5559LklRdXW17U2FhofVNOTk5vr6+nZ2dsiwnJia+/vrryvKSkhKNRmNZraGhQaPRmM3m4uJiPz+/PXv2XLx48YqlT58+LUlSSUmJZYmXl9fHH3/cy94mJCS8/PLL/T9KNYgxelzYD8rK06dPVx6NpKSk8vLyXnbVdvQcOHDgyy+//Pnnn/fu3RsVFWX7etFViDvWmDmOwsxRljNzbNn//Ar42Z3Q0FBJkiorK21vqqio8Pf3V1aQJMlgMLS2ttbW1iq/RkZGKj8YjUaNRjNhwoSYmJiYmJixY8cGBwdXVlYaDIbMzMy///3vERER06ZNO3ToULftBwYGSpLU0NCg/NrY2NjV1RUUFPTPf/7T8skv6/Vzc3ONRuPy5csddeyw5cJ+kGV59uzZBoPhwoULTU1NS5YsueWWW5qbm3vqB1vp6emTJ0+Oi4tbuHDhSy+9tGvXLnseCjgJMwfWmDluyrVpyxmU903/+Mc/dlve1dXVLVnn5ub6+PhYkvW+ffuU5WfOnBk0aJDZbO6pREtLy/PPPz906FDlvVhrYWFhO3fuVH4+ePBg7++j33333ffcc0//Dk9F9jy/7tMbLuyHmpoayeaNhmPHjvW0HdtXWtb27NkzbNiw3g5VRa56ft2nr6wxcxyFmaMsZ+bYckBicW15J/nggw98fX2feeaZ4uLi1tbWU6dOrV69Oi8vr6ura9KkSffdd19jY+P58+enTJny4IMPKnexbjVZlu+4445FixadO3dOluXq6up3331XluWffvopJyentbVVluUdO3aEhYXZjp6NGzcmJiaWlJSYTKapU6cuXbq0p52srq6+7rrr3PMDgwoxRo/s0n7Q6/UrV65saGi4dOnSs88+GxAQcOHCBds97OjouHTpUmZmZnh4+KVLl5RtdnZ2vvXWW0aj0Ww2Hzx4MDY21vI2v8sRd7ph5jgEM8eyBWZON8SdHh05cuSOO+4ICQkZMmTImDFjXnjhBeUD8BUVFQsWLNBqtSNGjFi9enVTU5OyfrdWM5vNa9asiYmJCQgIMBgMa9eulWU5Pz9/4sSJQUFBQ4cOTU1NPXz4sG3dy5cvP/rooyEhIQEBAUuXLm1oaOhpD//617+67QcGFcKMHtl1/XD8+PH09PShQ4cGBQVNnjy5p39p3njjDetzrv7+/rIsd3Z2zp49e/jw4dddd53BYHjqqadaWloc/sgMDHHHFjPHfswcy92ZOd3Y//xqLFsZAOVdQHu2AHdmz/NLb4jNVc8vfSU2Zg56Yv/zK+BHlQEAAKwRdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwXnbv4mrXmEVHovegDPQV+gJvYGecHYHAAAIzq5rZgEAALg/zu4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARn17cq8/2VnmBg38xEb3gC9b+1i77yBMwc9MSemcPZHQAAIDgHXDPLVa/wqKtOXXt42mPlaXVdxdMeZ0+raw9Pe6w8ra49OLsDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABCcw+KO2Wz29vaOiYnR6/V/+MMf+v6f8o1G4+zZs3u69cMPPzQYDDExMZmZmWrWnT9/fkhIyKJFi3pawRl1S0tLZ86cGRUVlZSU9Mknn6hWt6WlJSUlRafT6fX6bdu29XGDfUdv2F9X1N6wh5OeX0mSWlpa9Hr9unXr1Kzr7++v0+l0Ot3ixYvVrHv27NmZM2eGhYUlJSW1traqU7egoED3C29v77y8vD5us4/oDYfUFa03ZDtYb6G+vj4qKkqW5dbW1gkTJnz88cd93EhpaemsWbOueFN7e7vBYDAajTU1NdHR0Q0NDerUlWU5Nzc3Ozt74cKF1gudXbe4uPjo0aOyLJ86dSo8PLyzs1Oduh0dHefPn5dlua6uLjIyUvm5W93+ojccW1ek3rCHCs+vLMsbN25cvHjx2rVr1ayr1+ttF6pQd/bs2Tt27JBluby8vL29XbW6ipqamhEjRnR0dNjW7S96w+F1hekNhePfzPLx8Zk4ceKZM2ckSWpra5s1a1ZKSsq4ceMOHTokSZLRaExNTX3ooYduvfXWRx991PqOeXl5kydPrqmpsSz5+uuvExIS9Hq9VqudMWNGTk6OOnUlSZoxY0ZgYKDKx2swGCZNmiRJ0vXXXy9JUnNzszp1Bw0aFB4eLklSR0dHQECAn59fXw58AOgNesMZHPv8lpSU/Pjjj3feeafKdV1yvKWlpUajccWKFZIkjRw50tu7t+/Zd8bxvvfee3fdddegQYMG9lBcFb1Bb/x/9mQl6y1YUt7FixfHjh2bm5sry3JnZ2d9fb0sy1VVVWlpabIsl5aWBgcH19TUyLI8bdq0kpISJeXl5eWlpqaaTCbr7b/77ru///3vlZ//9Kc/bd++XZ26is8++6wvr+AdXleW5U8//XTKlClq1m1oaIiOjh40aNAbb7xxxbr9RW84o64sRG/YQ4XjXbhwYWFh4c6dO6/6Ct6xdQMCAgwGw/jx4z/55BPV6n766aczZsyYP3/+TTfdtHnzZjWPVzFz5sycnJwr1u0vesOxdUXqjf+3Bbvu/L+HPWjQIL1ef9111y1btkxZ2NXV9fTTT6elpU2fPj04OFiW5dLS0mnTpim3rly5Mjc3t7S0VK/XjxkzxnKe3KKP/6Q5vK7iqv+kOaluWVlZUlLSTz/9pHJd5V6jRo0qLy+3rdtf9Aa94QzOPt5PPvlk/fr1siz3/k+aMx5no9Eoy3J+fn5kZGRdXZ06dT/++GNfX9/CwsJLly5NnTrV8maEOn1lMpkiIyMt71bI7j1z6A3Vjld2dG8oHPlmVkREhNFoLCsr++qrr3744QdJkvbv319cXHzo0KGDBw/6+voqqw0ePFj5wcvLq6OjQ5KksLAwPz+/EydOdNtgZGTkuXPnlJ8rKysjIyPVqeuq45UkyWw233XXXdu3bx89erSadRUxMTGpqamnTp3q/4NxFfQGveEMDj/eY8eO7dmzJyYm5rHHHnv77befffZZdepKkqTX6yVJGjduXHJy8unTp9WpGxUVlZiYmJiY6Ovre+utt548eVK145Uk6b333ps3b56T3smiN+iNbhz/2Z2IiIgtW7Zs3bpVkqT6+nqDweDt7f3111+bTKae7hIUFPTBBx889thj33zzjfXyiRMnFhUVlZeX19XV5ebm9v6BeQfW7RcH1r18+fKCBQvWr18/a9YsNetWVVUp0aGiouLYsWPJyclXrT4w9Aa94QwOPN7NmzdXVFQYjca//e1vv/vd7zZt2qRO3bq6ugsXLkiSVFRUdOrUqdjYWHXq3nDDDV1dXRUVFZ2dnf/973+TkpLUqavYs2fPkiVLeqloP3qD3rBwyvfuLF68+MSJE4WFhfPmzfv666+XLl36r3/9Kzo6upe7REREZGdnP/DAA0VFRZaF3t7er7322owZM1JSUrZu3RoUFKROXUmSbrvttqVLl+7fv1+n0xUUFKhT9/PPPz98+PDTTz+t/B88o9GoTt26urrZs2dHRUXNmjXrz3/+s/JKwknoDXrDGRz4/Lqk7tmzZ1NTU6Oion7zm9+8/vrroaGh6tTVaDTbtm1LT09PSkq6/vrr586dq05dSZJMJtPp06enTZvWe0X70Rv0hkJjeUtsIHfWaCRJsmcL1BW17rW4z9SlLnWv3brX4j5TV826fKsyAAAQHHEHAAAIjrgDAAAEp17c6eUKR1e9CNGAlfZ8pSEVLgbU09VVrnoBFHv0dJUTZ1+kxh5/+ctf4uPj4+Li1q9fb3tTQkJCQkLCvn377Kxi22b96skBd2m3O/bSk7Yr29OlV9zhnnrSdmWndqk6mDkWzJxumDk9rSzyzLHnS3v6vgXbKxw1NDR0dXUpt17xIkQOqWt7pSFL3Z4uBuSQugrrq6tYH+8VL4DiqLrdrnJiXVfR7UIkjqo74PuePXtWr9e3tLS0t7enpKR88803ln3+7rvvbrrppkuXLtXV1SmT1J663dqsvz3Ze5f2vW4vPWm78lW7tO91FT31pO3KvXep/dNjYJg5vWPm9GVNZo5nzhyVzu7YXuFo7NixlZWVyq19vwhRf9leachS19kXA+p2dRXr43WeUpurnNjWdfZFavorICDA19e3ra1NuQTd8OHDLftcWFiYmprq6+s7bNiwkSNHHj582J5C3dqsvz054C7tdsdeetJ2ZXu61HaHe+lJ5/0Nugozh5nTE2aOZ84cleLOuXPnoqKilJ91Ol1lZWVWVtZVvz/AgT777LO4uLjAwEDruhcvXtTr9ZGRkevXr7/qF7f014YNG55//nnLr9Z16+rqYmNjb7755gMHDji26JkzZ3Q63YIFC8aNG7dly5ZudRUqfLVXv4SEhGRkZERHR0dGRs6bN2/UqFGWfR4zZsyRI0caGxvPnz+fn5/v2Nntnj1py4Fd2ktP2nJel6rDPZ9fZo47YOZ45szp7RqnTqWETXWUl5evXbs2Ozu7W92goKCysjKj0Thz5sw5c+aMHDnSURUPHDgQHR2dmJh49OhRZYl13VOnTun1+oKCgrlz5548eXLYsGGOqtvZ2Xns2LHvv/9er9enp6dPmjTp9ttvt16hurq6sLBw+vTpjqpov/Ly8ldffbWkpMTX1/dXv/rV3LlzLY/VmDFjVq1aNX369IiIiLS0tN4vyWs/d+hJW47q0t570pbzutRV3OH5Zea4A2aOZ84clc7u9PEKR85w1SsNOeNiQL1fXaUvF0AZmKte5cSpF6kZmIKCgptvvlmr1QYEBMycOfOrr76yvvWRRx7Jz8/fv39/fX19XFycA+u6c0/asr9L+3jFHwvndak63Pn5Zea4FjOnL8SbOSrFHdsrHG3evNlsNju7ru2Vhix1nXoxINurq1jq9usCKP1le5WTbo+zu51VliQpPj7+m2++aWpqamtrO3z4cEJCgvU+l5WVSZL04Ycfms3m1NRUB9Z1w5605cAu7aUnbTm1S9Xhhs8vM8dNMHM8dObY8znnfm3hgw8+GDVqVHR09M6dO2VZHjlyZGNjo3JTenq6Vqv18/OLiorKz893YN2PPvpo0KBBUb8oLS211D158mRSUlJkZGRCQsKuXbv6srUBPGI7d+5UPpFuqVtQUBAXFxcZGTl69Oi9e/c6vO4XX3yRlJQUHx+/bt06+X8f5/Pnz0dGRnZ2dvZxU/Z0SL/u+/zzz8fFxcXGxmZkZMj/u88TJ04MCwu7+eabT506ZWdd2zbrV0/23qV9r9tLT9qufNUu7dfxKmx70nblq3ap/dNjYJg5V8XM6QtmjgfOHPXijrWioqJHH32UuqLWtee+nvZYeVpdO11zx0tdderac19Pe6w8ra4FlwilrlPqXov7TF3qUvfarXst7jN11azLRSQAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAME54GsGITZ7vvILYnPVV41BbMwc9ISvGQQAAOiRXWd3AAAA3B9ndwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACC4/wNeW27o5DoAAAACSURBVCEI/r8gawAAAABJRU5ErkJggg==" -/> + +{: align="center"} -```Bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --tasks-per-node=4 -#SBATCH --cpus-per-task=4 +!!! example "Binding to sockets and block:block distribution" -export OMP_NUM_THREADS=4 + ```bash + #!/bin/bash + #SBATCH --nodes=2 + #SBATCH --tasks-per-node=4 + #SBATCH --cpus-per-task=4 -srun --ntasks 8 --cpus-per-task $OMP_NUM_THREADS ./application -``` + export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK + srun --ntasks 8 --cpus-per-task $OMP_NUM_THREADS ./application + ``` ### Core Bound @@ -195,36 +236,37 @@ srun --ntasks 8 --cpus-per-task $OMP_NUM_THREADS ./application This method allocates the tasks linearly to the cores. -\<img alt="" -src="<data:;base64,iVBORw0KGgoAAAANSUhEUgAAAvoAAADyCAIAAACzsfbGAAAABmJLR0QA/wD/AP+gvaeTAAAgAElEQVR4nO3df1RUdf7H8TuIIgy/lOGXICgoGPZjEUtByX54aK3VVhDStVXyqMlapLSZfsOf7UnUU2buycgs5XBytkzdPadadkWyg9nZTDHN/AEK/uLn6gy/HPl1v3/csxyOMIQMd2b4zPPxF3Pnzr2f952Pb19zZ+aORpZlCQAAQFxOth4AAACAuog7AABAcMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMERdwAAgOCIOwAAQHDEHQAAIDhnSx6s0Wj6ahwA+h1Zlq28R3oO4Mgs6Tmc3QEAAIKz6OyOwvqv8ADYlm3PstBzAEdjec/h7A4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMERdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB3H9dhjj2k0mu+++659SUBAwMGDB3u+haKiInd3956vn5OTExcXp9VqAwIC7mGgAIRg/Z6zfPnyqKgoNze3kJCQFStWNDU13cNwIRbijkPz8fF57bXXrLY7nU63bNmydevWWW2PAOyKlXtOfX19dnb21atX9Xq9Xq9fu3at1XYNe0PccWgLFy4sKSn54osvOt9VXl6enJzs5+cXHBz80ksvNTY2KsuvXr361FNPeXt733///UePHm1fv7a2Ni0tbfjw4b6+vrNnz66pqem8zaeffjolJWX48OEqlQPAzlm55+zcuTM+Pt7HxycuLu6FF17o+HA4GuKOQ3N3d1+3bt2qVauam5vvuispKWngwIElJSXHjx8/ceJERkaGsjw5OTk4OLiiouKrr7764IMP2tefO3duZWXlyZMnr1y54uXllZqaarUqAPQXNuw5hYWFMTExfVoN+hXZApZvATY0ZcqUN998s7m5ecyYMdu3b5dl2d/f/8CBA7Isnzt3TpKkqqoqZc38/PzBgwe3traeO3dOo9HcvHlTWZ6Tk6PVamVZvnTpkkajaV/faDRqNBqDwdDlfvfu3evv7692dVCVrf7t03P6NVv1HFmW16xZM3LkyJqaGlULhHos/7fvbO14BTvj7OyclZW1aNGiefPmtS+8du2aVqv19fVVboaFhZlMppqammvXrvn4+AwZMkRZPnr0aOWP0tJSjUbz8MMPt2/By8vr+vXrXl5e1qoDQP9g/Z6zYcOG3NzcgoICHx8ftaqC3SPuQHr22WfffvvtrKys9iXBwcENDQ3V1dVK9yktLXVxcdHpdEFBQQaD4c6dOy4uLpIkVVRUKOuHhIRoNJpTp06RbwD8Kmv2nJUrV+7fv//IkSPBwcGqFYR+gM/uQJIkacuWLdu2baurq1NuRkRETJw4MSMjo76+vrKyMjMzc/78+U5OTmPGjImOjt66daskSXfu3Nm2bZuyfnh4eEJCwsKFC8vLyyVJqq6u3rdvX+e9tLa2mkwm5T17k8l0584dK5UHwM5Yp+ekp6fv378/Ly9Pp9OZTCa+iO7IiDuQJEmaMGHCM8880/5VCI1Gs2/fvsbGxpEjR0ZHRz/44IPvvPOOctfnn3+en58/bty4J5544oknnmjfwt69e4cNGxYXF+fh4TFx4sTCwsLOe9m5c6erq+u8efMqKytdXV05sQw4LCv0HIPBsH379osXL4aFhbm6urq6ukZFRVmnOtghTfsngHrzYI1GkiRLtgCgP7LVv316DuCYLP+3z9kdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMERdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOGfLN6HRaCzfCAD0ED0HwL3i7A4AABCcRpZlW48BAABARZzdAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOOIOAAAQnEWXGeRiX46gd5cqYG44AutfxoJ55QjoOTDHkp7D2R0AACC4PvgRCS5UKCrLXy0xN0Rl21fSzCtR0XNgjuVzg7M7AABAcMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAAQnftw5e/bs9OnTdTqdm5vbmDFjXn/99V5sZMyYMQcPHuzhyr/5zW/0en2Xd+Xk5MTFxWm12oCAgF4MA33LrubG8uXLo6Ki3NzcQkJCVqxY0dTU1IvBwB7Y1byi59gVu5objtZzBI87bW1tv/3tb4cNG3b69Omamhq9Xh8WFmbD8eh0umXLlq1bt86GY4DC3uZGfX19dnb21atX9Xq9Xq9fu3atDQeDXrO3eUXPsR/2NjccrufIFrB8C2q7evWqJElnz57tfNeNGzdmzZrl6+sbFBS0dOnShoYGZfmtW7fS0tJCQkI8PDyio6PPnTsny3JkZOSBAweUe6dMmTJv3rympiaj0bhkyZLg4GCdTvfcc89VV1fLsvzSSy8NHDhQp9OFhobOmzevy1Ht3bvX399frZr7jiXPL3Ojd3NDsWbNmvj4+L6vue/Y6vllXtFz1Hisddjn3FA4Qs8R/OzOsGHDIiIilixZ8re//e3KlSsd70pKSho4cGBJScnx48dPnDiRkZGhLJ8zZ05ZWdmxY8cMBsOePXs8PDzaH1JWVjZp0qTJkyfv2bNn4MCBc+fOraysPHny5JUrV7y8vFJTUyVJ2r59e1RU1Pbt20tLS/fs2WPFWnFv7HluFBYWxsTE9H3NUJ89zyvYlj3PDYfoObZNW1ZQWVm5cuXKcePGOTs7jxo1au/evbIsnzt3TpKkqqoqZZ38/PzBgwe3traWlJRIknT9+vW7NhIZGbl69erg4ODs7GxlyaVLlzQaTfsWjEajRqMxGAyyLD/00EPKXszhlZadsMO5IcvymjVrRo4cWVNT04eV9jlbPb/MK3qOGo+1GjucG7LD9Bzx4067urq6t99+28nJ6aeffjp06JBWq22/6/Lly5IkVVZW5ufnu7m5dX5sZGSkv7//hAkTTCaTsuTw4cNOTk6hHXh7e//8888yrcfix1qf/cyN9evXh4WFlZaW9ml9fY+40xP2M6/oOfbGfuaG4/Qcwd/M6sjd3T0jI2Pw4ME//fRTcHBwQ0NDdXW1cldpaamLi4vyBmdjY2N5eXnnh2/bts3X13fGjBmNjY2SJIWEhGg0mlOnTpX+z61bt6KioiRJcnJyoKMqBjuZGytXrszNzT1y5EhoaKgKVcLa7GRewQ7ZydxwqJ4j+D+SioqK11577eTJkw0NDTdv3ty4cWNzc/PDDz8cERExceLEjIyM+vr6ysrKzMzM+fPnOzk5hYeHJyQkLF68uLy8XJblM2fOtE81FxeX/fv3e3p6Tps2ra6uTllz4cKFygrV1dX79u1T1gwICDh//nyX42ltbTWZTM3NzZIkmUymO3fuWOUwoAv2NjfS09P379+fl5en0+lMJpPwXwoVlb3NK3qO/bC3ueFwPce2J5fUZjQaFy1aNHr0aFdXV29v70mTJn355ZfKXdeuXUtMTNTpdIGBgWlpafX19crymzdvLlq0KCgoyMPDY9y4cefPn5c7fBK+paXlj3/84yOPPHLz5k2DwZCenj5ixAh3d/ewsLBXXnlF2cI333wzevRob2/vpKSku8azY8eOjge/4wlMO2TJ88vcuKe5cevWrbv+YYaHh1vvWNw7Wz2/zCt6jhqPtQ67mhsO2HM07VvpBY1Go+y+11uAPbPk+WVuiM1Wzy/zSmz0HJhj+fMr+JtZAAAAxB0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMERdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgnO2fBPKz7IDnTE3oAbmFcxhbsAczu4AAADBaWRZtvUYAAAAVMTZHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Cy6qjLXr3QEvbsyE3PDEVj/ql3MK0dAz4E5lvQczu4AAADB9cFvZnFdZlFZ/mqJuSEq276SZl6Jip4DcyyfG5zdAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAEJ2zcOXr06DPPPDN06FCtVvvAAw9kZmY2NDRYYb8tLS3p6elDhw719PScO3dubW1tl6u5u7trOnBxcblz544VhuewbDUfKisrU1JSdDqdt7f3U089df78+S5Xy8nJiYuL02q1AQEBHZenpqZ2nCd6vd4KY0bv0HPQET3H3ogZd/7xj388+eSTDz300LFjx6qqqnJzc6uqqk6dOtWTx8qy3Nzc3Otdr1+/Pi8v7/jx48XFxWVlZUuWLOlytcrKyrr/SUxMnDlzpouLS693iu7ZcD6kpaUZDIYLFy5cv349MDAwOTm5y9V0Ot2yZcvWrVvX+a6MjIz2qTJr1qxejwSqouegI3qOPZItYPkW1NDa2hocHJyRkXHX8ra2NlmWb9y4MWvWLF9f36CgoKVLlzY0NCj3RkZGZmZmTp48OSIioqCgwGg0LlmyJDg4WKfTPffcc9XV1cpq77zzTmhoqJeXV2Bg4Jtvvtl5735+fh9//LHyd0FBgbOz861bt7oZbXV1tYuLy+HDhy2sWg2WPL/2MzdsOx/Cw8M/+ugj5e+CggInJ6eWlhZzQ927d6+/v3/HJfPnz3/99dd7W7qKbPX82s+86oie01foOfQcc/ogsdh292pQEvTJkye7vDc2NnbOnDm1tbXl5eWxsbEvvviisjwyMvL++++vqalRbv7ud7+bOXNmdXV1Y2Pj4sWLn3nmGVmWz58/7+7ufvHiRVmWDQbDjz/+eNfGy8vLO+5aOat89OjRbka7ZcuW0aNHW1CuisRoPTacD7Isr1ix4sknn6ysrDQajc8//3xiYmI3Q+2y9QQGBgYHB8fExGzatKmpqeneD4AqiDsd0XP6Cj2HnmMOcacLhw4dkiSpqqqq813nzp3reFd+fv7gwYNbW1tlWY6MjPzrX/+qLL906ZJGo2lfzWg0ajQag8FQUlLi6ur62Wef1dbWdrnrCxcuSJJ06dKl9iVOTk5ff/11N6ONiIjYsmXLvVdpDWK0HhvOB2XlKVOmKEfjvvvuu3LlSjdD7dx68vLyvvvuu4sXL+7bty8oKKjz60VbIe50RM/pK/QcZTk9pzPLn18BP7vj6+srSdL169c733Xt2jWtVqusIElSWFiYyWSqqalRbg4bNkz5o7S0VKPRPPzwwyNGjBgxYsSDDz7o5eV1/fr1sLCwnJyc999/PyAg4NFHHz1y5Mhd2/fw8JAkyWg0Kjfr6ura2to8PT13797d/smvjusXFBSUlpampqb2Ve3ozIbzQZblqVOnhoWF3bx5s76+PiUlZfLkyQ0NDebmQ2cJCQmxsbGjRo1KSkratGlTbm6uJYcCKqHnoCN6jp2ybdpSg/K+6auvvnrX8ra2truSdUFBgYuLS3uyPnDggLK8uLh4wIABBoPB3C4aGxvfeuutIUOGKO/FduTn5/fJJ58of3/zzTfdv4/+3HPPzZ49+97KsyJLnl/7mRs2nA/V1dVSpzcavv/+e3Pb6fxKq6PPPvts6NCh3ZVqRbZ6fu1nXnVEz+kr9BxlOT2nsz5ILLbdvUr+/ve/Dx48ePXq1SUlJSaT6cyZM2lpaUePHm1ra5s4ceLzzz9fV1dXUVExadKkxYsXKw/pONVkWZ42bdqsWbNu3Lghy3JVVdXnn38uy/Ivv/ySn59vMplkWd65c6efn1/n1pOZmRkZGXnp0qXKysr4+Pg5c+aYG2RVVdWgQYPs8wODCjFaj2zT+RAaGrpo0SKj0Xj79u0NGza4u7vfvHmz8whbWlpu376dk5Pj7+9/+/ZtZZutra0fffRRaWmpwWD45ptvwsPD29/mtznizl3oOX2CntO+BXrOXYg7ZhUWFk6bNs3b29vNze2BBx7YuHGj8gH4a9euJSYm6nS6wMDAtLS0+vp6Zf27pprBYEhPTx8xYoS7u3tYWNgrr7wiy/KJEyceeeQRT0/PIUOGTJgw4dtvv+2836amppdfftnb29vd3X3OnDlGo9HcCDdv3my3HxhUCNN6ZNvNh1OnTiUkJAwZMsTT0zM2Ntbc/zQ7duzoeM5Vq9XKstza2jp16lQfH59BgwaFhYWtWrWqsbGxz49M7xB3OqPnWI6e0/5wes5dLH9+Ne1b6QXlXUBLtgB7Zsnzy9wQm62eX+aV2Og5MMfy51fAjyoDAAB0RNwBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAATnbPkmfvUXVuGwmBtQA/MK5jA3YA5ndwAAgOAs+s0sAAAA+8fZHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Cy6qjLXr3QEvbsyE3PDEVj/ql3MK0dAz4E5lvQczu4AAADB9cFvZjnOdZmVVw+OVq8lHO1YOVq9tuJox9nR6rWEox0rR6vXEpzdAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOJvFnccee0yj0Wg0Gg8Pj0ceeSQvL6/XmxozZszBgwe7WaGlpSU9PX3o0KGenp5z586tra3t9b56zZr15uTkxMXFabXagICAXu/Fhqx5rJYvXx4VFeXm5hYSErJixYqmpqZe76vXrFlvZmbmyJEjXVxcfHx8ZsyYUVxc3Ot99TvWPM6KlpaW6OhojUZTUVHR6331mjXrTU1N1XSg1+t7vS+bsPLc+Ne//jVhwoTBgwf7+vquWLGi1/vqNWvW6+7u3nFuuLi43Llzp9e7s4Qtz+783//9X3Nz89WrV6dOnTpz5syamhqVdrR+/fq8vLzjx48XFxeXlZUtWbJEpR11z2r16nS6ZcuWrVu3TqXtW4HVjlV9fX12dvbVq1f1er1er1+7dq1KO+qe1eqdPn16fn5+TU3N8ePHnZyc5s+fr9KO7JPVjrMiKyvLx8dH1V10z5r1ZmRk1P3PrFmz1NuRSqx2rA4fPpyUlLRw4cKysrITJ05Mnz5dpR11z2r1VlZWtk+MxMTEmTNnuri4qLSv7tky7mg0GmdnZ29v71deeeX27du//PKLsvydd96JjIz08PAYMWLEW2+91b7+mDFj1q1b9/jjj99///3jx48/ffr0XRs0GAyPPfbY/Pnzm5ubOy7/8MMPV65cGRYW5ufn95e//OXzzz83GAxqV9eZ1ep9+umnU1JShg8frnZF6rHasdq5c2d8fLyPj09cXNwLL7xw9OhRtUvrktXqnTBhQlhYmIeHR3Bw8LBhw7y9vdUuza5Y7ThLknT27Nndu3dv3LhR1Yq6Z816Bw4c6P4/zs59cEU3K7PasVq9evXSpUsXLVrk7+8/fPjw+Ph4tUvrktXq1Wq1yqwwmUxffvnliy++qHZp5tjFZ3f0er2rq2tkZKRyMzg4+J///Gdtbe2BAwfee++9L774on3NL7/88sCBA2fOnElOTl66dGnHjZSVlU2aNGny5Ml79uwZOHBg+/KKioqqqqro6GjlZkxMTEtLy9mzZ9UvyyxV6xWMNY9VYWFhTEyMSoX0kBXqzcnJCQgI8PDwOH369Keffqp2RfZJ7ePc2tq6YMGCrVu3enh4WKGcX2WdeTV8+PDx48dv3ry5cxjqR1Q9ViaT6fvvv29tbb3vvvuGDBny5JNP/vTTT9apyxyr9djdu3eHhIQ8/vjj6tXyK2QLWLKFKVOmaLVaf39/JfodPny4y9VWrFiRlpam/B0ZGblz507l77Nnz7q6urYvX716dXBwcHZ2ductXLhwQZKkS5cutS9xcnL6+uuvezHmflFvu7179/r7+/dutApL6u1fx0qW5TVr1owcObKmpqZ3Y+5H9TY2Nt64cePbb7+Njo5euHBh78Zsefew/n6teZy3bNmSnJwsy7Lyorm8vLx3Y+4v9ebl5X333XcXL17ct29fUFBQRkZG78YsfM8pLy+XJGnkyJFnzpypr69ftmxZUFBQfX19L8bcL+rtKCIiYsuWLb0bsNwXPceWZ3cWLVpUVFT07bffRkVFffLJJ+3LDx48+Oijj4aEhISGhn744YfV1dXtd+l0OuUPV1fX27dvt7S0KDc//PDDoKCgLj+IoLy6MhqNys26urq2tjZPT0+ViuqGdeoVg5WP1YYNG3JzcwsKCmz1SQtr1uvq6hoYGBgfH79t27Zdu3Y1NjaqU5M9ss5xLi4u3rp16/bt29UspUesNq8SEhJiY2NHjRqVlJS0adOm3Nxc1WpSi3WOlbu7uyRJaWlpY8eO1Wq1GzdurKio+PHHH1UszAwr99iCgoLS0tLU1NS+r6THbBl3lK8OjRs37tNPP9Xr9YWFhZIklZeXp6SkrF27tqysTPlYsdyD3wTZtm2br6/vjBkzOvfugIAAPz+/oqIi5eaJEyecnZ2joqL6vJxfZZ16xWDNY7Vy5crc3NwjR46Ehob2cRk9Zqu5MWDAgAEDBvRBAf2EdY5zYWFhTU3N2LFjdTpdbGysJEljx47dtWuXGhV1zybzatCgQe3/EfYj1jlW7u7uo0aNav/5Jxv+9pyV50Z2dnZiYmJ7YLIJu/jsTnh4+IIFC1avXi1JUl1dnSRJDz74oEajuXHjRg8/W+Di4rJ//35PT89p06YpW+ho8eLFWVlZly9frqqqWr16dXJysm0/oal2va2trSaTSXn73GQy2epbf31C7WOVnp6+f//+vLw8nU5nMpls8kX0jlStt7m5OSsr6/z580aj8YcffsjIyHj22Wdt9S0J21L1OKekpJSUlBQVFRUVFSnf0T106NDs2bNVqKOnVK23ra1t165dZWVlRqPxyJEjq1atSk5OVqMK61C75/zpT3/asWPHhQsXTCZTZmbmsGHDxo8f3+dV9Jza9UqSVF1dfeDAgcWLF/ftyO+VXcQdSZLeeOONY8eOHT58OCIiYu3atZMmTZo0adKSJUsSEhJ6uIWBAwfq9frQ0NCpU6feunWr411r1qxJSEgYN25ceHh4cHDwBx98oEIF90bVenfu3Onq6jpv3rzKykpXV1fbfhXWcuodK4PBsH379osXL4aFhbm6urq6utrktN9d1KtXo9EcO3ZsypQpfn5+KSkp8fHxH3/8sTpF9APqHWc3N7fg//H395ckKTAwUKvVqlJGj6nac/R6fUxMjJ+f34IFC1JSUrZu3apCBdaj6rFatmzZ888//+ijj/r7+584ceKrr75yc3NToYh7oGq9kiTt3r07NDTUlh9SliRJkjQ9OVVl9sEO+QP01Kv2Y/sj6hV7v7ZCvdZ5bH9EvffKXs7uAAAAqIS4AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcPYbd1paWtLT04cOHerp6Tl37tza2lpza2ZmZo4cOdLFxcXHx2fGjBnFxcXWHGffamlpiY6O1mg0FRUV5tZxd3fXdODi4tKvLyTYQ5WVlSkpKTqdztvb+6mnnjp//nyXq+Xk5MTFxSkXDO35XXbC3AiXL18eFRXl5uYWEhKyYsWKbq6FaG4LqampHeeMXq9Xq4b+jJ5jbh16Dj3nXrdghz3HfuPO+vXr8/Lyjh8/XlxcrFzN2tya06dPz8/Pr6mpOX78uJOTU7/+JamsrKxfvSpgZWVl3f8kJibOnDnTES6Mm5aWZjAYLly4cP369cDAQHOXbdXpdMuWLVu3bt093WUnzI2wvr4+Ozv76tWrer1er9evXbv2XrcgSVJGRkb7tJk1a1afDlwQ9Bxz6Dn0nHvdgmSHPceS3xe1fAvd8PPz+/jjj5W/CwoKnJ2db9261f1Dmpqa0tLSnn76aZWGpGq9siz//PPP4eHh//nPf6Se/YRydXW1i4uLuR+ztZwl9fb5sQoPD//oo4+UvwsKCpycnFpaWsyt3M2vwVv+Q/Fd6sN6ux/hmjVr4uPj73UL8+fPf/311/tkeAq1/y3YZL/0nF9dn55jbmV6jv33HDs9u1NRUVFVVRUdHa3cjImJaWlpOXv2rLn1c3JyAgICPDw8Tp8+3cOf+bA3ra2tCxYs2Lp1q/IT7j2xe/fukJAQm1+Z2zqSkpL27t1bVVVVW1u7a9eu3//+9w7125btCgsLY2JievHAnJyc4cOHjx8/fvPmzcrvqaEjek5P0HNsPSgbEKbn2GncUX5mzMvLS7np4eHh5OTUzVvpycnJJ0+e/Pe//93Q0PDnP//ZSqPsU1u3bg0JCZk+fXrPH7Jz506b/+ia1bzxxhstLS3+/v5eXl4//vjju+++a+sR2cDatWsvX76cmZl5rw/8wx/+8MUXXxQUFKxateq9995buXKlGsPr1+g5PUHPcTQi9Rw7jTvKqw2j0ajcrKura2tr8/T0lCRp9+7d7Z9+al/f1dU1MDAwPj5+27Ztu3bt6uZn6O1TcXHx1q1bt2/f3vmuLuuVJKmgoKC0tDQ1NdVKQ7QpWZanTp0aFhZ28+bN+vr6lJSUyZMnNzQ0mDs4QtqwYUNubm5BQUH7Jy16Xn5CQkJsbOyoUaOSkpI2bdqUm5ur/nj7GXpOO3qORM+RJEm4nmOncScgIMDPz6+oqEi5eeLECWdnZ+XXqlNTU+96M+8uAwYM6HenHAsLC2tqasaOHavT6WJjYyVJGjt27K5duyTz9WZnZycmJup0OtuM2Lr++9///vDDD+np6UOGDDMTiKgAAAKYSURBVNFqta+++uqVK1fOnDnzq5NBGCtXrszNzT1y5EhoaGj7wt6VP2jQoJaWFhXG2L/Rc+g5HdFzxOs5dhp3JElavHhxVlbW5cuXq6qqVq9enZyc7O3t3Xm15ubmrKys8+fPG43GH374ISMj49lnn+133xpISUkpKSkpKioqKio6ePCgJEmHDh2aPXu2ufWrq6sPHDjgOGeVdTpdaGjo+++/X1tbazKZ3n33XXd394iIiM5rtra2mkwm5X1ik8nU8euy3dxlJ8yNMD09ff/+/Xl5eTqdzmQydfOl0C630NbWtmvXrrKyMqPReOTIkVWrVpn7jomDo+fQc9rRcwTsOZZ8ztnyLXSjqanp5Zdf9vb2dnd3nzNnjtFo7HK15ubmGTNm+Pv7Dxo0aMSIEcuXLze3puVUrbfdL7/8Iv3atyQ2b948evRotUdiSb19fqxOnTqVkJAwZMgQT0/P2NhYc98N2bFjR8fprdVqe3KX5fqk3i5HeOvWrbv+zYaHh9/TFlpbW6dOnerj4zNo0KCwsLBVq1Y1NjZaOFTr/Fuw8n7pOd2sQ8+h5/R8C/bZczSyBWfklHfvLNlC/0K91nlsf0S9Yu/XVqjXOo/tj6j3Xtnvm1kAAAB9grgDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwzpZvwhGupd2Ro9VrCUc7Vo5Wr6042nF2tHot4WjHytHqtQRndwAAgOAsuswgAACA/ePsDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAAT3/8Z/zKE559m+AAAAAElFTkSuQmCC>" -/> + +{: align="center"} -```Bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --tasks-per-node=4 -#SBATCH --cpus-per-task=4 +!!! example "Binding to cores and block:block distribution" -export OMP_NUM_THREADS=4 + ```bash + #!/bin/bash + #SBATCH --nodes=2 + #SBATCH --tasks-per-node=4 + #SBATCH --cpus-per-task=4 -srun --ntasks 8 --cpus-per-task $OMP_NUM_THREADS --cpu_bind=cores --distribution=block:block ./application -``` + export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK + srun --ntasks 8 --cpus-per-task $OMP_NUM_THREADS --cpu_bind=cores --distribution=block:block ./application + ``` #### Distribution: cyclic:block -The cyclic:block distribution will allocate the tasks of your job in -alternation between the first node and the second node while filling the -sockets linearly. +The `cyclic:block` distribution will allocate the tasks of your job in alternation between the first +node and the second node while filling the sockets linearly. -\<img alt="" -src="data:;base64,iVBORw0KGgoAAAANSUhEUgAAAvoAAADyCAIAAACzsfbGAAAABmJLR0QA/wD/AP+gvaeTAAAgAElEQVR4nO3df1RUdf7H8TuIIszwQxl+CYKCgmE/FrEUleyHh9ZabQUhXVslj5qsRUqb6Tf82Z5EPWXmnozMUg4nZ8vU3XOqZVckO5idzRTTzB+g4C9+rs7wy5Ff9/vHPTuHI0LIcGeGzzwffzl37tz7ed/5zNsXd2buaGRZlgAAAMTlYu8BAAAAqIu4AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIJztebBGo2mt8YBoM+RZdnGe6TnAM7Mmp7D2R0AACA4q87uKGz/Fx4A+7LvWRZ6DuBsrO85nN0BAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMERdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOOIOAAAQHHEHAAAIjrjjvB577DGNRvPdd99ZlgQGBh44cKD7WygqKtLpdN1fPycnZ8KECVqtNjAw8B4GCkAItu85y5Yti46O9vDwCA0NXb58eVNT0z0MF2Ih7jg1X1/f1157zWa70+v1S5cuXbt2rc32CMCh2Ljn1NfXZ2dnX7lyxWAwGAyGNWvW2GzXcDTEHae2YMGCkpKSL774ouNd5eXlycnJ/v7+ISEhL730UmNjo7L8ypUrTz31lI+Pz/3333/kyBHL+rW1tWlpaUOHDvXz85s1a1ZNTU3HbT799NMpKSlDhw5VqRwADs7GPWfHjh3x8fG+vr4TJkx44YUX2j8czoa449R0Ot3atWtXrlzZ3Nx8x11JSUn9+/cvKSk5duzY8ePHMzIylOXJyckhISEVFRVfffXVBx98YFl/zpw5lZWVJ06cuHz5sre3d2pqqs2qANBX2LHnFBYWxsbG9mo16FNkK1i/BdjR5MmT33zzzebm5lGjRm3btk2W5YCAgP3798uyfPbsWUmSqqqqlDXz8/MHDhzY2tp69uxZjUZz48YNZXlOTo5Wq5Vl+eLFixqNxrK+yWTSaDRGo/Gu+92zZ09AQIDa1UFV9nrt03P6NHv1HFmWV69ePXz48JqaGlULhHqsf+272jpewcG4urpmZWUtXLhw7ty5loVXr17VarV+fn7KzfDwcLPZXFNTc/XqVV9f30GDBinLR44cqfyjtLRUo9E8/PDDli14e3tfu3bN29vbVnUA6Bts33PWr1+fm5tbUFDg6+urVlVweMQdSM8+++zbb7+dlZVlWRISEtLQ0FBdXa10n9LSUjc3N71eHxwcbDQab9++7ebmJklSRUWFsn5oaKhGozl58iT5BsCvsmXPWbFixb59+w4fPhwSEqJaQegD+OwOJEmSNm/evHXr1rq6OuVmZGTk+PHjMzIy6uvrKysrMzMz582b5+LiMmrUqJiYmC1btkiSdPv27a1btyrrR0REJCQkLFiwoLy8XJKk6urqvXv3dtxLa2ur2WxW3rM3m823b9+2UXkAHIxtek56evq+ffvy8vL0er3ZbOaL6M6MuANJkqRx48Y988wzlq9CaDSavXv3NjY2Dh8+PCYm5sEHH3znnXeUuz7//PP8/PwxY8Y88cQTTzzxhGULe/bsGTJkyIQJEzw9PcePH19YWNhxLzt27HB3d587d25lZaW7uzsnlgGnZYOeYzQat23bduHChfDwcHd3d3d39+joaNtUBweksXwCqCcP1mgkSbJmCwD6Inu99uk5gHOy/rXP2R0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMERdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA4V+s3odForN8IAHQTPQfAveLsDgAAEJxGlmV7jwEAAEBFnN0BAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABCcVZcZ5GJfzqBnlypgbjgD21/GgnnlDOg56Iw1PYezOwAAQHC98CMSXKhQVNb/tcTcEJV9/5JmXomKnoPOWD83OLsDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMERdwAAgOCIOwAAQHDix50zZ85MmzZNr9d7eHiMGjXq9ddf78FGRo0adeDAgW6u/Jvf/MZgMNz1rpycnAkTJmi12sDAwB4MA73LoebGsmXLoqOjPTw8QkNDly9f3tTU1IPBwBE41Lyi5zgUh5obztZzBI87bW1tv/3tb4cMGXLq1KmamhqDwRAeHm7H8ej1+qVLl65du9aOY4DC0eZGfX19dnb2lStXDAaDwWBYs2aNHQeDHnO0eUXPcRyONjecrufIVrB+C2q7cuWKJElnzpzpeNf169dnzpzp5+cXHBy8ZMmShoYGZfnNmzfT0tJCQ0M9PT1jYmLOnj0ry3JUVNT+/fuVeydPnjx37tympiaTybR48eKQkBC9Xv/cc89VV1fLsvzSSy/1799fr9eHhYXNnTv3rqPas2dPQECAWjX3HmueX+ZGz+aGYvXq1fHx8b1fc++x1/PLvKLnqPFY23DMuaFwhp4j+NmdIUOGREZGLl68+G9/+9vly5fb35WUlNS/f/+SkpJjx44dP348IyNDWT579uyysrKjR48ajcbdu3d7enpaHlJWVjZx4sRJkybt3r27f//+c+bMqaysPHHixOXLl729vVNTUyVJ2rZtW3R09LZt20pLS3fv3m3DWnFvHHluFBYWxsbG9n7NUJ8jzyvYlyPPDafoOfZNWzZQWVm5YsWKMWPGuLq6jhgxYs+ePbIsnz17VpKkqqoqZZ38/PyBAwe2traWlJRIknTt2rU7NhIVFbVq1aqQkJDs7GxlycWLFzUajWULJpNJo9EYjUZZlh966CFlL53hLy0H4YBzQ5bl1atXDx8+vKamphcr7XX2en6ZV/QcNR5rMw44N2Sn6Tnixx2Lurq6t99+28XF5aeffjp48KBWq7XcdenSJUmSKisr8/PzPTw8Oj42KioqICBg3LhxZrNZWXLo0CEXF5ewdnx8fH7++WeZ1mP1Y23PcebGunXrwsPDS0tLe7W+3kfc6Q7HmVf0HEfjOHPDeXqO4G9mtafT6TIyMgYOHPjTTz+FhIQ0NDRUV1crd5WWlrq5uSlvcDY2NpaXl3d8+NatW/38/KZPn97Y2ChJUmhoqEajOXnyZOn/3Lx5Mzo6WpIkFxcnOqpicJC5sWLFitzc3MOHD4eFhalQJWzNQeYVHJCDzA2n6jmCv0gqKipee+21EydONDQ03LhxY8OGDc3NzQ8//HBkZOT48eMzMjLq6+srKyszMzPnzZvn4uISERGRkJCwaNGi8vJyWZZPnz5tmWpubm779u3z8vKaOnVqXV2dsuaCBQuUFaqrq/fu3ausGRgYeO7cubuOp7W11Ww2Nzc3S5JkNptv375tk8OAu3C0uZGenr5v3768vDy9Xm82m4X/UqioHG1e0XMch6PNDafrOfY9uaQ2k8m0cOHCkSNHuru7+/j4TJw48csvv1Tuunr1amJiol6vDwoKSktLq6+vV5bfuHFj4cKFwcHBnp6eY8aMOXfunNzuk/AtLS1//OMfH3nkkRs3bhiNxvT09GHDhul0uvDw8FdeeUXZwjfffDNy5EgfH5+kpKQ7xrN9+/b2B7/9CUwHZM3zy9y4p7lx8+bNO16YERERtjsW985ezy/zip6jxmNtw6HmhhP2HI1lKz2g0WiU3fd4C3Bk1jy/zA2x2ev5ZV6JjZ6Dzlj//Ar+ZhYAAABxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgXK3fhPKz7EBHzA2ogXmFzjA30BnO7gAAAMFpZFm29xgAAABUxNkdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgrLqqMtevdAY9uzITc8MZ2P6qXcwrZ0DPQWes6Tmc3QEAAILrhd/M4rrMorL+ryXmhqjs+5c080pU9Bx0xvq5wdkdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwwsadI0eOPPPMM4MHD9ZqtQ888EBmZmZDQ4MN9tvS0pKenj548GAvL685c+bU1tbedTWdTqdpx83N7fbt2zYYntOy13yorKxMSUnR6/U+Pj5PPfXUuXPn7rpaTk7OhAkTtFptYGBg++Wpqant54nBYLDBmNEz9By0R89xNGLGnX/84x9PPvnkQw89dPTo0aqqqtzc3KqqqpMnT3bnsbIsNzc393jX69aty8vLO3bsWHFxcVlZ2eLFi++6WmVlZd3/JCYmzpgxw83Nrcc7RdfsOB/S0tKMRuP58+evXbsWFBSUnJx819X0ev3SpUvXrl3b8a6MjAzLVJk5c2aPRwJV0XPQHj3HEclWsH4LamhtbQ0JCcnIyLhjeVtbmyzL169fnzlzpp+fX3Bw8JIlSxoaGpR7o6KiMjMzJ02aFBkZWVBQYDKZFi9eHBISotfrn3vuuerqamW1d955JywszNvbOygo6M033+y4d39//48//lj5d0FBgaur682bN7sYbXV1tZub26FDh6ysWg3WPL+OMzfsOx8iIiI++ugj5d8FBQUuLi4tLS2dDXXPnj0BAQHtl8ybN+/111/vaekqstfz6zjzqj16Tm+h59BzOtMLicW+u1eDkqBPnDhx13vj4uJmz55dW1tbXl4eFxf34osvKsujoqLuv//+mpoa5ebvfve7GTNmVFdXNzY2Llq06JlnnpFl+dy5czqd7sKFC7IsG43GH3/88Y6Nl5eXt9+1clb5yJEjXYx28+bNI0eOtKJcFYnReuw4H2RZXr58+ZNPPllZWWkymZ5//vnExMQuhnrX1hMUFBQSEhIbG7tx48ampqZ7PwCqIO60R8/pLfQcek5niDt3cfDgQUmSqqqqOt519uzZ9nfl5+cPHDiwtbVVluWoqKi//vWvyvKLFy9qNBrLaiaTSaPRGI3GkpISd3f3zz77rLa29q67Pn/+vCRJFy9etCxxcXH5+uuvuxhtZGTk5s2b771KWxCj9dhxPigrT548WTka99133+XLl7sYasfWk5eX99133124cGHv3r3BwcEd/160F+JOe/Sc3kLPUZbTczqy/vkV8LM7fn5+kiRdu3at411Xr17VarXKCpIkhYeHm83mmpoa5eaQIUOUf5SWlmo0mocffnjYsGHDhg178MEHvb29r127Fh4enpOT8/777wcGBj766KOHDx++Y/uenp6SJJlMJuVmXV1dW1ubl5fXrl27LJ/8ar9+QUFBaWlpampqb9WOjuw4H2RZnjJlSnh4+I0bN+rr61NSUiZNmtTQ0NDZfOgoISEhLi5uxIgRSUlJGzduzM3NteZQQCX0HLRHz3FQ9k1balDeN3311VfvWN7W1nZHsi4oKHBzc7Mk6/379yvLi4uL+/XrZzQaO9tFY2PjW2+9NWjQIOW92Pb8/f0/+eQT5d/ffPNN1++jP/fcc7Nmzbq38mzImufXceaGHedDdXW11OGNhu+//76z7XT8S6u9zz77bPDgwV2VakP2en4dZ161R8/pLfQcZTk9p6NeSCz23b1K/v73vw8cOHDVqlUlJSVms/n06dNpaWlHjhxpa2sbP378888/X1dXV1FRMXHixEWLFikPaT/VZFmeOnXqzJkzr1+/LstyVVXV559/LsvyL7/8kp+fbzabZVnesWOHv79/x9aTmZkZFRV18eLFysrK+Pj42bNndzbIqqqqAQMGOOYHBhVitB7ZrvMhLCxs4cKFJpPp1q1b69ev1+l0N27c6DjClpaWW7du5eTkBAQE3Lp1S9lma2vrRx99VFpaajQav/nmm4iICMvb/HZH3LkDPadX0HMsW6Dn3IG406nCwsKpU6f6+Ph4eHg88MADGzZsUD4Af/Xq1cTERL1eHxQUlJaWVl9fr6x/x1QzGo3p6enDhg3T6XTh4eGvvPKKLMvHjx9/5JFHvLy8Bg0aNG7cuG+//bbjfpuaml5++WUfHx+dTjd79myTydTZCDdt2uSwHxhUCNN6ZPvNh5MnTyYkJAwaNMjLyysuLq6z/2m2b9/e/pyrVquVZbm1tXXKlCm+vr4DBgwIDw9fuXJlY2Njrx+ZniHudETPsR49x/Jwes4drH9+NZat9IDyLqA1W4Ajs+b5ZW6IzV7PL/NKbPQcdMb651fAjyoDAAC0R9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAATnav0mfvUXVuG0mBtQA/MKnWFuoDOc3QEAAIKz6jezAAAAHB9ndwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgrPqqspcv9IZ9OzKTMwNZ2D7q3Yxr5wBPQedsabncHYHAAAIrhd+M8t5rsus/PXgbPVaw9mOlbPVay/OdpydrV5rONuxcrZ6rcHZHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgrNb3Hnsscc0Go1Go/H09HzkkUfy8vJ6vKlRo0YdOHCgixVaWlrS09MHDx7s5eU1Z86c2traHu+rx2xZ77Jly6Kjoz08PEJDQ5cvX97U1NTjfdmFLY+VoqWlJSYmRqPRVFRU9HhfPWbjev/1r3+NGzdu4MCBfn5+y5cv7/G++hxbHuecnJwJEyZotdrAwMAe78VKtqw3MzNz+PDhbm5uvr6+06dPLy4u7vG+7MKWxyo1NVXTjsFg6PG+esyW9ep0uvb1urm53b59u8e7s4Y9z+783//9X3Nz85UrV6ZMmTJjxoyamhqVdrRu3bq8vLxjx44VFxeXlZUtXrxYpR11zWb11tfXZ2dnX7lyxWAwGAyGNWvWqLQj9djsWCmysrJ8fX1V3UXXbFbvoUOHkpKSFixYUFZWdvz48WnTpqm0I8dks+Os1+uXLl26du1albbfTTard9q0afn5+TU1NceOHXNxcZk3b55KO1KPLXtORkZG3f/MnDlTvR11wWb1VlZWWopNTEycMWOGm5ubSvvqmj3jjkajcXV19fHxeeWVV27duvXLL78oy995552oqChPT89hw4a99dZblvVHjRq1du3axx9//P777x87duypU6fu2KDRaHzsscfmzZvX3NzcfvmHH364YsWK8PBwf3//v/zlL59//rnRaFS7uo5sVu+OHTvi4+N9fX0nTJjwwgsvHDlyRO3Sep3NjpUkSWfOnNm1a9eGDRtUrahrNqt31apVS5YsWbhwYUBAwNChQ+Pj49UuzaHY7Dg//fTTKSkpQ4cOVbuirtms3nHjxoWHh3t6eoaEhAwZMsTHx0ft0nqdLXtO//79df/j6toLV7/rAZvVq9VqlUrNZvOXX3754osvql1aZxziszsGg8Hd3T0qKkq5GRIS8s9//rO2tnb//v3vvffeF198YVnzyy+/3L9//+nTp5OTk5csWdJ+I2VlZRMnTpw0adLu3bv79+9vWV5RUVFVVRUTE6PcjI2NbWlpOXPmjPpldUrVeu9QWFgYGxurUiE2oPaxam1tnT9//pYtWzw9PW1Qzq9StV6z2fz999+3trbed999gwYNevLJJ3/66Sfb1OVobPkadAQ2qDcnJycwMNDT0/PUqVOffvqp2hWpxzbHaujQoWPHjt20aVPHMGRjNnst7Nq1KzQ09PHHH1evll8hW8GaLUyePFmr1QYEBCjR79ChQ3ddbfny5Wlpacq/o6KiduzYofz7zJkz7u7uluWrVq0KCQnJzs7uuIXz589LknTx4kXLEhcXl6+//roHY+4T9ba3evXq4cOH19TU9GzM1tTbV47V5s2bk5OTZVlW/rgpLy/v2Zj7RL3l5eWSJA0fPvz06dP19fVLly4NDg6ur6/vwZit7x490yeOs8WePXsCAgJ6NlpFH6q3sbHx+vXr3377bUxMzIIFC3o2ZmfoOXl5ed99992FCxf27t0bHByckZHRszH3lXotIiMjN2/e3LMBy73Rc+x5dmfhwoVFRUXffvttdHT0J598Yll+4MCBRx99NDQ0NCws7MMPP6yurrbcpdfrlX+4u7vfunWrpaVFufnhhx8GBwff9Q1j5a92k8mk3Kyrq2tra/Py8lKpqC7Ypl6L9evX5+bmFhQU2PdTKT1jm2NVXFy8ZcuWbdu2qVlKt9imXp1OJ0lSWlra6NGjtVrthg0bKioqfvzxRxULczA2fg3anS3rdXd3DwoKio+P37p1686dOxsbG9WpSS02O1YJCQlxcXEjRoxISkrauHFjbm6uajV1xcavhYKCgtLS0tTU1N6vpNvsGXeUry2MGTPm008/NRgMhYWFkiSVl5enpKSsWbOmrKxM+Vix3I3fBNm6daufn9/06dM7vsYCAwP9/f2LioqUm8ePH3d1dY2Oju71cn6VbepVrFixIjc39/Dhw2FhYb1chk3Y5lgVFhbW1NSMHj1ar9fHxcVJkjR69OidO3eqUVHXbFOvTqcbMWKE5adnnPAXpG35GnQE9qq3X79+/fr164UCbMgux2rAgAGW0GBjNq43Ozs7MTHREpjswiE+uxMRETF//vxVq1ZJklRXVydJ0oMPPqjRaK5fv97N94Dd3Nz27dvn5eU1depUZQvtLVq0KCsr69KlS1VVVatWrUpOTrbvJ+nUrjc9PX3fvn15eXl6vd5sNve5L6K3p+qxSklJKSkpKSoqKioqUr5LefDgwVmzZqlQR3epPTf+9Kc/bd++/fz582azOTMzc8iQIWPHju31Khyf2se5tbXVbDYrH8swm832+uathar1Njc3Z2VlnTt3zmQy/fDDDxkZGc8++6y9vn1jPVWPVVtb286dO8vKykwm0+HDh1euXJmcnKxGFd2n9mtBkqTq6ur9+/cvWrSod0d+rxwi7kiS9MYbbxw9evTQoUORkZFr1qyZOHHixIkTFy9enJCQ0M0t9O/f32AwhIWFTZky5ebNm+3vWr16dUJCwpgxYyIiIkJCQj744AMVKrg36tVrNBq3bdt24cKF8PBwd3d3d3d3u5zK6kXqHSsPD4+Q/wkICJAkKSgoSKvVqlJGt6n6Wli6dOnzzz//6KOPBgQEHD9+/KuvvvLw8FChiD5A1eO8Y8cOd3f3uXPnVlZWuru7O8IbyurVq9Fojh49OnnyZH9//5SUlPj4+I8//lidImxE1blhMBhiY2P9/f3nz5+fkpKyZcsWFSq4N6rWK0nSrl27wsLC7PkhZUmSJEnTnVNVnT7YKX+AnnrVfmxfRL1i79deqNc2j+2LqPdeOcrZHQAAAJUQdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACM5x405LS0t6evrgwYO9vLzmzJlTW1vb2ZqZmZnDhw93c3Pz9fWdPn16cXGxLcfZu1paWmJiYjQaTUVFRWfr6HQ6TTtubm52v4iZDVRWVqakpOj1eh8fn6eeeurcuXN3XS0nJ2fChAnKBUO7f5eD6GyEy5Yti46O9vDwCA0NXb58eRfXjexsC6mpqe3njMFgUKuGvoye09k69Bx6zr1uwQF7juPGnXXr1uXl5R07dqy4uFi5mnVna06bNi0/P7+mpubYsWMuLi4O/is2XcvKyvrVK5JVVlbW/U9iYuKMGTP67gVMuy8tLc1oNJ4/f/7atWtBQUGdXYpUr9cvXbp07dq193SXg+hshPX19dnZ2VeuXDEYDAaDYc2aNfe6BUmSMjIyLNNm5syZvTpwQdBzOkPPoefc6xYkB+w51vy+qPVb6IK/v//HH3+s/LugoMDV1fXmzZtdP6SpqSktLe3pp59WaUiq1ivL8s8//xwREfGf//xH6t5Pc1dXV7u5uXX2Y7bWs6beXj9WERERH330kfLvgoICFxeXlpaWzlbu4peorf+R6rvqxXq7HuHq1avj4+PvdQvz5s17/fXXe2V4CrVfC3bZLz3nV9en53S2Mj3H8XuOg57dqaioqKqqiomJUW7Gxsa2tLScOXOms/VzcnICAwM9PT1PnTrVzZ/5cDStra3z58/fsmWL8hPu3bFr167Q0FC7X5nbNpKSkvbs2VNVVVVbW7tz587f//73fe43CHtFYWFhbGxsDx6Yk5MzdOjQsWPHbtq0SfktJ7RHz+kOeo69B2UHwvQcB407ys+MeXt7Kzc9PT1dXFy6eCs9OTn5xIkT//73vxsaGv785z/baJS9asuWLaGhodOmTev+Q3bs2GH3H12zmTfeeKOlpSUgIMDb2/vHH39899137T0iO1izZs2lS5cyMzPv9YF/+MMfvvjii4KCgpUrV7733nsrVqxQY3h9Gj2nO+g5zkaknuOgcUf5a8NkMik36+rq2travLy8JEnatWuX5dNPlvXd3d2DgoLi4+O3bt26c+fOLn6G3jEVFxdv2bJl27ZtHe+6a72SJBUUFJSWlqamptpoiHYly/KUKVPCw8Nv3LhRX1+fkpIyadKkhoaGzg6OkNavX5+bm1tQUGD5pEX3y09ISIiLixsxYkRSUtLGjRtzc3PVH28fQ8+xoOdI9BxJkoTrOQ4adwIDA/39/YuKipSbx48fd3V1VX7ZOzU19Y438+7Qr1+/PnfKsbCwsKamZvTo0Xq9Pi4uTpKk0aNH79y5U+q83uzs7MTERL1eb58R29Z///vfH374IWpri2kAAAKdSURBVD09fdCgQVqt9tVXX718+fLp06d/dTIIY8WKFbm5uYcPHw4LC7Ms7Fn5AwYMaGlpUWGMfRs9h57THj1HvJ7joHFHkqRFixZlZWVdunSpqqpq1apVycnJPj4+HVdrbm7Oyso6d+6cyWT64YcfMjIynn322T73rYGUlJSSkpKioqKioqIDBw5IknTw4MFZs2Z1tn51dfX+/fud56yyXq8PCwt7//33a2trzWbzu+++q9PpIiMjO67Z2tpqNpuV94nNZnP7r8t2cZeD6GyE6enp+/bty8vL0+v1ZrO5iy+F3nULbW1tO3fuLCsrM5lMhw8fXrlyZWffMXFy9Bx6jgU9R8CeY83nnK3fQheamppefvllHx8fnU43e/Zsk8l019Wam5unT58eEBAwYMCAYcOGLVu2rLM1radqvRa//PKL9Gvfkti0adPIkSPVHok19fb6sTp58mRCQsKgQYO8vLzi4uI6+27I9u3b209vrVbbnbus1yv13nWEN2/evOM1GxERcU9baG1tnTJliq+v74ABA8LDw1euXNnY2GjlUG3zWrDxfuk5XaxDz6HndH8LjtlzNLIVZ+SUd++s2ULfQr22eWxfRL1i79deqNc2j+2LqPdeOe6bWQAAAL2CuAMAAARH3AEAAIIj7gAAAMERdwAAgOCIOwAAQHCu1m/CGa6l3Z6z1WsNZztWzlavvTjbcXa2eq3hbMfK2eq1Bmd3AACA4Ky6zCAAAIDj4+wOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMERdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABPf/Kg/MoRR2M6oAAAAASUVORK5CYII=" -/> + +{: align="center"} -```Bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --tasks-per-node=4 -#SBATCH --cpus-per-task=4 +!!! example "Binding to cores and cyclic:block distribution" + + ```bash + #!/bin/bash + #SBATCH --nodes=2 + #SBATCH --tasks-per-node=4 + #SBATCH --cpus-per-task=4 -export OMP_NUM_THREADS=4<br /><br />srun --ntasks 8 --cpus-per-task $OMP_NUM_THREADS --cpu_bind=cores --distribution=cyclic:block ./application -``` + export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK + srun --ntasks 8 --cpus-per-task $OMP_NUM_THREADS --cpu_bind=cores --distribution=cyclic:block ./application + ``` diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/checkpoint_restart.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/checkpoint_restart.md index ea3343fe1a5d21a296207fc374aa181e3ccc0855..38d6686d7a655c1c5d7161d6607be9d6f55d8b5c 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/checkpoint_restart.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/checkpoint_restart.md @@ -12,6 +12,15 @@ from the very beginning, you should be familiar with the concept of checkpointin Another motivation is to use checkpoint/restart to split long running jobs into several shorter ones. This might improve the overall job throughput, since shorter jobs can "fill holes" in the job queue. +Here is an extreme example from literature for the waste of large computing resources due to missing +checkpoints: + +!!! cite "Adams, D. The Hitchhikers Guide Through the Galaxy" + + Earth was a supercomputer constructed to find the question to the answer to the Life, the Universe, + and Everything by a race of hyper-intelligent pan-dimensional beings. Unfortunately 10 million years + later, and five minutes before the program had run to completion, the Earth was destroyed by + Vogons. If you wish to do checkpointing, your first step should always be to check if your application already has such capabilities built-in, as that is the most stable and safe way of doing it. @@ -21,7 +30,7 @@ Abaqus, Amber, Gaussian, GROMACS, LAMMPS, NAMD, NWChem, Quantum Espresso, STAR-C In case your program does not natively support checkpointing, there are attempts at creating generic checkpoint/restart solutions that should work application-agnostic. One such project which we -recommend is [Distributed MultiThreaded CheckPointing](http://dmtcp.sourceforge.net) (DMTCP). +recommend is [Distributed Multi-Threaded Check-Pointing](http://dmtcp.sourceforge.net) (DMTCP). DMTCP is available on ZIH systems after having loaded the `dmtcp` module @@ -47,8 +56,8 @@ checkpoint/restart bits transparently to your batch script. You just have to spe total runtime of your calculation and the interval in which you wish to do checkpoints. The latter (plus the time it takes to write the checkpoint) will then be the runtime of the individual jobs. This should be targeted at below 24 hours in order to be able to run on all -[haswell64 partitions](../jobs_and_resources/system_taurus.md#run-time-limits). For increased -fault-tolerance, it can be chosen even shorter. +[partitions haswell64](../jobs_and_resources/partitions_and_limits.md#runtime-limits). For +increased fault-tolerance, it can be chosen even shorter. To use it, first add a `dmtcp_launch` before your application call in your batch script. In the case of MPI applications, you have to add the parameters `--ib --rm` and put it between `srun` and your @@ -85,7 +94,7 @@ about 2 days in total. !!! Hints - - If you see your first job running into the timelimit, that probably + - If you see your first job running into the time limit, that probably means the timeout for writing out checkpoint files does not suffice and should be increased. Our tests have shown that it takes approximately 5 minutes to write out the memory content of a fully @@ -95,7 +104,7 @@ about 2 days in total. content is rather incompressible, it might be a good idea to disable the checkpoint file compression by setting: `export DMTCP_GZIP=0` - Note that all jobs the script deems necessary for your chosen - timelimit/interval values are submitted right when first calling the + time limit/interval values are submitted right when first calling the script. If your applications take considerably less time than what you specified, some of the individual jobs will be unnecessary. As soon as one job does not find a checkpoint to resume from, it will @@ -115,7 +124,7 @@ What happens in your work directory? If you wish to restart manually from one of your checkpoints (e.g., if something went wrong in your later jobs or the jobs vanished from the queue for some reason), you have to call `dmtcp_sbatch` -with the `-r, --resume` parameter, specifying a cpkt\_\* directory to resume from. Then it will use +with the `-r, --resume` parameter, specifying a `cpkt_` directory to resume from. Then it will use the same parameters as in the initial run of this job chain. If you wish to adjust the time limit, for instance, because you realized that your original limit was too short, just use the `-t, --time` parameter again on resume. @@ -126,7 +135,7 @@ If for some reason our automatic chain job script is not suitable for your use c just use DMTCP on its own. In the following we will give you step-by-step instructions on how to checkpoint your job manually: -* Load the dmtcp module: `module load dmtcp` +* Load the DMTCP module: `module load dmtcp` * DMTCP usually runs an additional process that manages the creation of checkpoints and such, the so-called `coordinator`. It must be started in your batch script before the actual start of your application. To help you with this process, we @@ -138,9 +147,9 @@ first checkpoint has been created, which can be useful if you wish to implement chaining on your own. * In front of your program call, you have to add the wrapper script `dmtcp_launch`. This will create a checkpoint automatically after 40 seconds and then -terminate your application and with it the job. If the job runs into its timelimit (here: 60 +terminate your application and with it the job. If the job runs into its time limit (here: 60 seconds), the time to write out the checkpoint was probably not long enough. If all went well, you -should find cpkt\* files in your work directory together with a script called +should find `cpkt` files in your work directory together with a script called `./dmtcp_restart_script.sh` that can be used to resume from the checkpoint. ???+ example diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md new file mode 100644 index 0000000000000000000000000000000000000000..218bd3d4b186efcd583c3fb6c092b4e0dbad3180 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md @@ -0,0 +1,127 @@ +# ZIH Systems + +ZIH systems comprises the *High Performance Computing and Storage Complex* and its +extension *High Performance Computing – Data Analytics*. In total it offers scientists +about 60,000 CPU cores and a peak performance of more than 1.5 quadrillion floating point +operations per second. The architecture specifically tailored to data-intensive computing, Big Data +analytics, and artificial intelligence methods with extensive capabilities for energy measurement +and performance monitoring provides ideal conditions to achieve the ambitious research goals of the +users and the ZIH. + +## Login Nodes + +- Login-Nodes (`tauruslogin[3-6].hrsk.tu-dresden.de`) + - each with 2x Intel(R) Xeon(R) CPU E5-2680 v3 each with 12 cores + @ 2.50GHz, Multithreading Disabled, 64 GB RAM, 128 GB SSD local disk + - IPs: 141.30.73.\[102-105\] +- Transfer-Nodes (`taurusexport3/4.hrsk.tu-dresden.de`, DNS Alias + `taurusexport.hrsk.tu-dresden.de`) + - 2 Servers without interactive login, only available via file transfer protocols (`rsync`, `ftp`) + - IPs: 141.30.73.82/83 +- Direct access to these nodes is granted via IP whitelisting (contact + hpcsupport@zih.tu-dresden.de) - otherwise use TU Dresden VPN. + +## AMD Rome CPUs + NVIDIA A100 + +- 32 nodes, each with + - 8 x NVIDIA A100-SXM4 + - 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, Multithreading disabled + - 1 TB RAM + - 3.5 TB local memory at NVMe device at `/tmp` +- Hostnames: `taurusi[8001-8034]` +- Slurm partition `alpha` +- Dedicated mostly for ScaDS-AI + +## Island 7 - AMD Rome CPUs + +- 192 nodes, each with + - 2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, Multithreading + enabled, + - 512 GB RAM + - 200 GB /tmp on local SSD local disk +- Hostnames: `taurusi[7001-7192]` +- Slurm partition `romeo` +- More information under [Rome Nodes](rome_nodes.md) + +## Large SMP System HPE Superdome Flex + +- 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20GHz (28 cores) +- 47 TB RAM +- Currently configured as one single node + - Hostname: `taurussmp8` +- Slurm partition `julia` +- More information under [HPE SD Flex](sd_flex.md) + +## IBM Power9 Nodes for Machine Learning + +For machine learning, we have 32 IBM AC922 nodes installed with this configuration: + +- 2 x IBM Power9 CPU (2.80 GHz, 3.10 GHz boost, 22 cores) +- 256 GB RAM DDR4 2666MHz +- 6x NVIDIA VOLTA V100 with 32GB HBM2 +- NVLINK bandwidth 150 GB/s between GPUs and host +- Slurm partition `ml` +- Hostnames: `taurusml[1-32]` + +## Island 4 to 6 - Intel Haswell CPUs + +- 1456 nodes, each with 2x Intel(R) Xeon(R) CPU E5-2680 v3 (12 cores) + @ 2.50GHz, Multithreading disabled, 128 GB SSD local disk +- Hostname: `taurusi4[001-232]`, `taurusi5[001-612]`, + `taurusi6[001-612]` +- Varying amounts of main memory (selected automatically by the batch + system for you according to your job requirements) + - 1328 nodes with 2.67 GB RAM per core (64 GB total): + `taurusi[4001-4104,5001-5612,6001-6612]` + - 84 nodes with 5.34 GB RAM per core (128 GB total): + `taurusi[4105-4188]` + - 44 nodes with 10.67 GB RAM per core (256 GB total): + `taurusi[4189-4232]` +- Slurm Partition `haswell` + +??? hint "Node topology" + +  + {: align=center} + +### Extension of Island 4 with Broadwell CPUs + +* 32 nodes, each witch 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz + (**14 cores**), Multithreading disabled, 64 GB RAM, 256 GB SSD local disk +* from the users' perspective: Broadwell is like Haswell +* Hostname: `taurusi[4233-4264]` +* Slurm partition `broadwell` + +## Island 2 Phase 2 - Intel Haswell CPUs + NVIDIA K80 GPUs + +* 64 nodes, each with 2x Intel(R) Xeon(R) CPU E5-E5-2680 v3 (12 cores) + @ 2.50GHz, Multithreading Disabled, 64 GB RAM (2.67 GB per core), + 128 GB SSD local disk, 4x NVIDIA Tesla K80 (12 GB GDDR RAM) GPUs +* Hostname: `taurusi2[045-108]` +* Slurm Partition `gpu` +* Node topology, same as [island 4 - 6](#island-4-to-6-intel-haswell-cpus) + +## SMP Nodes - up to 2 TB RAM + +- 5 Nodes each with 4x Intel(R) Xeon(R) CPU E7-4850 v3 (14 cores) @ + 2.20GHz, Multithreading Disabled, 2 TB RAM + - Hostname: `taurussmp[3-7]` + - Slurm partition `smp2` + +??? hint "Node topology" + +  + {: align=center} + +## Island 2 Phase 1 - Intel Sandybridge CPUs + NVIDIA K20x GPUs + +- 44 nodes, each with 2x Intel(R) Xeon(R) CPU E5-2450 (8 cores) @ + 2.10GHz, Multithreading Disabled, 48 GB RAM (3 GB per core), 128 GB + SSD local disk, 2x NVIDIA Tesla K20x (6 GB GDDR RAM) GPUs +- Hostname: `taurusi2[001-044]` +- Slurm partition `gpu1` + +??? hint "Node topology" + +  + {: align=center} diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_taurus.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_taurus.md deleted file mode 100644 index ff28e9b69d95496f299b80b45179f3787ad996cb..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_taurus.md +++ /dev/null @@ -1,110 +0,0 @@ -# Central Components - -- Login-Nodes (`tauruslogin[3-6].hrsk.tu-dresden.de`) - - each with 2x Intel(R) Xeon(R) CPU E5-2680 v3 each with 12 cores - @ 2.50GHz, MultiThreading Disabled, 64 GB RAM, 128 GB SSD local - disk - - IPs: 141.30.73.\[102-105\] -- Transfer-Nodes (`taurusexport3/4.hrsk.tu-dresden.de`, DNS Alias - `taurusexport.hrsk.tu-dresden.de`) - - 2 Servers without interactive login, only available via file - transfer protocols (rsync, ftp) - - IPs: 141.30.73.82/83 -- Direct access to these nodes is granted via IP whitelisting (contact - <hpcsupport@zih.tu-dresden.de>) - otherwise use TU Dresden VPN. - -## AMD Rome CPUs + NVIDIA A100 - -- 32 nodes, each with - - 8 x NVIDIA A100-SXM4 - - 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, MultiThreading - disabled - - 1 TB RAM - - 3.5 TB /tmp local NVMe device -- Hostnames: taurusi\[8001-8034\] -- SLURM partition `alpha` -- dedicated mostly for ScaDS-AI - -## Island 7 - AMD Rome CPUs - -- 192 nodes, each with - - 2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, MultiThreading - enabled, - - 512 GB RAM - - 200 GB /tmp on local SSD local disk -- Hostnames: taurusi\[7001-7192\] -- SLURM partition `romeo` -- more information under [RomeNodes](rome_nodes.md) - -## Large SMP System HPE Superdome Flex - -- 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20GHz (28 cores) -- 47 TB RAM -- currently configured as one single node - - Hostname: taurussmp8 -- SLURM partition `julia` -- more information under [HPE SD Flex](sd_flex.md) - -## IBM Power9 Nodes for Machine Learning - -For machine learning, we have 32 IBM AC922 nodes installed with this -configuration: - -- 2 x IBM Power9 CPU (2.80 GHz, 3.10 GHz boost, 22 cores) -- 256 GB RAM DDR4 2666MHz -- 6x NVIDIA VOLTA V100 with 32GB HBM2 -- NVLINK bandwidth 150 GB/s between GPUs and host -- SLURM partition `ml` -- Hostnames: taurusml\[1-32\] - -## Island 4 to 6 - Intel Haswell CPUs - -- 1456 nodes, each with 2x Intel(R) Xeon(R) CPU E5-2680 v3 (12 cores) - @ 2.50GHz, MultiThreading disabled, 128 GB SSD local disk -- Hostname: taurusi4\[001-232\], taurusi5\[001-612\], - taurusi6\[001-612\] -- varying amounts of main memory (selected automatically by the batch - system for you according to your job requirements) - - 1328 nodes with 2.67 GB RAM per core (64 GB total): - taurusi\[4001-4104,5001-5612,6001-6612\] - - 84 nodes with 5.34 GB RAM per core (128 GB total): - taurusi\[4105-4188\] - - 44 nodes with 10.67 GB RAM per core (256 GB total): - taurusi\[4189-4232\] -- SLURM Partition `haswell` -- [Node topology] **todo** %ATTACHURL%/i4000.png - -### Extension of Island 4 with Broadwell CPUs - -- 32 nodes, eachs witch 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz - (**14 cores**) , MultiThreading disabled, 64 GB RAM, 256 GB SSD - local disk -- from the users' perspective: Broadwell is like Haswell -- Hostname: taurusi\[4233-4264\] -- SLURM partition `broadwell` - -## Island 2 Phase 2 - Intel Haswell CPUs + NVIDIA K80 GPUs - -- 64 nodes, each with 2x Intel(R) Xeon(R) CPU E5-E5-2680 v3 (12 cores) - @ 2.50GHz, MultiThreading Disabled, 64 GB RAM (2.67 GB per core), - 128 GB SSD local disk, 4x NVIDIA Tesla K80 (12 GB GDDR RAM) GPUs -- Hostname: taurusi2\[045-108\] -- SLURM Partition `gpu` -- [Node topology] **todo %ATTACHURL%/i4000.png** (without GPUs) - -## SMP Nodes - up to 2 TB RAM - -- 5 Nodes each with 4x Intel(R) Xeon(R) CPU E7-4850 v3 (14 cores) @ - 2.20GHz, MultiThreading Disabled, 2 TB RAM - - Hostname: `taurussmp[3-7]` - - SLURM Partition `smp2` - - [Node topology] **todo** %ATTACHURL%/smp2.png - -## Island 2 Phase 1 - Intel Sandybridge CPUs + NVIDIA K20x GPUs - -- 44 nodes, each with 2x Intel(R) Xeon(R) CPU E5-2450 (8 cores) @ - 2.10GHz, MultiThreading Disabled, 48 GB RAM (3 GB per core), 128 GB - SSD local disk, 2x NVIDIA Tesla K20x (6 GB GDDR RAM) GPUs -- Hostname: `taurusi2[001-044]` -- SLURM Partition `gpu1` -- [Node topology] **todo** %ATTACHURL%/i2000.png (without GPUs) diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hpcda.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hpcda.md deleted file mode 100644 index d7bdec9afe83de27488e712b07e5fd5bdbcfcd17..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hpcda.md +++ /dev/null @@ -1,67 +0,0 @@ -# HPC for Data Analytics - -With the HPC-DA system, the TU Dresden provides infrastructure for High-Performance Computing and -Data Analytics (HPC-DA) for German researchers for computing projects with focus in one of the -following areas: - -- machine learning scenarios for large systems -- evaluation of various hardware settings for large machine learning - problems, including accelerator and compute node configuration and - memory technologies -- processing of large amounts of data on highly parallel machine - learning infrastructure. - -Currently we offer 25 Mio core hours compute time per year for external computing projects. -Computing projects have a duration of up to one year with the possibility of extensions, thus -enabling projects to continue seamlessly. Applications for regular projects on HPC-DA can be -submitted at any time via the -[online web-based submission](https://tu-dresden.de/zih/hochleistungsrechnen/zugang/hpc-da) -and review system. The reviews of the applications are carried out by experts in their respective -scientific fields. Applications are evaluated only according to their scientific excellence. - -ZIH provides a portfolio of preinstalled applications and offers support for software -installation/configuration of project-specific applications. In particular, we provide consulting -services for all our users, and advise researchers on using the resources in an efficient way. - -\<img align="right" alt="HPC-DA Overview" -src="%ATTACHURL%/bandwidth.png" title="bandwidth.png" width="250" /> - -## Access - -- Application for access using this - [Online Web Form](https://tu-dresden.de/zih/hochleistungsrechnen/zugang/hpc-da) - -## Hardware Overview - -- [Nodes for machine learning (Power9)](../jobs_and_resources/power9.md) -- [NVMe Storage](../jobs_and_resources/nvme_storage.md) (2 PB) -- [Warm archive](../data_lifecycle/file_systems.md#warm-archive) (10 PB) -- HPC nodes (x86) for DA (island 6) -- Compute nodes with high memory bandwidth: - [AMD Rome Nodes](../jobs_and_resources/rome_nodes.md) (island 7) - -Additional hardware: - -- [Multi-GPU-Cluster](../jobs_and_resources/alpha_centauri.md) for projects of SCADS.AI - -## File Systems and Object Storage - -- Lustre -- BeeGFS -- Quobyte -- S3 - -## HOWTOS - -- [Get started with HPC-DA](../software/get_started_with_hpcda.md) -- [IBM Power AI](../software/power_ai.md) -- [Work with Singularity Containers on Power9]**todo** Cloud -- [TensorFlow on HPC-DA (native)](../software/tensorflow.md) -- [Tensorflow on Jupyter notebook](../software/tensorflow_on_jupyter_notebook.md) -- Create and run your own TensorFlow container for HPC-DA (Power9) (todo: no link at all in old compendium) -- [TensorFlow on x86](../software/deep_learning.md) -- [PyTorch on HPC-DA (Power9)](../software/pytorch.md) -- [Python on HPC-DA (Power9)](../software/python.md) -- [JupyterHub](../access/jupyterhub.md) -- [R on HPC-DA (Power9)](../software/data_analytics_with_r.md) -- [Big Data frameworks: Apache Spark, Apache Flink, Apache Hadoop](../software/big_data_frameworks.md) diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/index.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/index.md deleted file mode 100644 index 911449758f01a2fce79f5179b5d81f51c79abe84..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/index.md +++ /dev/null @@ -1,65 +0,0 @@ -# Batch System - -Applications on an HPC system can not be run on the login node. They have to be submitted to compute -nodes with dedicated resources for user jobs. Normally a job can be submitted with these data: - -* number of CPU cores, -* requested CPU cores have to belong on one node (OpenMP programs) or can distributed (MPI), -* memory per process, -* maximum wall clock time (after reaching this limit the process is killed automatically), -* files for redirection of output and error messages, -* executable and command line parameters. - -*Comment:* Please keep in mind that for a large runtime a computation may not reach its end. Try to -create shorter runs (4...8 hours) and use checkpointing. Here is an extreme example from literature -for the waste of large computing resources due to missing checkpoints: - ->Earth was a supercomputer constructed to find the question to the answer to the Life, the Universe, ->and Everything by a race of hyper-intelligent pan-dimensional beings. Unfortunately 10 million years ->later, and five minutes before the program had run to completion, the Earth was destroyed by ->Vogons. - -(Adams, D. The Hitchhikers Guide Through the Galaxy) - -## Slurm - -The HRSK-II systems are operated with the batch system [Slurm](https://slurm.schedmd.com). Just -specify the resources you need in terms of cores, memory, and time and your job will be placed on -the system. - -### Job Submission - -Job submission can be done with the command: `srun [options] <command>` - -However, using `srun` directly on the shell will be blocking and launch an interactive job. Apart -from short test runs, it is recommended to launch your jobs into the background by using batch jobs. -For that, you can conveniently put the parameters directly in a job file which you can submit using -`sbatch [options] <job file>` - -Some options of srun/sbatch are: - -| Slurm Option | Description | -|------------|-------| -| `-n <N>` or `--ntasks <N>` | set a number of tasks to N(default=1). This determines how many processes will be spawned by srun (for MPI jobs). | -| `-N <N>` or `--nodes <N>` | set number of nodes that will be part of a job, on each node there will be --ntasks-per-node processes started, if the option --ntasks-per-node is not given, 1 process per node will be started | -| `--ntasks-per-node <N>` | how many tasks per allocated node to start, as stated in the line before | -| `-c <N>` or `--cpus-per-task <N>` | this option is needed for multithreaded (e.g. OpenMP) jobs, it tells SLURM to allocate N cores per task allocated; typically N should be equal to the number of threads you program spawns, e.g. it should be set to the same number as OMP_NUM_THREADS | -| `-p <name>` or `--partition <name>`| select the type of nodes where you want to execute your job, on Taurus we currently have haswell, smp, sandy, west, ml and gpu available | -| `--mem-per-cpu <name>` | specify the memory need per allocated CPU in MB | -| `--time <HH:MM:SS>` | specify the maximum runtime of your job, if you just put a single number in, it will be interpreted as minutes | -| `--mail-user <your email>` | tell the batch system your email address to get updates about the status of the jobs | -| `--mail-type ALL` | specify for what type of events you want to get a mail; valid options beside ALL are: BEGIN, END, FAIL, REQUEUE | -| `-J <name> or --job-name <name>` | give your job a name which is shown in the queue, the name will also be included in job emails (but cut after 24 chars within emails) | -| `--exclusive` | tell SLURM that only your job is allowed on the nodes allocated to this job; please be aware that you will be charged for all CPUs/cores on the node | -| `-A <project>` | Charge resources used by this job to the specified project, useful if a user belongs to multiple projects. | -| `-o <filename>` or `--output <filename>` | specify a file name that will be used to store all normal output (stdout), you can use %j (job id) and %N (name of first node) to automatically adopt the file name to the job, per default stdout goes to "slurm-%j.out" | - -<!--NOTE: the target path of this parameter must be writeable on the compute nodes, i.e. it may not point to a read-only mounted file system like /projects.--> -<!---e <filename> or --error <filename>--> - -<!--specify a file name that will be used to store all error output (stderr), you can use %j (job id) and %N (name of first node) to automatically adopt the file name to the job, per default stderr goes to "slurm-%j.out" as well--> - -<!--NOTE: the target path of this parameter must be writeable on the compute nodes, i.e. it may not point to a read-only mounted file system like /projects.--> -<!---a or --array submit an array job, see the extra section below--> -<!---w <node1>,<node2>,... restrict job to run on specific nodes only--> -<!---x <node1>,<node2>,... exclude specific nodes from job--> diff --git a/Compendium_attachments/Slurm/hdfview_memory.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hdfview_memory.png similarity index 100% rename from Compendium_attachments/Slurm/hdfview_memory.png rename to doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hdfview_memory.png diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hybrid.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hybrid.png new file mode 100644 index 0000000000000000000000000000000000000000..116e03dd0785492be3f896cda69959a025f5ac49 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hybrid.png differ diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hybrid_cores_block_block.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hybrid_cores_block_block.png new file mode 100644 index 0000000000000000000000000000000000000000..4c196df91b2fe410609a8e76505eca95f283ce29 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hybrid_cores_block_block.png differ diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hybrid_cores_cyclic_block.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hybrid_cores_cyclic_block.png new file mode 100644 index 0000000000000000000000000000000000000000..dfccaf451553c710fcddd648ae9721866668f9e8 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hybrid_cores_cyclic_block.png differ diff --git a/Compendium_attachments/HardwareTaurus/i2000.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/i2000.png similarity index 100% rename from Compendium_attachments/HardwareTaurus/i2000.png rename to doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/i2000.png diff --git a/Compendium_attachments/HardwareTaurus/i4000.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/i4000.png similarity index 100% rename from Compendium_attachments/HardwareTaurus/i4000.png rename to doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/i4000.png diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi.png new file mode 100644 index 0000000000000000000000000000000000000000..82087209059e535401724c493fff74d743da58e4 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi.png differ diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_block_block.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_block_block.png new file mode 100644 index 0000000000000000000000000000000000000000..0c6e9bbfa0e7f0614ede7e89f292e2d5f1a74316 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_block_block.png differ diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_cyclic_block.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_cyclic_block.png new file mode 100644 index 0000000000000000000000000000000000000000..dab17e83ed4930b253818e15bc42ef1b1b2c9918 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_cyclic_block.png differ diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_cyclic_cyclic.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_cyclic_cyclic.png new file mode 100644 index 0000000000000000000000000000000000000000..8b9361dd1f0a2b76b063ad64652844c425aacbdf Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_cyclic_cyclic.png differ diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_default.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_default.png new file mode 100644 index 0000000000000000000000000000000000000000..82087209059e535401724c493fff74d743da58e4 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_default.png differ diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_socket_block_block.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_socket_block_block.png new file mode 100644 index 0000000000000000000000000000000000000000..be12c78d1a85297cd60161a1808462941def94fb Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_socket_block_block.png differ diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_socket_block_cyclic.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_socket_block_cyclic.png new file mode 100644 index 0000000000000000000000000000000000000000..08f2a90100ed88175f7ef6fa3d867a70ad0880d7 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_socket_block_cyclic.png differ diff --git a/Compendium_attachments/NvmeStorage/nvme.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/nvme.png similarity index 100% rename from Compendium_attachments/NvmeStorage/nvme.png rename to doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/nvme.png diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/openmp.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/openmp.png new file mode 100644 index 0000000000000000000000000000000000000000..0cf284368f10bdd8c4a3b4c97530151e0142aad6 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/openmp.png differ diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/part.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/part.png new file mode 100644 index 0000000000000000000000000000000000000000..e2b5418f622d3fa32ba2c6ce44889e84e4d1cddd Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/part.png differ diff --git a/Compendium_attachments/HardwareTaurus/smp2.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/smp2.png similarity index 100% rename from Compendium_attachments/HardwareTaurus/smp2.png rename to doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/smp2.png diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/nvme_storage.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/nvme_storage.md index 40a0d6af3e6f62fe69a76fc01e806b63fa8dc9df..78b8175ccbba3fb0eee8be7b946ebe2bee31219b 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/nvme_storage.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/nvme_storage.md @@ -1,6 +1,5 @@ # NVMe Storage -**TODO image nvme.png** 90 NVMe storage nodes, each with - 8x Intel NVMe Datacenter SSD P4610, 3.2 TB @@ -11,3 +10,6 @@ - 64 GB RAM NVMe cards can saturate the HCAs + + +{: align=center} diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/overview.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/overview.md index 67cc21a6cd4a4c68cbaec377151106bf63428b75..5240db14cb506d8719b9e46fe3feb89aede4a95f 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/overview.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/overview.md @@ -1,57 +1,53 @@ # HPC Resources and Jobs -When log in to ZIH systems, you are placed on a *login node* **TODO** link to login nodes section -where you can [manage data life cycle](../data_lifecycle/overview.md), -[setup experiments](../data_lifecycle/experiments.md), execute short tests and compile moderate -projects. The login nodes cannot be used for real experiments and computations. Long and extensive -computational work and experiments have to be encapsulated into so called **jobs** and scheduled to -the compute nodes. - -<!--Login nodes which are using for login can not be used for your computations.--> -<!--To run software, do calculations and experiments, or compile your code compute nodes have to be used.--> - -ZIH uses the batch system Slurm for resource management and job scheduling. -<!--[HPC Introduction]**todo link** is a good resource to get started with it.--> - -??? note "Batch Job" - - In order to allow the batch scheduler an efficient job placement it needs these - specifications: - - * **requirements:** cores, memory per core, (nodes), additional resources (GPU), - * maximum run-time, - * HPC project (normally use primary group which gives id), - * who gets an email on which occasion, - - The runtime environment (see [here](../software/overview.md)) as well as the executable and - certain command-line arguments have to be specified to run the computational work. - -??? note "Batch System" - - The batch system is the central organ of every HPC system users interact with its compute - resources. The batch system finds an adequate compute system (partition/island) for your compute - jobs. It organizes the queueing and messaging, if all resources are in use. If resources are - available for your job, the batch system allocates and connects to these resources, transfers - run-time environment, and starts the job. +ZIH operates a high performance computing (HPC) system with more than 60.000 cores, 720 GPUs, and a +flexible storage hierarchy with about 16 PB total capacity. The HPC system provides an optimal +research environment especially in the area of data analytics and machine learning as well as for +processing extremely large data sets. Moreover it is also a perfect platform for highly scalable, +data-intensive and compute-intensive applications. + +With shared [login nodes](#login-nodes) and [filesystems](../data_lifecycle/file_systems.md) our +HPC system enables users to easily switch between [the components](hardware_overview.md), each +specialized for different application scenarios. + +When log in to ZIH systems, you are placed on a login node where you can +[manage data life cycle](../data_lifecycle/overview.md), +[setup experiments](../data_lifecycle/experiments.md), +execute short tests and compile moderate projects. The login nodes cannot be used for real +experiments and computations. Long and extensive computational work and experiments have to be +encapsulated into so called **jobs** and scheduled to the compute nodes. Follow the page [Slurm](slurm.md) for comprehensive documentation using the batch system at ZIH systems. There is also a page with extensive set of [Slurm examples](slurm_examples.md). ## Selection of Suitable Hardware -### What do I need a CPU or GPU? +### What do I need, a CPU or GPU? + +If an application is designed to run on GPUs this is normally announced unmistakable since the +efforts of adapting an existing software to make use of a GPU can be overwhelming. +And even if the software was listed in [NVIDIA's list of GPU-Accelerated Applications](https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/gpu-applications-catalog.pdf) +only certain parts of the computations may run on the GPU. + +To answer the question: The easiest way is to compare a typical computation +on a normal node and on a GPU node. (Make sure to eliminate the influence of different +CPU types and different number of cores.) If the execution time with GPU is better +by a significant factor then this might be the obvious choice. + +??? note "Difference in Architecture" -The main difference between CPU and GPU architecture is that a CPU is designed to handle a wide -range of tasks quickly, but are limited in the concurrency of tasks that can be running. While GPUs -can process data much faster than a CPU due to massive parallelism (but the amount of data which -a single GPU's core can handle is small), GPUs are not as versatile as CPUs. + The main difference between CPU and GPU architecture is that a CPU is designed to handle a wide + range of tasks quickly, but are limited in the concurrency of tasks that can be running. + While GPUs can process data much faster than a CPU due to massive parallelism + (but the amount of data which + a single GPU's core can handle is small), GPUs are not as versatile as CPUs. ### Available Hardware ZIH provides a broad variety of compute resources ranging from normal server CPUs of different -manufactures, to large shared memory nodes, GPU-assisted nodes up to highly specialized resources for +manufactures, large shared memory nodes, GPU-assisted nodes up to highly specialized resources for [Machine Learning](../software/machine_learning.md) and AI. -The page [Hardware Taurus](hardware_taurus.md) holds a comprehensive overview. +The page [ZIH Systems](hardware_overview.md) holds a comprehensive overview. The desired hardware can be specified by the partition `-p, --partition` flag in Slurm. The majority of the basic tasks can be executed on the conventional nodes like a Haswell. Slurm will @@ -60,19 +56,19 @@ automatically select a suitable partition depending on your memory and GPU requi ### Parallel Jobs **MPI jobs:** For MPI jobs typically allocates one core per task. Several nodes could be allocated -if it is necessary. Slurm will automatically find suitable hardware. Normal compute nodes are -perfect for this task. +if it is necessary. The batch system [Slurm](slurm.md) will automatically find suitable hardware. +Normal compute nodes are perfect for this task. **OpenMP jobs:** SMP-parallel applications can only run **within a node**, so it is necessary to -include the options `-N 1` and `-n 1`. Using `--cpus-per-task N` Slurm will start one task and you -will have N CPUs. The maximum number of processors for an SMP-parallel program is 896 on Taurus -([SMP]**todo link** island). +include the [batch system](slurm.md) options `-N 1` and `-n 1`. Using `--cpus-per-task N` Slurm will +start one task and you will have `N` CPUs. The maximum number of processors for an SMP-parallel +program is 896 on partition `julia`, see [partitions](partitions_and_limits.md). -**GPUs** partitions are best suited for **repetitive** and **highly-parallel** computing tasks. If -you have a task with potential [data parallelism]**todo link** most likely that you need the GPUs. -Beyond video rendering, GPUs excel in tasks such as machine learning, financial simulations and risk -modeling. Use the gpu2 and ml partition only if you need GPUs! Otherwise using the x86 partitions -(e.g Haswell) most likely would be more beneficial. +Partitions with GPUs are best suited for **repetitive** and **highly-parallel** computing tasks. If +you have a task with potential [data parallelism](../software/gpu_programming.md) most likely that +you need the GPUs. Beyond video rendering, GPUs excel in tasks such as machine learning, financial +simulations and risk modeling. Use the partitions `gpu2` and `ml` only if you need GPUs! Otherwise +using the x86-based partitions most likely would be more beneficial. **Interactive jobs:** Slurm can forward your X11 credentials to the first node (or even all) for a job with the `--x11` option. To use an interactive job you have to specify `-X` flag for the ssh login. @@ -91,5 +87,31 @@ projects. The quality of this work influence on the computations. However, pre- in many cases can be done completely or partially on a local system and then transferred to ZIH systems. Please use ZIH systems primarily for the computation-intensive tasks. -<!--Useful links: [Batch Systems]**todo link**, [Hardware Taurus]**todo link**, [HPC-DA]**todo link**,--> -<!--[Slurm]**todo link**--> +## Exclusive Reservation of Hardware + +If you need for some special reasons, e.g., for benchmarking, a project or paper deadline, parts of +our machines exclusively, we offer the opportunity to request and reserve these parts for your +project. + +Please send your request **7 working days** before the reservation should start (as that's our +maximum time limit for jobs and it is therefore not guaranteed that resources are available on +shorter notice) with the following information to the +[HPC support](mailto:hpcsupport@zih.tu-dresden.de?subject=Request%20for%20a%20exclusive%20reservation%20of%20hardware&body=Dear%20HPC%20support%2C%0A%0AI%20have%20the%20following%20request%20for%20a%20exclusive%20reservation%20of%20hardware%3A%0A%0AProject%3A%0AReservation%20owner%3A%0ASystem%3A%0AHardware%20requirements%3A%0ATime%20window%3A%20%3C%5Byear%5D%3Amonth%3Aday%3Ahour%3Aminute%20-%20%5Byear%5D%3Amonth%3Aday%3Ahour%3Aminute%3E%0AReason%3A): + +- `Project:` *Which project will be credited for the reservation?* +- `Reservation owner:` *Who should be able to run jobs on the + reservation? I.e., name of an individual user or a group of users + within the specified project.* +- `System:` *Which machine should be used?* +- `Hardware requirements:` *How many nodes and cores do you need? Do + you have special requirements, e.g., minimum on main memory, + equipped with a graphic card, special placement within the network + topology?* +- `Time window:` *Begin and end of the reservation in the form + `year:month:dayThour:minute:second` e.g.: 2020-05-21T09:00:00* +- `Reason:` *Reason for the reservation.* + +!!! hint + + Please note that your project CPU hour budget will be credited for the reserved hardware even if + you don't use it. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/partitions_and_limits.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/partitions_and_limits.md new file mode 100644 index 0000000000000000000000000000000000000000..1b0b7e4343c271fca4782e1de6b9038c9e771895 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/partitions_and_limits.md @@ -0,0 +1,78 @@ +# Partitions, Memory and Run Time Limits + +There is no such thing as free lunch at ZIH systems. Since, compute nodes are operated in multi-user +node by default, jobs of several users can run at the same time at the very same node sharing +resources, like memory (but not CPU). On the other hand, a higher throughput can be achieved by +smaller jobs. Thus, restrictions w.r.t. [memory](#memory-limits) and +[runtime limits](#runtime-limits) have to be respected when submitting jobs. + +## Runtime Limits + +!!! note "Runtime limits are enforced." + + This means, a job will be canceled as soon as it exceeds its requested limit. Currently, the + maximum run time is 7 days. + +Shorter jobs come with multiple advantages: + +- lower risk of loss of computing time, +- shorter waiting time for scheduling, +- higher job fluctuation; thus, jobs with high priorities may start faster. + +To bring down the percentage of long running jobs we restrict the number of cores with jobs longer +than 2 days to approximately 50% and with jobs longer than 24 to 75% of the total number of cores. +(These numbers are subject to changes.) As best practice we advise a run time of about 8h. + +!!! hint "Please always try to make a good estimation of your needed time limit." + + For this, you can use a command line like this to compare the requested timelimit with the + elapsed time for your completed jobs that started after a given date: + + ```console + marie@login$ sacct -X -S 2021-01-01 -E now --format=start,JobID,jobname,elapsed,timelimit -s COMPLETED + ``` + +Instead of running one long job, you should split it up into a chain job. Even applications that are +not capable of checkpoint/restart can be adapted. Please refer to the section +[Checkpoint/Restart](../jobs_and_resources/checkpoint_restart.md) for further documentation. + + +{: align="center"} + +## Memory Limits + +!!! note "Memory limits are enforced." + + This means that jobs which exceed their per-node memory limit will be killed automatically by + the batch system. + +Memory requirements for your job can be specified via the `sbatch/srun` parameters: + +`--mem-per-cpu=<MB>` or `--mem=<MB>` (which is "memory per node"). The **default limit** is quite +low at **300 MB** per CPU. + +ZIH systems comprises different sets of nodes with different amount of installed memory which affect +where your job may be run. To achieve the shortest possible waiting time for your jobs, you should +be aware of the limits shown in the following table. + +???+ hint "Partitions and memory limits" + + | Partition | Nodes | # Nodes | Cores per Node | MB per Core | MB per Node | GPUs per Node | + |:-------------------|:-----------------------------------------|:--------|:----------------|:------------|:------------|:------------------| + | `haswell64` | `taurusi[4001-4104,5001-5612,6001-6612]` | `1328` | `24` | `2541` | `61000` | `-` | + | `haswell128` | `taurusi[4105-4188]` | `84` | `24` | `5250` | `126000` | `-` | + | `haswell256` | `taurusi[4189-4232]` | `44` | `24` | `10583` | `254000` | `-` | + | `broadwell` | `taurusi[4233-4264]` | `32` | `28` | `2214` | `62000` | `-` | + | `smp2` | `taurussmp[3-7]` | `5` | `56` | `36500` | `2044000` | `-` | + | `gpu2` | `taurusi[2045-2106]` | `62` | `24` | `2583` | `62000` | `4 (2 dual GPUs)` | + | `gpu2-interactive` | `taurusi[2045-2108]` | `64` | `24` | `2583` | `62000` | `4 (2 dual GPUs)` | + | `hpdlf` | `taurusa[3-16]` | `14` | `12` | `7916` | `95000` | `3` | + | `ml` | `taurusml[1-32]` | `32` | `44 (HT: 176)` | `1443*` | `254000` | `6` | + | `romeo` | `taurusi[7001-7192]` | `192` | `128 (HT: 256)` | `1972*` | `505000` | `-` | + | `julia` | `taurussmp8` | `1` | `896` | `27343*` | `49000000` | `-` | + +!!! note + + The ML nodes have 4way-SMT, so for every physical core allocated (,e.g., with + `SLURM_HINT=nomultithread`), you will always get 4*1443 MB because the memory of the other + threads is allocated implicitly, too. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md index a6cdfba8bd47659bc3a14473cad74c10b73089d0..57ab511938f3eb515b9e38ca831e91cede692418 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md @@ -2,50 +2,48 @@ ## Hardware -- Slurm partiton: romeo -- Module architecture: rome -- 192 nodes taurusi[7001-7192], each: - - 2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, MultiThreading +- Slurm partition: `romeo` +- Module architecture: `rome` +- 192 nodes `taurusi[7001-7192]`, each: + - 2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, Simultaneous Multithreading (SMT) - 512 GB RAM - - 200 GB SSD disk mounted on /tmp + - 200 GB SSD disk mounted on `/tmp` ## Usage -There is a total of 128 physical cores in each -node. SMT is also active, so in total, 256 logical cores are available -per node. +There is a total of 128 physical cores in each node. SMT is also active, so in total, 256 logical +cores are available per node. !!! note - Multithreading is disabled per default in a job. To make use of it - include the Slurm parameter `--hint=multithread` in your job script - or command line, or set - the environment variable `SLURM_HINT=multithread` before job submission. -Each node brings 512 GB of main memory, so you can request roughly -1972MB per logical core (using --mem-per-cpu). Note that you will always -get the memory for the logical core sibling too, even if you do not -intend to use SMT. + Multithreading is disabled per default in a job. To make use of it include the Slurm parameter + `--hint=multithread` in your job script or command line, or set the environment variable + `SLURM_HINT=multithread` before job submission. + +Each node brings 512 GB of main memory, so you can request roughly 1972 MB per logical core (using +`--mem-per-cpu`). Note that you will always get the memory for the logical core sibling too, even if +you do not intend to use SMT. !!! note - If you are running a job here with only ONE process (maybe - multiple cores), please explicitly set the option `-n 1` ! -Be aware that software built with Intel compilers and `-x*` optimization -flags will not run on those AMD processors! That's why most older -modules built with intel toolchains are not available on **romeo**. + If you are running a job here with only ONE process (maybe multiple cores), please explicitly + set the option `-n 1`! + +Be aware that software built with Intel compilers and `-x*` optimization flags will not run on those +AMD processors! That's why most older modules built with Intel toolchains are not available on +partition `romeo`. -We provide the script: `ml_arch_avail` that you can use to check if a -certain module is available on rome architecture. +We provide the script `ml_arch_avail` that can be used to check if a certain module is available on +`rome` architecture. ## Example, running CP2K on Rome First, check what CP2K modules are available in general: `module load spider CP2K` or `module avail CP2K`. -You will see that there are several different CP2K versions avail, built -with different toolchains. Now let's assume you have to decided you want -to run CP2K version 6 at least, so to check if those modules are built -for rome, use: +You will see that there are several different CP2K versions avail, built with different toolchains. +Now let's assume you have to decided you want to run CP2K version 6 at least, so to check if those +modules are built for rome, use: ```console marie@login$ ml_arch_avail CP2K/6 @@ -55,13 +53,11 @@ CP2K/6.1-intel-2018a: sandy, haswell CP2K/6.1-intel-2018a-spglib: haswell ``` -There you will see that only the modules built with **foss** toolchain -are available on architecture "rome", not the ones built with **intel**. -So you can load e.g. `ml CP2K/6.1-foss-2019a`. +There you will see that only the modules built with toolchain `foss` are available on architecture +`rome`, not the ones built with `intel`. So you can load, e.g. `ml CP2K/6.1-foss-2019a`. -Then, when writing your batch script, you have to specify the **romeo** -partition. Also, if e.g. you wanted to use an entire ROME node (no SMT) -and fill it with MPI ranks, it could look like this: +Then, when writing your batch script, you have to specify the partition `romeo`. Also, if e.g. you +wanted to use an entire ROME node (no SMT) and fill it with MPI ranks, it could look like this: ```bash #!/bin/bash @@ -73,27 +69,26 @@ and fill it with MPI ranks, it could look like this: srun cp2k.popt input.inp ``` -## Using the Intel toolchain on Rome +## Using the Intel Toolchain on Rome -Currently, we have only newer toolchains starting at `intel/2019b` -installed for the Rome nodes. Even though they have AMD CPUs, you can -still use the Intel compilers on there and they don't even create -bad-performing code. When using the MKL up to version 2019, though, -you should set the following environment variable to make sure that AVX2 -is used: +Currently, we have only newer toolchains starting at `intel/2019b` installed for the Rome nodes. +Even though they have AMD CPUs, you can still use the Intel compilers on there and they don't even +create bad-performing code. When using the Intel Math Kernel Library (MKL) up to version 2019, +though, you should set the following environment variable to make sure that AVX2 is used: ```bash export MKL_DEBUG_CPU_TYPE=5 ``` -Without it, the MKL does a CPUID check and disables AVX2/FMA on -non-Intel CPUs, leading to much worse performance. +Without it, the MKL does a CPUID check and disables AVX2/FMA on non-Intel CPUs, leading to much +worse performance. + !!! note - In version 2020, Intel has removed this environment variable and added separate Zen - codepaths to the library. However, they are still incomplete and do not - cover every BLAS function. Also, the Intel AVX2 codepaths still seem to - provide somewhat better performance, so a new workaround would be to - overwrite the `mkl_serv_intel_cpu_true` symbol with a custom function: + + In version 2020, Intel has removed this environment variable and added separate Zen codepaths to + the library. However, they are still incomplete and do not cover every BLAS function. Also, the + Intel AVX2 codepaths still seem to provide somewhat better performance, so a new workaround + would be to overwrite the `mkl_serv_intel_cpu_true` symbol with a custom function: ```c int mkl_serv_intel_cpu_true() { @@ -108,13 +103,11 @@ marie@login$ gcc -shared -fPIC -o libfakeintel.so fakeintel.c marie@login$ export LD_PRELOAD=libfakeintel.so ``` -As for compiler optimization flags, `-xHOST` does not seem to produce -best-performing code in every case on Rome. You might want to try -`-mavx2 -fma` instead. +As for compiler optimization flags, `-xHOST` does not seem to produce best-performing code in every +case on Rome. You might want to try `-mavx2 -fma` instead. ### Intel MPI -We have seen only half the theoretical peak bandwidth via Infiniband -between two nodes, whereas OpenMPI got close to the peak bandwidth, so -you might want to avoid using Intel MPI on romeo if your application -heavily relies on MPI communication until this issue is resolved. +We have seen only half the theoretical peak bandwidth via Infiniband between two nodes, whereas +OpenMPI got close to the peak bandwidth, so you might want to avoid using Intel MPI on partition +`rome` if your application heavily relies on MPI communication until this issue is resolved. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md index 04624da4e55fe3a32e3d41842622b38b3e176315..c09260cf8d814a6a6835f981a25d1e8700c71df2 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md @@ -1,24 +1,23 @@ -# Large shared-memory node - HPE Superdome Flex +# Large Shared-Memory Node - HPE Superdome Flex -- Hostname: taurussmp8 -- Access to all shared file systems -- Slurm partition `julia` -- 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20GHz (28 cores) -- 48 TB RAM (usable: 47 TB - one TB is used for cache coherence - protocols) -- 370 TB of fast NVME storage available at `/nvme/<projectname>` +- Hostname: `taurussmp8` +- Access to all shared filesystems +- Slurm partition `julia` +- 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20GHz (28 cores) +- 48 TB RAM (usable: 47 TB - one TB is used for cache coherence protocols) +- 370 TB of fast NVME storage available at `/nvme/<projectname>` -## Local temporary NVMe storage +## Local Temporary NVMe Storage There are 370 TB of NVMe devices installed. For immediate access for all projects, a volume of 87 TB -of fast NVMe storage is available at `/nvme/1/<projectname>`. For testing, we have set a quota of 100 -GB per project on this NVMe storage.This is +of fast NVMe storage is available at `/nvme/1/<projectname>`. For testing, we have set a quota of +100 GB per project on this NVMe storage. With a more detailed proposal on how this unique system (large shared memory + NVMe storage) can speed up their computations, a project's quota can be increased or dedicated volumes of up to the full capacity can be set up. -## Hints for usage +## Hints for Usage - granularity should be a socket (28 cores) - can be used for OpenMP applications with large memory demands @@ -35,5 +34,5 @@ full capacity can be set up. this unique system (large shared memory + NVMe storage) can speed up their computations, we will gladly increase this limit, for selected projects. -- Test users might have to clean-up their /nvme storage within 4 weeks +- Test users might have to clean-up their `/nvme` storage within 4 weeks to make room for large projects. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm.md index 0c4d3d92a25de40aa7ec887feeb08086081a5af3..a5bb1980e342b8f1c19ecb6b610a5d481cd98268 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm.md @@ -1,589 +1,413 @@ -# Slurm +# Batch System Slurm -The HRSK-II systems are operated with the batch system Slurm. Just specify the resources you need -in terms of cores, memory, and time and your job will be placed on the system. +When logging in to ZIH systems, you are placed on a login node. There, you can manage your +[data life cycle](../data_lifecycle/overview.md), +[setup experiments](../data_lifecycle/experiments.md), and +edit and prepare jobs. The login nodes are not suited for computational work! From the login nodes, +you can interact with the batch system, e.g., submit and monitor your jobs. -## Job Submission +??? note "Batch System" -Job submission can be done with the command: `srun [options] <command>` - -However, using srun directly on the shell will be blocking and launch an interactive job. Apart from -short test runs, it is recommended to launch your jobs into the background by using batch jobs. For -that, you can conveniently put the parameters directly in a job file which you can submit using -`sbatch [options] <job file>` - -Some options of `srun/sbatch` are: - -| slurm option | Description | -|:---------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| -n \<N> or --ntasks \<N> | set a number of tasks to N(default=1). This determines how many processes will be spawned by srun (for MPI jobs). | -| -N \<N> or --nodes \<N> | set number of nodes that will be part of a job, on each node there will be --ntasks-per-node processes started, if the option --ntasks-per-node is not given, 1 process per node will be started | -| --ntasks-per-node \<N> | how many tasks per allocated node to start, as stated in the line before | -| -c \<N> or --cpus-per-task \<N> | this option is needed for multithreaded (e.g. OpenMP) jobs, it tells Slurm to allocate N cores per task allocated; typically N should be equal to the number of threads you program spawns, e.g. it should be set to the same number as OMP_NUM_THREADS | -| -p \<name> or --partition \<name> | select the type of nodes where you want to execute your job, on Taurus we currently have haswell, `smp`, `sandy`, `west`, ml and `gpu` available | -| --mem-per-cpu \<name> | specify the memory need per allocated CPU in MB | -| --time \<HH:MM:SS> | specify the maximum runtime of your job, if you just put a single number in, it will be interpreted as minutes | -| --mail-user \<your email> | tell the batch system your email address to get updates about the status of the jobs | -| --mail-type ALL | specify for what type of events you want to get a mail; valid options beside ALL are: BEGIN, END, FAIL, REQUEUE | -| -J \<name> or --job-name \<name> | give your job a name which is shown in the queue, the name will also be included in job emails (but cut after 24 chars within emails) | -| --no-requeue | At node failure, jobs are requeued automatically per default. Use this flag to disable requeueing. | -| --exclusive | tell Slurm that only your job is allowed on the nodes allocated to this job; please be aware that you will be charged for all CPUs/cores on the node | -| -A \<project> | Charge resources used by this job to the specified project, useful if a user belongs to multiple projects. | -| -o \<filename> or --output \<filename> | \<p>specify a file name that will be used to store all normal output (stdout), you can use %j (job id) and %N (name of first node) to automatically adopt the file name to the job, per default stdout goes to "slurm-%j.out"\</p> \<p>%RED%NOTE:<span class="twiki-macro ENDCOLOR"></span> the target path of this parameter must be writeable on the compute nodes, i.e. it may not point to a read-only mounted file system like /projects.\</p> | -| -e \<filename> or --error \<filename> | \<p>specify a file name that will be used to store all error output (stderr), you can use %j (job id) and %N (name of first node) to automatically adopt the file name to the job, per default stderr goes to "slurm-%j.out" as well\</p> \<p>%RED%NOTE:<span class="twiki-macro ENDCOLOR"></span> the target path of this parameter must be writeable on the compute nodes, i.e. it may not point to a read-only mounted file system like /projects.\</p> | -| -a or --array | submit an array job, see the extra section below | -| -w \<node1>,\<node2>,... | restrict job to run on specific nodes only | -| -x \<node1>,\<node2>,... | exclude specific nodes from job | - -The following example job file shows how you can make use of sbatch - -```Bash -#!/bin/bash -#SBATCH --time=01:00:00 -#SBATCH --output=simulation-m-%j.out -#SBATCH --error=simulation-m-%j.err -#SBATCH --ntasks=512 -#SBATCH -A myproject - -echo Starting Program -``` + The batch system is the central organ of every HPC system users interact with its compute + resources. The batch system finds an adequate compute system (partition) for your compute jobs. + It organizes the queueing and messaging, if all resources are in use. If resources are available + for your job, the batch system allocates and connects to these resources, transfers runtime + environment, and starts the job. -During runtime, the environment variable SLURM_JOB_ID will be set to the id of your job. +??? note "Batch Job" -You can also use our [Slurm Batch File Generator]**todo** Slurmgenerator, which could help you create -basic Slurm job scripts. + At HPC systems, computational work and resource requirements are encapsulated into so-called + jobs. In order to allow the batch system an efficient job placement it needs these + specifications: -Detailed information on [memory limits on Taurus]**todo** + * requirements: number of nodes and cores, memory per core, additional resources (GPU) + * maximum run-time + * HPC project for accounting + * who gets an email on which occasion -### Interactive Jobs + Moreover, the [runtime environment](../software/overview.md) as well as the executable and + certain command-line arguments have to be specified to run the computational work. -Interactive activities like editing, compiling etc. are normally limited to the login nodes. For -longer interactive sessions you can allocate cores on the compute node with the command "salloc". It -takes the same options like `sbatch` to specify the required resources. +ZIH uses the batch system Slurm for resource management and job scheduling. +Just specify the resources you need in terms +of cores, memory, and time and your Slurm will place your job on the system. -The difference to LSF is, that `salloc` returns a new shell on the node, where you submitted the -job. You need to use the command `srun` in front of the following commands to have these commands -executed on the allocated resources. If you allocate more than one task, please be aware that srun -will run the command on each allocated task! +This page provides a brief overview on -An example of an interactive session looks like: - -```Shell Session -tauruslogin3 /home/mark; srun --pty -n 1 -c 4 --time=1:00:00 --mem-per-cpu=1700 bash<br />srun: job 13598400 queued and waiting for resources<br />srun: job 13598400 has been allocated resources -taurusi1262 /home/mark; # start interactive work with e.g. 4 cores. -``` +* [Slurm options](#options) to specify resource requirements, +* how to submit [interactive](#interactive-jobs) and [batch jobs](#batch-jobs), +* how to [write job files](#job-files), +* how to [manage and control your jobs](#manage-and-control-jobs). -**Note:** A dedicated partition `interactive` is reserved for short jobs (< 8h) with not more than -one job per user. Please check the availability of nodes there with `sinfo -p interactive` . +If you are are already familiar with Slurm, you might be more interested in our collection of +[job examples](slurm_examples.md). +There is also a ton of external resources regarding Slurm. We recommend these links for detailed +information: -### Interactive X11/GUI Jobs - -Slurm will forward your X11 credentials to the first (or even all) node -for a job with the (undocumented) --x11 option. For example, an -interactive session for 1 hour with Matlab using eight cores can be -started with: - -```Shell Session -module load matlab -srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first matlab -``` +- [slurm.schedmd.com](https://slurm.schedmd.com/) provides the official documentation comprising + manual pages, tutorials, examples, etc. +- [Comparison with other batch systems](https://www.schedmd.com/slurmdocs/rosetta.html) -**Note:** If you are getting the error: +## Job Submission -```Bash -srun: error: x11: unable to connect node taurusiXXXX -``` +There are three basic Slurm commands for job submission and execution: -that probably means you still have an old host key for the target node in your `\~/.ssh/known_hosts` -file (e.g. from pre-SCS5). This can be solved either by removing the entry from your known_hosts or -by simply deleting the known_hosts file altogether if you don't have important other entries in it. - -### Requesting an Nvidia K20X / K80 / A100 - -Slurm will allocate one or many GPUs for your job if requested. Please note that GPUs are only -available in certain partitions, like `gpu2`, `gpu3` or `gpu2-interactive`. The option -for sbatch/srun in this case is `--gres=gpu:[NUM_PER_NODE]` (where `NUM_PER_NODE` can be `1`, 2 or -4, meaning that one, two or four of the GPUs per node will be used for the job). A sample job file -could look like this - -```Bash -#!/bin/bash -#SBATCH -A Project1 # account CPU time to Project1 -#SBATCH --nodes=2 # request 2 nodes<br />#SBATCH --mincpus=1 # allocate one task per node...<br />#SBATCH --ntasks=2 # ...which means 2 tasks in total (see note below) -#SBATCH --cpus-per-task=6 # use 6 threads per task -#SBATCH --gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task) -#SBATCH --time=01:00:00 # run for 1 hour -srun ./your/cuda/application # start you application (probably requires MPI to use both nodes) -``` +1. `srun`: Submit a job for execution or initiate job steps in real time. +1. `sbatch`: Submit a batch script to Slurm for later execution. +1. `salloc`: Obtain a Slurm job allocation (a set of nodes), execute a command, and then release the + allocation when the command is finished. -Please be aware that the partitions `gpu`, `gpu1` and `gpu2` can only be used for non-interactive -jobs which are submitted by `sbatch`. Interactive jobs (`salloc`, `srun`) will have to use the -partition `gpu-interactive`. Slurm will automatically select the right partition if the partition -parameter (-p) is omitted. +Using `srun` directly on the shell will be blocking and launch an +[interactive job](#interactive-jobs). Apart from short test runs, it is recommended to submit your +jobs to Slurm for later execution by using [batch jobs](#batch-jobs). For that, you can conveniently +put the parameters directly in a [job file](#job-files), which you can submit using `sbatch +[options] <job file>`. -**Note:** Due to an unresolved issue concerning the Slurm job scheduling behavior, it is currently -not practical to use `--ntasks-per-node` together with GPU jobs. If you want to use multiple nodes, -please use the parameters `--ntasks` and `--mincpus` instead. The values of mincpus \* nodes has to -equal ntasks in this case. +At runtime, the environment variable `SLURM_JOB_ID` is set to the id of your job. The job +id is unique. The id allows you to [manage and control](#manage-and-control-jobs) your jobs. -### Limitations of GPU job allocations +## Options -The number of cores per node that are currently allowed to be allocated for GPU jobs is limited -depending on how many GPUs are being requested. On the K80 nodes, you may only request up to 6 -cores per requested GPU (8 per on the K20 nodes). This is because we do not wish that GPUs remain -unusable due to all cores on a node being used by a single job which does not, at the same time, -request all GPUs. +The following table contains the most important options for `srun/sbatch/salloc` to specify resource +requirements and control communication. -E.g., if you specify `--gres=gpu:2`, your total number of cores per node (meaning: ntasks \* -cpus-per-task) may not exceed 12 (on the K80 nodes) +??? tip "Options Table" -Note that this also has implications for the use of the --exclusive parameter. Since this sets the -number of allocated cores to 24 (or 16 on the K20X nodes), you also **must** request all four GPUs -by specifying --gres=gpu:4, otherwise your job will not start. In the case of --exclusive, it won't -be denied on submission, because this is evaluated in a later scheduling step. Jobs that directly -request too many cores per GPU will be denied with the error message: + | Slurm Option | Description | + |:---------------------------|:------------| + | `-n, --ntasks=<N>` | Number of (MPI) tasks (default: 1) | + | `-N, --nodes=<N>` | Number of nodes; there will be `--ntasks-per-node` processes started on each node | + | `--ntasks-per-node=<N>` | Number of tasks per allocated node to start (default: 1) | + | `-c, --cpus-per-task=<N>` | Number of CPUs per task; needed for multithreaded (e.g. OpenMP) jobs; typically `N` should be equal to `OMP_NUM_THREADS` | + | `-p, --partition=<name>` | Type of nodes where you want to execute your job (refer to [partitions](partitions_and_limits.md)) | + | `--mem-per-cpu=<size>` | Memory need per allocated CPU in MB | + | `-t, --time=<HH:MM:SS>` | Maximum runtime of the job | + | `--mail-user=<your email>` | Get updates about the status of the jobs | + | `--mail-type=ALL` | For what type of events you want to get a mail; valid options: `ALL`, `BEGIN`, `END`, `FAIL`, `REQUEUE` | + | `-J, --job-name=<name>` | Name of the job shown in the queue and in mails (cut after 24 chars) | + | `--no-requeue` | Disable requeueing of the job in case of node failure (default: enabled) | + | `--exclusive` | Exclusive usage of compute nodes; you will be charged for all CPUs/cores on the node | + | `-A, --account=<project>` | Charge resources used by this job to the specified project | + | `-o, --output=<filename>` | File to save all normal output (stdout) (default: `slurm-%j.out`) | + | `-e, --error=<filename>` | File to save all error output (stderr) (default: `slurm-%j.out`) | + | `-a, --array=<arg>` | Submit an array job ([examples](slurm_examples.md#array-jobs)) | + | `-w <node1>,<node2>,...` | Restrict job to run on specific nodes only | + | `-x <node1>,<node2>,...` | Exclude specific nodes from job | -```Shell Session -Batch job submission failed: Requested node configuration is not available -``` +!!! note "Output and Error Files" -### Parallel Jobs + When redirecting stderr and stderr into a file using `--output=<filename>` and + `--stderr=<filename>`, make sure the target path is writeable on the + compute nodes, i.e., it may not point to a read-only mounted + [filesystem](../data_lifecycle/overview.md) like `/projects.` -For submitting parallel jobs, a few rules have to be understood and followed. In general, they -depend on the type of parallelization and architecture. +!!! note "No free lunch" -#### OpenMP Jobs + Runtime and memory limits are enforced. Please refer to the section on [partitions and + limits](partitions_and_limits.md) for a detailed overview. -An SMP-parallel job can only run within a node, so it is necessary to include the options `-N 1` and -`-n 1`. The maximum number of processors for an SMP-parallel program is 488 on Venus and 56 on -taurus (smp island). Using --cpus-per-task N Slurm will start one task and you will have N CPUs -available for your job. An example job file would look like: +### Host List -```Bash -#!/bin/bash -#SBATCH -J Science1 -#SBATCH --nodes=1 -#SBATCH --tasks-per-node=1 -#SBATCH --cpus-per-task=8 -#SBATCH --mail-type=end -#SBATCH --mail-user=your.name@tu-dresden.de -#SBATCH --time=08:00:00 +If you want to place your job onto specific nodes, there are two options for doing this. Either use +`-p, --partition=<name>` to specify a host group aka. [partition](partitions_and_limits.md) that fits +your needs. Or, use `-w, --nodelist=<host1,host2,..>` with a list of hosts that will work for you. -export OMP_NUM_THREADS=8 -./path/to/binary -``` +## Interactive Jobs -#### MPI Jobs +Interactive activities like editing, compiling, preparing experiments etc. are normally limited to +the login nodes. For longer interactive sessions, you can allocate cores on the compute node with +the command `salloc`. It takes the same options as `sbatch` to specify the required resources. -For MPI jobs one typically allocates one core per task that has to be started. **Please note:** -There are different MPI libraries on Taurus and Venus, so you have to compile the binaries -specifically for their target. +`salloc` returns a new shell on the node, where you submitted the job. You need to use the command +`srun` in front of the following commands to have these commands executed on the allocated +resources. If you allocate more than one task, please be aware that `srun` will run the command on +each allocated task by default! -```Bash -#!/bin/bash -#SBATCH -J Science1 -#SBATCH --ntasks=864 -#SBATCH --mail-type=end -#SBATCH --mail-user=your.name@tu-dresden.de -#SBATCH --time=08:00:00 +The syntax for submitting a job is -srun ./path/to/binary +``` +marie@login$ srun [options] <command> ``` -#### Multiple Programs Running Simultaneously in a Job +An example of an interactive session looks like: -In this short example, our goal is to run four instances of a program concurrently in a **single** -batch script. Of course we could also start a batch script four times with sbatch but this is not -what we want to do here. Please have a look at [Running Multiple GPU Applications Simultaneously in -a Batch Job] todo Compendium.RunningNxGpuAppsInOneJob in case you intend to run GPU programs -simultaneously in a **single** job. +```console +marie@login$ srun --pty --ntasks=1 --cpus-per-task=4 --time=1:00:00 --mem-per-cpu=1700 bash -l +srun: job 13598400 queued and waiting for resources +srun: job 13598400 has been allocated resources +marie@compute$ # Now, you can start interactive work with e.g. 4 cores +``` -```Bash -#!/bin/bash -#SBATCH -J PseudoParallelJobs -#SBATCH --ntasks=4 -#SBATCH --cpus-per-task=1 -#SBATCH --mail-type=end -#SBATCH --mail-user=your.name@tu-dresden.de -#SBATCH --time=01:00:00 +!!! note "Using `module` commands" -# The following sleep command was reported to fix warnings/errors with srun by users (feel free to uncomment). -#sleep 5 -srun --exclusive --ntasks=1 ./path/to/binary & + The [module commands](../software/modules.md) are made available by sourcing the files + `/etc/profile` and `~/.bashrc`. This is done automatically by passing the parameter `-l` to your + shell, as shown in the example above. If you missed adding `-l` at submitting the interactive + session, no worry, you can source this files also later on manually. -#sleep 5 -srun --exclusive --ntasks=1 ./path/to/binary & +!!! note "Partition `interactive`" -#sleep 5 -srun --exclusive --ntasks=1 ./path/to/binary & + A dedicated partition `interactive` is reserved for short jobs (< 8h) with not more than one job + per user. Please check the availability of nodes there with `sinfo --partition=interactive`. -#sleep 5 -srun --exclusive --ntasks=1 ./path/to/binary & +### Interactive X11/GUI Jobs -echo "Waiting for parallel job steps to complete..." -wait -echo "All parallel job steps completed!" -``` +Slurm will forward your X11 credentials to the first (or even all) node for a job with the +(undocumented) `--x11` option. For example, an interactive session for one hour with Matlab using +eight cores can be started with: -### Exclusive Jobs for Benchmarking - -Jobs on taurus run, by default, in shared-mode, meaning that multiple jobs can run on the same -compute nodes. Sometimes, this behaviour is not desired (e.g. for benchmarking purposes), in which -case it can be turned off by specifying the Slurm parameter: `--exclusive` . - -Setting `--exclusive` **only** makes sure that there will be **no other jobs running on your nodes**. -It does not, however, mean that you automatically get access to all the resources which the node -might provide without explicitly requesting them, e.g. you still have to request a GPU via the -generic resources parameter (gres) to run on the GPU partitions, or you still have to request all -cores of a node if you need them. CPU cores can either to be used for a task (`--ntasks`) or for -multi-threading within the same task (--cpus-per-task). Since those two options are semantically -different (e.g., the former will influence how many MPI processes will be spawned by 'srun' whereas -the latter does not), Slurm cannot determine automatically which of the two you might want to use. -Since we use cgroups for separation of jobs, your job is not allowed to use more resources than -requested.* - -If you just want to use all available cores in a node, you have to -specify how Slurm should organize them, like with \<span>"-p haswell -c -24\</span>" or "\<span>-p haswell --ntasks-per-node=24". \</span> - -Here is a short example to ensure that a benchmark is not spoiled by -other jobs, even if it doesn't use up all resources in the nodes: - -```Bash -#!/bin/bash -#SBATCH -J Benchmark -#SBATCH -p haswell -#SBATCH --nodes=2 -#SBATCH --ntasks-per-node=2 -#SBATCH --cpus-per-task=8 -#SBATCH --exclusive # ensure that nobody spoils my measurement on 2 x 2 x 8 cores -#SBATCH --mail-user=your.name@tu-dresden.de -#SBATCH --time=00:10:00 - -srun ./my_benchmark +```console +marie@login$ module load matlab +marie@login$ srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first matlab ``` -### Array Jobs - -Array jobs can be used to create a sequence of jobs that share the same executable and resource -requirements, but have different input files, to be submitted, controlled, and monitored as a single -unit. The arguments `-a` or `--array` take an additional parameter that specify the array indices. -Within the job you can read the environment variables `SLURM_ARRAY_JOB_ID`, which will be set to the -first job ID of the array, and `SLURM_ARRAY_TASK_ID`, which will be set individually for each step. - -Within an array job, you can use %a and %A in addition to %j and %N -(described above) to make the output file name specific to the job. %A -will be replaced by the value of SLURM_ARRAY_JOB_ID and %a will be -replaced by the value of SLURM_ARRAY_TASK_ID. - -Here is an example of how an array job can look like: - -```Bash -#!/bin/bash -#SBATCH -J Science1 -#SBATCH --array 0-9 -#SBATCH -o arraytest-%A_%a.out -#SBATCH -e arraytest-%A_%a.err -#SBATCH --ntasks=864 -#SBATCH --mail-type=end -#SBATCH --mail-user=your.name@tu-dresden.de -#SBATCH --time=08:00:00 - -echo "Hi, I am step $SLURM_ARRAY_TASK_ID in this array job $SLURM_ARRAY_JOB_ID" -``` +!!! hint "X11 error" -**Note:** If you submit a large number of jobs doing heavy I/O in the Lustre file systems you should -limit the number of your simultaneously running job with a second parameter like: + If you are getting the error: -```Bash -#SBATCH --array=1-100000%100 -``` + ```Bash + srun: error: x11: unable to connect node taurusiXXXX + ``` -For further details please read the Slurm documentation at -(https://slurm.schedmd.com/sbatch.html) - -### Chain Jobs - -You can use chain jobs to create dependencies between jobs. This is often the case if a job relies -on the result of one or more preceding jobs. Chain jobs can also be used if the runtime limit of the -batch queues is not sufficient for your job. Slurm has an option `-d` or "--dependency" that allows -to specify that a job is only allowed to start if another job finished. - -Here is an example of how a chain job can look like, the example submits 4 jobs (described in a job -file) that will be executed one after each other with different CPU numbers: - -```Bash -#!/bin/bash -TASK_NUMBERS="1 2 4 8" -DEPENDENCY="" -JOB_FILE="myjob.slurm" - -for TASKS in $TASK_NUMBERS ; do - JOB_CMD="sbatch --ntasks=$TASKS" - if [ -n "$DEPENDENCY" ] ; then - JOB_CMD="$JOB_CMD --dependency afterany:$DEPENDENCY" - fi - JOB_CMD="$JOB_CMD $JOB_FILE" - echo -n "Running command: $JOB_CMD " - OUT=`$JOB_CMD` - echo "Result: $OUT" - DEPENDENCY=`echo $OUT | awk '{print $4}'` -done -``` + that probably means you still have an old host key for the target node in your + `~.ssh/known_hosts` file (e.g. from pre-SCS5). This can be solved either by removing the entry + from your known_hosts or by simply deleting the `known_hosts` file altogether if you don't have + important other entries in it. -### Binding and Distribution of Tasks +## Batch Jobs -The Slurm provides several binding strategies to place and bind the tasks and/or threads of your job -to cores, sockets and nodes. Note: Keep in mind that the distribution method has a direct impact on -the execution time of your application. The manipulation of the distribution can either speed up or -slow down your application. More detailed information about the binding can be found -[here](binding_and_distribution_of_tasks.md). +Working interactively using `srun` and `salloc` is a good starting point for testing and compiling. +But, as soon as you leave the testing stage, we highly recommend you to use batch jobs. +Batch jobs are encapsulated within [job files](#job-files) and submitted to the batch system using +`sbatch` for later execution. A job file is basically a script holding the resource requirements, +environment settings and the commands for executing the application. Using batch jobs and job files +has multiple advantages: -The default allocation of the tasks/threads for OpenMP, MPI and Hybrid (MPI and OpenMP) are as -follows. +* You can reproduce your experiments and work, because all steps are saved in a file. +* You can easily share your settings and experimental setup with colleagues. +* You can submit your job file to the scheduling system for later execution. In the meanwhile, you can + grab a coffee and proceed with other work (e.g., start writing a paper). -#### OpenMP +!!! hint "The syntax for submitting a job file to Slurm is" -The illustration below shows the default binding of a pure OpenMP-job on 1 node with 16 cpus on -which 16 threads are allocated. + ```console + marie@login$ sbatch [options] <job_file> + ``` -```Bash -#!/bin/bash -#SBATCH --nodes=1 -#SBATCH --tasks-per-node=1 -#SBATCH --cpus-per-task=16 +### Job Files -export OMP_NUM_THREADS=16 +Job files have to be written with the following structure. -srun --ntasks 1 --cpus-per-task $OMP_NUM_THREADS ./application -``` +```bash +#!/bin/bash # Batch script starts with shebang line -\<img alt="" -src="data:;base64,iVBORw0KGgoAAAANSUhEUgAAAX4AAADeCAIAAAC10/zxAAAABmJLR0QA/wD/AP+gvaeTAAASvElEQVR4nO3de1BU5ePH8XMIBN0FVllusoouCuZ0UzMV7WtTDqV2GRU0spRm1GAqtG28zaBhNmU62jg2WWkXGWegNLVmqnFGQhsv/WEaXQxLaFEQdpfBXW4ul+X8/jgTQ1z8KQd4luX9+mv3Oc8+5zl7nv3wnLNnObKiKBIA9C8/0R0AMBgRPQAEIHoACED0ABCA6AEgANEDQACiB4AARA8AAYgeAAIQPQAEIHoACED0ABCA6AEgANEDQACiB4AARA8AAYgeAAIQPQAEIHoACOCv5cWyLPdWPwAMOFr+szuzHgACaJr1qLinBTDYaD/iYdYDQACiB4AARA8AAYgeAAIQPQAEIHoACED0ABCA6AEgANEDQACiB4AARA8AAYgeAAIQPQAEIHoACED0ABCA6AEgANEDQACiB4AARA8AAYgeAAIQPQAEIHoACED0ABCA6Bm8HnnkEVmWz54921YSFRV17Nix22/hl19+0ev1t18/JycnMTFRp9NFRUXdQUfhi4ieQS0sLGzt2rX9tjqj0bhmzZrs7Ox+WyO8FtEzqK1YsaK4uPirr77qvKiioiIlJSUiIsJkMr3yyisNDQ1q+bVr1x5//HGDwXDPPfecOXOmrX5NTU1GRsaoUaPCw8OfffbZqqqqzm3Omzdv8eLFo0aN6qPNwQBC9Axqer0+Ozt748aNzc3NHRYtWrQoICCguLj4/PnzFy5csFgsanlKSorJZKqsrPzuu+8+/PDDtvpLly612WwXL168evVqaGhoWlpav20FBiRFA+0tQKDZs2dv3bq1ubl5woQJe/bsURQlMjLy6NGjiqIUFRVJkmS329Wa+fn5QUFBHo+nqKhIluXq6mq1PCcnR6fTKYpSUlIiy3JbfZfLJcuy0+nscr25ubmRkZF9vXXoU9o/+/7CMg/ewd/ff9u2bStXrly2bFlbYVlZmU6nCw8PV5+azWa3211VVVVWVhYWFjZ8+HC1fPz48eoDq9Uqy/LUqVPbWggNDS0vLw8NDe2v7cAAQ/RAeuaZZ3bu3Llt27a2EpPJVF9f73A41PSxWq2BgYFGozEmJsbpdDY2NgYGBkqSVFlZqdYfPXq0LMuFhYVkDW4T53ogSZK0Y8eO3bt319bWqk/j4+OnT59usVjq6upsNltWVtby5cv9/PwmTJgwadKk9957T5KkxsbG3bt3q/Xj4uKSkpJWrFhRUVEhSZLD4Th8+HDntXg8HrfbrZ5XcrvdjY2N/bR58D5EDyRJkqZNmzZ//vy2r7FkWT58+HBDQ8PYsWMnTZp033337dq1S1106NCh/Pz8yZMnP/roo48++mhbC7m5uSNHjkxMTAwODp4+ffrp06c7r2Xfvn1Dhw5dtmyZzWYbOnRoWFhYP2wavJPcdsaoJy+WZUmStLQAYCDS/tln1gNAAKIHgABEDwABiB4AAhA9AAQgegAIQPQAEIDoASAA0QNAAKIHgABEDwABiB4AAhA9AAQgegAIQPQAEIDoASAA0QNAAKIHgABEDwABiB4AAhA9AAQgegAIQPQAEIDoASAA0QNAAH/tTaj3IQSA28esB4AAmu65DgA9w6wHgABEDwABiB4AAhA9AAQgegAIQPQAEEDTJYVcTDgY9OzyC8bGYKDl0hxmPQAE6IUfUnBRoq/SPnNhbPgq7WODWQ8AAYgeAAIQPQAEIHoACED0ABCA6AEgANEDQACiB4AARA8AAYgeAAIQPQAEIHoACED0ABDA96Pn0qVLTz31lNFoHDZs2IQJE9avX9+DRiZMmHDs2LHbrPzAAw/k5eV1uSgnJycxMVGn00VFRfWgG+hdXjU2XnvttYkTJw4bNmz06NHr1q1ramrqQWcGEB+PntbW1ieeeGLkyJG//fZbVVVVXl6e2WwW2B+j0bhmzZrs7GyBfYDK28ZGXV3dRx99dO3atby8vLy8vDfeeENgZ/qDooH2FvratWvXJEm6dOlS50XXr19PTk4ODw+PiYl5+eWX6+vr1fIbN25kZGSMHj06ODh40qRJRUVFiqIkJCQcPXpUXTp79uxly5Y1NTW5XK709HSTyWQ0GpcsWeJwOBRFeeWVVwICAoxGY2xs7LJly7rsVW5ubmRkZF9tc+/Rsn8ZGz0bG6rNmzc//PDDvb/NvUf7/vXxWc/IkSPj4+PT09O/+OKLq1evtl+0aNGigICA4uLi8+fPX7hwwWKxqOWpqamlpaXnzp1zOp0HDhwIDg5ue0lpaenMmTNnzZp14MCBgICApUuX2my2ixcvXr16NTQ0NC0tTZKkPXv2TJw4cc+ePVar9cCBA/24rbgz3jw2Tp8+PWXKlN7fZq8iNvn6gc1m27Bhw+TJk/39/ceNG5ebm6soSlFRkSRJdrtdrZOfnx8UFOTxeIqLiyVJKi8v79BIQkLCpk2bTCbTRx99pJaUlJTIstzWgsvlkmXZ6XQqinL//fera+kOsx4v4YVjQ1GUzZs3jx07tqqqqhe3tNf1QnqIXX1/qq2t3blzp5+f36+//nrixAmdTte26J9//pEkyWaz5efnDxs2rPNrExISIiMjp02b5na71ZIffvjBz88vth2DwfDHH38oRI/m1/Y/7xkbW7ZsMZvNVqu1V7ev92nfvz5+wNWeXq+3WCxBQUG//vqryWSqr693OBzqIqvVGhgYqB6ENzQ0VFRUdH757t27w8PDn3766YaGBkmSRo8eLctyYWGh9V83btyYOHGiJEl+foPoXfUNXjI2NmzYcPDgwVOnTsXGxvbBVnoXH/+QVFZWrl279uLFi/X19dXV1e+8805zc/PUqVPj4+OnT59usVjq6upsNltWVtby5cv9/Pzi4uKSkpJWrVpVUVGhKMrvv//eNtQCAwOPHDkSEhIyd+7c2tpateaKFSvUCg6H4/Dhw2rNqKioy5cvd9kfj8fjdrubm5slSXK73Y2Njf3yNqAL3jY2MjMzjxw5cvz4caPR6Ha7ff7LdR8/4HK5XCtXrhw/fvzQoUMNBsPMmTO//fZbdVFZWdnChQuNRmN0dHRGRkZdXZ1aXl1dvXLlypiYmODg4MmTJ1++fFlp9y1GS0vLCy+88NBDD1VXVzudzszMzDFjxuj1erPZvHr1arWFkydPjh8/3mAwLFq0qEN/9u7d2/7Nbz+x90Ja9i9j447Gxo0bNzp8MOPi4vrvvbhz2vevrGi4XYl6QwwtLcCbadm/jA3fpn3/+vgBFwDvRPQAEIDoASAA0QNAAKIHgABEDwABiB4AAhA9AAQgegAIQPQAEIDoASAA0QNAAKIHgABEDwAB/LU3of58HuiMsYHuMOsBIICmfxUGAD3DrAeAAEQPAAGIHgACED0ABCB6AAhA9AAQgOgBIICmq5m5VnUw0HILQPg2bgEIYIDphd9wcT20r9I+c2Fs+CrtY4NZDwABiB4AAhA9AAQgegAIQPQAEIDoASAA0QNAAKIHgAA+Gz1nzpyZP3/+iBEjdDrdvffem5WVVV9f3w/rbWlpyczMHDFiREhIyNKlS2tqarqsptfr5XYCAwMbGxv7oXuDlqjxYLPZFi9ebDQaDQbD448/fvny5S6r5eTkJCYm6nS6qKio9uVpaWntx0leXl4/9Ll/+Gb0fPPNN4899tj9999/7tw5u91+8OBBu91eWFh4O69VFKW5ubnHq96yZcvx48fPnz9/5cqV0tLS9PT0LqvZbLbafy1cuHDBggWBgYE9XiluTeB4yMjIcDqdf/31V3l5eXR0dEpKSpfVjEbjmjVrsrOzOy+yWCxtQyU5ObnHPfE6igbaW+gLHo/HZDJZLJYO5a2trYqiXL9+PTk5OTw8PCYm5uWXX66vr1eXJiQkZGVlzZo1Kz4+vqCgwOVypaenm0wmo9G4ZMkSh8OhVtu1a1dsbGxoaGh0dPTWrVs7rz0iIuLTTz9VHxcUFPj7+9+4ceMWvXU4HIGBgT/88IPGre4LWvav94wNseMhLi5u//796uOCggI/P7+WlpbuupqbmxsZGdm+ZPny5evXr+/ppvehXkgPsavvC+pfs4sXL3a5dMaMGampqTU1NRUVFTNmzHjppZfU8oSEhHvuuaeqqkp9+uSTTy5YsMDhcDQ0NKxatWr+/PmKoly+fFmv1//999+Kojidzp9//rlD4xUVFe1XrR5tnTlz5ha93bFjx/jx4zVsbh/yjegROB4URVm3bt1jjz1ms9lcLtfzzz+/cOHCW3S1y+iJjo42mUxTpkx59913m5qa7vwN6BNETxdOnDghSZLdbu+8qKioqP2i/Pz8oKAgj8ejKEpCQsL777+vlpeUlMiy3FbN5XLJsux0OouLi4cOHfrll1/W1NR0ueq//vpLkqSSkpK2Ej8/v++///4WvY2Pj9+xY8edb2V/8I3oETge1MqzZ89W342777776tWrt+hq5+g5fvz42bNn//7778OHD8fExHSeu4miff/64Lme8PBwSZLKy8s7LyorK9PpdGoFSZLMZrPb7a6qqlKfjhw5Un1gtVplWZ46deqYMWPGjBlz3333hYaGlpeXm83mnJycDz74ICoq6n//+9+pU6c6tB8cHCxJksvlUp/W1ta2traGhIR8/vnnbWcK29cvKCiwWq1paWm9te3oTOB4UBRlzpw5ZrO5urq6rq5u8eLFs2bNqq+v7248dJaUlDRjxoxx48YtWrTo3XffPXjwoJa3wruITb6+oB7bv/766x3KW1tbO/yVKygoCAwMbPsrd/ToUbX8ypUrd911l9Pp7G4VDQ0Nb7/99vDhw9XzBe1FRER89tln6uOTJ0/e+lzPkiVLnn322TvbvH6kZf96z9gQOB4cDofU6QD8p59+6q6dzrOe9r788ssRI0bcalP7US+kh9jV95Gvv/46KCho06ZNxcXFbrf7999/z8jIOHPmTGtr6/Tp059//vna2trKysqZM2euWrVKfUn7oaYoyty5c5OTk69fv64oit1uP3TokKIof/75Z35+vtvtVhRl3759ERERnaMnKysrISGhpKTEZrM9/PDDqamp3XXSbrcPGTLEO08wq3wjehSh4yE2NnblypUul+vmzZtvvvmmXq+vrq7u3MOWlpabN2/m5ORERkbevHlTbdPj8ezfv99qtTqdzpMnT8bFxbWdihKO6OnW6dOn586dazAYhg0bdu+9977zzjvqlxdlZWULFy40Go3R0dEZGRl1dXVq/Q5Dzel0ZmZmjhkzRq/Xm83m1atXK4py4cKFhx56KCQkZPjw4dOmTfvxxx87r7epqenVV181GAx6vT41NdXlcnXXw+3bt3vtCWaVz0SPIm48FBYWJiUlDR8+PCQkZMaMGd39pdm7d2/7YxGdTqcoisfjmTNnTlhY2JAhQ8xm88aNGxsaGnr9nekZ7ftX0z3X1SNVLS3Am2nZv4wN36Z9//rgaWYA3o/oASAA0QNAAKIHgABEDwABiB4AAhA9AAQgegAIQPQAEKAX7rmu/e7L8FWMDXSHWQ8AATT9hgsAeoZZDwABiB4AAhA9AAQgegAIQPQAEIDoASAA0QNAAE1XM3OtKjCY8b+ZAQwwvfAbLq6HBgYb7Uc8zHoACED0ABCA6AEgANGDjlpaWjIzM0eMGBESErJ06dKampouq+Xk5CQmJup0uqioqA6L0tLS5Hby8vL6vtcYYIgedLRly5bjx4+fP3/+ypUrpaWl6enpXVYzGo1r1qzJzs7ucqnFYqn9V3Jych92FwMT0YOOPv744w0bNpjN5oiIiLfeeuvQoUNOp7NztXnz5i1evHjUqFFdNhIQEKD/l79/L3yRCh9D9OA/Kisr7Xb7pEmT1KdTpkxpaWm5dOnSnbaTk5MzatSoBx98cPv27c3Nzb3dTQx4/DnCf9TW1kqSFBoaqj4NDg728/Pr7nRPd5577rmXXnopPDy8sLBw9erVNptt586dvd9XDGRED/4jODhYkiSXy6U+ra2tbW1tDQkJ+fzzz1988UW18P+9iDQpKUl9MG7cOLfbbbFYiB50wAEX/iMqKioiIuKXX35Rn164cMHf33/ixIlpaWnKv+6owSFDhrS0tPRBTzGwET3oaNWqVdu2bfvnn3/sdvumTZtSUlIMBkPnah6Px+12q+dx3G53Y2OjWt7a2vrJJ5+Ulpa6XK5Tp05t3LgxJSWlXzcAA4KigfYW4IWamppeffVVg8Gg1+tTU1NdLleX1fbu3dt+IOl0OrXc4/HMmTMnLCxsyJAhZrN548aNDQ0N/dh99Aftn31NN8NRf0KmpQUAA5H2zz4HXAAEIHoACED0ABCA6AEgQC9cUsh/aAZwp5j1ABBA05frANAzzHoACED0ABCA6AEgANEDQACiB4AARA8AAYgeAAIQPQAEIHoACED0ABCA6AEgANEDQACiB4AARA8AAYgeAAIQPQAEIHoACED0ABCA6AEgwP8BhqBe/aVBoe8AAAAASUVORK5CYII=" -/> +#SBATCH --ntasks=24 # All #SBATCH lines have to follow uninterrupted +#SBATCH --time=01:00:00 # after the shebang line +#SBATCH --account=<KTR> # Comments start with # and do not count as interruptions +#SBATCH --job-name=fancyExp +#SBATCH --output=simulation-%j.out +#SBATCH --error=simulation-%j.err -#### MPI +module purge # Set up environment, e.g., clean modules environment +module load <modules> # and load necessary modules -The illustration below shows the default binding of a pure MPI-job. In -which 32 global ranks are distributed onto 2 nodes with 16 cores each. -Each rank has 1 core assigned to it. +srun ./application [options] # Execute parallel application with srun +``` -```Bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --tasks-per-node=16 -#SBATCH --cpus-per-task=1 +The following two examples show the basic resource specifications for a pure OpenMP application and +a pure MPI application, respectively. Within the section [Job Examples](slurm_examples.md), we +provide a comprehensive collection of job examples. -srun --ntasks 32 ./application -``` +??? example "Job file OpenMP" -\<img alt="" -src="data:;base64,iVBORw0KGgoAAAANSUhEUgAAAw4AAADeCAIAAAAb9sCoAAAABmJLR0QA/wD/AP+gvaeTAAAfBklEQVR4nO3dfXBU1f348bshJEA2ISGbB0gIZAMJxqciIhCktGKxaqs14UEGC9gBJVUjxIo4EwFlpiqMOgydWipazTBNVATbGevQMQQYUMdSEEUNYGIID8kmMewmm2TzeH9/3On+9pvN2T27N9nsJu/XX+Tu/dx77uee8+GTu8tiUFVVAQAAQH/ChnoAAAAAwYtWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQIhWCQAAQChcT7DBYBiocQAIOaqqDvUQfEC9AkYyPfWKp0oAAABCup4qaULrN0sA+oXuExrqFTDS6K9XPFUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUCAAAQolUauX72s58ZDIZPP/3UuSU5OfnDDz+UP8KXX35pNBrl9y8uLs7JyYmKikpOTvZhoABGvMDXq40bN2ZnZ48bNy4tLW3Tpk2dnZ0+DBfDC63SiBYfH//0008H7HQmk2nDhg3btm0L2BkBDBsBrld2u33Pnj2XLl0qLS0tLS3dunVrwE6NYEOrNKKtXbu2srLygw8+cH+ptrZ26dKliYmJqampjz/+eFtbm7b90qVLd911V2xs7A033HDixAnn/s3Nzfn5+ZMnT05ISHjwwQcbGxvdj3nPPfcsW7Zs8uTJg3Q5AIaxANerN954Y8GCBfHx8Tk5OQ8//LBrOEYaWqURzWg0btu27dlnn+3q6urzUl5e3ujRoysrK0+ePHnq1KnCwkJt+9KlS1NTU+vq6v71r3/95S9/ce6/cuVKi8Vy+vTpmpqa8ePHr1mzJmBXAWAkGMJ6dfz48VmzZg3o1SCkqDroPwKG0MKFC7dv397V1TVjxozdu3erqpqUlHTw4EFVVSsqKhRFqa+v1/YsKysbM2ZMT09PRUWFwWBoamrSthcXF0dFRamqWlVVZTAYnPvbbDaDwWC1Wvs9b0lJSVJS0mBfHQZVKK79UBwznIaqXqmqumXLlvT09MbGxkG9QAwe/Ws/PNCtGYJMeHj4Sy+9tG7dulWrVjk3Xr58OSoqKiEhQfvRbDY7HI7GxsbLly/Hx8fHxcVp26dPn679obq62mAwzJ4923mE8ePHX7lyZfz48YG6DgDDX+Dr1QsvvLBv377y8vL4+PjBuioEPVolKPfff/8rr7zy0ksvObekpqa2trY2NDRo1ae6ujoyMtJkMqWkpFit1o6OjsjISEVR6urqtP3T0tIMBsOZM2fojQAMqkDWq82bNx84cODo0aOpqamDdkEIAXxWCYqiKDt37ty1a1dLS4v2Y2Zm5ty5cwsLC+12u8ViKSoqWr16dVhY2IwZM2bOnPnaa68pitLR0bFr1y5t/4yMjMWLF69du7a2tlZRlIaGhv3797ufpaenx+FwaJ8zcDgcHR0dAbo8AMNIYOpVQUHBgQMHDh06ZDKZHA4HXxYwktEqQVEUZc6cOffee6/zn40YDIb9+/e3tbWlp6fPnDnzpptuevXVV7WX3n///bKysltuueWOO+644447nEcoKSmZNGlSTk5OdHT03Llzjx8/7n6WN954Y+zYsatWrbJYLGPHjuWBNgA/BKBeWa3W3bt3X7hwwWw2jx07duzYsdnZ2YG5OgQhg/MTT/4EGwyKoug5AoBQFIprPxTHDEA//Wufp0oAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABCtEoAAABC4UM9AAAInKqqqqEeAoAQY1BV1f9gg0FRFD1HABCKQnHta2MGMDLpqVcD8FSJAgQg+JnN5qEeAoCQNABPlQCMTKH1VAkA/KOrVQIAABje+BdwAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrq+gpLvVRoJ/Ps6CebGSBBaXzXCnBwJqFcQ0VOveKoEAAAgNAD/sUlo/WYJefp/02JuDFeh+1s4c3K4ol5BRP/c4KkSAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACAEK0SAACA0PBvlb799ttf//rXJpNp3LhxM2bMeOaZZ/w4yIwZMz788EPJnX/yk5+Ulpb2+1JxcXFOTk5UVFRycrIfw8DACqq5sXHjxuzs7HHjxqWlpW3atKmzs9OPwSDUBdWcpF4FlaCaGyOtXg3zVqm3t/eXv/zlpEmTvv7668bGxtLSUrPZPITjMZlMGzZs2LZt2xCOAZpgmxt2u33Pnj2XLl0qLS0tLS3dunXrEA4GQyLY5iT1KngE29wYcfVK1UH/EQbbpUuXFEX59ttv3V+6evXqkiVLEhISUlJSHnvssdbWVm37tWvX8vPz09LSoqOjZ86cWVFRoapqVlbWwYMHtVcXLly4atWqzs5Om822fv361NRUk8m0fPnyhoYGVVUff/zx0aNHm0ymKVOmrFq1qt9RlZSUJCUlDdY1Dxw995e54d/c0GzZsmXBggUDf80DJ/jvr7vgH3NwzknqVTAIzrmhGQn1apg/VZo0aVJmZub69evffffdmpoa15fy8vJGjx5dWVl58uTJU6dOFRYWattXrFhx8eLFzz77zGq1vvPOO9HR0c6Qixcvzp8///bbb3/nnXdGjx69cuVKi8Vy+vTpmpqa8ePHr1mzRlGU3bt3Z2dn7969u7q6+p133gngtcI3wTw3jh8/PmvWrIG/ZgS3YJ6TGFrBPDdGRL0a2k4tACwWy+bNm2+55Zbw8PBp06aVlJSoqlpRUaEoSn19vbZPWVnZmDFjenp6KisrFUW5cuVKn4NkZWU999xzqampe/bs0bZUVVUZDAbnEWw2m8FgsFqtqqrefPPN2llE+C0tSATh3FBVdcuWLenp6Y2NjQN4pQMuJO5vHyEx5iCck9SrIBGEc0MdMfVq+LdKTi0tLa+88kpYWNhXX331ySefREVFOV/64YcfFEWxWCxlZWXjxo1zj83KykpKSpozZ47D4dC2HD58OCwsbIqL2NjYb775RqX06I4NvOCZG88//7zZbK6urh7Q6xt4oXV/NaE15uCZk9SrYBM8c2Pk1Kth/gacK6PRWFhYOGbMmK+++io1NbW1tbWhoUF7qbq6OjIyUntTtq2trba21j18165dCQkJ9913X1tbm6IoaWlpBoPhzJkz1f9z7dq17OxsRVHCwkZQVoeHIJkbmzdv3rdv39GjR6dMmTIIV4lQEiRzEkEoSObGiKpXw3yR1NXVPf3006dPn25tbW1qanrxxRe7urpmz56dmZk5d+7cwsJCu91usViKiopWr14dFhaWkZGxePHiRx55pLa2VlXVs2fPOqdaZGTkgQMHYmJi7r777paWFm3PtWvXajs0NDTs379f2zM5OfncuXP9jqenp8fhcHR1dSmK4nA4Ojo6ApIG9CPY5kZBQcGBAwcOHTpkMpkcDsew/8e3cBdsc5J6FTyCbW6MuHo1tA+1BpvNZlu3bt306dPHjh0bGxs7f/78jz76SHvp8uXLubm5JpNp4sSJ+fn5drtd297U1LRu3bqUlJTo6Ohbbrnl3Llzqsu/Guju7v7tb3972223NTU1Wa3WgoKCqVOnGo1Gs9n85JNPakc4cuTI9OnTY2Nj8/Ly+ozn9ddfd02+64PTIKTn/jI3fJob165d67MwMzIyApcL3wX//XUX/GMOqjmpUq+CSVDNjRFYrwzOo/jBYDBop/f7CAhmeu4vc2N4C8X7G4pjhjzqFUT0399h/gYcAACAHrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQrRKAAAAQuH6D2EwGPQfBMMScwPBhjkJEeYGRHiqBAAAIGRQVXWoxwAAABCkeKoEAAAgRKsEAAAgRKsEAAAgRKsEAAAgRKsEAAAgRKsEAAAgRKsEAAAgpOvbuvlu05HAv2/eYm6MBKH1rWzMyZGAegURPfWKp0oAAABCA/B/wIXWb5aQp/83LebGcBW6v4UzJ4cr6hVE9M8NnioBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAI0SoBAAAIDdtW6cSJE/fee++ECROioqJuvPHGoqKi1tbWAJy3u7u7oKBgwoQJMTExK1eubG5u7nc3o9FocBEZGdnR0RGA4Y1YQzUfLBbLsmXLTCZTbGzsXXfdde7cuX53Ky4uzsnJiYqKSk5Odt2+Zs0a13lSWloagDEj8KhXcEW9CjbDs1X65z//uWjRoptvvvmzzz6rr6/ft29ffX39mTNnZGJVVe3q6vL71M8///yhQ4dOnjz5/fffX7x4cf369f3uZrFYWv4nNzf3gQceiIyM9Puk8GwI50N+fr7Vaj1//vyVK1cmTpy4dOnSfnczmUwbNmzYtm2b+0uFhYXOqbJkyRK/R4KgRb2CK+pVMFJ10H+EwdDT05OamlpYWNhne29vr6qqV69eXbJkSUJCQkpKymOPPdba2qq9mpWVVVRUdPvtt2dmZpaXl9tstvXr16empppMpuXLlzc0NGi7vfrqq1OmTBk/fvzEiRO3b9/ufvbExMS33npL+3N5eXl4ePi1a9c8jLahoSEyMvLw4cM6r3ow6Lm/wTM3hnY+ZGRk7N27V/tzeXl5WFhYd3e3aKglJSVJSUmuW1avXv3MM8/4e+mDKHjur7zgHDP1aqBQr6hXIgPQ7Qzt6QeD1n2fPn2631fnzZu3YsWK5ubm2traefPmPfroo9r2rKysG264obGxUfvxV7/61QMPPNDQ0NDW1vbII4/ce++9qqqeO3fOaDReuHBBVVWr1frf//63z8Fra2tdT609zT5x4oSH0e7cuXP69Ok6LncQDY/SM4TzQVXVTZs2LVq0yGKx2Gy2hx56KDc318NQ+y09EydOTE1NnTVr1ssvv9zZ2el7AgZF8NxfecE5ZurVQKFeUa9EaJX68cknnyiKUl9f7/5SRUWF60tlZWVjxozp6elRVTUrK+tPf/qTtr2qqspgMDh3s9lsBoPBarVWVlaOHTv2vffea25u7vfU58+fVxSlqqrKuSUsLOzjjz/2MNrMzMydO3f6fpWBMDxKzxDOB23nhQsXatm47rrrampqPAzVvfQcOnTo008/vXDhwv79+1NSUtx/1xwqwXN/5QXnmKlXA4V6pW2nXrnTf3+H4WeVEhISFEW5cuWK+0uXL1+OiorSdlAUxWw2OxyOxsZG7cdJkyZpf6iurjYYDLNnz546derUqVNvuumm8ePHX7lyxWw2FxcX//nPf05OTv7pT3969OjRPsePjo5WFMVms2k/trS09Pb2xsTEvP32285PurnuX15eXl1dvWbNmoG6drgbwvmgquqdd95pNpubmprsdvuyZctuv/321tZW0Xxwt3jx4nnz5k2bNi0vL+/ll1/et2+fnlQgCFGv4Ip6FaSGtlMbDNp7vU899VSf7b29vX268vLy8sjISGdXfvDgQW37999/P2rUKKvVKjpFW1vbH//4x7i4OO39Y1eJiYl/+9vftD8fOXLE83v/y5cvf/DBB327vADSc3+DZ24M4XxoaGhQ3N7g+Pzzz0XHcf8tzdV77703YcIET5caQMFzf+UF55ipVwOFeqVtp165G4BuZ2hPP0j+8Y9/jBkz5rnnnqusrHQ4HGfPns3Pzz9x4kRvb+/cuXMfeuihlpaWurq6+fPnP/LII1qI61RTVfXuu+9esmTJ1atXVVWtr69///33VVX97rvvysrKHA6HqqpvvPFGYmKie+kpKirKysqqqqqyWCwLFixYsWKFaJD19fURERHB+QFJzfAoPeqQzocpU6asW7fOZrO1t7e/8MILRqOxqanJfYTd3d3t7e3FxcVJSUnt7e3aMXt6evbu3VtdXW21Wo8cOZKRkeH8aMKQC6r7Kylox0y9GhDUK+cRqFd90CoJHT9+/O67746NjR03btyNN9744osvav9Y4PLly7m5uSaTaeLEifn5+Xa7Xdu/z1SzWq0FBQVTp041Go1ms/nJJ59UVfXUqVO33XZbTExMXFzcnDlzjh075n7ezs7OJ554IjY21mg0rlixwmaziUa4Y8eOoP2ApGbYlB516ObDmTNnFi9eHBcXFxMTM2/ePNHfNK+//rrrs96oqChVVXt6eu688874+PiIiAiz2fzss8+2tbUNeGb8E2z3V0Ywj5l6pR/1yhlOvepD//01OI/iB+2dSz1HQDDTc3+ZG8NbKN7fUBwz5FGvIKL//g7Dj3UDAAAMFFolAAAAIVolAAAAIVolAAAAIVolAAAAIVolAAAAIVolAAAAIVolAAAAIVolAAAAoXD9h/D6vw1jxGJuINgwJyHC3IAIT5UAAACEdP0fcAAAAMMbT5UAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEdH1bN99tOhL4981bzI2RILS+lY05ORJQryCip17xVAkAAEBoAP4POD1dPLHBH6tHKF4vsfKxoSgU80ysfKweoXi9xMrH6sFTJQAAACFaJQAAACFaJQAAAKFBaZW6u7sLCgomTJgQExOzcuXK5uZm+diNGzdmZ2ePGzcuLS1t06ZNnZ2dfpx95syZBoOhrq7Op8B///vfc+bMGTNmTEJCwqZNm+QDLRbLsmXLTCZTbGzsXXfdde7cOc/7FxcX5+TkREVFJScn9xm517yJYmXyJop1nt2/vPnE8xg8KyoqSk9Pj4yMjI+Pv++++77//nv52DVr1hhclJaWyscajUbX2MjIyI6ODsnYy5cv5+XlxcfHT5gw4fe//73XQFF+ZPIm2kcmb6JYPXkLZh7y6bUOiGJl6oBoncqsfVGszNr3vI/nte8h1muuRLEyuRLNWz1/v8gQ3V+ZOiCKlakDolzJrH1RrMzaF8XKrH1RrEyuRLEyuRJdl56/X7xQdRAdoaioKDMzs7Ky0mKxzJ8/f8WKFfKxa9euPXbsWGNj44kTJyZPnrx582b5WM327dsXLVqkKEptba18bFlZmdFo/Otf/1pXV1dTU3Ps2DH52AceeOAXv/jFjz/+aLfbV69efeONN3qO/eijj959990dO3YkJSW57iPKm0ysKG8ysRr3vOmZIaJYz2PwHPv5559XVlY2NzdXVVXdf//9OTk58rGrV68uLCxs+Z+uri75WLvd7gzMzc1dvny5fOxtt9324IMP2my2q1evzp0798knn/QcK8qPaLtMrChvMrGivOmvHoEnc72iOiATK6oDrrGidSqz9kWxMmvfc131vPZFsTK5EsXK5Eo0b2Vy5SuZ+yuqAzKxojogkyuZtS+KlVn7oliZtS+KlcmVKFYmV6LrksmVfwalVUpMTHzrrbe0P5eXl4eHh1+7dk0y1tWWLVsWLFggf15VVb/55puMjIwvvvhC8bFVysnJeeaZZzyPRxSbkZGxd+9e7c/l5eVhYWHd3d1eY0tKSvrcTlHeZGJdueZNMrbfvA1U6XHnefxez9vZ2Zmfn3/PPffIx65evdrv++vU0NAQGRl5+PBhydgrV64oilJRUaH9ePDgQaPR2NHR4TVWlB/37T7NjT55k4kV5U1/6Qk8mesV1QGZWFEdEOXKdZ3Kr333WNF2yVif1r5rrHyu3GN9ylWfeetrrmT4tI761AGvsR7qgPz9lVn7olhVYu27x/q69vs9r9dc9Yn1NVf9/l0gnyt5A/8GXF1dXX19/cyZM7UfZ82a1d3d/e233/pxqOPHj8+aNUt+/56ent/97nevvfZadHS0TydyOByff/55T0/PddddFxcXt2jRoq+++ko+PC8vr6SkpL6+vrm5+c033/zNb34zatQonwaghGbeAq+4uDg5OTk6Ovrrr7/++9//7mvs5MmTb7311h07dnR1dflx9rfffjstLe3nP/+55P7OJepkt9t9et9woAxt3kJFgOuAc536sfZFa1xm7bvu4+vad8b6kSvX80rmyn3eDmCd9FsA6oCvNdxDrE9r3z1Wfu33O2bJXDlj5XOlp6b5Q0+f1e8Rzp8/ryhKVVXV/2/HwsI+/vhjmVhXW7ZsSU9Pb2xslDyvqqo7d+5cunSpqqrfffed4stTpdraWkVR0tPTz549a7fbN2zYkJKSYrfbJc9rs9kWLlyovXrdddfV1NTInLdP5+shb15jXfXJm0ysKG96ZojnWL+fKrW1tV29evXYsWMzZ85cu3atfOyhQ4c+/fTTCxcu7N+/PyUlpbCw0Ncxq6qamZm5c+dOn8Z86623Oh8mz5s3T1GUzz77zGvsgD9V6jdvMrGivOmvHoHn9Xo91AGZXInqQL+5cl2nPq19VVwbva599318WvuusT7lyv28krlyn7e+5kqSTzW2Tx2QiRXVAfn7K/mkxD1Wcu27x/q09kVz0muu3GMlc+Xh74LBeKo08K2StoROnz6t/ah95u7EiRMysU7PP/+82Wyurq6WP++FCxcmTZpUV1en+t4qtbS0KIqyY8cO7cf29vZRo0YdPXpUJra3t3f27NkPP/xwU1OT3W7funVrWlqaTJvVb5nuN2/yy9g9b15jPeRtYEuPzPjlz3vs2DGDwdDa2upH7L59+xITE3097+HDhyMiIhoaGnwa88WLF3Nzc5OSktLT07du3aooyvnz573GDtIbcOr/zZuvsa550196As/r9XqoA15jPdQB99g+69SntS+qjTJrv88+Pq39PrE+5apPrE+50jjnrU+5kie/FtzrgEysqA7I31+Zte/5703Pa99zrOe1L4qVyZV7rHyu3K9LExpvwCUnJycmJn755Zfaj6dOnQoPD8/OzpY/wubNm/ft23f06NEpU6bIRx0/fryxsfH66683mUxaK3r99de/+eabMrFGo3HatGnOL/T06Zs9f/zxx//85z8FBQVxcXFRUVFPPfVUTU3N2bNn5Y+gCcW8Da1Ro0b58UanoigRERHd3d2+Ru3Zsyc3N9dkMvkUlZaW9sEHH9TV1VVVVaWmpqakpEybNs3XUw+sAOcthASmDrivU/m1L1rjMmvffR/5te8eK58r91j/aqY2b/XXSZ0GtQ74V8PlY0Vr32ush7XvIdZrrvqN9aNm+l3TfKCnzxIdoaioKCsrq6qqymKxLFiwwKd/AffEE09Mnz69qqqqvb29vb3d/TOwotjW1tZL/3PkyBFFUU6dOiX/Jtqrr75qNpvPnTvX3t7+hz/8YfLkyfJPLKZMmbJu3Tqbzdbe3v7CCy8YjcampiYPsd3d3e3t7cXFxUlJSe3t7Q6HQ9suyptMrChvXmM95E3PDBHFisbvNbazs/PFF1+sqKiwWq1ffPHFrbfempeXJxnb09Ozd+/e6upqq9V65MiRjIyMRx99VH7MqqrW19dHRET0+4Fuz7EnT5784YcfGhsbDxw4kJCQ8Pbbb3uOFeVHtN1rrIe8eY31kDf91SPwZPIsqgMysaI64BorWqcya18UK7P2+91Hcu2Lji+TK1Gs11x5mLcyuRqMuaEK6oBMrKgOyORKZu33Gyu59vuNlVz7Hv6+9porUazXXHm4Lplc+WdQWqXOzs4nnngiNjbWaDSuWLHCZrNJxl67dk35vzIyMuTP6+TrG3Cqqvb29m7ZsiUpKSkmJuaOO+74+uuv5WPPnDmzePHiuLi4mJiYefPmef0XUq+//rrrNUZFRWnbRXnzGushbzLnFeVNz/QSxXodgyi2q6vrvvvuS0pKioiImDp16saNG+XnVU9Pz5133hkfHx8REWE2m5999tm2tjb5MauqumPHjunTp/f7kufYXbt2JSYmjh49Ojs7u7i42GusKD+i7V5jPeTNa6yHvOmZG0NFJs+iOiATK6oDzlgP69Tr2hfFyqx9mboqWvseYr3mykOs11x5mLcyddJXMvdXFdQBmVhRHZDJlde1L4qVWfuiWJm173leec6Vh1ivufJwXTJ10j8G51H8ELr/bR6xxBI7VLFDJRRzRSyxxA5trIb/2AQAAECIVgkAAECIVgkAAECIVgkAAEBoAD7WjeFNz8foMLyF4se6MbxRryDCx7oBAAAGha6nSgAAAMMbT5UAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACEaJUAAACE/h82xQH7rLtt0wAAAABJRU5ErkJggg==" -/> + ```bash + #!/bin/bash -#### Hybrid (MPI and OpenMP) + #SBATCH --nodes=1 + #SBATCH --tasks-per-node=1 + #SBATCH --cpus-per-task=64 + #SBATCH --time=01:00:00 + #SBATCH --account=<account> -In the illustration below the default binding of a Hybrid-job is shown. -In which 8 global ranks are distributed onto 2 nodes with 16 cores each. -Each rank has 4 cores assigned to it. + module purge + module load <modules> -```Bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --tasks-per-node=4 -#SBATCH --cpus-per-task=4 + export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK + srun ./path/to/openmpi_application + ``` -export OMP_NUM_THREADS=4 + * Submisson: `marie@login$ sbatch batch_script.sh` + * Run with fewer CPUs: `marie@login$ sbatch --cpus-per-task=14 batch_script.sh` -srun --ntasks 8 --cpus-per-task $OMP_NUM_THREADS ./application -``` +??? example "Job file MPI" -\<img alt="" -src="data:;base64,iVBORw0KGgoAAAANSUhEUgAAAvoAAADyCAIAAACzsfbGAAAABmJLR0QA/wD/AP+gvaeTAAAgAElEQVR4nO3de1iUdf7/8XsQA+SoDgdhZHA4CaUlpijmYdXooJvrsbxqzXa1dCsPbFlt5qF2O2xbXV52bdtlV25c7iVrhrVXWVaEupJ2gjxUYAIDgjgcZJCDIIf7+8f9a36zjCAwM/eMn3k+/oJ77rnf9z3z5u1r7hnn1siyLAEAAIjLy9U7AAAA4FzEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABCctz131mg0jtoPANccWZZVrsjMATyZPTOHszsAAEBwdp3dUaj/Cg+Aa7n2LAszB/A09s8czu4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMERdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9zxXDNmzNBoNF9++aVlSURExPvvv9/3LXz//fcBAQF9Xz8zMzMtLc3f3z8iIqIfOwpACOrPnPXr1ycnJw8ZMiQ6OnrDhg2XL1/ux+5CLMQdjzZ8+PDHH39ctXJarXbdunVbtmxRrSIAt6LyzGlqanrzzTfPnj2blZWVlZW1efNm1UrD3RB3PNqKFSuKi4vfe+8925uqqqoWL14cFham0+keeeSRlpYWZfnZs2dvu+22kJCQG264IS8vz7L+xYsXV69ePXLkyNDQ0Hvuuae2ttZ2m3feeeeSJUtGjhzppMMB4OZUnjk7duyYOnXq8OHD09LSHnjgAeu7w9MQdzxaQEDAli1bnnrqqfb29m43LVy4cPDgwcXFxd9++21+fn5GRoayfPHixTqd7vz58/v37//HP/5hWf/ee+81mUwFBQXl5eXBwcHLly9X7SgAXCtcOHOOHDkyfvx4hx4NrimyHezfAlxo+vTpzz33XHt7++jRo7dv3y7Lcnh4+L59+2RZLiwslCSpurpaWTMnJ8fX17ezs7OwsFCj0Vy4cEFZnpmZ6e/vL8tySUmJRqOxrN/Q0KDRaMxm8xXr7t69Ozw83NlHB6dy1d8+M+ea5qqZI8vypk2bRo0aVVtb69QDhPPY/7fvrXa8gpvx9vZ+8cUXV65cuWzZMsvCiooKf3//0NBQ5VeDwdDa2lpbW1tRUTF8+PChQ4cqy+Pj45UfjEajRqOZMGGCZQvBwcGVlZXBwcFqHQeAa4P6M+fZZ5/dtWtXbm7u8OHDnXVUcHvEHUjz5s175ZVXXnzxRcsSnU7X3NxcU1OjTB+j0ejj46PVaqOiosxmc1tbm4+PjyRJ58+fV9aPjo7WaDTHjx8n3wC4KjVnzpNPPpmdnX3o0CGdTue0A8I1gM/uQJIk6eWXX962bVtjY6Pya0JCwqRJkzIyMpqamkwm08aNG++//34vL6/Ro0ePGzfutddekySpra1t27ZtyvqxsbHp6ekrVqyoqqqSJKmmpmbv3r22VTo7O1tbW5X37FtbW9va2lQ6PABuRp2Zs2bNmuzs7AMHDmi12tbWVv4juicj7kCSJCk1NXXOnDmW/wqh0Wj27t3b0tIyatSocePGjR079tVXX1Vuevfdd3NyclJSUmbOnDlz5kzLFnbv3h0ZGZmWlhYYGDhp0qQjR47YVtmxY4efn9+yZctMJpOfnx8nlgGPpcLMMZvN27dv//nnnw0Gg5+fn5+fX3JysjpHBzeksXwCaCB31mgkSbJnCwCuRa7622fmAJ7J/r99zu4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMERdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBedu/CY1GY/9GAKCPmDkA+ouzOwAAQHAaWZZdvQ8AAABOxNkdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADB2fU1g3zZlycY2FcV0BueQP2vsaCvPAEzBz2xZ+ZwdgcAAAjOAReR4IsKRWX/qyV6Q1SufSVNX4mKmYOe2N8bnN0BAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMERdwAAgOCIOwAAQHDEHQAAIDjx486PP/7461//WqvVDhkyZPTo0U888cQANjJ69Oj333+/jyvfdNNNWVlZV7wpMzMzLS3N398/IiJiALsBx3Kr3li/fn1ycvKQIUOio6M3bNhw+fLlAewM3IFb9RUzx624VW942swRPO50dXXdfvvtkZGRJ0+erK2tzcrKMhgMLtwfrVa7bt26LVu2uHAfoHC33mhqanrzzTfPnj2blZWVlZW1efNmF+4MBszd+oqZ4z7crTc8bubIdrB/C8529uxZSZJ+/PFH25vOnTu3aNGi0NDQqKiohx9+uLm5WVleX1+/evXq6OjowMDAcePGFRYWyrKcmJi4b98+5dbp06cvW7bs8uXLDQ0Nq1at0ul0Wq327rvvrqmpkWX5kUceGTx4sFar1ev1y5Ytu+Je7d69Ozw83FnH7Dj2PL/0xsB6Q7Fp06apU6c6/pgdx1XPL33FzHHGfdXhnr2h8ISZI/jZncjIyISEhFWrVv373/8uLy+3vmnhwoWDBw8uLi7+9ttv8/PzMzIylOVLly4tKys7evSo2Wx+5513AgMDLXcpKyubMmXKLbfc8s477wwePPjee+81mUwFBQXl5eXBwcHLly+XJGn79u3Jycnbt283Go3vvPOOiseK/nHn3jhy5Mj48eMdf8xwPnfuK7iWO/eGR8wc16YtFZhMpieffDIlJcXb2zsuLm737t2yLBcWFkqSVF1drayTk5Pj6+vb2dlZXFwsSVJlZWW3jSQmJj7zzDM6ne7NN99UlpSUlGg0GssWGhoaNBqN2WyWZfnGG29UqvSEV1puwg17Q5blTZs2jRo1qra21oFH6nCuen7pK2aOM+6rGjfsDdljZo74cceisbHxlVde8fLyOnHixOeff+7v72+5qbS0VJIkk8mUk5MzZMgQ2/smJiaGh4enpqa2trYqS7744gsvLy+9lZCQkB9++EFm9Nh9X/W5T29s3brVYDAYjUaHHp/jEXf6wn36ipnjbtynNzxn5gj+Zpa1gICAjIwMX1/fEydO6HS65ubmmpoa5Saj0ejj46O8wdnS0lJVVWV7923btoWGht51110tLS2SJEVHR2s0muPHjxt/UV9fn5ycLEmSl5cHPapicJPeePLJJ3ft2nXo0CG9Xu+Eo4Ta3KSv4IbcpDc8auYI/kdy/vz5xx9/vKCgoLm5+cKFCy+88EJ7e/uECRMSEhImTZqUkZHR1NRkMpk2btx4//33e3l5xcbGpqenP/jgg1VVVbIsnzp1ytJqPj4+2dnZQUFBd9xxR2Njo7LmihUrlBVqamr27t2rrBkREVFUVHTF/ens7GxtbW1vb5ckqbW1ta2tTZWHAVfgbr2xZs2a7OzsAwcOaLXa1tZW4f9TqKjcra+YOe7D3XrD42aOa08uOVtDQ8PKlSvj4+P9/PxCQkKmTJny0UcfKTdVVFQsWLBAq9WOGDFi9erVTU1NyvILFy6sXLkyKioqMDAwJSWlqKhItvokfEdHx29/+9uJEydeuHDBbDavWbMmJiYmICDAYDCsXbtW2cLBgwfj4+NDQkIWLlzYbX/eeOMN6wff+gSmG7Ln+aU3+tUb9fX13f4wY2Nj1Xss+s9Vzy99xcxxxn3V4Va94YEzR2PZygBoNBql/IC3AHdmz/NLb4jNVc8vfSU2Zg56Yv/zK/ibWQAAAMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAILztn8TymXZAVv0BpyBvkJP6A30hLM7AABAcBpZll29DwAAAE7E2R0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAMERdwAAgODs+lZlvr/SEwzsm5noDU+g/rd20VeegJmDntgzczi7AwAABOeAa2bxvcyisv/VEr0hKte+kqavRMXMQU/s7w3O7gAAAMERdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAghM27uTl5c2ZM2fYsGH+/v5jxozZuHFjc3OzCnU7OjrWrFkzbNiwoKCge++99+LFi1dcLSAgQGPFx8enra1Nhd3zWK7qB5PJtGTJEq1WGxIScttttxUVFV1xtczMzLS0NH9//4iICOvly5cvt+6TrKwsFfYZA8PMgTVmjrsRM+785z//mTVr1o033nj06NHq6updu3ZVV1cfP368L/eVZbm9vX3Apbdu3XrgwIFvv/32zJkzZWVlq1atuuJqJpOp8RcLFiyYP3++j4/PgIuidy7sh9WrV5vN5tOnT1dWVo4YMWLx4sVXXE2r1a5bt27Lli22N2VkZFhaZdGiRQPeEzgVMwfWmDnuSLaD/Vtwhs7OTp1Ol5GR0W15V1eXLMvnzp1btGhRaGhoVFTUww8/3NzcrNyamJi4cePGW265JSEhITc3t6GhYdWqVTqdTqvV3n333TU1Ncpqr776ql6vDw4OHjFixHPPPWdbPSws7O2331Z+zs3N9fb2rq+v72Vva2pqfHx8vvjiCzuP2hnseX7dpzdc2w+xsbFvvfWW8nNubq6Xl1dHR0dPu7p79+7w8HDrJffff/8TTzwx0EN3Ilc9v+7TV9aYOY7CzGHm9MQBicW15Z1BSdAFBQVXvHXy5MlLly69ePFiVVXV5MmTH3roIWV5YmLiDTfcUFtbq/w6d+7c+fPn19TUtLS0PPjgg3PmzJFluaioKCAg4Oeff5Zl2Ww2f/fdd902XlVVZV1aOaucl5fXy96+/PLL8fHxdhyuE4kxelzYD7Isb9iwYdasWSaTqaGh4b777luwYEEvu3rF0TNixAidTjd+/PiXXnrp8uXL/X8AnIK4Y42Z4yjMHGZOT4g7V/D5559LklRdXW17U2FhofVNOTk5vr6+nZ2dsiwnJia+/vrryvKSkhKNRmNZraGhQaPRmM3m4uJiPz+/PXv2XLx48YqlT58+LUlSSUmJZYmXl9fHH3/cy94mJCS8/PLL/T9KNYgxelzYD8rK06dPVx6NpKSk8vLyXnbVdvQcOHDgyy+//Pnnn/fu3RsVFWX7etFViDvWmDmOwsxRljNzbNn//Ar42Z3Q0FBJkiorK21vqqio8Pf3V1aQJMlgMLS2ttbW1iq/RkZGKj8YjUaNRjNhwoSYmJiYmJixY8cGBwdXVlYaDIbMzMy///3vERER06ZNO3ToULftBwYGSpLU0NCg/NrY2NjV1RUUFPTPf/7T8skv6/Vzc3ONRuPy5csddeyw5cJ+kGV59uzZBoPhwoULTU1NS5YsueWWW5qbm3vqB1vp6emTJ0+Oi4tbuHDhSy+9tGvXLnseCjgJMwfWmDluyrVpyxmU903/+Mc/dlve1dXVLVnn5ub6+PhYkvW+ffuU5WfOnBk0aJDZbO6pREtLy/PPPz906FDlvVhrYWFhO3fuVH4+ePBg7++j33333ffcc0//Dk9F9jy/7tMbLuyHmpoayeaNhmPHjvW0HdtXWtb27NkzbNiw3g5VRa56ft2nr6wxcxyFmaMsZ+bYckBicW15J/nggw98fX2feeaZ4uLi1tbWU6dOrV69Oi8vr6ura9KkSffdd19jY+P58+enTJny4IMPKnexbjVZlu+4445FixadO3dOluXq6up3331XluWffvopJyentbVVluUdO3aEhYXZjp6NGzcmJiaWlJSYTKapU6cuXbq0p52srq6+7rrr3PMDgwoxRo/s0n7Q6/UrV65saGi4dOnSs88+GxAQcOHCBds97OjouHTpUmZmZnh4+KVLl5RtdnZ2vvXWW0aj0Ww2Hzx4MDY21vI2v8sRd7ph5jgEM8eyBWZON8SdHh05cuSOO+4ICQkZMmTImDFjXnjhBeUD8BUVFQsWLNBqtSNGjFi9enVTU5OyfrdWM5vNa9asiYmJCQgIMBgMa9eulWU5Pz9/4sSJQUFBQ4cOTU1NPXz4sG3dy5cvP/rooyEhIQEBAUuXLm1oaOhpD//617+67QcGFcKMHtl1/XD8+PH09PShQ4cGBQVNnjy5p39p3njjDetzrv7+/rIsd3Z2zp49e/jw4dddd53BYHjqqadaWloc/sgMDHHHFjPHfswcy92ZOd3Y//xqLFsZAOVdQHu2AHdmz/NLb4jNVc8vfSU2Zg56Yv/zK+BHlQEAAKwRdwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwXnbv4mrXmEVHovegDPQV+gJvYGecHYHAAAIzq5rZgEAALg/zu4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARn17cq8/2VnmBg38xEb3gC9b+1i77yBMwc9MSemcPZHQAAIDgHXDPLVa/wqKtOXXt42mPlaXVdxdMeZ0+raw9Pe6w8ra49OLsDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACA44g4AABCcw+KO2Wz29vaOiYnR6/V/+MMf+v6f8o1G4+zZs3u69cMPPzQYDDExMZmZmWrWnT9/fkhIyKJFi3pawRl1S0tLZ86cGRUVlZSU9Mknn6hWt6WlJSUlRafT6fX6bdu29XGDfUdv2F9X1N6wh5OeX0mSWlpa9Hr9unXr1Kzr7++v0+l0Ot3ixYvVrHv27NmZM2eGhYUlJSW1traqU7egoED3C29v77y8vD5us4/oDYfUFa03ZDtYb6G+vj4qKkqW5dbW1gkTJnz88cd93EhpaemsWbOueFN7e7vBYDAajTU1NdHR0Q0NDerUlWU5Nzc3Ozt74cKF1gudXbe4uPjo0aOyLJ86dSo8PLyzs1Oduh0dHefPn5dlua6uLjIyUvm5W93+ojccW1ek3rCHCs+vLMsbN25cvHjx2rVr1ayr1+ttF6pQd/bs2Tt27JBluby8vL29XbW6ipqamhEjRnR0dNjW7S96w+F1hekNhePfzPLx8Zk4ceKZM2ckSWpra5s1a1ZKSsq4ceMOHTokSZLRaExNTX3ooYduvfXWRx991PqOeXl5kydPrqmpsSz5+uuvExIS9Hq9VqudMWNGTk6OOnUlSZoxY0ZgYKDKx2swGCZNmiRJ0vXXXy9JUnNzszp1Bw0aFB4eLklSR0dHQECAn59fXw58AOgNesMZHPv8lpSU/Pjjj3feeafKdV1yvKWlpUajccWKFZIkjRw50tu7t+/Zd8bxvvfee3fdddegQYMG9lBcFb1Bb/x/9mQl6y1YUt7FixfHjh2bm5sry3JnZ2d9fb0sy1VVVWlpabIsl5aWBgcH19TUyLI8bdq0kpISJeXl5eWlpqaaTCbr7b/77ru///3vlZ//9Kc/bd++XZ26is8++6wvr+AdXleW5U8//XTKlClq1m1oaIiOjh40aNAbb7xxxbr9RW84o64sRG/YQ4XjXbhwYWFh4c6dO6/6Ct6xdQMCAgwGw/jx4z/55BPV6n766aczZsyYP3/+TTfdtHnzZjWPVzFz5sycnJwr1u0vesOxdUXqjf+3Bbvu/L+HPWjQIL1ef9111y1btkxZ2NXV9fTTT6elpU2fPj04OFiW5dLS0mnTpim3rly5Mjc3t7S0VK/XjxkzxnKe3KKP/6Q5vK7iqv+kOaluWVlZUlLSTz/9pHJd5V6jRo0qLy+3rdtf9Aa94QzOPt5PPvlk/fr1siz3/k+aMx5no9Eoy3J+fn5kZGRdXZ06dT/++GNfX9/CwsJLly5NnTrV8maEOn1lMpkiIyMt71bI7j1z6A3Vjld2dG8oHPlmVkREhNFoLCsr++qrr3744QdJkvbv319cXHzo0KGDBw/6+voqqw0ePFj5wcvLq6OjQ5KksLAwPz+/EydOdNtgZGTkuXPnlJ8rKysjIyPVqeuq45UkyWw233XXXdu3bx89erSadRUxMTGpqamnTp3q/4NxFfQGveEMDj/eY8eO7dmzJyYm5rHHHnv77befffZZdepKkqTX6yVJGjduXHJy8unTp9WpGxUVlZiYmJiY6Ovre+utt548eVK145Uk6b333ps3b56T3smiN+iNbhz/2Z2IiIgtW7Zs3bpVkqT6+nqDweDt7f3111+bTKae7hIUFPTBBx889thj33zzjfXyiRMnFhUVlZeX19XV5ebm9v6BeQfW7RcH1r18+fKCBQvWr18/a9YsNetWVVUp0aGiouLYsWPJyclXrT4w9Aa94QwOPN7NmzdXVFQYjca//e1vv/vd7zZt2qRO3bq6ugsXLkiSVFRUdOrUqdjYWHXq3nDDDV1dXRUVFZ2dnf/973+TkpLUqavYs2fPkiVLeqloP3qD3rBwyvfuLF68+MSJE4WFhfPmzfv666+XLl36r3/9Kzo6upe7REREZGdnP/DAA0VFRZaF3t7er7322owZM1JSUrZu3RoUFKROXUmSbrvttqVLl+7fv1+n0xUUFKhT9/PPPz98+PDTTz+t/B88o9GoTt26urrZs2dHRUXNmjXrz3/+s/JKwknoDXrDGRz4/Lqk7tmzZ1NTU6Oion7zm9+8/vrroaGh6tTVaDTbtm1LT09PSkq6/vrr586dq05dSZJMJtPp06enTZvWe0X70Rv0hkJjeUtsIHfWaCRJsmcL1BW17rW4z9SlLnWv3brX4j5TV826fKsyAAAQHHEHAAAIjrgDAAAEp17c6eUKR1e9CNGAlfZ8pSEVLgbU09VVrnoBFHv0dJUTZ1+kxh5/+ctf4uPj4+Li1q9fb3tTQkJCQkLCvn377Kxi22b96skBd2m3O/bSk7Yr29OlV9zhnnrSdmWndqk6mDkWzJxumDk9rSzyzLHnS3v6vgXbKxw1NDR0dXUpt17xIkQOqWt7pSFL3Z4uBuSQugrrq6tYH+8VL4DiqLrdrnJiXVfR7UIkjqo74PuePXtWr9e3tLS0t7enpKR88803ln3+7rvvbrrppkuXLtXV1SmT1J663dqsvz3Ze5f2vW4vPWm78lW7tO91FT31pO3KvXep/dNjYJg5vWPm9GVNZo5nzhyVzu7YXuFo7NixlZWVyq19vwhRf9leachS19kXA+p2dRXr43WeUpurnNjWdfZFavorICDA19e3ra1NuQTd8OHDLftcWFiYmprq6+s7bNiwkSNHHj582J5C3dqsvz054C7tdsdeetJ2ZXu61HaHe+lJ5/0Nugozh5nTE2aOZ84cleLOuXPnoqKilJ91Ol1lZWVWVtZVvz/AgT777LO4uLjAwEDruhcvXtTr9ZGRkevXr7/qF7f014YNG55//nnLr9Z16+rqYmNjb7755gMHDji26JkzZ3Q63YIFC8aNG7dly5ZudRUqfLVXv4SEhGRkZERHR0dGRs6bN2/UqFGWfR4zZsyRI0caGxvPnz+fn5/v2Nntnj1py4Fd2ktP2nJel6rDPZ9fZo47YOZ45szp7RqnTqWETXWUl5evXbs2Ozu7W92goKCysjKj0Thz5sw5c+aMHDnSURUPHDgQHR2dmJh49OhRZYl13VOnTun1+oKCgrlz5548eXLYsGGOqtvZ2Xns2LHvv/9er9enp6dPmjTp9ttvt16hurq6sLBw+vTpjqpov/Ly8ldffbWkpMTX1/dXv/rV3LlzLY/VmDFjVq1aNX369IiIiLS0tN4vyWs/d+hJW47q0t570pbzutRV3OH5Zea4A2aOZ84clc7u9PEKR85w1SsNOeNiQL1fXaUvF0AZmKte5cSpF6kZmIKCgptvvlmr1QYEBMycOfOrr76yvvWRRx7Jz8/fv39/fX19XFycA+u6c0/asr9L+3jFHwvndak63Pn5Zea4FjOnL8SbOSrFHdsrHG3evNlsNju7ru2Vhix1nXoxINurq1jq9usCKP1le5WTbo+zu51VliQpPj7+m2++aWpqamtrO3z4cEJCgvU+l5WVSZL04Ycfms3m1NRUB9Z1w5605cAu7aUnbTm1S9Xhhs8vM8dNMHM8dObY8znnfm3hgw8+GDVqVHR09M6dO2VZHjlyZGNjo3JTenq6Vqv18/OLiorKz893YN2PPvpo0KBBUb8oLS211D158mRSUlJkZGRCQsKuXbv6srUBPGI7d+5UPpFuqVtQUBAXFxcZGTl69Oi9e/c6vO4XX3yRlJQUHx+/bt06+X8f5/Pnz0dGRnZ2dvZxU/Z0SL/u+/zzz8fFxcXGxmZkZMj/u88TJ04MCwu7+eabT506ZWdd2zbrV0/23qV9r9tLT9qufNUu7dfxKmx70nblq3ap/dNjYJg5V8XM6QtmjgfOHPXijrWioqJHH32UuqLWtee+nvZYeVpdO11zx0tdderac19Pe6w8ra4FlwilrlPqXov7TF3qUvfarXst7jN11azLRSQAAIDgiDsAAEBwxB0AACA44g4AABAccQcAAAiOuAMAAARH3AEAAIIj7gAAAME54GsGITZ7vvILYnPVV41BbMwc9ISvGQQAAOiRXWd3AAAA3B9ndwAAgOCIOwAAQHDEHQAAIDjiDgAAEBxxBwAACI64AwAABEfcAQAAgiPuAAAAwRF3AACA4Ig7AABAcMQdAAAgOOIOAAAQHHEHAAAIjrgDAAAER9wBAACCI+4AAADBEXcAAIDgiDsAAEBwxB0AACC4/wNeW27o5DoAAAACSURBVCEI/r8gawAAAABJRU5ErkJggg==" -/> + ```bash + #!/bin/bash -### Node Features for Selective Job Submission + #SBATCH --ntasks=64 + #SBATCH --time=01:00:00 + #SBATCH --account=<account> -The nodes in our HPC system are becoming more diverse in multiple aspects: hardware, mounted -storage, software. The system administrators can describe the set of properties and it is up to the -user to specify her/his requirements. These features should be thought of as changing over time -(e.g. a file system get stuck on a certain node). + module purge + module load <modules> -A feature can be used with the Slurm option `--constrain` or `-C` like -`srun -C fs_lustre_scratch2 ...` with `srun` or `sbatch`. Combinations like -`--constraint="fs_beegfs_global0`are allowed. For a detailed description of the possible -constraints, please refer to the Slurm documentation (<https://slurm.schedmd.com/srun.html>). + srun ./path/to/mpi_application + ``` -**Remark:** A feature is checked only for scheduling. Running jobs are not affected by changing -features. + * Submisson: `marie@login$ sbatch batch_script.sh` + * Run with fewer MPI tasks: `marie@login$ sbatch --ntasks=14 batch_script.sh` -### Available features on Taurus +## Manage and Control Jobs -| Feature | Description | -|:--------|:-------------------------------------------------------------------------| -| DA | subset of Haswell nodes with a high bandwidth to NVMe storage (island 6) | +### Job and Slurm Monitoring -#### File system features +On the command line, use `squeue` to watch the scheduling queue. This command will tell the reason, +why a job is not running (job status in the last column of the output). More information about job +parameters can also be determined with `scontrol -d show job <jobid>`. The following table holds +detailed descriptions of the possible job states: + +??? tip "Reason Table" + + | Reason | Long Description | + |:-------------------|:------------------| + | `Dependency` | This job is waiting for a dependent job to complete. | + | `None` | No reason is set for this job. | + | `PartitionDown` | The partition required by this job is in a down state. | + | `PartitionNodeLimit` | The number of nodes required by this job is outside of its partitions current limits. Can also indicate that required nodes are down or drained. | + | `PartitionTimeLimit` | The jobs time limit exceeds its partitions current time limit. | + | `Priority` | One or higher priority jobs exist for this partition. | + | `Resources` | The job is waiting for resources to become available. | + | `NodeDown` | A node required by the job is down. | + | `BadConstraints` | The jobs constraints can not be satisfied. | + | `SystemFailure` | Failure of the Slurm system, a filesystem, the network, etc. | + | `JobLaunchFailure` | The job could not be launched. This may be due to a filesystem problem, invalid program name, etc. | + | `NonZeroExitCode` | The job terminated with a non-zero exit code. | + | `TimeLimit` | The job exhausted its time limit. | + | `InactiveLimit` | The job reached the system inactive limit. | -A feature `fs_*` is active if a certain file system is mounted and available on a node. Access to -these file systems are tested every few minutes on each node and the Slurm features set accordingly. +In addition, the `sinfo` command gives you a quick status overview. -| Feature | Description | -|:-------------------|:---------------------------------------------------------------------| -| fs_lustre_scratch2 | `/scratch` mounted read-write (the OS mount point is `/lustre/scratch2)` | -| fs_lustre_ssd | `/lustre/ssd` mounted read-write | -| fs_warm_archive_ws | `/warm_archive/ws` mounted read-only | -| fs_beegfs_global0 | `/beegfs/global0` mounted read-write | +For detailed information on why your submitted job has not started yet, you can use the command -For certain projects, specific file systems are provided. For those, -additional features are available, like `fs_beegfs_<projectname>`. +```console +marie@login$ whypending <jobid> +``` -## Editing Jobs +### Editing Jobs Jobs that have not yet started can be altered. Using `scontrol update timelimit=4:00:00 -jobid=<jobid>` is is for example possible to modify the maximum runtime. scontrol understands many -different options, please take a look at the man page for more details. +jobid=<jobid>`, it is for example possible to modify the maximum runtime. `scontrol` understands +many different options, please take a look at the +[scontrol documentation](https://slurm.schedmd.com/scontrol.html) for more details. -## Job and Slurm Monitoring +### Canceling Jobs -On the command line, use `squeue` to watch the scheduling queue. This command will tell the reason, -why a job is not running (job status in the last column of the output). More information about job -parameters can also be determined with `scontrol -d show job <jobid>` Here are detailed descriptions -of the possible job status: - -| Reason | Long description | -|:-------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------| -| Dependency | This job is waiting for a dependent job to complete. | -| None | No reason is set for this job. | -| PartitionDown | The partition required by this job is in a DOWN state. | -| PartitionNodeLimit | The number of nodes required by this job is outside of its partitions current limits. Can also indicate that required nodes are DOWN or DRAINED. | -| PartitionTimeLimit | The jobs time limit exceeds its partitions current time limit. | -| Priority | One or higher priority jobs exist for this partition. | -| Resources | The job is waiting for resources to become available. | -| NodeDown | A node required by the job is down. | -| BadConstraints | The jobs constraints can not be satisfied. | -| SystemFailure | Failure of the Slurm system, a file system, the network, etc. | -| JobLaunchFailure | The job could not be launched. This may be due to a file system problem, invalid program name, etc. | -| NonZeroExitCode | The job terminated with a non-zero exit code. | -| TimeLimit | The job exhausted its time limit. | -| InactiveLimit | The job reached the system InactiveLimit. | +The command `scancel <jobid>` kills a single job and removes it from the queue. By using `scancel -u +<username>`, you can send a canceling signal to all of your jobs at once. -In addition, the `sinfo` command gives you a quick status overview. +### Accounting + +The Slurm command `sacct` provides job statistics like memory usage, CPU time, energy usage etc. -For detailed information on why your submitted job has not started yet, you can use: `whypending -<jobid>`. +!!! hint "Learn from old jobs" -## Accounting + We highly encourage you to use `sacct` to learn from you previous jobs in order to better + estimate the requirements, e.g., runtime, for future jobs. -The Slurm command `sacct` provides job statistics like memory usage, CPU -time, energy usage etc. Examples: +`sacct` outputs the following fields by default. -```Shell Session +```console # show all own jobs contained in the accounting database -sacct -# show specific job -sacct -j <JOBID> -# specify fields -sacct -j <JOBID> -o JobName,MaxRSS,MaxVMSize,CPUTime,ConsumedEnergy -# show all fields -sacct -j <JOBID> -o ALL +marie@login$ sacct + JobID JobName Partition Account AllocCPUS State ExitCode +------------ ---------- ---------- ---------- ---------- ---------- -------- +[...] ``` -Read the manpage (`man sacct`) for information on the provided fields. +We'd like to point your attention to the following options to gain insight in your jobs. -Note that sacct by default only shows data of the last day. If you want -to look further into the past without specifying an explicit job id, you -need to provide a startdate via the **-S** or **--starttime** parameter, -e.g +??? example "Show specific job" -```Shell Session -# show all jobs since the beginning of year 2020: -sacct -S 2020-01-01 -``` + ```console + marie@login$ sacct --jobs=<JOBID> + ``` -## Killing jobs +??? example "Show all fields for a specific job" -The command `scancel <jobid>` kills a single job and removes it from the queue. By using `scancel -u -<username>` you are able to kill all of your jobs at once. + ```console + marie@login$ sacct --jobs=<JOBID> --format=All + ``` -## Host List +??? example "Show specific fields" -If you want to place your job onto specific nodes, there are two options for doing this. Either use -`-p` to specify a host group that fits your needs. Or, use `-w` or (`--nodelist`) with a name node -nodes that will work for you. + ```console + marie@login$ sacct --jobs=<JOBID> --format=JobName,MaxRSS,MaxVMSize,CPUTime,ConsumedEnergy + ``` -## Job Profiling +The manual page (`man sacct`) and the [sacct online reference](https://slurm.schedmd.com/sacct.html) +provide a comprehensive documentation regarding available fields and formats. -\<a href="%ATTACHURL%/hdfview_memory.png"> \<img alt="" height="272" -src="%ATTACHURL%/hdfview_memory.png" style="float: right; margin-left: -10px;" title="hdfview" width="324" /> \</a> +!!! hint "Time span" -Slurm offers the option to gather profiling data from every task/node of the job. Following data can -be gathered: + By default, `sacct` only shows data of the last day. If you want to look further into the past + without specifying an explicit job id, you need to provide a start date via the option + `--starttime` (or short: `-S`). A certain end date is also possible via `--endtime` (or `-E`). -- Task data, such as CPU frequency, CPU utilization, memory - consumption (RSS and VMSize), I/O -- Energy consumption of the nodes -- Infiniband data (currently deactivated) -- Lustre filesystem data (currently deactivated) +??? example "Show all jobs since the beginning of year 2021" -The data is sampled at a fixed rate (i.e. every 5 seconds) and is stored in a HDF5 file. + ```console + marie@login$ sacct -S 2021-01-01 [-E now] + ``` -**CAUTION**: Please be aware that the profiling data may be quiet large, depending on job size, -runtime, and sampling rate. Always remove the local profiles from -`/lustre/scratch2/profiling/${USER}`, either by running sh5util as shown above or by simply removing -those files. +## Jobs at Reservations -Usage examples: +How to ask for a reservation is described in the section +[reservations](overview.md#exclusive-reservation-of-hardware). +After we agreed with your requirements, we will send you an e-mail with your reservation name. Then, +you could see more information about your reservation with the following command: -```Shell Session -# create energy and task profiling data (--acctg-freq is the sampling rate in seconds) -srun --profile=All --acctg-freq=5,energy=5 -n 32 ./a.out -# create task profiling data only -srun --profile=All --acctg-freq=5 -n 32 ./a.out +```console +marie@login$ scontrol show res=<reservation name> +# e.g. scontrol show res=hpcsupport_123 +``` -# merge the node local files in /lustre/scratch2/profiling/${USER} to single file -# (without -o option output file defaults to job_<JOBID>.h5) -sh5util -j <JOBID> -o profile.h5 -# in jobscripts or in interactive sessions (via salloc): -sh5util -j ${SLURM_JOBID} -o profile.h5 +If you want to use your reservation, you have to add the parameter +`--reservation=<reservation name>` either in your sbatch script or to your `srun` or `salloc` command. -# view data: -module load HDFView -hdfview.sh profile.h5 -``` +## Node Features for Selective Job Submission -More information about profiling with Slurm: +The nodes in our HPC system are becoming more diverse in multiple aspects: hardware, mounted +storage, software. The system administrators can describe the set of properties and it is up to the +user to specify her/his requirements. These features should be thought of as changing over time +(e.g., a filesystem get stuck on a certain node). -- [Slurm Profiling](http://slurm.schedmd.com/hdf5_profile_user_guide.html) -- [sh5util](http://slurm.schedmd.com/sh5util.html) +A feature can be used with the Slurm option `--constrain` or `-C` like +`srun -C fs_lustre_scratch2 ...` with `srun` or `sbatch`. Combinations like +`--constraint="fs_beegfs_global0`are allowed. For a detailed description of the possible +constraints, please refer to the [Slurm documentation](https://slurm.schedmd.com/srun.html). -## Reservations +!!! hint -If you want to run jobs, which specifications are out of our job limitations, you could -[ask for a reservation](mailto:hpcsupport@zih.tu-dresden.de). Please add the following information -to your request mail: + A feature is checked only for scheduling. Running jobs are not affected by changing features. -- start time (please note, that the start time have to be later than - the day of the request plus 7 days, better more, because the longest - jobs run 7 days) -- duration or end time -- account -- node count or cpu count -- partition +### Available Features -After we agreed with your requirements, we will send you an e-mail with your reservation name. Then -you could see more information about your reservation with the following command: +| Feature | Description | +|:--------|:-------------------------------------------------------------------------| +| DA | Subset of Haswell nodes with a high bandwidth to NVMe storage (island 6) | -```Shell Session -scontrol show res=<reservation name> -# e.g. scontrol show res=hpcsupport_123 -``` +#### Filesystem Features -If you want to use your reservation, you have to add the parameter `--reservation=<reservation -name>` either in your sbatch script or to your `srun` or `salloc` command. +A feature `fs_*` is active if a certain filesystem is mounted and available on a node. Access to +these filesystems are tested every few minutes on each node and the Slurm features set accordingly. -## Slurm External Links +| Feature | Description | +|:-------------------|:---------------------------------------------------------------------| +| `fs_lustre_scratch2` | `/scratch` mounted read-write (mount point is `/lustre/scratch2`) | +| `fs_lustre_ssd` | `/ssd` mounted read-write (mount point is `/lustre/ssd`) | +| `fs_warm_archive_ws` | `/warm_archive/ws` mounted read-only | +| `fs_beegfs_global0` | `/beegfs/global0` mounted read-write | +| `fs_beegfs` | `/beegfs` mounted read-write | -- Manpages, tutorials, examples, etc: (http://slurm.schedmd.com/) -- Comparison with other batch systems: (http://www.schedmd.com/slurmdocs/rosetta.html) +For certain projects, specific filesystems are provided. For those, +additional features are available, like `fs_beegfs_<projectname>`. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_examples.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_examples.md index 187bd7cf82651718fb0b188edfa0c95f33621b20..65e445f354d08a3473e226cc97c45ff6c01e8c48 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_examples.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_examples.md @@ -1,5 +1,358 @@ -# SlurmExamples +# Job Examples -## Array-Job with Afterok-Dependency and DataMover Usage +## Parallel Jobs -TODO +For submitting parallel jobs, a few rules have to be understood and followed. In general, they +depend on the type of parallelization and architecture. + +### OpenMP Jobs + +An SMP-parallel job can only run within a node, so it is necessary to include the options `-N 1` and +`-n 1`. The maximum number of processors for an SMP-parallel program is 896 and 56 on partition +`taurussmp8` and `smp2`, respectively. Please refer to the +[partitions section](partitions_and_limits.md#memory-limits) for up-to-date information. Using the +option `--cpus-per-task=<N>` Slurm will start one task and you will have `N` CPUs available for your +job. An example job file would look like: + +!!! example "Job file for OpenMP application" + + ```Bash + #!/bin/bash + #SBATCH --nodes=1 + #SBATCH --tasks-per-node=1 + #SBATCH --cpus-per-task=8 + #SBATCH --time=08:00:00 + #SBATCH -J Science1 + #SBATCH --mail-type=end + #SBATCH --mail-user=your.name@tu-dresden.de + + export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK + ./path/to/binary + ``` + +### MPI Jobs + +For MPI-parallel jobs one typically allocates one core per task that has to be started. + +!!! warning "MPI libraries" + + There are different MPI libraries on ZIH systems for the different micro archtitectures. Thus, + you have to compile the binaries specifically for the target architecture and partition. Please + refer to the sections [building software](../software/building_software.md) and + [module environments](../software/modules.md#module-environments) for detailed + information. + +!!! example "Job file for MPI application" + + ```Bash + #!/bin/bash + #SBATCH --ntasks=864 + #SBATCH --time=08:00:00 + #SBATCH -J Science1 + #SBATCH --mail-type=end + #SBATCH --mail-user=your.name@tu-dresden.de + + srun ./path/to/binary + ``` + +### Multiple Programs Running Simultaneously in a Job + +In this short example, our goal is to run four instances of a program concurrently in a **single** +batch script. Of course, we could also start a batch script four times with `sbatch` but this is not +what we want to do here. However, you can also find an example about +[how to run GPU programs simultaneously in a single job](#running-multiple-gpu-applications-simultaneously-in-a-batch-job) +below. + +!!! example " " + + ```Bash + #!/bin/bash + #SBATCH --ntasks=4 + #SBATCH --cpus-per-task=1 + #SBATCH --time=01:00:00 + #SBATCH -J PseudoParallelJobs + #SBATCH --mail-type=end + #SBATCH --mail-user=your.name@tu-dresden.de + + # The following sleep command was reported to fix warnings/errors with srun by users (feel free to uncomment). + #sleep 5 + srun --exclusive --ntasks=1 ./path/to/binary & + + #sleep 5 + srun --exclusive --ntasks=1 ./path/to/binary & + + #sleep 5 + srun --exclusive --ntasks=1 ./path/to/binary & + + #sleep 5 + srun --exclusive --ntasks=1 ./path/to/binary & + + echo "Waiting for parallel job steps to complete..." + wait + echo "All parallel job steps completed!" + ``` + +## Requesting GPUs + +Slurm will allocate one or many GPUs for your job if requested. Please note that GPUs are only +available in certain partitions, like `gpu2`, `gpu3` or `gpu2-interactive`. The option +for `sbatch/srun` in this case is `--gres=gpu:[NUM_PER_NODE]` (where `NUM_PER_NODE` can be `1`, `2` or +`4`, meaning that one, two or four of the GPUs per node will be used for the job). + +!!! example "Job file to request a GPU" + + ```Bash + #!/bin/bash + #SBATCH --nodes=2 # request 2 nodes + #SBATCH --mincpus=1 # allocate one task per node... + #SBATCH --ntasks=2 # ...which means 2 tasks in total (see note below) + #SBATCH --cpus-per-task=6 # use 6 threads per task + #SBATCH --gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task) + #SBATCH --time=01:00:00 # run for 1 hour + #SBATCH -A Project1 # account CPU time to Project1 + + srun ./your/cuda/application # start you application (probably requires MPI to use both nodes) + ``` + +Please be aware that the partitions `gpu`, `gpu1` and `gpu2` can only be used for non-interactive +jobs which are submitted by `sbatch`. Interactive jobs (`salloc`, `srun`) will have to use the +partition `gpu-interactive`. Slurm will automatically select the right partition if the partition +parameter `-p, --partition` is omitted. + +!!! note + + Due to an unresolved issue concerning the Slurm job scheduling behavior, it is currently not + practical to use `--ntasks-per-node` together with GPU jobs. If you want to use multiple nodes, + please use the parameters `--ntasks` and `--mincpus` instead. The values of `mincpus`*`nodes` + has to equal `ntasks` in this case. + +### Limitations of GPU Job Allocations + +The number of cores per node that are currently allowed to be allocated for GPU jobs is limited +depending on how many GPUs are being requested. On the K80 nodes, you may only request up to 6 +cores per requested GPU (8 per on the K20 nodes). This is because we do not wish that GPUs remain +unusable due to all cores on a node being used by a single job which does not, at the same time, +request all GPUs. + +E.g., if you specify `--gres=gpu:2`, your total number of cores per node (meaning: +`ntasks`*`cpus-per-task`) may not exceed 12 (on the K80 nodes) + +Note that this also has implications for the use of the `--exclusive` parameter. Since this sets the +number of allocated cores to 24 (or 16 on the K20X nodes), you also **must** request all four GPUs +by specifying `--gres=gpu:4`, otherwise your job will not start. In the case of `--exclusive`, it won't +be denied on submission, because this is evaluated in a later scheduling step. Jobs that directly +request too many cores per GPU will be denied with the error message: + +```console +Batch job submission failed: Requested node configuration is not available +``` + +### Running Multiple GPU Applications Simultaneously in a Batch Job + +Our starting point is a (serial) program that needs a single GPU and four CPU cores to perform its +task (e.g. TensorFlow). The following batch script shows how to run such a job on the partition `ml`. + +!!! example + + ```bash + #!/bin/bash + #SBATCH --ntasks=1 + #SBATCH --cpus-per-task=4 + #SBATCH --gres=gpu:1 + #SBATCH --gpus-per-task=1 + #SBATCH --time=01:00:00 + #SBATCH --mem-per-cpu=1443 + #SBATCH --partition=ml + + srun some-gpu-application + ``` + +When `srun` is used within a submission script, it inherits parameters from `sbatch`, including +`--ntasks=1`, `--cpus-per-task=4`, etc. So we actually implicitly run the following + +```bash +srun --ntasks=1 --cpus-per-task=4 ... --partition=ml some-gpu-application +``` + +Now, our goal is to run four instances of this program concurrently in a single batch script. Of +course we could also start the above script multiple times with `sbatch`, but this is not what we want +to do here. + +#### Solution + +In order to run multiple programs concurrently in a single batch script/allocation we have to do +three things: + +1. Allocate enough resources to accommodate multiple instances of our program. This can be achieved + with an appropriate batch script header (see below). +1. Start job steps with srun as background processes. This is achieved by adding an ampersand at the + end of the `srun` command +1. Make sure that each background process gets its private resources. We need to set the resource + fraction needed for a single run in the corresponding srun command. The total aggregated + resources of all job steps must fit in the allocation specified in the batch script header. + Additionally, the option `--exclusive` is needed to make sure that each job step is provided with + its private set of CPU and GPU resources. The following example shows how four independent + instances of the same program can be run concurrently from a single batch script. Each instance + (task) is equipped with 4 CPUs (cores) and one GPU. + +!!! example "Job file simultaneously executing four independent instances of the same program" + + ```Bash + #!/bin/bash + #SBATCH --ntasks=4 + #SBATCH --cpus-per-task=4 + #SBATCH --gres=gpu:4 + #SBATCH --gpus-per-task=1 + #SBATCH --time=01:00:00 + #SBATCH --mem-per-cpu=1443 + #SBATCH --partition=ml + + srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application & + srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application & + srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application & + srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application & + + echo "Waiting for all job steps to complete..." + wait + echo "All jobs completed!" + ``` + +In practice it is possible to leave out resource options in `srun` that do not differ from the ones +inherited from the surrounding `sbatch` context. The following line would be sufficient to do the +job in this example: + +```bash +srun --exclusive --gres=gpu:1 --ntasks=1 some-gpu-application & +``` + +Yet, it adds some extra safety to leave them in, enabling the Slurm batch system to complain if not +enough resources in total were specified in the header of the batch script. + +## Exclusive Jobs for Benchmarking + +Jobs ZIH systems run, by default, in shared-mode, meaning that multiple jobs (from different users) +can run at the same time on the same compute node. Sometimes, this behavior is not desired (e.g. +for benchmarking purposes). Thus, the Slurm parameter `--exclusive` request for exclusive usage of +resources. + +Setting `--exclusive` **only** makes sure that there will be **no other jobs running on your nodes**. +It does not, however, mean that you automatically get access to all the resources which the node +might provide without explicitly requesting them, e.g. you still have to request a GPU via the +generic resources parameter (`gres`) to run on the partitions with GPU, or you still have to +request all cores of a node if you need them. CPU cores can either to be used for a task +(`--ntasks`) or for multi-threading within the same task (`--cpus-per-task`). Since those two +options are semantically different (e.g., the former will influence how many MPI processes will be +spawned by `srun` whereas the latter does not), Slurm cannot determine automatically which of the +two you might want to use. Since we use cgroups for separation of jobs, your job is not allowed to +use more resources than requested.* + +If you just want to use all available cores in a node, you have to specify how Slurm should organize +them, like with `-p haswell -c 24` or `-p haswell --ntasks-per-node=24`. + +Here is a short example to ensure that a benchmark is not spoiled by other jobs, even if it doesn't +use up all resources in the nodes: + +!!! example "Exclusive resources" + + ```Bash + #!/bin/bash + #SBATCH -p haswell + #SBATCH --nodes=2 + #SBATCH --ntasks-per-node=2 + #SBATCH --cpus-per-task=8 + #SBATCH --exclusive # ensure that nobody spoils my measurement on 2 x 2 x 8 cores + #SBATCH --time=00:10:00 + #SBATCH -J Benchmark + #SBATCH --mail-user=your.name@tu-dresden.de + + srun ./my_benchmark + ``` + +## Array Jobs + +Array jobs can be used to create a sequence of jobs that share the same executable and resource +requirements, but have different input files, to be submitted, controlled, and monitored as a single +unit. The option is `-a, --array=<indexes>` where the parameter `indexes` specifies the array +indices. The following specifications are possible + +* comma separated list, e.g., `--array=0,1,2,17`, +* range based, e.g., `--array=0-42`, +* step based, e.g., `--array=0-15:4`, +* mix of comma separated and range base, e.g., `--array=0,1,2,16-42`. + +A maximum number of simultaneously running tasks from the job array may be specified using the `%` +separator. The specification `--array=0-23%8` limits the number of simultaneously running tasks from +this job array to 8. + +Within the job you can read the environment variables `SLURM_ARRAY_JOB_ID` and +`SLURM_ARRAY_TASK_ID` which is set to the first job ID of the array and set individually for each +step, respectively. + +Within an array job, you can use `%a` and `%A` in addition to `%j` and `%N` to make the output file +name specific to the job: + +* `%A` will be replaced by the value of `SLURM_ARRAY_JOB_ID` +* `%a` will be replaced by the value of `SLURM_ARRAY_TASK_ID` + +!!! example "Job file using job arrays" + + ```Bash + #!/bin/bash + #SBATCH --array 0-9 + #SBATCH -o arraytest-%A_%a.out + #SBATCH -e arraytest-%A_%a.err + #SBATCH --ntasks=864 + #SBATCH --time=08:00:00 + #SBATCH -J Science1 + #SBATCH --mail-type=end + #SBATCH --mail-user=your.name@tu-dresden.de + + echo "Hi, I am step $SLURM_ARRAY_TASK_ID in this array job $SLURM_ARRAY_JOB_ID" + ``` + +!!! note + + If you submit a large number of jobs doing heavy I/O in the Lustre filesystems you should limit + the number of your simultaneously running job with a second parameter like: + + ```Bash + #SBATCH --array=1-100000%100 + ``` + +Please read the Slurm documentation at https://slurm.schedmd.com/sbatch.html for further details. + +## Chain Jobs + +You can use chain jobs to create dependencies between jobs. This is often the case if a job relies +on the result of one or more preceding jobs. Chain jobs can also be used if the runtime limit of the +batch queues is not sufficient for your job. Slurm has an option +`-d, --dependency=<dependency_list>` that allows to specify that a job is only allowed to start if +another job finished. + +Here is an example of how a chain job can look like, the example submits 4 jobs (described in a job +file) that will be executed one after each other with different CPU numbers: + +!!! example "Script to submit jobs with dependencies" + + ```Bash + #!/bin/bash + TASK_NUMBERS="1 2 4 8" + DEPENDENCY="" + JOB_FILE="myjob.slurm" + + for TASKS in $TASK_NUMBERS ; do + JOB_CMD="sbatch --ntasks=$TASKS" + if [ -n "$DEPENDENCY" ] ; then + JOB_CMD="$JOB_CMD --dependency afterany:$DEPENDENCY" + fi + JOB_CMD="$JOB_CMD $JOB_FILE" + echo -n "Running command: $JOB_CMD " + OUT=`$JOB_CMD` + echo "Result: $OUT" + DEPENDENCY=`echo $OUT | awk '{print $4}'` + done + ``` + +## Array-Job with Afterok-Dependency and Datamover Usage + +This part is under construction. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_profiling.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_profiling.md new file mode 100644 index 0000000000000000000000000000000000000000..273a87710602b62feb97c342335b4c44f30ad09e --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_profiling.md @@ -0,0 +1,62 @@ +# Job Profiling + +Slurm offers the option to gather profiling data from every task/node of the job. Analyzing this +data allows for a better understanding of your jobs in terms of elapsed time, runtime and I/O +behavior, and many more. + +The following data can be gathered: + +* Task data, such as CPU frequency, CPU utilization, memory consumption (RSS and VMSize), I/O +* Energy consumption of the nodes +* Infiniband data (currently deactivated) +* Lustre filesystem data (currently deactivated) + +The data is sampled at a fixed rate (i.e. every 5 seconds) and is stored in a HDF5 file. + +!!! note "Data hygiene" + + Please be aware that the profiling data may be quiet large, depending on job size, runtime, and + sampling rate. Always remove the local profiles from `/lustre/scratch2/profiling/${USER}`, + either by running `sh5util` as shown above or by simply removing those files. + +## Examples + +The following examples of `srun` profiling command lines are meant to replace the current `srun` +line within your job file. + +??? example "Create profiling data" + + (--acctg-freq is the sampling rate in seconds) + + ```console + # Energy and task profiling + srun --profile=All --acctg-freq=5,energy=5 -n 32 ./a.out + # Task profiling data only + srun --profile=All --acctg-freq=5 -n 32 ./a.out + ``` + +??? example "Merge the node local files" + + ... in `/lustre/scratch2/profiling/${USER}` to single file. + + ```console + # (without -o option output file defaults to job_$JOBID.h5) + sh5util -j <JOBID> -o profile.h5 + # in jobscripts or in interactive sessions (via salloc): + sh5util -j ${SLURM_JOBID} -o profile.h5 + ``` + +??? example "View data" + + ```console + marie@login$ module load HDFView + marie@login$ hdfview.sh profile.h5 + ``` + + +{: align="center"} + +More information about profiling with Slurm: + +- [Slurm Profiling](http://slurm.schedmd.com/hdf5_profile_user_guide.html) +- [`sh5util`](http://slurm.schedmd.com/sh5util.html) diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/system_taurus.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/system_taurus.md deleted file mode 100644 index 3625bf4503d4b41d73fc7a9de6c02dabc3d3feec..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/system_taurus.md +++ /dev/null @@ -1,210 +0,0 @@ -# Taurus - -## Information about the Hardware - -Detailed information on the current HPC hardware can be found -[here.](../jobs_and_resources/hardware_taurus.md) - -## Applying for Access to the System - -Project and login application forms for taurus are available -[here](../access/overview.md). - -## Login to the System - -Login to the system is available via ssh at taurus.hrsk.tu-dresden.de. -There are several login nodes (internally called tauruslogin3 to -tauruslogin6). Currently, if you use taurus.hrsk.tu-dresden.de, you will -be placed on tauruslogin5. It might be a good idea to give the other -login nodes a try if the load on tauruslogin5 is rather high (there will -once again be load balancer soon, but at the moment, there is none). - -Please note that if you store data on the local disk (e.g. under /tmp), -it will be on only one of the three nodes. If you relogin and the data -is not there, you are probably on another node. - -You can find an list of fingerprints [here](../access/key_fingerprints.md). - -## Transferring Data from/to Taurus - -taurus has two specialized data transfer nodes. Both nodes are -accessible via `taurusexport.hrsk.tu-dresden.de`. Currently, only rsync, -scp and sftp to these nodes will work. A login via SSH is not possible -as these nodes are dedicated to data transfers. - -These nodes are located behind a firewall. By default, they are only -accessible from IP addresses from with the Campus of the TU Dresden. -External IP addresses can be enabled upon request. These requests should -be send via eMail to `servicedesk@tu-dresden.de` and mention the IP -address range (or node names), the desired protocol and the time frame -that the firewall needs to be open. - -We are open to discuss options to export the data in the scratch file -system via CIFS or other protocols. If you have a need for this, please -contact the Service Desk as well. - -**Phase 2:** The nodes taurusexport\[3,4\] provide access to the -`/scratch` file system of the second phase. - -## Compiling Parallel Applications - -You have to explicitly load a compiler module and an MPI module on -Taurus. Eg. with `module load GCC OpenMPI`. ( [read more about -Modules](../software/runtime_environment.md), **todo link** (read more about -Compilers)(Compendium.Compilers)) - -Use the wrapper commands like e.g. `mpicc` (`mpiicc` for intel), -`mpicxx` (`mpiicpc`) or `mpif90` (`mpiifort`) to compile MPI source -code. To reveal the command lines behind the wrappers, use the option -`-show`. - -For running your code, you have to load the same compiler and MPI module -as for compiling the program. Please follow the following guiedlines to -run your parallel program using the batch system. - -## Batch System - -Applications on an HPC system can not be run on the login node. They -have to be submitted to compute nodes with dedicated resources for the -user's job. Normally a job can be submitted with these data: - -- number of CPU cores, -- requested CPU cores have to belong on one node (OpenMP programs) or - can distributed (MPI), -- memory per process, -- maximum wall clock time (after reaching this limit the process is - killed automatically), -- files for redirection of output and error messages, -- executable and command line parameters. - -The batch system on Taurus is Slurm. If you are migrating from LSF -(deimos, mars, atlas), the biggest difference is that Slurm has no -notion of batch queues any more. - -- [General information on the Slurm batch system](slurm.md) -- Slurm also provides process-level and node-level [profiling of - jobs](slurm.md#Job_Profiling) - -### Partitions - -Please note that the islands are also present as partitions for the -batch systems. They are called - -- romeo (Island 7 - AMD Rome CPUs) -- julia (large SMP machine) -- haswell (Islands 4 to 6 - Haswell CPUs) -- gpu (Island 2 - GPUs) - - gpu2 (K80X) -- smp2 (SMP Nodes) - -**Note:** usually you don't have to specify a partition explicitly with -the parameter -p, because SLURM will automatically select a suitable -partition depending on your memory and gres requirements. - -### Run-time Limits - -**Run-time limits are enforced**. This means, a job will be canceled as -soon as it exceeds its requested limit. At Taurus, the maximum run time -is 7 days. - -Shorter jobs come with multiple advantages:\<img alt="part.png" -height="117" src="%ATTACHURL%/part.png" style="float: right;" -title="part.png" width="284" /> - -- lower risk of loss of computing time, -- shorter waiting time for reservations, -- higher job fluctuation; thus, jobs with high priorities may start - faster. - -To bring down the percentage of long running jobs we restrict the number -of cores with jobs longer than 2 days to approximately 50% and with jobs -longer than 24 to 75% of the total number of cores. (These numbers are -subject to changes.) As best practice we advise a run time of about 8h. - -Please always try to make a good estimation of your needed time limit. -For this, you can use a command line like this to compare the requested -timelimit with the elapsed time for your completed jobs that started -after a given date: - - sacct -X -S 2021-01-01 -E now --format=start,JobID,jobname,elapsed,timelimit -s COMPLETED - -Instead of running one long job, you should split it up into a chain -job. Even applications that are not capable of chreckpoint/restart can -be adapted. The HOWTO can be found [here](../jobs_and_resources/checkpoint_restart.md), - -### Memory Limits - -**Memory limits are enforced.** This means that jobs which exceed their -per-node memory limit will be killed automatically by the batch system. -Memory requirements for your job can be specified via the *sbatch/srun* -parameters: **--mem-per-cpu=\<MB>** or **--mem=\<MB>** (which is "memory -per node"). The **default limit** is **300 MB** per cpu. - -Taurus has sets of nodes with a different amount of installed memory -which affect where your job may be run. To achieve the shortest possible -waiting time for your jobs, you should be aware of the limits shown in -the following table. - -| Partition | Nodes | # Nodes | Cores per Node | Avail. Memory per Core | Avail. Memory per Node | GPUs per node | -|:-------------------|:-----------------------------------------|:--------|:----------------|:-----------------------|:-----------------------|:------------------| -| `haswell64` | `taurusi[4001-4104,5001-5612,6001-6612]` | `1328` | `24` | `2541 MB` | `61000 MB` | `-` | -| `haswell128` | `taurusi[4105-4188]` | `84` | `24` | `5250 MB` | `126000 MB` | `-` | -| `haswell256` | `taurusi[4189-4232]` | `44` | `24` | `10583 MB` | `254000 MB` | `-` | -| `broadwell` | `taurusi[4233-4264]` | `32` | `28` | `2214 MB` | `62000 MB` | `-` | -| `smp2` | `taurussmp[3-7]` | `5` | `56` | `36500 MB` | `2044000 MB` | `-` | -| `gpu2` | `taurusi[2045-2106]` | `62` | `24` | `2583 MB` | `62000 MB` | `4 (2 dual GPUs)` | -| `gpu2-interactive` | `taurusi[2045-2108]` | `64` | `24` | `2583 MB` | `62000 MB` | `4 (2 dual GPUs)` | -| `hpdlf` | `taurusa[3-16]` | `14` | `12` | `7916 MB` | `95000 MB` | `3` | -| `ml` | `taurusml[1-32]` | `32` | `44 (HT: 176)` | `1443 MB*` | `254000 MB` | `6` | -| `romeo` | `taurusi[7001-7192]` | `192` | `128 (HT: 256)` | `1972 MB*` | `505000 MB` | `-` | -| `julia` | `taurussmp8` | `1` | `896` | `27343 MB*` | `49000000 MB` | `-` | - -\* note that the ML nodes have 4way-SMT, so for every physical core -allocated (e.g., with SLURM_HINT=nomultithread), you will always get -4\*1443MB because the memory of the other threads is allocated -implicitly, too. - -### Submission of Parallel Jobs - -To run MPI jobs ensure that the same MPI module is loaded as during -compile-time. In doubt, check you loaded modules with `module list`. If -your code has been compiled with the standard `bullxmpi` installation, -you can load the module via `module load bullxmpi`. Alternative MPI -libraries (`intelmpi`, `openmpi`) are also available. - -Please pay attention to the messages you get loading the module. They -are more up-to-date than this manual. - -## GPUs - -Island 2 of taurus contains a total of 128 NVIDIA Tesla K80 (dual) GPUs -in 64 nodes. - -More information on how to program applications for GPUs can be found -[GPU Programming](GPU Programming). - -The following software modules on taurus offer GPU support: - -- `CUDA` : The NVIDIA CUDA compilers -- `PGI` : The PGI compilers with OpenACC support - -## Hardware for Deep Learning (HPDLF) - -The partition hpdlf contains 14 servers. Each of them has: - -- 2 sockets CPU E5-2603 v4 (1.70GHz) with 6 cores each, -- 3 consumer GPU cards NVIDIA GTX1080, -- 96 GB RAM. - -## Energy Measurement - -Taurus contains sophisticated energy measurement instrumentation. -Especially HDEEM is available on the haswell nodes of Phase II. More -detailed information can be found at -**todo link** (EnergyMeasurement)(EnergyMeasurement). - -## Low level optimizations - -x86 processsors provide registers that can be used for optimizations and -performance monitoring. Taurus provides you access to such features via -the **todo link** (X86Adapt)(X86Adapt) software infrastructure. diff --git a/doc.zih.tu-dresden.de/docs/legal_notice.md b/doc.zih.tu-dresden.de/docs/legal_notice.md index 3412a3a0a511d26d1a8bf8e730161622fb7930d9..a5e187ee3f5eb9937e8eb01c33eed182fb2c423d 100644 --- a/doc.zih.tu-dresden.de/docs/legal_notice.md +++ b/doc.zih.tu-dresden.de/docs/legal_notice.md @@ -1,8 +1,10 @@ -# Legal Notice / Impressum +# Legal Notice + +## Impressum Es gilt das [Impressum der TU Dresden](https://tu-dresden.de/impressum) mit folgenden Änderungen: -## Ansprechpartner/Betreiber: +### Ansprechpartner/Betreiber: Technische Universität Dresden Zentrum für Informationsdienste und Hochleistungsrechnen @@ -12,7 +14,7 @@ Tel.: +49 351 463-40000 Fax: +49 351 463-42328 E-Mail: servicedesk@tu-dresden.de -## Konzeption, Technische Umsetzung, Anbieter: +### Konzeption, Technische Umsetzung, Anbieter: Technische Universität Dresden Zentrum für Informationsdienste und Hochleistungsrechnen @@ -22,3 +24,10 @@ Prof. Dr. Wolfgang E. Nagel Tel.: +49 351 463-35450 Fax: +49 351 463-37773 E-Mail: zih@tu-dresden.de + +## License + +This documentation and the repository have two licenses: + +* All documentation is licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). +* All software components are licensed under MIT license. diff --git a/doc.zih.tu-dresden.de/docs/misc/HPC-Introduction.pdf b/doc.zih.tu-dresden.de/docs/misc/HPC-Introduction.pdf new file mode 100644 index 0000000000000000000000000000000000000000..71d47f04b75004fad2b9fd7181051c2beae4e2fe Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/misc/HPC-Introduction.pdf differ diff --git a/doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md b/doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md index 59aa75e842e3875f99d458caec785c6bf9645a81..df7fc8b56a8a015b5a13a8c871b5163b2c1d473d 100644 --- a/doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md +++ b/doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md @@ -1,30 +1,27 @@ -# Big Data Frameworks: Apache Spark, Apache Flink, Apache Hadoop - -!!! note - - This page is under construction +# Big Data Analytics [Apache Spark](https://spark.apache.org/), [Apache Flink](https://flink.apache.org/) and [Apache Hadoop](https://hadoop.apache.org/) are frameworks for processing and integrating -Big Data. These frameworks are also offered as software [modules](modules.md) on both `ml` and -`scs5` partition. You can check module versions and availability with the command - -```console -marie@login$ module av Spark -``` - -The **aim** of this page is to introduce users on how to start working with -these frameworks on ZIH systems, e. g. on the [HPC-DA](../jobs_and_resources/hpcda.md) system. +Big Data. These frameworks are also offered as software [modules](modules.md) in both `ml` and +`scs5` software environments. You can check module versions and availability with the command + +=== "Spark" + ```console + marie@login$ module avail Spark + ``` +=== "Flink" + ```console + marie@login$ module avail Flink + ``` **Prerequisites:** To work with the frameworks, you need [access](../access/ssh_login.md) to ZIH systems and basic knowledge about data analysis and the batch system [Slurm](../jobs_and_resources/slurm.md). -The usage of Big Data frameworks is -different from other modules due to their master-worker approach. That -means, before an application can be started, one has to do additional steps. -In the following, we assume that a Spark application should be -started. +The usage of Big Data frameworks is different from other modules due to their master-worker +approach. That means, before an application can be started, one has to do additional steps. +In the following, we assume that a Spark application should be started and give alternative +commands for Flink where applicable. The steps are: @@ -34,190 +31,247 @@ The steps are: 1. Start the Spark application Apache Spark can be used in [interactive](#interactive-jobs) and [batch](#batch-jobs) jobs as well -as via [Jupyter notebook](#jupyter-notebook). All three ways are outlined in the following. - -!!! note - - It is recommended to use ssh keys to avoid entering the password - every time to log in to nodes. For the details, please check the - [external documentation](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/s2-ssh-configuration-keypairs). +as via [Jupyter notebooks](#jupyter-notebook). All three ways are outlined in the following. ## Interactive Jobs ### Default Configuration -The Spark module is available for both `scs5` and `ml` partitions. -Thus, Spark can be executed using different CPU architectures, e.g., Haswell and Power9. - -Let us assume that two nodes should be used for the computation. Use a -`srun` command similar to the following to start an interactive session -using the Haswell partition. The following code snippet shows a job submission -to Haswell nodes with an allocation of two nodes with 60 GB main memory -exclusively for one hour: - -```console -marie@login$ srun --partition=haswell -N 2 --mem=60g --exclusive --time=01:00:00 --pty bash -l -``` +The Spark and Flink modules are available in both `scs5` and `ml` environments. +Thus, Spark and Flink can be executed using different CPU architectures, e.g., Haswell and Power9. -The command for different resource allocation on the `ml` partition is -similar, e. g. for a job submission to `ml` nodes with an allocation of one -node, one task per node, two CPUs per task, one GPU per node, with 10000 MB for one hour: +Let us assume that two nodes should be used for the computation. Use a `srun` command similar to +the following to start an interactive session using the partition haswell. The following code +snippet shows a job submission to haswell nodes with an allocation of two nodes with 60000 MB main +memory exclusively for one hour: ```console -marie@login$ srun --partition=ml -N 1 -n 1 -c 2 --gres=gpu:1 --mem-per-cpu=10000 --time=01:00:00 --pty bash +marie@login$ srun --partition=haswell --nodes=2 --mem=60000M --exclusive --time=01:00:00 --pty bash -l ``` -Once you have the shell, load Spark using the command +Once you have the shell, load desired Big Data framework using the command + +=== "Spark" + ```console + marie@compute$ module load Spark + ``` +=== "Flink" + ```console + marie@compute$ module load Flink + ``` + +Before the application can be started, the cluster with the allocated nodes needs to be set up. To +do this, configure the cluster first using the configuration template at `$SPARK_HOME/conf` for +Spark or `$FLINK_ROOT_DIR/conf` for Flink: + +=== "Spark" + ```console + marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf + ``` +=== "Flink" + ```console + marie@compute$ source framework-configure.sh flink $FLINK_ROOT_DIR/conf + ``` + +This places the configuration in a directory called `cluster-conf-<JOB_ID>` in your `home` +directory, where `<JOB_ID>` stands for the id of the Slurm job. After that, you can start in +the usual way: + +=== "Spark" + ```console + marie@compute$ start-all.sh + ``` +=== "Flink" + ```console + marie@compute$ start-cluster.sh + ``` + +The necessary background processes should now be set up and you can start your application, e. g.: + +=== "Spark" + ```console + marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi \ + $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000 + ``` +=== "Flink" + ```console + marie@compute$ flink run $FLINK_ROOT_DIR/examples/batch/KMeans.jar + ``` -```console -marie@compute$ module load Spark -``` +!!! warning -Before the application can be started, the Spark cluster needs to be set -up. To do this, configure Spark first using configuration template at -`$SPARK_HOME/conf`: + Do not delete the directory `cluster-conf-<JOB_ID>` while the job is still + running. This leads to errors. -```console -marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf -``` +### Custom Configuration -This places the configuration in a directory called -`cluster-conf-<JOB_ID>` in your `home` directory, where `<JOB_ID>` stands -for the id of the Slurm job. After that, you can start Spark in the -usual way: +The script `framework-configure.sh` is used to derive a configuration from a template. It takes two +parameters: -```console -marie@compute$ start-all.sh -``` +- The framework to set up (parameter `spark` for Spark, `flink` for Flink, and `hadoop` for Hadoop) +- A configuration template -The Spark processes should now be set up and you can start your -application, e. g.: +Thus, you can modify the configuration by replacing the default configuration template with a +customized one. This way, your custom configuration template is reusable for different jobs. You +can start with a copy of the default configuration ahead of your interactive session: + +=== "Spark" + ```console + marie@login$ cp -r $SPARK_HOME/conf my-config-template + ``` +=== "Flink" + ```console + marie@login$ cp -r $FLINK_ROOT_DIR/conf my-config-template + ``` + +After you have changed `my-config-template`, you can use your new template in an interactive job +with: + +=== "Spark" + ```console + marie@compute$ source framework-configure.sh spark my-config-template + ``` +=== "Flink" + ```console + marie@compute$ source framework-configure.sh flink my-config-template + ``` -```console -marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.4.jar 1000 -``` +### Using Hadoop Distributed Filesystem (HDFS) -!!! warning +If you want to use Spark and HDFS together (or in general more than one framework), a scheme +similar to the following can be used: + +=== "Spark" + ```console + marie@compute$ module load Hadoop + marie@compute$ module load Spark + marie@compute$ source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop + marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf + marie@compute$ start-dfs.sh + marie@compute$ start-all.sh + ``` +=== "Flink" + ```console + marie@compute$ module load Hadoop + marie@compute$ module load Flink + marie@compute$ source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop + marie@compute$ source framework-configure.sh flink $FLINK_ROOT_DIR/conf + marie@compute$ start-dfs.sh + marie@compute$ start-cluster.sh + ``` - Do not delete the directory `cluster-conf-<JOB_ID>` while the job is still - running. This leads to errors. +## Batch Jobs -### Custom Configuration +Using `srun` directly on the shell blocks the shell and launches an interactive job. Apart from +short test runs, it is **recommended to launch your jobs in the background using batch jobs**. For +that, you can conveniently put the parameters directly into the job file and submit it via +`sbatch [options] <job file>`. -The script `framework-configure.sh` is used to derive a configuration from -a template. It takes two parameters: +Please use a [batch job](../jobs_and_resources/slurm.md) with a configuration, similar to the +example below: -- The framework to set up (Spark, Flink, Hadoop) -- A configuration template +??? example "example-starting-script.sbatch" + === "Spark" + ```bash + #!/bin/bash -l + #SBATCH --time=01:00:00 + #SBATCH --partition=haswell + #SBATCH --nodes=2 + #SBATCH --exclusive + #SBATCH --mem=60000M + #SBATCH --job-name="example-spark" -Thus, you can modify the configuration by replacing the default -configuration template with a customized one. This way, your custom -configuration template is reusable for different jobs. You can start -with a copy of the default configuration ahead of your interactive -session: + module load Spark/3.0.1-Hadoop-2.7-Java-1.8-Python-3.7.4-GCCcore-8.3.0 -```console -marie@login$ cp -r $SPARK_HOME/conf my-config-template -``` + function myExitHandler () { + stop-all.sh + } -After you have changed `my-config-template`, you can use your new template -in an interactive job with: + #configuration + . framework-configure.sh spark $SPARK_HOME/conf -```console -marie@compute$ source framework-configure.sh spark my-config-template -``` + #register cleanup hook in case something goes wrong + trap myExitHandler EXIT -### Using Hadoop Distributed Filesystem (HDFS) + start-all.sh -If you want to use Spark and HDFS together (or in general more than one -framework), a scheme similar to the following can be used: + spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000 -```console -marie@compute$ module load Hadoop -marie@compute$ module load Spark -marie@compute$ source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop -marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf -marie@compute$ start-dfs.sh -marie@compute$ start-all.sh -``` + stop-all.sh -## Batch Jobs + exit 0 + ``` + === "Flink" + ```bash + #!/bin/bash -l + #SBATCH --time=01:00:00 + #SBATCH --partition=haswell + #SBATCH --nodes=2 + #SBATCH --exclusive + #SBATCH --mem=60000M + #SBATCH --job-name="example-flink" -Using `srun` directly on the shell blocks the shell and launches an -interactive job. Apart from short test runs, it is **recommended to -launch your jobs in the background using batch jobs**. For that, you can -conveniently put the parameters directly into the job file and submit it via -`sbatch [options] <job file>`. + module load Flink/1.12.3-Java-1.8.0_161-OpenJDK-Python-3.7.4-GCCcore-8.3.0 -Please use a [batch job](../jobs_and_resources/slurm.md) similar to -[example-spark.sbatch](misc/example-spark.sbatch). + function myExitHandler () { + stop-cluster.sh + } -## Jupyter Notebook + #configuration + . framework-configure.sh flink $FLINK_ROOT_DIR/conf -There are two general options on how to work with Jupyter notebooks: -There is [JupyterHub](../access/jupyterhub.md), where you can simply -run your Jupyter notebook on HPC nodes (the preferable way). Also, you -can run a remote Jupyter server manually within a GPU job using -the modules and packages you need. You can find the manual server -setup [here](deep_learning.md). + #register cleanup hook in case something goes wrong + trap myExitHandler EXIT -### Preparation + #start the cluster + start-cluster.sh -If you want to run Spark in Jupyter notebooks, you have to prepare it first. This is comparable -to the [description for custom environments](../access/jupyterhub.md#conda-environment). -You start with an allocation: + #run your application + flink run $FLINK_ROOT_DIR/examples/batch/KMeans.jar -```console -marie@login$ srun --pty -n 1 -c 2 --mem-per-cpu=2500 -t 01:00:00 bash -l -``` + #stop the cluster + stop-cluster.sh -When a node is allocated, install the required package with Anaconda: + exit 0 + ``` -```console -marie@compute$ module load Anaconda3 -marie@compute$ cd -marie@compute$ mkdir user-kernel -marie@compute$ conda create --prefix $HOME/user-kernel/haswell-py3.6-spark python=3.6 -Collecting package metadata: done -Solving environment: done [...] +## Jupyter Notebook -marie@compute$ conda activate $HOME/user-kernel/haswell-py3.6-spark -marie@compute$ conda install ipykernel -Collecting package metadata: done -Solving environment: done [...] +You can run Jupyter notebooks with Spark and Flink on the ZIH systems in a similar way as described +on the [JupyterHub](../access/jupyterhub.md) page. -marie@compute$ python -m ipykernel install --user --name haswell-py3.6-spark --display-name="haswell-py3.6-spark" -Installed kernelspec haswell-py3.6-spark in [...] +### Spawning a Notebook -marie@compute$ conda install -c conda-forge findspark -marie@compute$ conda install pyspark +Go to [https://taurus.hrsk.tu-dresden.de/jupyter](https://taurus.hrsk.tu-dresden.de/jupyter). +In the tab "Advanced", go to the field "Preload modules" and select the following Spark or Flink +module: -marie@compute$ conda deactivate -``` +=== "Spark" + ``` + Spark/3.0.1-Hadoop-2.7-Java-1.8-Python-3.7.4-GCCcore-8.3.0 + ``` +=== "Flink" + ``` + Flink/1.12.3-Java-1.8.0_161-OpenJDK-Python-3.7.4-GCCcore-8.3.0 + ``` -You are now ready to spawn a notebook with Spark. +When your Jupyter instance is started, you can set up Spark/Flink. Since the setup in the notebook +requires more steps than in an interactive session, we have created example notebooks that you can +use as a starting point for convenience: [SparkExample.ipynb](misc/SparkExample.ipynb), +[FlinkExample.ipynb](misc/FlinkExample.ipynb) -### Spawning a Notebook +!!! warning -Assuming that you have prepared everything as described above, you can go to -[https://taurus.hrsk.tu-dresden.de/jupyter](https://taurus.hrsk.tu-dresden.de/jupyter). -In the tab "Advanced", go -to the field "Preload modules" and select one of the Spark modules. -When your Jupyter instance is started, check whether the kernel that -you created in the preparation phase (see above) is shown in the top -right corner of the notebook. If it is not already selected, select the -kernel `haswell-py3.6-spark`. Then, you can set up Spark. Since the setup -in the notebook requires more steps than in an interactive session, we -have created an example notebook that you can use as a starting point -for convenience: [SparkExample.ipynb](misc/SparkExample.ipynb) + The notebooks only work with the Spark or Flink module mentioned above. When using other + Spark/Flink modules, it is possible that you have to do additional or other steps in order to + make Spark/Flink running. !!! note - You could work with simple examples in your home directory but according to the - [storage concept](../data_lifecycle/overview.md) - **please use [workspaces](../data_lifecycle/workspaces.md) for - your study and work projects**. For this reason, you have to use - advanced options of Jupyterhub and put "/" in "Workspace scope" field. + You could work with simple examples in your home directory, but, according to the + [storage concept](../data_lifecycle/overview.md), **please use + [workspaces](../data_lifecycle/workspaces.md) for your study and work projects**. For this + reason, you have to use advanced options of Jupyterhub and put "/" in "Workspace scope" field. ## FAQ @@ -231,10 +285,9 @@ re-login to the ZIH system. Q: There are a lot of errors and warnings during the set up of the session -A: Please check the work capability on a simple example. The source of -warnings could be ssh etc, and it could be not affecting the frameworks +A: Please check the work capability on a simple example as shown in this documentation. !!! help - If you have questions or need advice, please see - [https://www.scads.de/transfer-2/beratung-und-support-en/](https://www.scads.de/transfer-2/beratung-und-support-en/) or contact the HPC support. + If you have questions or need advice, please use the contact form on + [https://scads.ai/contact/](https://scads.ai/contact/) or contact the HPC support. diff --git a/doc.zih.tu-dresden.de/docs/software/building_software.md b/doc.zih.tu-dresden.de/docs/software/building_software.md index 33aeaf919fa3bad56c14fdb6aa130eafac6c0d5d..c83932a16c1c0227cb160d4853cd1815626fc404 100644 --- a/doc.zih.tu-dresden.de/docs/software/building_software.md +++ b/doc.zih.tu-dresden.de/docs/software/building_software.md @@ -1,42 +1,39 @@ # Building Software -While it is possible to do short compilations on the login nodes, it is -generally considered good practice to use a job for that, especially -when using many parallel make processes. Note that starting on December -6th 2016, the /projects file system will be mounted read-only on all -compute nodes in order to prevent users from doing large I/O there -(which is what the /scratch is for). In consequence, you cannot compile -in /projects within a job anymore. If you wish to install software for -your project group anyway, you can use a build directory in the /scratch -file system instead: - -Every sane build system should allow you to keep your source code tree -and your build directory separate, some even demand them to be different -directories. Plus, you can set your installation prefix (the target -directory) back to your /projects folder and do the "make install" step -on the login nodes. - -For instance, when using CMake and keeping your source in /projects, you -could do the following: - - # save path to your source directory: - export SRCDIR=/projects/p_myproject/mysource - - # create a build directory in /scratch: - mkdir /scratch/p_myproject/mysoftware_build - - # change to build directory within /scratch: - cd /scratch/p_myproject/mysoftware_build - - # create Makefiles: - cmake -DCMAKE_INSTALL_PREFIX=/projects/p_myproject/mysoftware $SRCDIR - - # build in a job: - srun --mem-per-cpu=1500 -c 12 --pty make -j 12 - - # do the install step on the login node again: - make install - -As a bonus, your compilation should also be faster in the parallel -/scratch file system than it would be in the comparatively slow -NFS-based /projects file system. +While it is possible to do short compilations on the login nodes, it is generally considered good +practice to use a job for that, especially when using many parallel make processes. Since 2016, +the `/projects` filesystem is mounted read-only on all compute +nodes in order to prevent users from doing large I/O there (which is what the `/scratch` is for). +In consequence, you cannot compile in `/projects` within a job. If you wish to install +software for your project group anyway, you can use a build directory in the `/scratch` filesystem +instead. + +Every sane build system should allow you to keep your source code tree and your build directory +separate, some even demand them to be different directories. Plus, you can set your installation +prefix (the target directory) back to your `/projects` folder and do the "make install" step on the +login nodes. + +For instance, when using CMake and keeping your source in `/projects`, you could do the following: + +```console +# save path to your source directory: +marie@login$ export SRCDIR=/projects/p_marie/mysource + +# create a build directory in /scratch: +marie@login$ mkdir /scratch/p_marie/mysoftware_build + +# change to build directory within /scratch: +marie@login$ cd /scratch/p_marie/mysoftware_build + +# create Makefiles: +marie@login$ cmake -DCMAKE_INSTALL_PREFIX=/projects/p_marie/mysoftware $SRCDIR + +# build in a job: +marie@login$ srun --mem-per-cpu=1500 --cpus-per-task=12 --pty make -j 12 + +# do the install step on the login node again: +marie@login$ make install +``` + +As a bonus, your compilation should also be faster in the parallel `/scratch` filesystem than it +would be in the comparatively slow NFS-based `/projects` filesystem. diff --git a/doc.zih.tu-dresden.de/docs/software/cfd.md b/doc.zih.tu-dresden.de/docs/software/cfd.md index 492cb96d24f3761e2820fdba34eaa6b0a35db320..62ed65116e51ae8bbb593664f4bc48a3373d3a41 100644 --- a/doc.zih.tu-dresden.de/docs/software/cfd.md +++ b/doc.zih.tu-dresden.de/docs/software/cfd.md @@ -16,7 +16,7 @@ The OpenFOAM (Open Field Operation and Manipulation) CFD Toolbox can simulate an fluid flows involving chemical reactions, turbulence and heat transfer, to solid dynamics, electromagnetics and the pricing of financial options. OpenFOAM is developed primarily by [OpenCFD Ltd](https://www.openfoam.com) and is freely available and open-source, -licensed under the GNU General Public Licence. +licensed under the GNU General Public License. The command `module spider OpenFOAM` provides the list of installed OpenFOAM versions. In order to use OpenFOAM, it is mandatory to set the environment by sourcing the `bashrc` (for users running @@ -42,7 +42,7 @@ marie@login$ # source $FOAM_CSH module load OpenFOAM source $FOAM_BASH cd /scratch/ws/1/marie-example-workspace # work directory using workspace - srun pimpleFoam -parallel > "$OUTFILE" + srun pimpleFoam -parallel > "$OUTFILE" ``` ## Ansys CFX @@ -62,7 +62,7 @@ geometry and mesh generator cfx5pre, and the post-processor cfx5post. module load ANSYS cd /scratch/ws/1/marie-example-workspace # work directory using workspace - cfx-parallel.sh -double -def StaticMixer.def + cfx-parallel.sh -double -def StaticMixer.def ``` ## Ansys Fluent diff --git a/doc.zih.tu-dresden.de/docs/software/compilers.md b/doc.zih.tu-dresden.de/docs/software/compilers.md index 4292602e02e77bf01ad04c8c01643aadcc8c580a..7bb9c3c4b9f3a65151d5292ff587decd306e35c9 100644 --- a/doc.zih.tu-dresden.de/docs/software/compilers.md +++ b/doc.zih.tu-dresden.de/docs/software/compilers.md @@ -55,10 +55,10 @@ pages or use the option `--help` to list all options of the compiler. | `-fprofile-use` | `-prof-use` | `-Mpfo` | use profile data for optimization | !!! note - We can not generally give advice as to which option should be used. - To gain maximum performance please test the compilers and a few combinations of - optimization flags. - In case of doubt, you can also contact [HPC support](../support.md) and ask the staff for help. + + We can not generally give advice as to which option should be used. To gain maximum performance + please test the compilers and a few combinations of optimization flags. In case of doubt, you + can also contact [HPC support](../support/support.md) and ask the staff for help. ### Architecture-specific Optimizations diff --git a/doc.zih.tu-dresden.de/docs/software/containers.md b/doc.zih.tu-dresden.de/docs/software/containers.md index a67a4a986881ffe09a16582adfeda719e6f90ccd..d15535933ef7f2b9e0330d07e35168f10fc22ded 100644 --- a/doc.zih.tu-dresden.de/docs/software/containers.md +++ b/doc.zih.tu-dresden.de/docs/software/containers.md @@ -2,94 +2,112 @@ [Containerization](https://www.ibm.com/cloud/learn/containerization) encapsulating or packaging up software code and all its dependencies to run uniformly and consistently on any infrastructure. On -Taurus [Singularity](https://sylabs.io/) used as a standard container solution. Singularity enables -users to have full control of their environment. This means that you don’t have to ask an HPC -support to install anything for you - you can put it in a Singularity container and run! As opposed -to Docker (the most famous container solution), Singularity is much more suited to being used in an -HPC environment and more efficient in many cases. Docker containers can easily be used in -Singularity. Information about the use of Singularity on Taurus can be found [here]**todo link**. - -In some cases using Singularity requires a Linux machine with root privileges (e.g. using the ml -partition), the same architecture and a compatible kernel. For many reasons, users on Taurus cannot -be granted root permissions. A solution is a Virtual Machine (VM) on the ml partition which allows -users to gain root permissions in an isolated environment. There are two main options on how to work -with VM on Taurus: - -1. [VM tools]**todo link**. Automative algorithms for using virtual machines; -1. [Manual method]**todo link**. It required more operations but gives you more flexibility and reliability. +ZIH systems [Singularity](https://sylabs.io/) is used as a standard container solution. Singularity +enables users to have full control of their environment. This means that you don’t have to ask the +HPC support to install anything for you - you can put it in a Singularity container and run! As +opposed to Docker (the most famous container solution), Singularity is much more suited to being +used in an HPC environment and more efficient in many cases. Docker containers can easily be used in +Singularity. Information about the use of Singularity on ZIH systems can be found on this page. + +In some cases using Singularity requires a Linux machine with root privileges (e.g. using the +partition `ml`), the same architecture and a compatible kernel. For many reasons, users on ZIH +systems cannot be granted root permissions. A solution is a Virtual Machine (VM) on the partition +`ml` which allows users to gain root permissions in an isolated environment. There are two main +options on how to work with Virtual Machines on ZIH systems: + +1. [VM tools](virtual_machines_tools.md): Automated algorithms for using virtual machines; +1. [Manual method](virtual_machines.md): It requires more operations but gives you more flexibility + and reliability. ## Singularity -If you wish to containerize your workflow/applications, you can use Singularity containers on -Taurus. As opposed to Docker, this solution is much more suited to being used in an HPC environment. -Existing Docker containers can easily be converted. +If you wish to containerize your workflow and/or applications, you can use Singularity containers on +ZIH systems. As opposed to Docker, this solution is much more suited to being used in an HPC +environment. -ZIH wiki sites: +!!! note -- [Example Definitions](singularity_example_definitions.md) -- [Building Singularity images on Taurus](vm_tools.md) -- [Hints on Advanced usage](singularity_recipe_hints.md) + It is not possible for users to generate new custom containers on ZIH systems directly, because + creating a new container requires root privileges. -It is available on Taurus without loading any module. +However, new containers can be created on your local workstation and moved to ZIH systems for +execution. Follow the instructions for [locally installing Singularity](#local-installation) and +[container creation](#container-creation). Moreover, existing Docker container can easily be +converted, see [Import a docker container](#importing-a-docker-container). -### Local installation +If you are already familiar with Singularity, you might be more interested in our [singularity +recipes and hints](singularity_recipe_hints.md). -One advantage of containers is that you can create one on a local machine (e.g. your laptop) and -move it to the HPC system to execute it there. This requires a local installation of singularity. -The easiest way to do so is: +### Local Installation -1. Check if go is installed by executing `go version`. If it is **not**: +The local installation of Singularity comprises two steps: Make `go` available and then follow the +instructions from the official documentation to install Singularity. -```Bash -wget <https://storage.googleapis.com/golang/getgo/installer_linux> && chmod +x -installer_linux && ./installer_linux && source $HOME/.bash_profile -``` +1. Check if `go` is installed by executing `go version`. If it is **not**: -1. Follow the instructions to [install Singularity](https://github.com/sylabs/singularity/blob/master/INSTALL.md#clone-the-repo) + ```console + marie@local$ wget <https://storage.googleapis.com/golang/getgo/installer_linux> && chmod +x + installer_linux && ./installer_linux && source $HOME/.bash_profile + ``` -clone the repo +1. Instructions to + [install Singularity](https://github.com/sylabs/singularity/blob/master/INSTALL.md#clone-the-repo) + from the official documentation: -```Bash -mkdir -p ${GOPATH}/src/github.com/sylabs && cd ${GOPATH}/src/github.com/sylabs && git clone <https://github.com/sylabs/singularity.git> && cd -singularity -``` + Clone the repository -Checkout the version you want (see the [Github releases page](https://github.com/sylabs/singularity/releases) -for available releases), e.g. + ```console + marie@local$ mkdir -p ${GOPATH}/src/github.com/sylabs + marie@local$ cd ${GOPATH}/src/github.com/sylabs + marie@local$ git clone https://github.com/sylabs/singularity.git + marie@local$ cd singularity + ``` -```Bash -git checkout v3.2.1\ -``` + Checkout the version you want (see the [GitHub releases page](https://github.com/sylabs/singularity/releases) + for available releases), e.g. -Build and install + ```console + marie@local$ git checkout v3.2.1 + ``` -```Bash -cd ${GOPATH}/src/github.com/sylabs/singularity && ./mconfig && cd ./builddir && make && sudo -make install -``` + Build and install -### Container creation + ```console + marie@local$ cd ${GOPATH}/src/github.com/sylabs/singularity + marie@local$ ./mconfig && cd ./builddir && make + marie@local$ sudo make install + ``` -Since creating a new container requires access to system-level tools and thus root privileges, it is -not possible for users to generate new custom containers on Taurus directly. You can, however, -import an existing container from, e.g., Docker. +### Container Creation -In case you wish to create a new container, you can do so on your own local machine where you have -the necessary privileges and then simply copy your container file to Taurus and use it there. +!!! note -This does not work on our **ml** partition, as it uses Power9 as its architecture which is -different to the x86 architecture in common computers/laptops. For that you can use the -[VM Tools](vm_tools.md). + It is not possible for users to generate new custom containers on ZIH systems directly, because + creating a new container requires root privileges. -#### Creating a container +There are two possibilities: -Creating a container is done by writing a definition file and passing it to +1. Create a new container on your local workstation (where you have the necessary privileges), and + then copy the container file to ZIH systems for execution. +1. You can, however, import an existing container from, e.g., Docker. -```Bash -singularity build myContainer.sif myDefinition.def -``` +Both methods are outlined in the following. + +#### New Custom Container + +You can create a new custom container on your workstation, if you have root rights. + +!!! attention "Respect the micro-architectures" -NOTE: This must be done on a machine (or [VM](virtual_machines.md) with root rights. + You cannot create containers for the partition `ml`, as it bases on Power9 micro-architecture + which is different to the x86 architecture in common computers/laptops. For that you can use + the [VM Tools](virtual_machines_tools.md). + +Creating a container is done by writing a **definition file** and passing it to + +```console +marie@local$ singularity build myContainer.sif <myDefinition.def> +``` A definition file contains a bootstrap [header](https://sylabs.io/guides/3.2/user-guide/definition_files.html#header) @@ -99,20 +117,26 @@ where you install your software. The most common approach is to start from an existing docker image from DockerHub. For example, to start from an [Ubuntu image](https://hub.docker.com/_/ubuntu) copy the following into a new file -called ubuntu.def (or any other filename of your choosing) +called `ubuntu.def` (or any other filename of your choice) -```Bash -Bootstrap: docker<br />From: ubuntu:trusty<br /><br />%runscript<br /> echo "This is what happens when you run the container..."<br /><br />%post<br /> apt-get install g++ +```bash +Bootstrap: docker +From: ubuntu:trusty + +%runscript + echo "This is what happens when you run the container..." + +%post + apt-get install g++ ``` -Then you can call: +Then you can call -```Bash -singularity build ubuntu.sif ubuntu.def +```console +marie@local$ singularity build ubuntu.sif ubuntu.def ``` And it will install Ubuntu with g++ inside your container, according to your definition file. - More bootstrap options are available. The following example, for instance, bootstraps a basic CentOS 7 image. @@ -131,23 +155,25 @@ Include: yum ``` More examples of definition files can be found at -https://github.com/singularityware/singularity/tree/master/examples +https://github.com/singularityware/singularity/tree/master/examples. + +#### Import a Docker Container + +!!! hint -#### Importing a docker container + As opposed to bootstrapping a container, importing from Docker does **not require root + privileges** and therefore works on ZIH systems directly. You can import an image directly from the Docker repository (Docker Hub): -```Bash -singularity build my-container.sif docker://ubuntu:latest +```console +marie@local$ singularity build my-container.sif docker://ubuntu:latest ``` -As opposed to bootstrapping a container, importing from Docker does **not require root privileges** -and therefore works on Taurus directly. - -Creating a singularity container directly from a local docker image is possible but not recommended. -Steps: +Creating a singularity container directly from a local docker image is possible but not +recommended. The steps are: -```Bash +```console # Start a docker registry $ docker run -d -p 5000:5000 --restart=always --name registry registry:2 @@ -165,109 +191,122 @@ From: alpine $ singularity build --nohttps alpine.sif example.def ``` -#### Starting from a Dockerfile +#### Start from a Dockerfile -As singularity definition files and Dockerfiles are very similar you can start creating a definition +As Singularity definition files and Dockerfiles are very similar you can start creating a definition file from an existing Dockerfile by "translating" each section. -There are tools to automate this. One of them is \<a -href="<https://github.com/singularityhub/singularity-cli>" -target="\_blank">spython\</a> which can be installed with \`pip\` (add -\`--user\` if you don't want to install it system-wide): +There are tools to automate this. One of them is +[spython](https://github.com/singularityhub/singularity-cli) which can be installed with `pip` +(add `--user` if you don't want to install it system-wide): -`pip3 install -U spython` +```console +marie@local$ pip3 install -U spython +``` + +With this you can simply issue the following command to convert a Dockerfile in the current folder +into a singularity definition file: + +```console +marie@local$ spython recipe Dockerfile myDefinition.def +``` -With this you can simply issue the following command to convert a -Dockerfile in the current folder into a singularity definition file: +Please **verify** your generated definition and adjust where required! -`spython recipe Dockerfile myDefinition.def<br />` +There are some notable changes between Singularity definitions and Dockerfiles: -Now please **verify** your generated definition and adjust where -required! +1. Command chains in Dockerfiles (`apt-get update && apt-get install foo`) must be split into + separate commands (`apt-get update; apt-get install foo`). Otherwise a failing command before the + ampersand is considered "checked" and does not fail the build. +1. The environment variables section in Singularity is only set on execution of the final image, not + during the build as with Docker. So `*ENV*` sections from Docker must be translated to an entry + in the `%environment` section and **additionally** set in the `%runscript` section if the + variable is used there. +1. `*VOLUME*` sections from Docker cannot be represented in Singularity containers. Use the runtime + option \`-B\` to bind folders manually. +1. `CMD` and `ENTRYPOINT` from Docker do not have a direct representation in Singularity. + The closest is to check if any arguments are given in the `%runscript` section and call the + command from `ENTRYPOINT` with those, if none are given call `ENTRYPOINT` with the + arguments of `CMD`: -There are some notable changes between singularity definitions and -Dockerfiles: 1 Command chains in Dockerfiles (\`apt-get update && -apt-get install foo\`) must be split into separate commands (\`apt-get -update; apt-get install foo). Otherwise a failing command before the -ampersand is considered "checked" and does not fail the build. 1 The -environment variables section in Singularity is only set on execution of -the final image, not during the build as with Docker. So \`*ENV*\` -sections from Docker must be translated to an entry in the -*%environment* section and **additionally** set in the *%runscript* -section if the variable is used there. 1 \`*VOLUME*\` sections from -Docker cannot be represented in Singularity containers. Use the runtime -option \`-B\` to bind folders manually. 1 *\`CMD\`* and *\`ENTRYPOINT\`* -from Docker do not have a direct representation in Singularity. The -closest is to check if any arguments are given in the *%runscript* -section and call the command from \`*ENTRYPOINT*\` with those, if none -are given call \`*ENTRYPOINT*\` with the arguments of \`*CMD*\`: -\<verbatim>if \[ $# -gt 0 \]; then \<ENTRYPOINT> "$@" else \<ENTRYPOINT> -\<CMD> fi\</verbatim> + ```bash + if [ $# -gt 0 ]; then + <ENTRYPOINT> "$@" + else + <ENTRYPOINT> <CMD> + fi + ``` -### Using the containers +### Use the Containers -#### Entering a shell in your container +#### Enter a Shell in Your Container A read-only shell can be entered as follows: -```Bash -singularity shell my-container.sif +```console +marie@login$ singularity shell my-container.sif ``` -**IMPORTANT:** In contrast to, for instance, Docker, this will mount various folders from the host -system including $HOME. This may lead to problems with, e.g., Python that stores local packages in -the home folder, which may not work inside the container. It also makes reproducibility harder. It -is therefore recommended to use `--contain/-c` to not bind $HOME (and others like `/tmp`) -automatically and instead set up your binds manually via `-B` parameter. Example: +!!! note -```Bash -singularity shell --contain -B /scratch,/my/folder-on-host:/folder-in-container my-container.sif -``` + In contrast to, for instance, Docker, this will mount various folders from the host system + including $HOME. This may lead to problems with, e.g., Python that stores local packages in the + home folder, which may not work inside the container. It also makes reproducibility harder. It + is therefore recommended to use `--contain/-c` to not bind `$HOME` (and others like `/tmp`) + automatically and instead set up your binds manually via `-B` parameter. Example: + + ```console + marie@login$ singularity shell --contain -B /scratch,/my/folder-on-host:/folder-in-container my-container.sif + ``` You can write into those folders by default. If this is not desired, add an `:ro` for read-only to the bind specification (e.g. `-B /scratch:/scratch:ro\`). Note that we already defined bind paths for `/scratch`, `/projects` and `/sw` in our global `singularity.conf`, so you needn't use the `-B` parameter for those. -If you wish, for instance, to install additional packages, you have to use the `-w` parameter to -enter your container with it being writable. This, again, must be done on a system where you have +If you wish to install additional packages, you have to use the `-w` parameter to +enter your container with it being writable. This, again, must be done on a system where you have the necessary privileges, otherwise you can only edit files that your user has the permissions for. E.g: -```Bash -singularity shell -w my-container.sif +```console +marie@local$ singularity shell -w my-container.sif Singularity.my-container.sif> yum install htop ``` The `-w` parameter should only be used to make permanent changes to your container, not for your -productive runs (it can only be used writeable by one user at the same time). You should write your -output to the usual Taurus file systems like `/scratch`. Launching applications in your container +productive runs (it can only be used writable by one user at the same time). You should write your +output to the usual ZIH filesystems like `/scratch`. Launching applications in your container -#### Running a command inside the container +#### Run a Command Inside the Container -While the "shell" command can be useful for tests and setup, you can also launch your applications +While the `shell` command can be useful for tests and setup, you can also launch your applications inside the container directly using "exec": -```Bash -singularity exec my-container.img /opt/myapplication/bin/run_myapp +```console +marie@login$ singularity exec my-container.img /opt/myapplication/bin/run_myapp ``` This can be useful if you wish to create a wrapper script that transparently calls a containerized application for you. E.g.: -```Bash +```bash #!/bin/bash X=`which singularity 2>/dev/null` if [ "z$X" = "z" ] ; then - echo "Singularity not found. Is the module loaded?" - exit 1 + echo "Singularity not found. Is the module loaded?" + exit 1 fi singularity exec /scratch/p_myproject/my-container.sif /opt/myapplication/run_myapp "$@" -The better approach for that however is to use `singularity run` for that, which executes whatever was set in the _%runscript_ section of the definition file with the arguments you pass to it. -Example: -Build the following definition file into an image: +``` + +The better approach is to use `singularity run`, which executes whatever was set in the `%runscript` +section of the definition file with the arguments you pass to it. Example: Build the following +definition file into an image: + +```bash Bootstrap: docker From: ubuntu:trusty @@ -285,33 +324,32 @@ singularity build my-container.sif example.def Then you can run your application via -```Bash +```console singularity run my-container.sif first_arg 2nd_arg ``` -Alternatively you can execute the container directly which is -equivalent: +Alternatively you can execute the container directly which is equivalent: -```Bash +```console ./my-container.sif first_arg 2nd_arg ``` With this you can even masquerade an application with a singularity container as if it was an actual program by naming the container just like the binary: -```Bash +```console mv my-container.sif myCoolAp ``` -### Use-cases +### Use-Cases -One common use-case for containers is that you need an operating system with a newer GLIBC version -than what is available on Taurus. E.g., the bullx Linux on Taurus used to be based on RHEL6 having a -rather dated GLIBC version 2.12, some binary-distributed applications didn't work on that anymore. -You can use one of our pre-made CentOS 7 container images (`/scratch/singularity/centos7.img`) to -circumvent this problem. Example: +One common use-case for containers is that you need an operating system with a newer +[glibc](https://www.gnu.org/software/libc/) version than what is available on ZIH systems. E.g., the +bullx Linux on ZIH systems used to be based on RHEL 6 having a rather dated glibc version 2.12, some +binary-distributed applications didn't work on that anymore. You can use one of our pre-made CentOS +7 container images (`/scratch/singularity/centos7.img`) to circumvent this problem. Example: -```Bash -$ singularity exec /scratch/singularity/centos7.img ldd --version +```console +marie@login$ singularity exec /scratch/singularity/centos7.img ldd --version ldd (GNU libc) 2.17 ``` diff --git a/doc.zih.tu-dresden.de/docs/software/custom_easy_build_environment.md b/doc.zih.tu-dresden.de/docs/software/custom_easy_build_environment.md index d482d89a45a3849054af19a75ccaf64daeb6e9eb..231ce447b0fa8157ebb9b4a8ea6dd9bb1542fa7b 100644 --- a/doc.zih.tu-dresden.de/docs/software/custom_easy_build_environment.md +++ b/doc.zih.tu-dresden.de/docs/software/custom_easy_build_environment.md @@ -1,133 +1,155 @@ # EasyBuild -Sometimes the \<a href="SoftwareModulesList" target="\_blank" -title="List of Modules">modules installed in the cluster\</a> are not -enough for your purposes and you need some other software or a different -version of a software. - -\<br />For most commonly used software, chances are high that there is -already a *recipe* that EasyBuild provides, which you can use. But what -is Easybuild? - -\<a href="<https://easybuilders.github.io/easybuild/>" -target="\_blank">EasyBuild\</a>\<span style="font-size: 1em;"> is the -software used to build and install software on, and create modules for, -Taurus.\</span> - -\<span style="font-size: 1em;">The aim of this page is to introduce -users to working with EasyBuild and to utilizing it to create -modules**.**\</span> - -**Prerequisites:** \<a href="Login" target="\_blank">access\</a> to the -Taurus system and basic knowledge about Linux, \<a href="SystemTaurus" -target="\_blank" title="SystemTaurus">Taurus\</a> and the \<a -href="RuntimeEnvironment" target="\_blank" -title="RuntimeEnvironment">modules system \</a>on Taurus. - -\<span style="font-size: 1em;">EasyBuild uses a configuration file -called recipe or "EasyConfig", which contains all the information about -how to obtain and build the software:\</span> +Sometimes the [modules](modules.md) installed in the cluster are not enough for your purposes and +you need some other software or a different version of a software. + +For most commonly used software, chances are high that there is already a *recipe* that EasyBuild +provides, which you can use. But what is EasyBuild? + +[EasyBuild](https://easybuild.io/) is the software used to build and install +software on ZIH systems. + +The aim of this page is to introduce users to working with EasyBuild and to utilizing it to create +modules. + +## Prerequisites + +1. [Shell access](../access/ssh_login.md) to ZIH systems +1. basic knowledge about: + - [the ZIH system](../jobs_and_resources/hardware_overview.md) + - [the module system](modules.md) on ZIH systems + +EasyBuild uses a configuration file called recipe or "EasyConfig", which contains all the +information about how to obtain and build the software: - Name - Version - Toolchain (think: Compiler + some more) - Download URL -- Buildsystem (e.g. configure && make or cmake && make) +- Build system (e.g. `configure && make` or `cmake && make`) - Config parameters - Tests to ensure a successful build -The "Buildsystem" part is implemented in so-called "EasyBlocks" and -contains the common workflow. Sometimes those are specialized to -encapsulate behaviour specific to multiple/all versions of the software. -\<span style="font-size: 1em;">Everything is written in Python, which -gives authors a great deal of flexibility.\</span> +The build system part is implemented in so-called "EasyBlocks" and contains the common workflow. +Sometimes, those are specialized to encapsulate behavior specific to multiple/all versions of the +software. Everything is written in Python, which gives authors a great deal of flexibility. ## Set up a custom module environment and build your own modules -Installation of the new software (or version) does not require any -specific credentials. +Installation of the new software (or version) does not require any specific credentials. -\<br />Prerequisites: 1 An existing EasyConfig 1 a place to put your -modules. \<span style="font-size: 1em;">Step by step guide:\</span> +### Prerequisites -1\. Create a \<a href="WorkSpaces" target="\_blank">workspace\</a> where -you'll install your modules. You need a place where your modules will be -placed. This needs to be done only once : +1. An existing EasyConfig +1. a place to put your modules. - ws_allocate -F scratch EasyBuild 50 # +### Step by step guide -2\. Allocate nodes. You can do this with interactive jobs (see the -example below) and/or put commands in a batch file and source it. The -latter is recommended for non-interactive jobs, using the command sbatch -in place of srun. For the sake of illustration, we use an interactive -job as an example. The node parameters depend, to some extent, on the -architecture you want to use. ML nodes for the Power9 and others for the -x86. We will use Haswell nodes. +**Step 1:** Create a [workspace](../data_lifecycle/workspaces.md#allocate-a-workspace) where you +install your modules. You need a place where your modules are placed. This needs to be done only +once: - srun -p haswell -N 1 -c 4 --time=08:00:00 --pty /bin/bash +```console +marie@login$ ws_allocate -F scratch EasyBuild 50 +marie@login$ ws_list | grep 'directory.*EasyBuild' + workspace directory : /scratch/ws/1/marie-EasyBuild +``` -\*Using EasyBuild on the login nodes is not allowed\* +**Step 2:** Allocate nodes. You can do this with interactive jobs (see the example below) and/or +put commands in a batch file and source it. The latter is recommended for non-interactive jobs, +using the command `sbatch` instead of `srun`. For the sake of illustration, we use an +interactive job as an example. Depending on the partitions that you want the module to be usable on +later, you need to select nodes with the same architecture. Thus, use nodes from partition ml for +building, if you want to use the module on nodes of that partition. In this example, we assume +that we want to use the module on nodes with x86 architecture and thus, we use Haswell nodes. -3\. Load EasyBuild module. +```console +marie@login$ srun --partition=haswell --nodes=1 --cpus-per-task=4 --time=08:00:00 --pty /bin/bash -l +``` - module load EasyBuild +!!! warning -\<br />4. Specify Workspace. The rest of the guide is based on it. -Please create an environment variable called \`WORKSPACE\` with the -location of your Workspace: + Using EasyBuild on the login nodes is not allowed. - WORKSPACE=<location_of_your_workspace> # For example: WORKSPACE=/scratch/ws/anpo879a-EasyBuild +**Step 3:** Specify the workspace. The rest of the guide is based on it. Please create an +environment variable called `WORKSPACE` with the path to your workspace: -5\. Load the correct modenv according to your current or target -architecture: \`ml modenv/scs5\` for x86 (default) or \`modenv/ml\` for -Power9 (ml partition). Load EasyBuild module +```console +marie@compute$ export WORKSPACE=/scratch/ws/1/marie-EasyBuild #see output of ws_list above +``` - ml modenv/scs5 - module load EasyBuild +**Step 4:** Load the correct module environment `modenv` according to your current or target +architecture: -6\. Set up your environment: +=== "x86 (default, e. g. partition haswell)" + ```console + marie@compute$ module load modenv/scs5 + ``` +=== "Power9 (partition ml)" + ```console + marie@ml$ module load modenv/ml + ``` - export EASYBUILD_ALLOW_LOADED_MODULES=EasyBuild,modenv/scs5 - export EASYBUILD_DETECT_LOADED_MODULES=unload - export EASYBUILD_BUILDPATH="/tmp/${USER}-EasyBuild${SLURM_JOB_ID:-}" - export EASYBUILD_SOURCEPATH="${WORKSPACE}/sources" - export EASYBUILD_INSTALLPATH="${WORKSPACE}/easybuild-$(basename $(readlink -f /sw/installed))" - export EASYBUILD_INSTALLPATH_MODULES="${EASYBUILD_INSTALLPATH}/modules" - module use "${EASYBUILD_INSTALLPATH_MODULES}/all" - export LMOD_IGNORE_CACHE=1 +**Step 5:** Load module `EasyBuild` -7\. \<span style="font-size: 13px;">Now search for an existing -EasyConfig: \</span> +```console +marie@compute$ module load EasyBuild +``` - eb --search TensorFlow +**Step 6:** Set up your environment: -\<span style="font-size: 13px;">8. Build the EasyConfig and its -dependencies\</span> +```console +marie@compute$ export EASYBUILD_ALLOW_LOADED_MODULES=EasyBuild,modenv/scs5 +marie@compute$ export EASYBUILD_DETECT_LOADED_MODULES=unload +marie@compute$ export EASYBUILD_BUILDPATH="/tmp/${USER}-EasyBuild${SLURM_JOB_ID:-}" +marie@compute$ export EASYBUILD_SOURCEPATH="${WORKSPACE}/sources" +marie@compute$ export EASYBUILD_INSTALLPATH="${WORKSPACE}/easybuild-$(basename $(readlink -f /sw/installed))" +marie@compute$ export EASYBUILD_INSTALLPATH_MODULES="${EASYBUILD_INSTALLPATH}/modules" +marie@compute$ module use "${EASYBUILD_INSTALLPATH_MODULES}/all" +marie@compute$ export LMOD_IGNORE_CACHE=1 +``` - eb TensorFlow-1.8.0-fosscuda-2018a-Python-3.6.4.eb -r +**Step 7:** Now search for an existing EasyConfig: -\<span style="font-size: 13px;">After this is done (may take A LONG -time), you can load it just like any other module.\</span> +```console +marie@compute$ eb --search TensorFlow +``` -9\. To use your custom build modules you only need to rerun step 4, 5, 6 -and execute the usual: +**Step 8:** Build the EasyConfig and its dependencies (option `-r`) - module load <name_of_your_module> # For example module load TensorFlow-1.8.0-fosscuda-2018a-Python-3.6.4 +```console +marie@compute$ eb TensorFlow-1.8.0-fosscuda-2018a-Python-3.6.4.eb -r +``` -The key is the \`module use\` command which brings your modules into -scope so \`module load\` can find them and the LMOD_IGNORE_CACHE line -which makes LMod pick up the custom modules instead of searching the +This may take a long time. After this is done, you can load it just like any other module. + +**Step 9:** To use your custom build modules you only need to rerun steps 3, 4, 5, 6 and execute +the usual: + +```console +marie@compute$ module load TensorFlow-1.8.0-fosscuda-2018a-Python-3.6.4 #replace with the name of your module +``` + +The key is the `module use` command, which brings your modules into scope, so `module load` can find +them. The `LMOD_IGNORE_CACHE` line makes `LMod` pick up the custom modules instead of searching the system cache. ## Troubleshooting -When building your EasyConfig fails, you can first check the log -mentioned and scroll to the bottom to see what went wrong. +When building your EasyConfig fails, you can first check the log mentioned and scroll to the bottom +to see what went wrong. + +It might also be helpful to inspect the build environment EasyBuild uses. For that you can run: + +```console +marie@compute$ eb myEC.eb --dump-env-script` +``` + +This command creates a sourceable `.env`-file with `module load` and `export` commands that show +what EasyBuild does before running, e.g., the configuration step. -It might also be helpful to inspect the build environment EB uses. For -that you can run \`eb myEC.eb --dump-env-script\` which creates a -sourceable .env file with \`module load\` and \`export\` commands that -show what EB does before running, e.g., the configure step. +It might also be helpful to use -It might also be helpful to use '\<span style="font-size: 1em;">export -LMOD_IGNORE_CACHE=0'\</span> +```console +marie@compute$ export LMOD_IGNORE_CACHE=0 +``` diff --git a/doc.zih.tu-dresden.de/docs/software/dask.md b/doc.zih.tu-dresden.de/docs/software/dask.md deleted file mode 100644 index 316aefe2395e077bec611fdbd0c080cce2af1940..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/dask.md +++ /dev/null @@ -1,136 +0,0 @@ -# Dask - -**Dask** is an open-source library for parallel computing. Dask is a flexible library for parallel -computing in Python. - -Dask natively scales Python. It provides advanced parallelism for analytics, enabling performance at -scale for some of the popular tools. For instance: Dask arrays scale Numpy workflows, Dask -dataframes scale Pandas workflows, Dask-ML scales machine learning APIs like Scikit-Learn and -XGBoost. - -Dask is composed of two parts: - -- Dynamic task scheduling optimized for computation and interactive - computational workloads. -- Big Data collections like parallel arrays, data frames, and lists - that extend common interfaces like NumPy, Pandas, or Python - iterators to larger-than-memory or distributed environments. These - parallel collections run on top of dynamic task schedulers. - -Dask supports several user interfaces: - -High-Level: - -- Arrays: Parallel NumPy -- Bags: Parallel lists -- DataFrames: Parallel Pandas -- Machine Learning : Parallel Scikit-Learn -- Others from external projects, like XArray - -Low-Level: - -- Delayed: Parallel function evaluation -- Futures: Real-time parallel function evaluation - -## Installation - -### Installation Using Conda - -Dask is installed by default in [Anaconda](https://www.anaconda.com/download/). To install/update -Dask on a Taurus with using the [conda](https://www.anaconda.com/download/) follow the example: - -```Bash -# Job submission in ml nodes with allocating: 1 node, 1 gpu per node, 4 hours -srun -p ml -N 1 -n 1 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash -``` - -Create a conda virtual environment. We would recommend using a workspace. See the example (use -`--prefix` flag to specify the directory). - -**Note:** You could work with simple examples in your home directory (where you are loading by -default). However, in accordance with the -[HPC storage concept](../data_lifecycle/overview.md) please use a -[workspaces](../data_lifecycle/workspaces.md) for your study and work projects. - -```Bash -conda create --prefix /scratch/ws/0/aabc1234-Workproject/conda-virtual-environment/dask-test python=3.6 -``` - -By default, conda will locate the environment in your home directory: - -```Bash -conda create -n dask-test python=3.6 -``` - -Activate the virtual environment, install Dask and verify the installation: - -```Bash -ml modenv/ml -ml PythonAnaconda/3.6 -conda activate /scratch/ws/0/aabc1234-Workproject/conda-virtual-environment/dask-test python=3.6 -which python -which conda -conda install dask -python - -from dask.distributed import Client, progress -client = Client(n_workers=4, threads_per_worker=1) -client -``` - -### Installation Using Pip - -You can install everything required for most common uses of Dask (arrays, dataframes, etc) - -```Bash -srun -p ml -N 1 -n 1 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash - -cd /scratch/ws/0/aabc1234-Workproject/python-virtual-environment/dask-test - -ml modenv/ml -module load PythonAnaconda/3.6 -which python - -python3 -m venv --system-site-packages dask-test -source dask-test/bin/activate -python -m pip install "dask[complete]" - -python -from dask.distributed import Client, progress -client = Client(n_workers=4, threads_per_worker=1) -client -``` - -Distributed scheduler - -? - -## Run Dask on Taurus - -The preferred and simplest way to run Dask on HPC systems today both for new, experienced users or -administrator is to use [dask-jobqueue](https://jobqueue.dask.org/). - -You can install dask-jobqueue with `pip` or `conda` - -Installation with Pip - -```Bash -srun -p haswell -N 1 -n 1 -c 4 --mem-per-cpu=2583 --time=01:00:00 --pty bash -cd -/scratch/ws/0/aabc1234-Workproject/python-virtual-environment/dask-test -ml modenv/ml module load PythonAnaconda/3.6 which python - -source dask-test/bin/activate pip -install dask-jobqueue --upgrade # Install everything from last released version -``` - -Installation with Conda - -```Bash -srun -p haswell -N 1 -n 1 -c 4 --mem-per-cpu=2583 --time=01:00:00 --pty bash - -ml modenv/ml module load PythonAnaconda/3.6 source -dask-test/bin/activate - -conda install dask-jobqueue -c conda-forge\</verbatim> -``` diff --git a/doc.zih.tu-dresden.de/docs/software/data_analytics.md b/doc.zih.tu-dresden.de/docs/software/data_analytics.md new file mode 100644 index 0000000000000000000000000000000000000000..b4a5f7f8b9f86c9a47fec20b875970efd4d787b2 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/data_analytics.md @@ -0,0 +1,36 @@ +# Data Analytics + +On ZIH systems, there are many possibilities for working with tools from the field of data +analytics. The boundaries between data analytics and machine learning are fluid. +Therefore, it may be worthwhile to search for a specific issue within the data analytics and +machine learning sections. + +The following tools are available on ZIH systems, among others: + +* [Python](data_analytics_with_python.md) +* [R](data_analytics_with_r.md) +* [RStudio](data_analytics_with_rstudio.md) +* [Big Data framework Spark](big_data_frameworks.md) +* [MATLAB and Mathematica](mathematics.md) + +Detailed information about frameworks for machine learning, such as [TensorFlow](tensorflow.md) +and [PyTorch](pytorch.md), can be found in the [machine learning](machine_learning.md) subsection. + +Other software, not listed here, can be searched with + +```console +marie@compute$ module spider <software_name> +``` + +Refer to the section covering [modules](modules.md) for further information on the modules system. +Additional software or special versions of [individual modules](custom_easy_build_environment.md) +can be installed individually by each user. If possible, the use of +[virtual environments](python_virtual_environments.md) is +recommended (e.g. for Python). Likewise, software can be used within [containers](containers.md). + +For the transfer of larger amounts of data into and within the system, the +[export nodes and datamover](../data_transfer/overview.md) should be used. +Data is stored in the [workspaces](../data_lifecycle/workspaces.md). +Software modules or virtual environments can also be installed in workspaces to enable +collaborative work even within larger groups. General recommendations for setting up workflows +can be found in the [experiments](../data_lifecycle/experiments.md) section. diff --git a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md new file mode 100644 index 0000000000000000000000000000000000000000..00ce0c5c4c3ddbd3654161bab69ee0a493cb4350 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md @@ -0,0 +1,451 @@ +# Python for Data Analytics + +Python is a high-level interpreted language widely used in research and science. Using ZIH system +allows you to work with Python quicker and more effective. Here, a general introduction to working +with Python on ZIH systems is given. Further documentation is available for specific +[machine learning frameworks](machine_learning.md). + +## Python Console and Virtual Environments + +Often, it is useful to create an isolated development environment, which can be shared among +a research group and/or teaching class. For this purpose, +[Python virtual environments](python_virtual_environments.md) can be used. + +The interactive Python interpreter can also be used on ZIH systems via an interactive job: + +```console +marie@login$ srun --partition=haswell --gres=gpu:1 --ntasks=1 --cpus-per-task=7 --pty --mem-per-cpu=8000 bash +marie@haswell$ module load Python +marie@haswell$ python +Python 3.8.6 (default, Feb 17 2021, 11:48:51) +[GCC 10.2.0] on linux +Type "help", "copyright", "credits" or "license" for more information. +>>> +``` + +## Jupyter Notebooks + +Jupyter notebooks allow to analyze data interactively using your web browser. One advantage of +Jupyter is, that code, documentation and visualization can be included in a single notebook, so that +it forms a unit. Jupyter notebooks can be used for many tasks, such as data cleaning and +transformation, numerical simulation, statistical modeling, data visualization and also machine +learning. + +On ZIH systems, a [JupyterHub](../access/jupyterhub.md) is available, which can be used to run a +Jupyter notebook on a node, using a GPU when needed. + +## Parallel Computing with Python + +### Pandas with Pandarallel + +[Pandas](https://pandas.pydata.org/){:target="_blank"} is a widely used library for data +analytics in Python. +In many cases, an existing source code using Pandas can be easily modified for parallel execution by +using the [pandarallel](https://github.com/nalepae/pandarallel/tree/v1.5.2) module. The number of +threads that can be used in parallel depends on the number of cores (parameter `--cpus-per-task`) +within the Slurm request, e.g. + +```console +marie@login$ srun --partition=haswell --cpus-per-task=4 --mem=2G --hint=nomultithread --pty --time=8:00:00 bash +``` + +The above request allows to use 4 parallel threads. + +The following example shows how to parallelize the apply method for pandas dataframes with the +pandarallel module. If the pandarallel module is not installed already, use a +[virtual environment](python_virtual_environments.md) to install the module. + +??? example + + ```python + import pandas as pd + import numpy as np + from pandarallel import pandarallel + + pandarallel.initialize() + # unfortunately the initialize method gets the total number of physical cores without + # taking into account allocated cores by Slurm, but the choice of the -c parameter is of relevance here + + N_rows = 10**5 + N_cols = 5 + df = pd.DataFrame(np.random.randn(N_rows, N_cols)) + + # here some function that needs to be executed in parallel + def transform(x): + return(np.mean(x)) + + print('calculate with normal apply...') + df.apply(func=transform, axis=1) + + print('calculate with pandarallel...') + df.parallel_apply(func=transform, axis=1) + ``` +For more examples of using pandarallel check out +[https://github.com/nalepae/pandarallel/blob/master/docs/examples.ipynb](https://github.com/nalepae/pandarallel/blob/master/docs/examples.ipynb). + +### Dask + +[Dask](https://dask.org/) is a flexible and open-source library +for parallel computing in Python. +It replaces some Python data structures with parallel versions +in order to provide advanced +parallelism for analytics, enabling performance at scale +for some of the popular tools. +For instance: Dask arrays replace NumPy arrays, +Dask dataframes replace Pandas dataframes. +Furthermore, Dask-ML scales machine learning APIs like Scikit-Learn and XGBoost. + +Dask is composed of two parts: + +- Dynamic task scheduling optimized for computation and interactive + computational workloads. +- Big Data collections like parallel arrays, data frames, and lists that extend common interfaces + like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. + These parallel collections run on top of dynamic task schedulers. + +Dask supports several user interfaces: + +- High-Level + - Arrays: Parallel NumPy + - Bags: Parallel lists + - DataFrames: Parallel Pandas + - Machine Learning: Parallel Scikit-Learn + - Others from external projects, like XArray +- Low-Level + - Delayed: Parallel function evaluation + - Futures: Real-time parallel function evaluation + +#### Dask Modules on ZIH Systems + +On ZIH systems, Dask is available as a module. +Check available versions and load your preferred one: + +```console +marie@compute$ module spider dask +------------------------------------------------------------------------------------------ + dask: +---------------------------------------------------------------------------------------------- + Versions: + dask/2.8.0-fosscuda-2019b-Python-3.7.4 + dask/2.8.0-Python-3.7.4 + dask/2.8.0 (E) +[...] +marie@compute$ module load dask/2.8.0-fosscuda-2019b-Python-3.7.4 +marie@compute$ python -c "import dask; print(dask.__version__)" +2021.08.1 +``` + +The preferred way is to use Dask as a separate module as was described above. +However, you can use it as part of the **Anaconda** module, e.g: `module load Anaconda3`. + +#### Scheduling by Dask + +Whenever you use functions on Dask collections (Dask Array, Dask Bag, etc.), Dask models these as +single tasks forming larger task graphs in the background without you noticing. +After Dask generates these task graphs, +it needs to execute them on parallel hardware. +This is the job of a task scheduler. +Please use Distributed scheduler for your +Dask computations on the cluster and avoid using a Single machine scheduler. + +##### Distributed Scheduler + +There are a variety of ways to set Distributed scheduler. +However, `dask.distributed` scheduler will be used for many of them. +To use the `dask.distributed` scheduler you must set up a Client: + +```python +from dask.distributed import Client +client = Client(...) # Connect to distributed cluster and override default +df.x.sum().compute() # This now runs on the distributed system +``` + +The idea behind Dask is to scale Python and distribute computation among the workers (multiple +machines, jobs). +The preferred and simplest way to run Dask on ZIH systems +today both for new or experienced users +is to use **[dask-jobqueue](https://jobqueue.dask.org/)**. + +However, Dask-jobqueue is slightly oriented toward +interactive analysis +usage, and it might be better to use tools like +**[Dask-mpi](https://docs.dask.org/en/latest/setup/hpc.html#using-mpi)** +in some routine batch production workloads. + +##### Dask-mpi + +You can launch a Dask network using +`mpirun` or `mpiexec` and the `dask-mpi` command line executable. +This depends on the [mpi4py library](#mpi4py-mpi-for-python). +For more detailed information, please check +[the official documentation](https://docs.dask.org/en/latest/setup/hpc.html#using-mpi). + +##### Dask-jobqueue + +[Dask-jobqueue](https://jobqueue.dask.org/) can be used as the standard way +to use dask for most users. +It allows an easy deployment of Dask Distributed on HPC with Slurm +or other job queuing systems. + +Dask-jobqueue is available as an extension +for a Dask module (which can be loaded by: `module load dask`). + +The availability of the exact packages (such a Dask-jobqueue) +in the module can be checked by the +`module whatis <name_of_the_module>` command, e.g. `module whatis dask`. + +Moreover, it is possible to install and use `dask-jobqueue` +in your local python environments. +You can install Dask-jobqueue with `pip` or `conda`. + +###### Example of Using Dask-Jobqueue with SLURMCluster + +[Dask-jobqueue](https://jobqueue.dask.org/en/latest/howitworks.html#workers-vs-jobs) +allows running jobs on the ZIH system +inside the python code and scale computations over the jobs. +[Dask-jobqueue](https://jobqueue.dask.org/en/latest/howitworks.html#workers-vs-jobs) +creates a Dask Scheduler in the Python process +where the cluster object is instantiated. +Please check the example of a definition of the cluster object +for the partition `alpha` (queue at the dask terms) on the ZIH system: + +```python +from dask_jobqueue import SLURMCluster + +cluster = SLURMCluster(queue='alpha', + cores=8, + processes=2, + project='p_marie', + memory="8GB", + walltime="00:30:00") + +``` + +These parameters above specify the characteristics of a +single job or a single compute node, +rather than the characteristics of your computation as a whole. +It hasn’t actually launched any jobs yet. +For the full computation, you will then ask for a number of +jobs using the scale command, e.g : `cluster.scale(2)`. +Thus, you have to specify a `SLURMCluster` by `dask_jobqueue`, +scale it and use it for your computations. There is an example: + +```python +from distributed import Client +from dask_jobqueue import SLURMCluster +from dask import delayed + +cluster = SLURMCluster(queue='alpha', + cores=8, + processes=2, + project='p_marie', + memory="80GB", + walltime="00:30:00", + extra=['--resources gpu=1']) + +cluster.scale(2) #scale it to 2 workers! +client = Client(cluster) #command will show you number of workers (python objects corresponds to jobs) +``` + +Please have a look at the `extra` parameter in the script above. +This could be used to specify a +special hardware availability that the scheduler +is not aware of, for example, GPUs. +Please don't forget to specify the name of your project. + +The Python code for setting up Slurm clusters +and scaling clusters can be run by the `srun` +(but remember that using `srun` directly on the shell +blocks the shell and launches an +interactive job) or batch jobs or +[JupyterHub](../access/jupyterhub.md) with loaded Dask +(by module or by Python virtual environment). + +!!! note + The job to run original code (de facto an interface) with a setup should be simple and light. + Please don't use a lot of resources for that. + +The following example shows using +Dask by `dask-jobqueue` with `SLURMCluster` and `dask.array` +for the Monte-Carlo estimation of Pi. + +??? example "Example of using SLURMCluster" + + ```python + #use of dask-jobqueue for the estimation of Pi by Monte-Carlo method + + import time + from time import time, sleep + from dask.distributed import Client + from dask_jobqueue import SLURMCluster + import subprocess as sp + + import dask.array as da + import numpy as np + + #setting up the dashboard + + uid = int( sp.check_output('id -u', shell=True).decode('utf-8').replace('\n','') ) + portdash = 10001 + uid + + #create a Slurm cluster, please specify your project + + cluster = SLURMCluster(queue='alpha', cores=2, project='p_marie', memory="8GB", walltime="00:30:00", extra=['--resources gpu=1'], scheduler_options={"dashboard_address": f":{portdash}"}) + + #submit the job to the scheduler with the number of nodes (here 2) requested: + + cluster.scale(2) + + #wait for Slurm to allocate a resources + + sleep(120) + + #check resources + + client = Client(cluster) + client + + #real calculations with a Monte Carlo method + + def calc_pi_mc(size_in_bytes, chunksize_in_bytes=200e6): + """Calculate PI using a Monte Carlo estimate.""" + + size = int(size_in_bytes / 8) + chunksize = int(chunksize_in_bytes / 8) + + xy = da.random.uniform(0, 1, size=(size / 2, 2), chunks=(chunksize / 2, 2)) + + in_circle = ((xy ** 2).sum(axis=-1) < 1) + pi = 4 * in_circle.mean() + + return pi + + def print_pi_stats(size, pi, time_delta, num_workers): + """Print pi, calculate offset from true value, and print some stats.""" + print(f"{size / 1e9} GB\n" + f"\tMC pi: {pi : 13.11f}" + f"\tErr: {abs(pi - np.pi) : 10.3e}\n" + f"\tWorkers: {num_workers}" + f"\t\tTime: {time_delta : 7.3f}s") + + #let's loop over different volumes of double-precision random numbers and estimate it + + for size in (1e9 * n for n in (1, 10, 100)): + + start = time() + pi = calc_pi_mc(size).compute() + elaps = time() - start + + print_pi_stats(size, pi, time_delta=elaps, num_workers=len(cluster.scheduler.workers)) + + #Scaling the Cluster to twice its size and re-run the experiments + + new_num_workers = 2 * len(cluster.scheduler.workers) + + print(f"Scaling from {len(cluster.scheduler.workers)} to {new_num_workers} workers.") + + cluster.scale(new_num_workers) + + sleep(120) + + client + + #Re-run same experiments with doubled cluster + + for size in (1e9 * n for n in (1, 10, 100)): + + start = time() + pi = calc_pi_mc(size).compute() + elaps = time() - start + + print_pi_stats(size, pi, time_delta=elaps, num_workers=len(cluster.scheduler.workers)) + ``` + +Please check the availability of resources that you want to allocate +by the script for the example above. +You can do it with `sinfo` command. The script doesn't work +without available cluster resources. + +### Mpi4py - MPI for Python + +Message Passing Interface (MPI) is a standardized and +portable message-passing standard, designed to +function on a wide variety of parallel computing architectures. + +Mpi4py (MPI for Python) provides bindings of the MPI standard for +the Python programming language, +allowing any Python program to exploit multiple processors. + +Mpi4py is based on MPI-2 C++ bindings. It supports almost all MPI calls. +It supports communication of pickle-able Python objects. +Mpi4py provides optimized communication of NumPy arrays. + +Mpi4py is included in the SciPy-bundle modules on the ZIH system. + +```console +marie@compute$ module load SciPy-bundle/2020.11-foss-2020b +Module SciPy-bundle/2020.11-foss-2020b and 28 dependencies loaded. +marie@compute$ pip list +Package Version +----------------------------- ---------- +[...] +mpi4py 3.0.3 +[...] +``` + +Other versions of the package can be found with: + +```console +marie@compute$ module spider mpi4py +----------------------------------------------------------------------------------------------------------------------------------------- + mpi4py: +----------------------------------------------------------------------------------------------------------------------------------------- + Versions: + mpi4py/1.3.1 + mpi4py/2.0.0-impi + mpi4py/3.0.0 (E) + mpi4py/3.0.2 (E) + mpi4py/3.0.3 (E) + +Names marked by a trailing (E) are extensions provided by another module. + +----------------------------------------------------------------------------------------------------------------------------------------- + For detailed information about a specific "mpi4py" package (including how to load the modules), use the module's full name. + Note that names that have a trailing (E) are extensions provided by other modules. + For example: + + $ module spider mpi4py/3.0.3 +----------------------------------------------------------------------------------------------------------------------------------------- +``` + +Moreover, it is possible to install mpi4py in your local conda +environment. + +The example of mpi4py usage for the verification that +mpi4py is running correctly can be found below: + +```python +from mpi4py import MPI +comm = MPI.COMM_WORLD +print("%d of %d" % (comm.Get_rank(), comm.Get_size())) +``` + +For the multi-node case, use a script similar to this: + +```bash +#!/bin/bash +#SBATCH --nodes=2 +#SBATCH --partition=ml +#SBATCH --tasks-per-node=2 +#SBATCH --cpus-per-task=1 + +module load modenv/ml +module load PythonAnaconda/3.6 + +eval "$(conda shell.bash hook)" +conda activate /home/marie/conda-virtual-environment/kernel2 && srun python mpi4py_test.py #specify name of your virtual environment +``` + +For the verification of the multi-node case, +you can use the Python code from the previous part +(with verification of the installation) as a test file. diff --git a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md index 9c1e092a72d6294a9c5b91f0cd3459bc8e215ebb..72224113fdf8a9c6f4727d47771283dc1d0c1baa 100644 --- a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md +++ b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md @@ -1,53 +1,41 @@ # R for Data Analytics [R](https://www.r-project.org/about.html) is a programming language and environment for statistical -computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling, -classical statistical tests, time-series analysis, classification, etc) and graphical techniques. R -is an integrated suite of software facilities for data manipulation, calculation and -graphing. +computing and graphics. It provides a wide variety of statistical (linear and nonlinear modeling, +classical statistical tests, time-series analysis, classification, etc.), machine learning +algorithms and graphical techniques. R is an integrated suite of software facilities for data +manipulation, calculation and graphing. -R possesses an extensive catalogue of statistical and graphical methods. It includes machine -learning algorithms, linear regression, time series, statistical inference. - -We recommend using **Haswell** and/or **Romeo** partitions to work with R. For more details -see [here](../jobs_and_resources/hardware_taurus.md). +We recommend using the partitions Haswell and/or Romeo to work with R. For more details +see our [hardware documentation](../jobs_and_resources/hardware_overview.md). ## R Console -This is a quickstart example. The `srun` command is used to submit a real-time execution job -designed for interactive use with monitoring the output. Please check -[the Slurm page](../jobs_and_resources/slurm.md) for details. - -```Bash -# job submission on haswell nodes with allocating: 1 task, 1 node, 4 CPUs per task with 2541 mb per CPU(core) for 1 hour -tauruslogin$ srun --partition=haswell --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2541 --time=01:00:00 --pty bash - -# Ensure that you are using the scs5 environment -module load modenv/scs5 -# Check all available modules for R with version 3.6 -module available R/3.6 -# Load default R module -module load R -# Checking the current R version -which R -# Start R console -R +In the following example, the `srun` command is used to start an interactive job, so that the output +is visible to the user. Please check the [Slurm page](../jobs_and_resources/slurm.md) for details. + +```console +marie@login$ srun --partition=haswell --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2541 --time=01:00:00 --pty bash +marie@haswell$ module load modenv/scs5 +marie@haswell$ module load R/3.6 +[...] +Module R/3.6.0-foss-2019a and 56 dependencies loaded. +marie@haswell$ which R +marie@haswell$ /sw/installed/R/3.6.0-foss-2019a/bin/R ``` -Using `srun` is recommended only for short test runs, while for larger runs batch jobs should be -used. The examples can be found [here](get_started_with_hpcda.md) or -[here](../jobs_and_resources/slurm.md). +Using interactive sessions is recommended only for short test runs, while for larger runs batch jobs +should be used. Examples can be found on the [Slurm page](../jobs_and_resources/slurm.md). It is also possible to run `Rscript` command directly (after loading the module): -```Bash -# Run Rscript directly. For instance: Rscript /scratch/ws/0/marie-study_project/my_r_script.R -Rscript /path/to/script/your_script.R param1 param2 +```console +marie@haswell$ Rscript </path/to/script/your_script.R> <param1> <param2> ``` ## R in JupyterHub -In addition to using interactive and batch jobs, it is possible to work with **R** using +In addition to using interactive and batch jobs, it is possible to work with R using [JupyterHub](../access/jupyterhub.md). The production and test [environments](../access/jupyterhub.md#standard-environments) of @@ -55,66 +43,49 @@ JupyterHub contain R kernel. It can be started either in the notebook or in the ## RStudio -[RStudio](<https://rstudio.com/) is an integrated development environment (IDE) for R. It includes -a console, syntax-highlighting editor that supports direct code execution, as well as tools for -plotting, history, debugging and workspace management. RStudio is also available on Taurus. - -The easiest option is to run RStudio in JupyterHub directly in the browser. It can be started -similarly to a new kernel from [JupyterLab](../access/jupyterhub.md#jupyterlab) launcher. - - -{: align="center"} - -Please keep in mind that it is currently not recommended to use the interactive x11 job with the -desktop version of RStudio, as described, for example, in introduction HPC-DA slides. +For using R with RStudio please refer to the documentation on +[Data Analytics with RStudio](data_analytics_with_rstudio.md). ## Install Packages in R -By default, user-installed packages are saved in the users home in a subfolder depending on -the architecture (x86 or PowerPC). Therefore the packages should be installed using interactive +By default, user-installed packages are saved in the users home in a folder depending on +the architecture (`x86` or `PowerPC`). Therefore the packages should be installed using interactive jobs on the compute node: -```Bash -srun -p haswell --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2541 --time=01:00:00 --pty bash - -module purge -module load modenv/scs5 -module load R -R -e 'install.packages("package_name")' #For instance: 'install.packages("ggplot2")' +```console +marie@compute$ module load R +[...] +Module R/3.6.0-foss-2019a and 56 dependencies loaded. +marie@compute$ R -e 'install.packages("ggplot2")' +[...] ``` ## Deep Learning with R The deep learning frameworks perform extremely fast when run on accelerators such as GPU. -Therefore, using nodes with built-in GPUs ([ml](../jobs_and_resources/power9.md) or -[alpha](../jobs_and_resources/alpha_centauri.md) partitions) is beneficial for the examples here. +Therefore, using nodes with built-in GPUs, e.g., partitions [ml](../jobs_and_resources/power9.md) +and [alpha](../jobs_and_resources/alpha_centauri.md), is beneficial for the examples here. ### R Interface to TensorFlow The ["TensorFlow" R package](https://tensorflow.rstudio.com/) provides R users access to the -Tensorflow toolset. [TensorFlow](https://www.tensorflow.org/) is an open-source software library +TensorFlow framework. [TensorFlow](https://www.tensorflow.org/) is an open-source software library for numerical computation using data flow graphs. -```Bash -srun --partition=ml --ntasks=1 --nodes=1 --cpus-per-task=7 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash +The respective modules can be loaded with the following -module purge -ml modenv/ml -ml TensorFlow -ml R - -which python -mkdir python-virtual-environments # Create a folder for virtual environments -cd python-virtual-environments -python3 -m venv --system-site-packages R-TensorFlow #create python virtual environment -source R-TensorFlow/bin/activate #activate environment -module list -which R +```console +marie@compute$ module load R/3.6.2-fosscuda-2019b +[...] +Module R/3.6.2-fosscuda-2019b and 63 dependencies loaded. +marie@compute$ module load TensorFlow/2.3.1-fosscuda-2019b-Python-3.7.4 +Module TensorFlow/2.3.1-fosscuda-2019b-Python-3.7.4 and 15 dependencies loaded. ``` -Please allocate the job with respect to -[hardware specification](../jobs_and_resources/hardware_taurus.md)! Note that the nodes on `ml` -partition have 4way-SMT, so for every physical core allocated, you will always get 4\*1443Mb=5772mb. +!!! warning + + Be aware that for compatibility reasons it is important to choose [modules](modules.md) with + the same toolchain version (in this case `fosscuda/2019b`). In order to interact with Python-based frameworks (like TensorFlow) `reticulate` R library is used. To configure it to point to the correct Python executable in your virtual environment, create @@ -122,23 +93,40 @@ a file named `.Rprofile` in your project directory (e.g. R-TensorFlow) with the contents: ```R -Sys.setenv(RETICULATE_PYTHON = "/sw/installed/Anaconda3/2019.03/bin/python") #assign the output of the 'which python' from above to RETICULATE_PYTHON +Sys.setenv(RETICULATE_PYTHON = "/sw/installed/Python/3.7.4-GCCcore-8.3.0/bin/python") #assign RETICULATE_PYTHON to the python executable ``` Let's start R, install some libraries and evaluate the result: -```R -install.packages("reticulate") -library(reticulate) -reticulate::py_config() -install.packages("tensorflow") -library(tensorflow) -tf$constant("Hello Tensorflow") #In the output 'Tesla V100-SXM2-32GB' should be mentioned +```rconsole +> install.packages(c("reticulate", "tensorflow")) +Installing packages into ‘~/R/x86_64-pc-linux-gnu-library/3.6’ +(as ‘lib’ is unspecified) +> reticulate::py_config() +python: /software/rome/Python/3.7.4-GCCcore-8.3.0/bin/python +libpython: /sw/installed/Python/3.7.4-GCCcore-8.3.0/lib/libpython3.7m.so +pythonhome: /software/rome/Python/3.7.4-GCCcore-8.3.0:/software/rome/Python/3.7.4-GCCcore-8.3.0 +version: 3.7.4 (default, Mar 25 2020, 13:46:43) [GCC 8.3.0] +numpy: /software/rome/SciPy-bundle/2019.10-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/numpy +numpy_version: 1.17.3 + +NOTE: Python version was forced by RETICULATE_PYTHON + +> library(tensorflow) +2021-08-26 16:11:47.110548: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 +> tf$constant("Hello TensorFlow") +2021-08-26 16:14:00.269248: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1 +2021-08-26 16:14:00.674878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: +pciBusID: 0000:0b:00.0 name: A100-SXM4-40GB computeCapability: 8.0 +coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s +[...] +tf.Tensor(b'Hello TensorFlow', shape=(), dtype=string) ``` ??? example + The example shows the use of the TensorFlow package with the R for the classification problem - related to the MNIST dataset. + related to the MNIST data set. ```R library(tensorflow) library(keras) @@ -214,20 +202,16 @@ tf$constant("Hello Tensorflow") #In the output 'Tesla V100-SXM2-32GB' sh ## Parallel Computing with R Generally, the R code is serial. However, many computations in R can be made faster by the use of -parallel computations. Taurus allows a vast number of options for parallel computations. Large -amounts of data and/or use of complex models are indications to use parallelization. - -### General Information about the R Parallelism - -There are various techniques and packages in R that allow parallelization. This section -concentrates on most general methods and examples. The Information here is Taurus-specific. +parallel computations. This section concentrates on most general methods and examples. The [parallel](https://www.rdocumentation.org/packages/parallel/versions/3.6.2) library will be used below. -**Warning:** Please do not install or update R packages related to parallelism as it could lead to -conflicts with other pre-installed packages. +!!! warning -### Basic Lapply-Based Parallelism + Please do not install or update R packages related to parallelism as it could lead to + conflicts with other preinstalled packages. + +### Basic lapply-Based Parallelism `lapply()` function is a part of base R. lapply is useful for performing operations on list-objects. Roughly speaking, lapply is a vectorization of the source code and it is the first step before @@ -243,6 +227,7 @@ This is a simple option for parallelization. It doesn't require much effort to r code to use `mclapply` function. Check out an example below. ??? example + ```R library(parallel) @@ -269,9 +254,9 @@ code to use `mclapply` function. Check out an example below. list_of_averages <- mclapply(X=sample_sizes, FUN=average, mc.cores=threads) # apply function "average" 100 times ``` -The disadvantages of using shared-memory parallelism approach are, that the number of parallel -tasks is limited to the number of cores on a single node. The maximum number of cores on a single -node can be found [here](../jobs_and_resources/hardware_taurus.md). +The disadvantages of using shared-memory parallelism approach are, that the number of parallel tasks +is limited to the number of cores on a single node. The maximum number of cores on a single node can +be found in our [hardware documentation](../jobs_and_resources/hardware_overview.md). Submitting a multicore R job to Slurm is very similar to submitting an [OpenMP Job](../jobs_and_resources/slurm.md#binding-and-distribution-of-tasks), @@ -305,9 +290,10 @@ running in parallel. The desired type of the cluster can be specified with a par This way of the R parallelism uses the [Rmpi](http://cran.r-project.org/web/packages/Rmpi/index.html) package and the [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) (Message Passing Interface) as a -"backend" for its parallel operations. The MPI-based job in R is very similar to submitting an +"back-end" for its parallel operations. The MPI-based job in R is very similar to submitting an [MPI Job](../jobs_and_resources/slurm.md#binding-and-distribution-of-tasks) since both are running -multicore jobs on multiple nodes. Below is an example of running R script with the Rmpi on Taurus: +multicore jobs on multiple nodes. Below is an example of running R script with the Rmpi on the ZIH +system: ```Bash #!/bin/bash @@ -315,8 +301,8 @@ multicore jobs on multiple nodes. Below is an example of running R script with t #SBATCH --ntasks=32 # this parameter determines how many processes will be spawned, please use >=8 #SBATCH --cpus-per-task=1 #SBATCH --time=01:00:00 -#SBATCH -o test_Rmpi.out -#SBATCH -e test_Rmpi.err +#SBATCH --output=test_Rmpi.out +#SBATCH --error=test_Rmpi.err module purge module load modenv/scs5 @@ -333,10 +319,10 @@ However, in some specific cases, you can specify the number of nodes and the num tasks per node explicitly: ```Bash -#!/bin/bash #SBATCH --nodes=2 #SBATCH --tasks-per-node=16 #SBATCH --cpus-per-task=1 + module purge module load modenv/scs5 module load R @@ -348,6 +334,7 @@ Use an example below, where 32 global ranks are distributed over 2 nodes with 16 Each MPI rank has 1 core assigned to it. ??? example + ```R library(Rmpi) @@ -371,6 +358,7 @@ Each MPI rank has 1 core assigned to it. Another example: ??? example + ```R library(Rmpi) library(parallel) @@ -405,7 +393,7 @@ Another example: #snow::stopCluster(cl) # usually it hangs over here with OpenMPI > 2.0. In this case this command may be avoided, Slurm will clean up after the job finishes ``` -To use Rmpi and MPI please use one of these partitions: **haswell**, **broadwell** or **rome**. +To use Rmpi and MPI please use one of these partitions: `haswell`, `broadwell` or `rome`. Use `mpirun` command to start the R script. It is a wrapper that enables the communication between processes running on different nodes. It is important to use `-np 1` (the number of spawned @@ -422,6 +410,7 @@ parallel workers, you have to manually specify the number of nodes according to hardware specification and parameters of your job. ??? example + ```R library(parallel) @@ -456,7 +445,7 @@ hardware specification and parameters of your job. print(paste("Program finished")) ``` -#### FORK cluster +#### FORK Cluster The `type="FORK"` method behaves exactly like the `mclapply` function discussed in the previous section. Like `mclapply`, it can only use the cores available on a single node. However this method @@ -464,7 +453,7 @@ requires exporting the workspace data to other processes. The FORK method in a c `parLapply` function might be used in situations, where different source code should run on each parallel process. -### Other parallel options +### Other Parallel Options - [foreach](https://cran.r-project.org/web/packages/foreach/index.html) library. It is functionally equivalent to the @@ -476,7 +465,8 @@ parallel process. expression via futures - [Poor-man's parallelism](https://www.glennklockwood.com/data-intensive/r/alternative-parallelism.html#6-1-poor-man-s-parallelism) (simple data parallelism). It is the simplest, but not an elegant way to parallelize R code. - It runs several copies of the same R script where's each read different sectors of the input data + It runs several copies of the same R script where each copy reads a different part of the input + data. - [Hands-off (OpenMP)](https://www.glennklockwood.com/data-intensive/r/alternative-parallelism.html#6-2-hands-off-parallelism) method. R has [OpenMP](https://www.openmp.org/resources/) support. Thus using OpenMP is a simple method where you don't need to know much about the parallelism options in your code. Please be diff --git a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_rstudio.md b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_rstudio.md new file mode 100644 index 0000000000000000000000000000000000000000..51d1068e3d1c32796859037e51a37e71810259b6 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_rstudio.md @@ -0,0 +1,14 @@ +# Data Analytics with RStudio + +[RStudio](https://rstudio.com/) is an integrated development environment (IDE) for R. It includes +a console, syntax-highlighting editor that supports direct code execution, as well as tools for +plotting, history, debugging and workspace management. RStudio is also available on ZIH systems. + +The easiest option is to run RStudio in JupyterHub directly in the browser. It can be started +similarly to a new kernel from [JupyterLab](../access/jupyterhub.md#jupyterlab) launcher. + + +{: style="width:90%" } + +!!! tip + If an error "could not start RStudio in time" occurs, try reloading the web page with `F5`. diff --git a/doc.zih.tu-dresden.de/docs/software/debuggers.md b/doc.zih.tu-dresden.de/docs/software/debuggers.md index d88ca5f068f0145e8acc46407feca93a14968522..0d4bda97f61fe6453d6027406ff88145c4204cfb 100644 --- a/doc.zih.tu-dresden.de/docs/software/debuggers.md +++ b/doc.zih.tu-dresden.de/docs/software/debuggers.md @@ -73,8 +73,8 @@ modified by DDT available, which has better support for Fortran 90 (e.g. derive  - Intuitive graphical user interface and great support for parallel applications -- We have 1024 licences, so many user can use this tool for parallel debugging -- Don't expect that debugging an MPI program with 100ths of process will always work without +- We have 1024 licenses, so many user can use this tool for parallel debugging +- Don't expect that debugging an MPI program with hundreds of processes will always work without problems - The more processes and nodes involved, the higher is the probability for timeouts or other problems @@ -159,7 +159,7 @@ marie@login$ srun -n 1 valgrind ./myprog - Not recommended for MPI parallel programs, since usually the MPI library will throw a lot of errors. But you may use Valgrind the following way such that every rank - writes its own Valgrind logfile: + writes its own Valgrind log file: ```console marie@login$ module load Valgrind diff --git a/doc.zih.tu-dresden.de/docs/software/deep_learning.md b/doc.zih.tu-dresden.de/docs/software/deep_learning.md deleted file mode 100644 index da8c9c461fddc3c870ef418bb7db2b1ed493abe8..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/deep_learning.md +++ /dev/null @@ -1,333 +0,0 @@ -# Deep learning - -**Prerequisites**: To work with Deep Learning tools you obviously need [Login](../access/ssh_login.md) -for the Taurus system and basic knowledge about Python, Slurm manager. - -**Aim** of this page is to introduce users on how to start working with Deep learning software on -both the ml environment and the scs5 environment of the Taurus system. - -## Deep Learning Software - -### TensorFlow - -[TensorFlow](https://www.tensorflow.org/guide/) is a free end-to-end open-source software library -for dataflow and differentiable programming across a range of tasks. - -TensorFlow is available in both main partitions -[ml environment and scs5 environment](modules.md#module-environments) -under the module name "TensorFlow". However, for purposes of machine learning and deep learning, we -recommend using Ml partition [HPC-DA](../jobs_and_resources/hpcda.md). For example: - -```Bash -module load TensorFlow -``` - -There are numerous different possibilities on how to work with [TensorFlow](tensorflow.md) on -Taurus. On this page, for all examples default, scs5 partition is used. Generally, the easiest way -is using the [modules system](modules.md) -and Python virtual environment (test case). However, in some cases, you may need directly installed -TensorFlow stable or night releases. For this purpose use the -[EasyBuild](custom_easy_build_environment.md), [Containers](tensorflow_container_on_hpcda.md) and see -[the example](https://www.tensorflow.org/install/pip). For examples of using TensorFlow for ml partition -with module system see [TensorFlow page for HPC-DA](tensorflow.md). - -Note: If you are going used manually installed TensorFlow release we recommend use only stable -versions. - -## Keras - -[Keras](https://keras.io/) is a high-level neural network API, written in Python and capable of -running on top of [TensorFlow](https://github.com/tensorflow/tensorflow) Keras is available in both -environments [ml environment and scs5 environment](modules.md#module-environments) under the module -name "Keras". - -On this page for all examples default scs5 partition used. There are numerous different -possibilities on how to work with [TensorFlow](tensorflow.md) and Keras -on Taurus. Generally, the easiest way is using the [module system](modules.md) and Python -virtual environment (test case) to see TensorFlow part above. -For examples of using Keras for ml partition with the module system see the -[Keras page for HPC-DA](keras.md). - -It can either use TensorFlow as its backend. As mentioned in Keras documentation Keras capable of -running on Theano backend. However, due to the fact that Theano has been abandoned by the -developers, we don't recommend use Theano anymore. If you wish to use Theano backend you need to -install it manually. To use the TensorFlow backend, please don't forget to load the corresponding -TensorFlow module. TensorFlow should be loaded automatically as a dependency. - -Test case: Keras with TensorFlow on MNIST data - -Go to a directory on Taurus, get Keras for the examples and go to the examples: - -```Bash -git clone https://github.com/fchollet/keras.git'>https://github.com/fchollet/keras.git -cd keras/examples/ -``` - -If you do not specify Keras backend, then TensorFlow is used as a default - -Job-file (schedule job with sbatch, check the status with 'squeue -u \<Username>'): - -```Bash -#!/bin/bash -#SBATCH --gres=gpu:1 # 1 - using one gpu, 2 - for using 2 gpus -#SBATCH --mem=8000 -#SBATCH -p gpu2 # select the type of nodes (options: haswell, smp, sandy, west, gpu, ml) K80 GPUs on Haswell node -#SBATCH --time=00:30:00 -#SBATCH -o HLR_<name_of_your_script>.out # save output under HLR_${SLURMJOBID}.out -#SBATCH -e HLR_<name_of_your_script>.err # save error messages under HLR_${SLURMJOBID}.err - -module purge # purge if you already have modules loaded -module load modenv/scs5 # load scs5 environment -module load Keras # load Keras module -module load TensorFlow # load TensorFlow module - -# if you see 'broken pipe error's (might happen in interactive session after the second srun -command) uncomment line below -# module load h5py - -python mnist_cnn.py -``` - -Keep in mind that you need to put the bash script to the same folder as an executable file or -specify the path. - -Example output: - -```Bash -x_train shape: (60000, 28, 28, 1) 60000 train samples 10000 test samples Train on 60000 samples, -validate on 10000 samples Epoch 1/12 - -128/60000 [..............................] - ETA: 12:08 - loss: 2.3064 - acc: 0.0781 256/60000 -[..............................] - ETA: 7:04 - loss: 2.2613 - acc: 0.1523 384/60000 -[..............................] - ETA: 5:22 - loss: 2.2195 - acc: 0.2005 - -... - -60000/60000 [==============================] - 128s 2ms/step - loss: 0.0296 - acc: 0.9905 - -val_loss: 0.0268 - val_acc: 0.9911 Test loss: 0.02677746053306255 Test accuracy: 0.9911 -``` - -## Datasets - -There are many different datasets designed for research purposes. If you would like to download some -of them, first of all, keep in mind that many machine learning libraries have direct access to -public datasets without downloading it (for example -[TensorFlow Datasets](https://www.tensorflow.org/datasets). - -If you still need to download some datasets, first of all, be careful with the size of the datasets -which you would like to download (some of them have a size of few Terabytes). Don't download what -you really not need to use! Use login nodes only for downloading small files (hundreds of the -megabytes). For downloading huge files use [DataMover](../data_transfer/data_mover.md). -For example, you can use command `dtwget` (it is an analogue of the general wget -command). This command submits a job to the data transfer machines. If you need to download or -allocate massive files (more than one terabyte) please contact the support before. - -### The ImageNet dataset - -The [ImageNet](http://www.image-net.org/) project is a large visual database designed for use in -visual object recognition software research. In order to save space in the file system by avoiding -to have multiple duplicates of this lying around, we have put a copy of the ImageNet database -(ILSVRC2012 and ILSVR2017) under `/scratch/imagenet` which you can use without having to download it -again. For the future, the ImageNet dataset will be available in `/warm_archive`. ILSVR2017 also -includes a dataset for recognition objects from a video. Please respect the corresponding -[Terms of Use](https://image-net.org/download.php). - -## Jupyter Notebook - -Jupyter notebooks are a great way for interactive computing in your web browser. Jupyter allows -working with data cleaning and transformation, numerical simulation, statistical modelling, data -visualization and of course with machine learning. - -There are two general options on how to work Jupyter notebooks using HPC: remote Jupyter server and -JupyterHub. - -These sections show how to run and set up a remote Jupyter server within a sbatch GPU job and which -modules and packages you need for that. - -**Note:** On Taurus, there is a [JupyterHub](../access/jupyterhub.md), where you do not need the -manual server setup described below and can simply run your Jupyter notebook on HPC nodes. Keep in -mind, that, with JupyterHub, you can't work with some special instruments. However, general data -analytics tools are available. - -The remote Jupyter server is able to offer more freedom with settings and approaches. - -### Preparation phase (optional) - -On Taurus, start an interactive session for setting up the -environment: - -```Bash -srun --pty -n 1 --cpus-per-task=2 --time=2:00:00 --mem-per-cpu=2500 --x11=first bash -l -i -``` - -Create a new subdirectory in your home, e.g. Jupyter - -```Bash -mkdir Jupyter cd Jupyter -``` - -There are two ways how to run Anaconda. The easiest way is to load the Anaconda module. The second -one is to download Anaconda in your home directory. - -1. Load Anaconda module (recommended): - -```Bash -module load modenv/scs5 module load Anaconda3 -``` - -1. Download latest Anaconda release (see example below) and change the rights to make it an -executable script and run the installation script: - -```Bash -wget https://repo.continuum.io/archive/Anaconda3-2019.03-Linux-x86_64.sh chmod 744 -Anaconda3-2019.03-Linux-x86_64.sh ./Anaconda3-2019.03-Linux-x86_64.sh - -(during installation you have to confirm the license agreement) -``` - -Next step will install the anaconda environment into the home -directory (/home/userxx/anaconda3). Create a new anaconda environment with the name "jnb". - -```Bash -conda create --name jnb -``` - -### Set environmental variables on Taurus - -In shell activate previously created python environment (you can -deactivate it also manually) and install Jupyter packages for this python environment: - -```Bash -source activate jnb conda install jupyter -``` - -If you need to adjust the configuration, you should create the template. Generate config files for -Jupyter notebook server: - -```Bash -jupyter notebook --generate-config -``` - -Find a path of the configuration file, usually in the home under `.jupyter` directory, e.g. -`/home//.jupyter/jupyter_notebook_config.py` - -Set a password (choose easy one for testing), which is needed later on to log into the server -in browser session: - -```Bash -jupyter notebook password Enter password: Verify password: -``` - -You get a message like that: - -```Bash -[NotebookPasswordApp] Wrote *hashed password* to -/home/<zih_user>/.jupyter/jupyter_notebook_config.json -``` - -I order to create an SSL certificate for https connections, you can create a self-signed -certificate: - -```Bash -openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mykey.key -out mycert.pem -``` - -Fill in the form with decent values. - -Possible entries for your Jupyter config (`.jupyter/jupyter_notebook*config.py*`). Uncomment below -lines: - -```Bash -c.NotebookApp.certfile = u'<path-to-cert>/mycert.pem' c.NotebookApp.keyfile = -u'<path-to-cert>/mykey.key' - -# set ip to '*' otherwise server is bound to localhost only c.NotebookApp.ip = '*' -c.NotebookApp.open_browser = False - -# copy hashed password from the jupyter_notebook_config.json c.NotebookApp.password = u'<your -hashed password here>' c.NotebookApp.port = 9999 c.NotebookApp.allow_remote_access = True -``` - -Note: `<path-to-cert>` - path to key and certificate files, for example: -(`/home/\<username>/mycert.pem`) - -### Slurm job file to run the Jupyter server on Taurus with GPU (1x K80) (also works on K20) - -```Bash -#!/bin/bash -l #SBATCH --gres=gpu:1 # request GPU #SBATCH --partition=gpu2 # use GPU partition -SBATCH --output=notebook_output.txt #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --time=02:30:00 -SBATCH --mem=4000M #SBATCH -J "jupyter-notebook" # job-name #SBATCH -A <name_of_your_project> - -unset XDG_RUNTIME_DIR # might be required when interactive instead of sbatch to avoid -'Permission denied error' srun jupyter notebook -``` - -Start the script above (e.g. with the name jnotebook) with sbatch command: - -```Bash -sbatch jnotebook.slurm -``` - -If you have a question about sbatch script see the article about [Slurm](../jobs_and_resources/slurm.md). - -Check by the command: `tail notebook_output.txt` the status and the **token** of the server. It -should look like this: - -```Bash -https://(taurusi2092.taurus.hrsk.tu-dresden.de or 127.0.0.1):9999/ -``` - -You can see the **server node's hostname** by the command: `squeue -u <username>`. - -Remote connect to the server - -There are two options on how to connect to the server: - -1. You can create an ssh tunnel if you have problems with the -solution above. Open the other terminal and configure ssh -tunnel: (look up connection values in the output file of Slurm job, e.g.) (recommended): - -```Bash -node=taurusi2092 #see the name of the node with squeue -u <your_login> -localport=8887 #local port on your computer remoteport=9999 -#pay attention on the value. It should be the same value as value in the notebook_output.txt ssh --fNL ${localport}:${node}:${remoteport} <zih_user>@taurus.hrsk.tu-dresden.de #configure -of the ssh tunnel for connection to your remote server pgrep -f "ssh -fNL ${localport}" -#verify that tunnel is alive -``` - -2. On your client (local machine) you now can connect to the server. You need to know the **node's - hostname**, the **port** of the server and the **token** to login (see paragraph above). - -You can connect directly if you know the IP address (just ping the node's hostname while logged on -Taurus). - -```Bash -#comand on remote terminal taurusi2092$> host taurusi2092 # copy IP address from output # paste -IP to your browser or call on local terminal e.g. local$> firefox https://<IP>:<PORT> # https -important to use SSL cert -``` - -To login into the Jupyter notebook site, you have to enter the **token**. -(`https://localhost:8887`). Now you can create and execute notebooks on Taurus with GPU support. - -If you would like to use [JupyterHub](../access/jupyterhub.md) after using a remote manually configured -Jupyter server (example above) you need to change the name of the configuration file -(`/home//.jupyter/jupyter_notebook_config.py`) to any other. - -### F.A.Q - -**Q:** - I have an error to connect to the Jupyter server (e.g. "open failed: administratively -prohibited: open failed") - -**A:** - Check the settings of your Jupyter config file. Is it all necessary lines uncommented, the -right path to cert and key files, right hashed password from .json file? Check is the used local -port [available](https://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers) -Check local settings e.g. (`/etc/ssh/sshd_config`, `/etc/hosts`). - -**Q:** I have an error during the start of the interactive session (e.g. PMI2_Init failed to -initialize. Return code: 1) - -**A:** Probably you need to provide `--mpi=none` to avoid ompi errors (). -`srun --mpi=none --reservation \<...> -A \<...> -t 90 --mem=4000 --gres=gpu:1 ---partition=gpu2-interactive --pty bash -l` diff --git a/doc.zih.tu-dresden.de/docs/software/distributed_training.md b/doc.zih.tu-dresden.de/docs/software/distributed_training.md new file mode 100644 index 0000000000000000000000000000000000000000..b3c6733bc0c7150eeee561ec450d33a7db27d54a --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/distributed_training.md @@ -0,0 +1,347 @@ +# Distributed Training + +## Internal Distribution + +Training a machine learning model can be a very time-consuming task. +Distributed training allows scaling up deep learning tasks, +so we can train very large models and speed up training time. + +There are two paradigms for distributed training: + +1. data parallelism: +each device has a replica of the model and computes over different parts of the data. +2. model parallelism: +models are distributed over multiple devices. + +In the following, we will stick to the concept of data parallelism because it is a widely-used +technique. +There are basically two strategies to train the scattered data throughout the devices: + +1. synchronous training: devices (workers) are trained over different slices of the data and at the +end of each step gradients are aggregated. +2. asynchronous training: +all devices are independently trained over the data and update variables asynchronously. + +### Distributed TensorFlow + +[TensorFlow](https://www.tensorflow.org/guide/distributed_training) provides a high-end API to +train your model and distribute the training on multiple GPUs or machines with minimal code changes. + +The primary distributed training method in TensorFlow is `tf.distribute.Strategy`. +There are multiple strategies that distribute the training depending on the specific use case, +the data and the model. + +TensorFlow refers to the synchronous training as mirrored strategy. +There are two mirrored strategies available whose principles are the same: + +- `tf.distribute.MirroredStrategy` supports the training on multiple GPUs on one machine. +- `tf.distribute.MultiWorkerMirroredStrategy` for multiple machines, each with multiple GPUs. + +The Central Storage Strategy applies to environments where the GPUs might not be able to store +the entire model: + +- `tf.distribute.experimental.CentralStorageStrategy` supports the case of a single machine +with multiple GPUs. + +The CPU holds the global state of the model and GPUs perform the training. + +In some cases asynchronous training might be the better choice, for example, if workers differ on +capability, are down for maintenance, or have different priorities. +The Parameter Server Strategy is capable of applying asynchronous training: + +- `tf.distribute.experimental.ParameterServerStrategy` requires several Parameter Servers and workers. + +The Parameter Server holds the parameters and is responsible for updating +the global state of the models. +Each worker runs the training loop independently. + +??? example "Multi Worker Mirrored Strategy" + + In this case, we will go through an example with Multi Worker Mirrored Strategy. + Multi-node training requires a `TF_CONFIG` environment variable to be set which will + be different on each node. + + ```console + marie@compute$ TF_CONFIG='{"cluster": {"worker": ["10.1.10.58:12345", "10.1.10.250:12345"]}, "task": {"index": 0, "type": "worker"}}' python main.py + ``` + + The `cluster` field describes how the cluster is set up (same on each node). + Here, the cluster has two nodes referred to as workers. + The `IP:port` information is listed in the `worker` array. + The `task` field varies from node to node. + It specifies the type and index of the node. + In this case, the training job runs on worker 0, which is `10.1.10.58:12345`. + We need to adapt this snippet for each node. + The second node will have `'task': {'index': 1, 'type': 'worker'}`. + + With two modifications, we can parallelize the serial code: + We need to initialize the distributed strategy: + + ```python + strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy() + ``` + + And define the model under the strategy scope: + + ```python + with strategy.scope(): + model = resnet.resnet56(img_input=img_input, classes=NUM_CLASSES) + model.compile( + optimizer=opt, + loss='sparse_categorical_crossentropy', + metrics=['sparse_categorical_accuracy']) + model.fit(train_dataset, + epochs=NUM_EPOCHS) + ``` + + To run distributed training, the training script needs to be copied to all nodes, + in this case on two nodes. + TensorFlow is available as a module. + Check for the version. + The `TF_CONFIG` environment variable can be set as a prefix to the command. + Now, run the script on the partition `alpha` simultaneously on both nodes: + + ```bash + #!/bin/bash + + #SBATCH --job-name=distr + #SBATCH --partition=alpha + #SBATCH --output=%j.out + #SBATCH --error=%j.err + #SBATCH --mem=64000 + #SBATCH --nodes=2 + #SBATCH --ntasks=2 + #SBATCH --ntasks-per-node=1 + #SBATCH --cpus-per-task=14 + #SBATCH --gres=gpu:1 + #SBATCH --time=01:00:00 + + function print_nodelist { + scontrol show hostname $SLURM_NODELIST + } + NODE_1=$(print_nodelist | awk '{print $1}' | sort -u | head -n 1) + NODE_2=$(print_nodelist | awk '{print $1}' | sort -u | tail -n 1) + IP_1=$(dig +short ${NODE_1}.taurus.hrsk.tu-dresden.de) + IP_2=$(dig +short ${NODE_2}.taurus.hrsk.tu-dresden.de) + + module load modenv/hiera + module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 TensorFlow/2.4.1 + + # On the first node + TF_CONFIG='{"cluster": {"worker": ["'"${NODE_1}"':33562", "'"${NODE_2}"':33561"]}, "task": {"index": 0, "type": "worker"}}' srun -w ${NODE_1} -N 1 --ntasks=1 --gres=gpu:1 python main_ddl.py & + + # On the second node + TF_CONFIG='{"cluster": {"worker": ["'"${NODE_1}"':33562", "'"${NODE_2}"':33561"]}, "task": {"index": 1, "type": "worker"}}' srun -w ${NODE_2} -N 1 --ntasks=1 --gres=gpu:1 python main_ddl.py & + + wait + ``` + +### Distributed PyTorch + +!!! note + + This section is under construction + +PyTorch provides multiple ways to achieve data parallelism to train the deep learning models +efficiently. These models are part of the `torch.distributed` sub-package that ships with the main +deep learning package. + +The easiest method to quickly prototype if the model is trainable in a multi-GPU setting is to wrap +the existing model with the `torch.nn.DataParallel` class as shown below, + +```python +model = torch.nn.DataParalell(model) +``` + +Adding this single line of code to the existing application will let PyTorch know that the model +needs to be parallelized. But since this method uses threading to achieve parallelism, it fails to +achieve true parallelism due to the well known issue of Global Interpreter Lock that exists in +Python. To work around this issue and gain performance benefits of parallelism, the use of +`torch.nn.DistributedDataParallel` is recommended. This involves little more code changes to set up, +but further increases the performance of model training. The starting step is to initialize the +process group by calling the `torch.distributed.init_process_group()` using the appropriate back end +such as NCCL, MPI or Gloo. The use of NCCL as back end is recommended as it is currently the fastest +back end when using GPUs. + +#### Using Multiple GPUs with PyTorch + +The example below shows how to solve that problem by using model parallelism, which in contrast to +data parallelism splits a single model onto different GPUs, rather than replicating the entire +model on each GPU. +The high-level idea of model parallelism is to place different sub-networks of a model onto +different devices. +As only part of a model operates on any individual device a set of devices can collectively +serve a larger model. + +It is recommended to use +[DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html), +instead of this class, to do multi-GPU training, even if there is only a single node. +See: Use `nn.parallel.DistributedDataParallel` instead of multiprocessing or `nn.DataParallel`. +Check the [page](https://pytorch.org/docs/stable/notes/cuda.html#cuda-nn-ddp-instead) and +[Distributed Data Parallel](https://pytorch.org/docs/stable/notes/ddp.html#ddp). + +??? example "Parallel Model" + + The main aim of this model is to show the way how to effectively implement your neural network + on multiple GPUs. It includes a comparison of different kinds of models and tips to improve the + performance of your model. + **Necessary** parameters for running this model are **2 GPU** and 14 cores. + + Download: [example_PyTorch_parallel.zip (4.2 KB)](misc/example_PyTorch_parallel.zip) + + Remember that for using [JupyterHub service](../access/jupyterhub.md) for PyTorch, you need to + create and activate a virtual environment (kernel) with loaded essential modules. + + Run the example in the same way as the previous examples. + +#### Distributed Data-Parallel + +[DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel) +(DDP) implements data parallelism at the module level which can run across multiple machines. +Applications using DDP should spawn multiple processes and create a single DDP instance per process. +DDP uses collective communications in the +[torch.distributed](https://pytorch.org/tutorials/intermediate/dist_tuto.html) package to +synchronize gradients and buffers. + +Please also look at the [official tutorial](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). + +To use distributed data parallelism on ZIH systems, please make sure the value of +parameter `--ntasks-per-node=<N>` equals the number of GPUs you use per node. +Also, it can be useful to increase `memory/cpu` parameters if you run larger models. +Memory can be set up to: + +- `--mem=250G` and `--cpus-per-task=7` for the partition `ml`. +- `--mem=60G` and `--cpus-per-task=6` for the partition `gpu2`. + +Keep in mind that only one memory parameter (`--mem-per-cpu=<MB>` or `--mem=<MB>`) can be specified. + +## External Distribution + +### Horovod + +[Horovod](https://github.com/horovod/horovod) is the open-source distributed training framework +for TensorFlow, Keras and PyTorch. +It makes it easier to develop distributed deep learning projects and speeds them up. +Horovod scales well to a large number of nodes and has a strong focus on efficient training on +GPUs. + +#### Why use Horovod? + +Horovod allows you to easily take a single-GPU TensorFlow and PyTorch program and +train it on many GPUs! +In some cases, the MPI model is much more straightforward and requires far less code changes than +the distributed code from TensorFlow for instance, with parameter servers. +Horovod uses MPI and NCCL which gives in some cases better results than +pure TensorFlow and PyTorch. + +#### Horovod as Module + +Horovod is available as a module with **TensorFlow** or **PyTorch** for +**all** module environments. +Please check the [software module list](modules.md) for the current version of the software. +Horovod can be loaded like other software on ZIH system: + +```console +marie@compute$ module spider Horovod # Check available modules +------------------------------------------------------------------------------------------------ + Horovod: +------------------------------------------------------------------------------------------------ + Description: + Horovod is a distributed training framework for TensorFlow. + + Versions: + Horovod/0.18.2-fosscuda-2019b-TensorFlow-2.0.0-Python-3.7.4 + Horovod/0.19.5-fosscuda-2019b-TensorFlow-2.2.0-Python-3.7.4 + Horovod/0.21.1-TensorFlow-2.4.1 +[...] +marie@compute$ module load Horovod/0.19.5-fosscuda-2019b-TensorFlow-2.2.0-Python-3.7.4 +``` + +Or if you want to use Horovod on the partition `alpha`, you can load it with the dependencies: + +```console +marie@alpha$ module spider Horovod #Check available modules +marie@alpha$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 Horovod/0.21.1-TensorFlow-2.4.1 +``` + +#### Horovod Installation + +However, if it is necessary to use another version of Horovod, it is possible to install it +manually. For that, you need to create a [virtual environment](python_virtual_environments.md) and +load the dependencies (e.g. MPI). +Installing TensorFlow can take a few hours and is not recommended. + +##### Install Horovod for TensorFlow with Python and Pip + +This example shows the installation of Horovod for TensorFlow. +Adapt as required and refer to the [Horovod documentation](https://horovod.readthedocs.io/en/stable/install_include.html) +for details. + +```console +marie@alpha$ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_TENSORFLOW=1 pip install --no-cache-dir horovod\[tensorflow\] +[...] +marie@alpha$ horovodrun --check-build +Horovod v0.19.5: + +Available Frameworks: + [X] TensorFlow + [ ] PyTorch + [ ] MXNet + +Available Controllers: + [X] MPI + [ ] Gloo + +Available Tensor Operations: + [X] NCCL + [ ] DDL + [ ] CCL + [X] MPI + [ ] Gloo +``` + +If you want to use OpenMPI then specify `HOROVOD_GPU_ALLREDUCE=MPI`. +To have better performance it is recommended to use NCCL instead of OpenMPI. + +##### Verify Horovod Works + +```pycon +>>> import tensorflow +2021-10-07 16:38:55.694445: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 +>>> import horovod.tensorflow as hvd #import horovod +>>> hvd.init() #initialize horovod +>>> hvd.size() +1 +>>> hvd.rank() +0 +>>> print('Hello from:', hvd.rank()) +Hello from: 0 +``` + +??? example + + Follow the steps in the + [official examples](https://github.com/horovod/horovod/tree/master/examples) + to parallelize your code. + In Horovod, each GPU gets pinned to a process. + You can easily start your job with the following bash script with four processes on two nodes: + + ```bash + #!/bin/bash + #SBATCH --nodes=2 + #SBATCH --ntasks=4 + #SBATCH --ntasks-per-node=2 + #SBATCH --gres=gpu:2 + #SBATCH --partition=ml + #SBATCH --mem=250G + #SBATCH --time=01:00:00 + #SBATCH --output=run_horovod.out + + module load modenv/ml + module load Horovod/0.19.5-fosscuda-2019b-TensorFlow-2.2.0-Python-3.7.4 + + srun python <your_program.py> + ``` + + Do not forget to specify the total number of tasks `--ntasks` and the number of tasks per node + `--ntasks-per-node` which must match the number of GPUs per node. diff --git a/doc.zih.tu-dresden.de/docs/software/fem_software.md b/doc.zih.tu-dresden.de/docs/software/fem_software.md index bd65ea9832462bae475841f2e3ed2fa8193e3355..160aeded633f50e9abfdfae6d74a7627257ca565 100644 --- a/doc.zih.tu-dresden.de/docs/software/fem_software.md +++ b/doc.zih.tu-dresden.de/docs/software/fem_software.md @@ -1,247 +1,238 @@ # FEM Software -For an up-to-date list of the installed software versions on our -cluster, please refer to SoftwareModulesList **TODO LINK** (SoftwareModulesList). +!!! hint "Its all in the modules" -## Abaqus - -[ABAQUS](http://www.hks.com) **TODO links to realestate site** is a general-purpose finite-element program -designed for advanced linear and nonlinear engineering analysis -applications with facilities for linking-in user developed material -models, elements, friction laws, etc. - -Eike Dohmen (from Inst.f. Leichtbau und Kunststofftechnik) sent us the -attached description of his ABAQUS calculations. Please try to adapt -your calculations in that way.\<br />Eike is normally a Windows-User and -his description contains also some hints for basic Unix commands. ( -ABAQUS-SLURM.pdf **TODO LINK** (%ATTACHURL%/ABAQUS-SLURM.pdf) - only in German) - -Please note: Abaqus calculations should be started with a batch script. -Please read the information about the Batch System **TODO LINK ** (BatchSystems) -SLURM. - -The detailed Abaqus documentation can be found at -abaqus **TODO LINK MISSING** (only accessible from within the -TU Dresden campus net). + All packages described in this section, are organized in so-called modules. To list the available versions of a package and load a + particular, e.g., ANSYS, version, invoke the commands -**Example - Thanks to Benjamin Groeger, Inst. f. Leichtbau und -Kunststofftechnik) ** - -1. Prepare an Abaqus input-file (here the input example from Benjamin) - -Rot-modell-BenjaminGroeger.inp **TODO LINK** (%ATTACHURL%/Rot-modell-BenjaminGroeger.inp) - -2. Prepare a batch script on taurus like this - -``` -#!/bin/bash<br> -### Thanks to Benjamin Groeger, Institut fuer Leichtbau und Kunststofftechnik, 38748<br />### runs on taurus and needs ca 20sec with 4cpu<br />### generates files: -### yyyy.com -### yyyy.dat -### yyyy.msg -### yyyy.odb -### yyyy.prt -### yyyy.sim -### yyyy.sta -#SBATCH --nodes=1 ### with >1 node abaqus needs a nodeliste -#SBATCH --ntasks-per-node=4 -#SBATCH --mem=500 ### memory (sum) -#SBATCH --time=00:04:00 -### give a name, what ever you want -#SBATCH --job-name=yyyy -### you get emails when the job will finished or failed -### set your right email -#SBATCH --mail-type=END,FAIL -#SBATCH --mail-user=xxxxx.yyyyyy@mailbox.tu-dresden.de -### set your project -#SBATCH -A p_xxxxxxx -### Abaqus have its own MPI -unset SLURM_GTIDS -### load and start -module load ABAQUS/2019 -abaqus interactive input=Rot-modell-BenjaminGroeger.inp job=yyyy cpus=4 mp_mode=mpi + ```console + marie@login$ module avail ANSYS + [...] + marie@login$ module load ANSYS/<version> + ``` -``` + The section [runtime environment](modules.md) provides a comprehensive overview + on the module system and relevant commands. -3. Start the batch script (name of our script is -"batch-Rot-modell-BenjaminGroeger") +## Abaqus -``` -sbatch batch-Rot-modell-BenjaminGroeger --->; you will get a jobnumber = JobID (for example 3130522) -``` +[Abaqus](https://www.3ds.com/de/produkte-und-services/simulia/produkte/abaqus/) is a general-purpose +finite element method program designed for advanced linear and nonlinear engineering analysis +applications with facilities for linking-in user developed material models, elements, friction laws, +etc. -4. Control the status of the job +### Guide by User -``` -squeue -u your_login -->; in column "ST" (Status) you will find a R=Running or P=Pending (waiting for resources) -``` +Eike Dohmen (from Inst. f. Leichtbau und Kunststofftechnik) sent us the description of his +Abaqus calculations. Please try to adapt your calculations in that way. Eike is normally a +Windows user and his description contains also some hints for basic Unix commands: +[Abaqus-Slurm.pdf (only in German)](misc/abaqus-slurm.pdf). -## ANSYS +### General -ANSYS is a general-purpose finite-element program for engineering -analysis, and includes preprocessing, solution, and post-processing -functions. It is used in a wide range of disciplines for solutions to -mechanical, thermal, and electronic problems. [ANSYS and ANSYS -CFX](http://www.ansys.com) used to be separate packages in the past and -are now combined. +Abaqus calculations should be started using a job file (aka. batch script). Please refer to the +page covering the [batch system Slurm](../jobs_and_resources/slurm.md) if you are not familiar with +Slurm or [writing job files](../jobs_and_resources/slurm.md#job-files). -ANSYS, like all other installed software, is organized in so-called -modules **TODO LINK** (RuntimeEnvironment). To list the available versions and load a -particular ANSYS version, type +??? example "Usage of Abaqus" -``` -module avail ANSYS -... -module load ANSYS/VERSION -``` + (Thanks to Benjamin Groeger, Inst. f. Leichtbau und Kunststofftechnik)). -In general, HPC-systems are not designed for interactive "GUI-working". -Even so, it is possible to start a ANSYS workbench on Taurus (login -nodes) interactively for short tasks. The second and recommended way is -to use batch files. Both modes are documented in the following. + 1. Prepare an Abaqus input-file. You can start with the input example from Benjamin: + [Rot-modell-BenjaminGroeger.inp](misc/Rot-modell-BenjaminGroeger.inp) + 2. Prepare a job file on ZIH systems like this + ```bash + #!/bin/bash + ### needs ca 20 sec with 4cpu + ### generates files: + ### yyyy.com + ### yyyy.dat + ### yyyy.msg + ### yyyy.odb + ### yyyy.prt + ### yyyy.sim + ### yyyy.sta + #SBATCH --nodes=1 # with >1 node Abaqus needs a nodeliste + #SBATCH --ntasks-per-node=4 + #SBATCH --mem=500 # total memory + #SBATCH --time=00:04:00 + #SBATCH --job-name=yyyy # give a name, what ever you want + #SBATCH --mail-type=END,FAIL # send email when the job finished or failed + #SBATCH --mail-user=<name>@mailbox.tu-dresden.de # set your email + #SBATCH -A p_xxxxxxx # charge compute time to your project + + + # Abaqus has its own MPI + unset SLURM_GTIDS + + # load module and start Abaqus + module load ABAQUS/2019 + abaqus interactive input=Rot-modell-BenjaminGroeger.inp job=yyyy cpus=4 mp_mode=mpi + ``` + 3. Start the job file (e.g., name `batch-Rot-modell-BenjaminGroeger.sh`) + ``` + marie@login$ sbatch batch-Rot-modell-BenjaminGroeger.sh # Slurm will provide the Job Id (e.g., 3130522) + ``` + 4. Control the status of the job + ``` + marie@login squeue -u your_login # in column "ST" (Status) you will find a R=Running or P=Pending (waiting for resources) + ``` + +## Ansys + +Ansys is a general-purpose finite element method program for engineering analysis, and includes +preprocessing, solution, and post-processing functions. It is used in a wide range of disciplines +for solutions to mechanical, thermal, and electronic problems. +[Ansys and Ansys CFX](http://www.ansys.com) used to be separate packages in the past and are now +combined. + +In general, HPC systems are not designed for interactive working with GUIs. Even so, it is possible to +start a Ansys workbench on the login nodes interactively for short tasks. The second and +**recommended way** is to use job files. Both modes are documented in the following. + +!!! note "" + + Since the MPI library that Ansys uses internally (Platform MPI) has some problems integrating + seamlessly with Slurm, you have to unset the enviroment variable `SLURM_GTIDS` in your + environment bevor running Ansysy workbench in interactive andbatch mode. ### Using Workbench Interactively -For fast things, ANSYS workbench can be invoked interactively on the -login nodes of Taurus. X11 forwarding needs to enabled when establishing -the SSH connection. For OpenSSH this option is '-X' and it is valuable -to use compression of all data via '-C'. +Ansys workbench (`runwb2`) an be invoked interactively on the login nodes of ZIH systems for short tasks. +[X11 forwarding](../access/ssh_login.md#x11-forwarding) needs to enabled when establishing the SSH +connection. For OpenSSH the corresponding option is `-X` and it is valuable to use compression of +all data via `-C`. -``` -# Connect to taurus, e.g. ssh -CX -module load ANSYS/VERSION -runwb2 +```console +# SSH connection established using -CX +marie@login$ module load ANSYS/<version> +marie@login$ runwb2 ``` -If more time is needed, a CPU has to be allocated like this (see topic -batch systems **TODO LINK** (BatchSystems) for further information): +If more time is needed, a CPU has to be allocated like this (see +[batch systems Slurm](../jobs_and_resources/slurm.md) for further information): +```console +marie@login$ module load ANSYS/<version> +marie@login$ srun -t 00:30:00 --x11=first [SLURM_OPTIONS] --pty bash +[...] +marie@login$ runwb2 ``` -module load ANSYS/VERSION -srun -t 00:30:00 --x11=first [SLURM_OPTIONS] --pty bash -runwb2 -``` - -**Note:** The software NICE Desktop Cloud Visualization (DCV) enables to -remotly access OpenGL-3D-applications running on taurus using its GPUs -(cf. virtual desktops **TODO LINK** (Compendium.VirtualDesktops)). Using ANSYS -together with dcv works as follows: - -- Follow the instructions within virtual - desktops **TODO LINK** (Compendium.VirtualDesktops) -``` -module load ANSYS -``` +!!! hint "Better use DCV" -``` -unset SLURM_GTIDS -``` + The software NICE Desktop Cloud Visualization (DCV) enables to + remotly access OpenGL-3D-applications running on ZIH systems using its GPUs + (cf. [virtual desktops](virtual_desktops.md)). -- Note the hints w.r.t. GPU support on dcv side +Ansys can be used under DCV to make use of GPU acceleration. Follow the instructions within +[virtual desktops](virtual_desktops.md) to set up a DCV session. Then, load a Ansys module, unset +the environment variable `SLURM_GTIDS`, and finally start the workbench: -``` -runwb2 +```console +marie@gpu$ module load ANSYS +marie@gpu$ unset SLURM_GTIDS +marie@gpu$ runwb2 ``` ### Using Workbench in Batch Mode -The ANSYS workbench (runwb2) can also be used in a batch script to start -calculations (the solver, not GUI) from a workbench project into the -background. To do so, you have to specify the -B parameter (for batch -mode), -F for your project file, and can then either add different -commands via -E parameters directly, or specify a workbench script file -containing commands via -R. +The Ansys workbench (`runwb2`) can also be used in a job file to start calculations (the solver, +not GUI) from a workbench project into the background. To do so, you have to specify the `-B` +parameter (for batch mode), `-F` for your project file, and can then either add different commands via +`-E parameters directly`, or specify a workbench script file containing commands via `-R`. -**NOTE:** Since the MPI library that ANSYS uses internally (Platform -MPI) has some problems integrating seamlessly with SLURM, you have to -unset the enviroment variable SLURM_GTIDS in your job environment before -running workbench. An example batch script could look like this: +??? example "Ansys Job File" + ```bash #!/bin/bash #SBATCH --time=0:30:00 #SBATCH --nodes=1 #SBATCH --ntasks=2 #SBATCH --mem-per-cpu=1000M + unset SLURM_GTIDS # Odd, but necessary! - unset SLURM_GTIDS # Odd, but necessary! - - module load ANSYS/VERSION + module load ANSYS/<version> runwb2 -B -F Workbench_Taurus.wbpj -E 'Project.Update' -E 'Save(Overwrite=True)' #or, if you wish to use a workbench replay file, replace the -E parameters with: -R mysteps.wbjn + ``` ### Running Workbench in Parallel -Unfortunately, the number of CPU cores you wish to use cannot simply be -given as a command line parameter to your runwb2 call. Instead, you have -to enter it into an XML file in your home. This setting will then be -used for all your runwb2 jobs. While it is also possible to edit this -setting via the Mechanical GUI, experience shows that this can be -problematic via X-Forwarding and we only managed to use the GUI properly -via DCV **TODO LINK** (DesktopCloudVisualization), so we recommend you simply edit -the XML file directly with a text editor of your choice. It is located +Unfortunately, the number of CPU cores you wish to use cannot simply be given as a command line +parameter to your `runwb2` call. Instead, you have to enter it into an XML file in your `home` +directory. This setting will then be **used for all** your `runwb2` jobs. While it is also possible +to edit this setting via the Mechanical GUI, experience shows that this can be problematic via +X11-forwarding and we only managed to use the GUI properly via [DCV](virtual_desktops.md), so we +recommend you simply edit the XML file directly with a text editor of your choice. It is located under: -'$HOME/.mw/Application Data/Ansys/v181/SolveHandlers.xml' +`$HOME/.mw/Application Data/Ansys/v181/SolveHandlers.xml` -(mind the space in there.) You might have to adjust the ANSYS Version -(v181) in the path. In this file, you can find the parameter +(mind the space in there.) You might have to adjust the Ansys version +(here `v181`) in the path to your preferred version. In this file, you can find the parameter - <MaxNumberProcessors>2</MaxNumberProcessors> +`<MaxNumberProcessors>2</MaxNumberProcessors>` -that you can simply change to something like 16 oder 24. For now, you -should stay within single-node boundaries, because multi-node -calculations require additional parameters. The number you choose should -match your used --cpus-per-task parameter in your sbatch script. +that you can simply change to something like 16 or 24. For now, you should stay within single-node +boundaries, because multi-node calculations require additional parameters. The number you choose +should match your used `--cpus-per-task` parameter in your job file. ## COMSOL Multiphysics -"[COMSOL Multiphysics](http://www.comsol.com) (formerly FEMLAB) is a -finite element analysis, solver and Simulation software package for -various physics and engineering applications, especially coupled -phenomena, or multiphysics." -[\[http://en.wikipedia.org/wiki/COMSOL_Multiphysics Wikipedia\]]( - http://en.wikipedia.org/wiki/COMSOL_Multiphysics Wikipedia) +[COMSOL Multiphysics](http://www.comsol.com) (formerly FEMLAB) is a finite element analysis, solver +and Simulation software package for various physics and engineering applications, especially coupled +phenomena, or multiphysics. -Comsol may be used remotely on ZIH machines or locally on the desktop, -using ZIH license server. +COMSOL may be used remotely on ZIH systems or locally on the desktop, using ZIH license server. -For using Comsol on ZIH machines, the following operating modes (see -Comsol manual) are recommended: +For using COMSOL on ZIH systems, we recommend the interactive client-server mode (see COMSOL +manual). -- Interactive Client Server Mode +### Client-Server Mode -In this mode Comsol runs as server process on the ZIH machine and as -client process on your local workstation. The client process needs a -dummy license for installation, but no license for normal work. Using -this mode is almost undistinguishable from working with a local -installation. It works well with Windows clients. For this operation -mode to work, you must build an SSH tunnel through the firewall of ZIH. -For further information, see the Comsol manual. +In this mode, COMSOL runs as server process on the ZIH system and as client process on your local +workstation. The client process needs a dummy license for installation, but no license for normal +work. Using this mode is almost undistinguishable from working with a local installation. It also works +well with Windows clients. For this operation mode to work, you must build an SSH tunnel through the +firewall of ZIH. For further information, please refer to the COMSOL manual. -Example for starting the server process (4 cores, 10 GB RAM, max. 8 -hours running time): +### Usage - module load COMSOL - srun -c4 -t 8:00 --mem-per-cpu=2500 comsol -np 4 server +??? example "Server Process" -- Interactive Job via Batchsystem SLURM + Start the server process with 4 cores, 10 GB RAM and max. 8 hours running time using an + interactive Slurm job like this: -<!-- --> + ```console + marie@login$ module load COMSOL + marie@login$ srun -n 1 -c 4 --mem-per-cpu=2500 -t 8:00 comsol -np 4 server + ``` - module load COMSOL - srun -n1 -c4 --mem-per-cpu=2500 -t 8:00 --pty --x11=first comsol -np 4 +??? example "Interactive Job" + + If you'd like to work interactively using COMSOL, you can request for an interactive job with, + e.g., 4 cores and 2500 MB RAM for 8 hours and X11 forwarding to open the COMSOL GUI: + + ```console + marie@login$ module load COMSOL + marie@login$ srun -n 1 -c 4 --mem-per-cpu=2500 -t 8:00 --pty --x11=first comsol -np 4 + ``` -Man sollte noch schauen, ob das Rendering unter Options -> Preferences --> Graphics and Plot Windows auf Software-Rendering steht - und dann -sollte man campusintern arbeiten knnen. + Please make sure, that the option *Preferences* --> Graphics --> *Renedering* is set to *software + rendering*. Than, you can work from within the campus network. -- Background Job via Batchsystem SLURM +??? example "Background Job" -<!-- --> + Interactive working is great for debugging and setting experiments up. But, if you have a huge + workload, you should definitively rely on job files. I.e., you put the necessary steps to get + the work done into scripts and submit these scripts to the batch system. These two steps are + outlined: + 1. Create a [job file](../jobs_and_resources/slurm.md#job-files), e.g. + ```bash #!/bin/bash #SBATCH --time=24:00:00 #SBATCH --nodes=2 @@ -251,21 +242,33 @@ sollte man campusintern arbeiten knnen. module load COMSOL srun comsol -mpi=intel batch -inputfile ./MyInputFile.mph - -Submit via: `sbatch <filename>` + ``` ## LS-DYNA -Both, the shared memory version and the distributed memory version (mpp) -are installed on all machines. +[LS-DYNA](https://www.dynamore.de/de) is a general-purpose, implicit and explicit FEM software for +nonlinear structural analysis. Both, the shared memory version and the distributed memory version +(`mpp`) are installed on ZIH systems. + +You need a job file (aka. batch script) to run the MPI version. -To run the MPI version on Taurus or Venus you need a batchfile (sumbmit -with `sbatch <filename>`) like: +??? example "Minimal Job File" + ```bash #!/bin/bash - #SBATCH --time=01:00:00 # walltime - #SBATCH --ntasks=16 # number of processor cores (i.e. tasks) + #SBATCH --time=01:00:00 # walltime + #SBATCH --ntasks=16 # number of processor cores (i.e. tasks) #SBATCH --mem-per-cpu=1900M # memory per CPU core - + module load ls-dyna srun mpp-dyna i=neon_refined01_30ms.k memory=120000000 + ``` + + Submit the job file to the batch system via + + ```console + marie@login$ sbatch <filename> + ``` + + Please refer to the section [Slurm](../jobs_and_resources/slurm.md) for further details and + options on the batch system as well as monitoring commands. diff --git a/doc.zih.tu-dresden.de/docs/software/get_started_with_hpcda.md b/doc.zih.tu-dresden.de/docs/software/get_started_with_hpcda.md deleted file mode 100644 index 850493f6d4a86b6d3220b03bf17a445dc2061979..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/get_started_with_hpcda.md +++ /dev/null @@ -1,353 +0,0 @@ -# Get started with HPC-DA - -HPC-DA (High-Performance Computing and Data Analytics) is a part of TU-Dresden general purpose HPC -cluster (Taurus). HPC-DA is the best **option** for **Machine learning, Deep learning** applications -and tasks connected with the big data. - -**This is an introduction of how to run machine learning applications on the HPC-DA system.** - -The main **aim** of this guide is to help users who have started working with Taurus and focused on -working with Machine learning frameworks such as TensorFlow or Pytorch. - -**Prerequisites:** To work with HPC-DA, you need [Login](../access/ssh_login.md) for the Taurus system -and preferably have basic knowledge about High-Performance computers and Python. - -**Disclaimer:** This guide provides the main steps on the way of using Taurus, for details please -follow links in the text. - -You can also find the information you need on the -[HPC-Introduction] **todo** %ATTACHURL%/HPC-Introduction.pdf?t=1585216700 and -[HPC-DA-Introduction] *todo** %ATTACHURL%/HPC-DA-Introduction.pdf?t=1585162693 presentation slides. - -## Why should I use HPC-DA? The architecture and feature of the HPC-DA - -HPC-DA built on the base of [Power9](https://www.ibm.com/it-infrastructure/power/power9) -architecture from IBM. HPC-DA created from -[AC922 IBM servers](https://www.ibm.com/ie-en/marketplace/power-systems-ac922), which was created -for AI challenges, analytics and working with, Machine learning, data-intensive workloads, -deep-learning frameworks and accelerated databases. POWER9 is the processor with state-of-the-art -I/O subsystem technology, including next-generation NVIDIA NVLink, PCIe Gen4 and OpenCAPI. -[Here](../jobs_and_resources/power9.md) you could find a detailed specification of the TU Dresden -HPC-DA system. - -The main feature of the Power9 architecture (ppc64le) is the ability to work the -[NVIDIA Tesla V100](https://www.nvidia.com/en-gb/data-center/tesla-v100/) GPU with **NV-Link** -support. NV-Link technology allows increasing a total bandwidth of 300 gigabytes per second (GB/sec) - -- 10X the bandwidth of PCIe Gen 3. The bandwidth is a crucial factor for deep learning and machine - learning applications. - -**Note:** The Power9 architecture not so common as an x86 architecture. This means you are not so -flexible with choosing applications for your projects. Even so, the main tools and applications are -available. See available modules here. - -**Please use the ml partition if you need GPUs!** Otherwise using the x86 partitions (e.g Haswell) -most likely would be more beneficial. - -## Login - -### SSH Access - -The recommended way to connect to the HPC login servers directly via ssh: - -```Bash -ssh <zih-login>@taurus.hrsk.tu-dresden.de -``` - -Please put this command in the terminal and replace `<zih-login>` with your login that you received -during the access procedure. Accept the host verifying and enter your password. - -This method requires two conditions: -Linux OS, workstation within the campus network. For other options and -details check the [login page](../access/ssh_login.md). - -## Data management - -### Workspaces - -As soon as you have access to HPC-DA you have to manage your data. The main method of working with -data on Taurus is using Workspaces. You could work with simple examples in your home directory -(where you are loading by default). However, in accordance with the -[storage concept](../data_lifecycle/overview.md) -**please use** a [workspace](../data_lifecycle/workspaces.md) -for your study and work projects. - -You should create your workspace with a similar command: - -```Bash -ws_allocate -F scratch Machine_learning_project 50 #allocating workspase in scratch directory for 50 days -``` - -After the command, you will have an output with the address of the workspace based on scratch. Use -it to store the main data of your project. - -For different purposes, you should use different storage systems. To work as efficient as possible, -consider the following points: - -- Save source code etc. in `/home` or `/projects/...` -- Store checkpoints and other massive but temporary data with - workspaces in: `/scratch/ws/...` -- For data that seldom changes but consumes a lot of space, use - mid-term storage with workspaces: `/warm_archive/...` -- For large parallel applications where using the fastest file system - is a necessity, use with workspaces: `/lustre/ssd/...` -- Compilation in `/dev/shm`** or `/tmp` - -### Data moving - -#### Moving data to/from the HPC machines - -To copy data to/from the HPC machines, the Taurus [export nodes](../data_transfer/export_nodes.md) -should be used. They are the preferred way to transfer your data. There are three possibilities to -exchanging data between your local machine (lm) and the HPC machines (hm): **SCP, RSYNC, SFTP**. - -Type following commands in the local directory of the local machine. For example, the **`SCP`** -command was used. - -#### Copy data from lm to hm - -```Bash -scp <file> <zih-user>@taurusexport.hrsk.tu-dresden.de:<target-location> #Copy file from your local machine. For example: scp helloworld.txt mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/ - -scp -r <directory> <zih-user>@taurusexport.hrsk.tu-dresden.de:<target-location> #Copy directory from your local machine. -``` - -#### Copy data from hm to lm - -```Bash -scp <zih-user>@taurusexport.hrsk.tu-dresden.de:<file> <target-location> #Copy file. For example: scp mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/helloworld.txt /home/mustermann/Downloads - -scp -r <zih-user>@taurusexport.hrsk.tu-dresden.de:<directory> <target-location> #Copy directory -``` - -#### Moving data inside the HPC machines. Datamover - -The best way to transfer data inside the Taurus is the [data mover](../data_transfer/data_mover.md). -It is the special data transfer machine providing the global file systems of each ZIH HPC system. -Datamover provides the best data speed. To load, move, copy etc. files from one file system to -another file system, you have to use commands with **dt** prefix, such as: - -`dtcp, dtwget, dtmv, dtrm, dtrsync, dttar, dtls` - -These commands submit a job to the data transfer machines that execute the selected command. Except -for the `dt` prefix, their syntax is the same as the shell command without the `dt`. - -```Bash -dtcp -r /scratch/ws/<name_of_your_workspace>/results /lustre/ssd/ws/<name_of_your_workspace>; #Copy from workspace in scratch to ssd. -dtwget https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz #Download archive CIFAR-100. -``` - -## BatchSystems. SLURM - -After logon and preparing your data for further work the next logical step is to start your job. For -these purposes, SLURM is using. Slurm (Simple Linux Utility for Resource Management) is an -open-source job scheduler that allocates compute resources on clusters for queued defined jobs. By -default, after your logging, you are using the login nodes. The intended purpose of these nodes -speaks for oneself. Applications on an HPC system can not be run there! They have to be submitted -to compute nodes (ml nodes for HPC-DA) with dedicated resources for user jobs. - -Job submission can be done with the command: `-srun [options] <command>.` - -This is a simple example which you could use for your start. The `srun` command is used to submit a -job for execution in real-time designed for interactive use, with monitoring the output. For some -details please check [the Slurm page](../jobs_and_resources/slurm.md). - -```Bash -srun -p ml -N 1 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash #Job submission in ml nodes with allocating: 1 node, 1 gpu per node, with 8000 mb on 1 hour. -``` - -However, using srun directly on the shell will lead to blocking and launch an interactive job. Apart -from short test runs, it is **recommended to launch your jobs into the background by using batch -jobs**. For that, you can conveniently put the parameters directly into the job file which you can -submit using `sbatch [options] <job file>.` - -This is the example of the sbatch file to run your application: - -```Bash -#!/bin/bash -#SBATCH --mem=8GB # specify the needed memory -#SBATCH -p ml # specify ml partition -#SBATCH --gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task) -#SBATCH --nodes=1 # request 1 node -#SBATCH --time=00:15:00 # runs for 10 minutes -#SBATCH -c 1 # how many cores per task allocated -#SBATCH -o HLR_name_your_script.out # save output message under HLR_${SLURMJOBID}.out -#SBATCH -e HLR_name_your_script.err # save error messages under HLR_${SLURMJOBID}.err - -module load modenv/ml -module load TensorFlow - -python machine_learning_example.py - -## when finished writing, submit with: sbatch <script_name> For example: sbatch machine_learning_script.slurm -``` - -The `machine_learning_example.py` contains a simple ml application based on the mnist model to test -your sbatch file. It could be found as the [attachment] **todo** -%ATTACHURL%/machine_learning_example.py in the bottom of the page. - -## Start your application - -As stated before HPC-DA was created for deep learning, machine learning applications. Machine -learning frameworks as TensorFlow and PyTorch are industry standards now. - -There are three main options on how to work with Tensorflow and PyTorch: - -1. **Modules** -1. **JupyterNotebook** -1. **Containers** - -### Modules - -The easiest way is using the [modules system](modules.md) and Python virtual environment. Modules -are a way to use frameworks, compilers, loader, libraries, and utilities. The module is a user -interface that provides utilities for the dynamic modification of a user's environment without -manual modifications. You could use them for srun , bath jobs (sbatch) and the Jupyterhub. - -A virtual environment is a cooperatively isolated runtime environment that allows Python users and -applications to install and update Python distribution packages without interfering with the -behaviour of other Python applications running on the same system. At its core, the main purpose of -Python virtual environments is to create an isolated environment for Python projects. - -**Vitualenv (venv)** is a standard Python tool to create isolated Python environments. We recommend -using venv to work with Tensorflow and Pytorch on Taurus. It has been integrated into the standard -library under the [venv module](https://docs.python.org/3/library/venv.html). However, if you have -reasons (previously created environments etc) you could easily use conda. The conda is the second -way to use a virtual environment on the Taurus. -[Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) -is an open-source package management system and environment management system from the Anaconda. - -As was written in the previous chapter, to start the application (using -modules) and to run the job exist two main options: - -- The `srun` command:** - -```Bash -srun -p ml -N 1 -n 1 -c 2 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=8000 bash #job submission in ml nodes with allocating: 1 node, 1 task per node, 2 CPUs per task, 1 gpu per node, with 8000 mb on 1 hour. - -module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml - -mkdir python-virtual-environments #create folder for your environments -cd python-virtual-environments #go to folder -module load TensorFlow #load TensorFlow module to use python. Example output: Module Module TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4 and 31 dependencies loaded. -which python #check which python are you using -python3 -m venv --system-site-packages env #create virtual environment "env" which inheriting with global site packages -source env/bin/activate #activate virtual environment "env". Example output: (env) bash-4.2$ -``` - -The inscription (env) at the beginning of each line represents that now you are in the virtual -environment. - -Now you can check the working capacity of the current environment. - -```Bash -python # start python -import tensorflow as tf -print(tf.__version__) # example output: 2.1.0 -``` - -The second and main option is using batch jobs (`sbatch`). It is used to submit a job script for -later execution. Consequently, it is **recommended to launch your jobs into the background by using -batch jobs**. To launch your machine learning application as well to srun job you need to use -modules. See the previous chapter with the sbatch file example. - -Versions: TensorFlow 1.14, 1.15, 2.0, 2.1; PyTorch 1.1, 1.3 are available. (25.02.20) - -Note: However in case of using sbatch files to send your job you usually don't need a virtual -environment. - -### JupyterNotebook - -The Jupyter Notebook is an open-source web application that allows you to create documents -containing live code, equations, visualizations, and narrative text. Jupyter notebook allows working -with TensorFlow on Taurus with GUI (graphic user interface) in a **web browser** and the opportunity -to see intermediate results step by step of your work. This can be useful for users who dont have -huge experience with HPC or Linux. - -There is [JupyterHub](../access/jupyterhub.md) on Taurus, where you can simply run your Jupyter -notebook on HPC nodes. Also, for more specific cases you can run a manually created remote jupyter -server. You can find the manual server setup [here](deep_learning.md). However, the simplest option -for beginners is using JupyterHub. - -JupyterHub is available at -[taurus.hrsk.tu-dresden.de/jupyter](https://taurus.hrsk.tu-dresden.de/jupyter) - -After logging, you can start a new session and configure it. There are simple and advanced forms to -set up your session. On the simple form, you have to choose the "IBM Power (ppc64le)" architecture. -You can select the required number of CPUs and GPUs. For the acquaintance with the system through -the examples below the recommended amount of CPUs and 1 GPU will be enough. -With the advanced form, you can use -the configuration with 1 GPU and 7 CPUs. To access for all your workspaces use " / " in the -workspace scope. Please check updates and details [here](../access/jupyterhub.md). - -Several Tensorflow and PyTorch examples for the Jupyter notebook have been prepared based on some -simple tasks and models which will give you an understanding of how to work with ML frameworks and -JupyterHub. It could be found as the [attachment] **todo** %ATTACHURL%/machine_learning_example.py -in the bottom of the page. A detailed explanation and examples for TensorFlow can be found -[here](tensorflow_on_jupyter_notebook.md). For the Pytorch - [here](pytorch.md). Usage information -about the environments for the JupyterHub could be found [here](../access/jupyterhub.md) in the chapter -*Creating and using your own environment*. - -Versions: TensorFlow 1.14, 1.15, 2.0, 2.1; PyTorch 1.1, 1.3 are -available. (25.02.20) - -### Containers - -Some machine learning tasks such as benchmarking require using containers. A container is a standard -unit of software that packages up code and all its dependencies so the application runs quickly and -reliably from one computing environment to another. Using containers gives you more flexibility -working with modules and software but at the same time requires more effort. - -On Taurus [Singularity](https://sylabs.io/) is used as a standard container solution. Singularity -enables users to have full control of their environment. This means that **you dont have to ask an -HPC support to install anything for you - you can put it in a Singularity container and run!**As -opposed to Docker (the beat-known container solution), Singularity is much more suited to being used -in an HPC environment and more efficient in many cases. Docker containers also can easily be used by -Singularity from the [DockerHub](https://hub.docker.com) for instance. Also, some containers are -available in [Singularity Hub](https://singularity-hub.org/). - -The simplest option to start working with containers on HPC-DA is importing from Docker or -SingularityHub container with TensorFlow. It does **not require root privileges** and so works on -Taurus directly: - -```Bash -srun -p ml -N 1 --gres=gpu:1 --time=02:00:00 --pty --mem-per-cpu=8000 bash #allocating resourses from ml nodes to start the job to create a container. -singularity build my-ML-container.sif docker://ibmcom/tensorflow-ppc64le #create a container from the DockerHub with the last TensorFlow version -singularity run --nv my-ML-container.sif #run my-ML-container.sif container with support of the Nvidia's GPU. You could also entertain with your container by commands: singularity shell, singularity exec -``` - -There are two sources for containers for Power9 architecture with -Tensorflow and PyTorch on the board: - -* [Tensorflow-ppc64le](https://hub.docker.com/r/ibmcom/tensorflow-ppc64le): - Community-supported ppc64le docker container for TensorFlow. -* [PowerAI container](https://hub.docker.com/r/ibmcom/powerai/): - Official Docker container with Tensorflow, PyTorch and many other packages. - Heavy container. It requires a lot of space. Could be found on Taurus. - -Note: You could find other versions of software in the container on the "tag" tab on the docker web -page of the container. - -To use not a pure Tensorflow, PyTorch but also with some Python packages -you have to use the definition file to create the container -(bootstrapping). For details please see the [Container](containers.md) page -from our wiki. Bootstrapping **has required root privileges** and -Virtual Machine (VM) should be used! There are two main options on how -to work with VM on Taurus: [VM tools](vm_tools.md) - automotive algorithms -for using virtual machines; [Manual method](virtual_machines.md) - it requires more -operations but gives you more flexibility and reliability. - -- [machine_learning_example.py] **todo** %ATTACHURL%/machine_learning_example.py: - machine_learning_example.py -- [example_TensofFlow_MNIST.zip] **todo** %ATTACHURL%/example_TensofFlow_MNIST.zip: - example_TensofFlow_MNIST.zip -- [example_Pytorch_MNIST.zip] **todo** %ATTACHURL%/example_Pytorch_MNIST.zip: - example_Pytorch_MNIST.zip -- [example_Pytorch_image_recognition.zip] **todo** %ATTACHURL%/example_Pytorch_image_recognition.zip: - example_Pytorch_image_recognition.zip -- [example_TensorFlow_Automobileset.zip] **todo** %ATTACHURL%/example_TensorFlow_Automobileset.zip: - example_TensorFlow_Automobileset.zip -- [HPC-Introduction.pdf] **todo** %ATTACHURL%/HPC-Introduction.pdf: - HPC-Introduction.pdf -- [HPC-DA-Introduction.pdf] **todo** %ATTACHURL%/HPC-DA-Introduction.pdf : - HPC-DA-Introduction.pdf diff --git a/doc.zih.tu-dresden.de/docs/software/gpu_programming.md b/doc.zih.tu-dresden.de/docs/software/gpu_programming.md index 9847cc9dbfec4137eada70dbc23285c7825effc7..070176efcb2ab0f463da30675841ade0e0a585a3 100644 --- a/doc.zih.tu-dresden.de/docs/software/gpu_programming.md +++ b/doc.zih.tu-dresden.de/docs/software/gpu_programming.md @@ -2,8 +2,9 @@ ## Directive Based GPU Programming -Directives are special compiler commands in your C/C++ or Fortran source code. The tell the compiler -how to parallelize and offload work to a GPU. This section explains how to use this technique. +Directives are special compiler commands in your C/C++ or Fortran source code. They tell the +compiler how to parallelize and offload work to a GPU. This section explains how to use this +technique. ### OpenACC @@ -19,10 +20,11 @@ newer for full support for the NVIDIA Tesla K20x GPUs at ZIH. #### Using OpenACC with PGI compilers -* For compilaton please add the compiler flag `-acc`, to enable OpenACC interpreting by the compiler; -* `-Minfo` will tell you what the compiler is actually doing to your code; +* For compilation, please add the compiler flag `-acc` to enable OpenACC interpreting by the + compiler; +* `-Minfo` tells you what the compiler is actually doing to your code; * If you only want to use the created binary at ZIH resources, please also add `-ta=nvidia:keple`; -* OpenACC Turorial: intro1.pdf, intro2.pdf. +* OpenACC Tutorial: intro1.pdf, intro2.pdf. ### HMPP @@ -38,4 +40,4 @@ use the following slides as an introduction: * Introduction to CUDA; * Advanced Tuning for NVIDIA Kepler GPUs. -In order to compiler an application with CUDA use the `nvcc` compiler command. +In order to compile an application with CUDA use the `nvcc` compiler command. diff --git a/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md b/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md new file mode 100644 index 0000000000000000000000000000000000000000..8f61fe49fd56642aaded82cf711ca92d0035b99f --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md @@ -0,0 +1,365 @@ +# Hyperparameter Optimization (OmniOpt) + +Classical simulation methods as well as machine learning methods (e.g. neural networks) have a large +number of hyperparameters that significantly determine the accuracy, efficiency, and transferability +of the method. In classical simulations, the hyperparameters are usually determined by adaptation to +measured values. Esp. in neural networks, the hyperparameters determine the network architecture: +number and type of layers, number of neurons, activation functions, measures against overfitting +etc. The most common methods to determine hyperparameters are intuitive testing, grid search or +random search. + +The tool OmniOpt performs hyperparameter optimization within a broad range of applications as +classical simulations or machine learning algorithms. OmniOpt is robust and it checks and installs +all dependencies automatically and fixes many problems in the background. While OmniOpt optimizes, +no further intervention is required. You can follow the ongoing output live in the console. +Overhead of OmniOpt is minimal and virtually imperceptible. + +## Quick start with OmniOpt + +The following instructions demonstrate the basic usage of OmniOpt on the ZIH system, based on the +hyperparameter optimization for a neural network. + +The typical OmniOpt workflow comprises at least the following steps: + +1. [Prepare application script and software environment](#prepare-application-script-and-software-environment) +1. [Configure and run OmniOpt](#configure-and-run-omniopt) +1. [Check and evaluate OmniOpt results](#check-and-evaluate-omniopt-results) + +### Prepare Application Script and Software Environment + +The following example application script was created from +[https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html) +as a starting point. +Therein, a neural network is trained on the MNIST Fashion data set. + +There are the following script preparation steps for OmniOpt: + +1. Changing hard-coded hyperparameters (chosen here: batch size, epochs, size of layer 1 and 2) into + command line parameters. Esp. for this example, the Python module `argparse` (see the docs at + [https://docs.python.org/3/library/argparse.html](https://docs.python.org/3/library/argparse.html) + is used. + + ??? note "Parsing arguments in Python" + There are many ways for parsing arguments into Python scripts. The easiest approach is + the `sys` module (see + [www.geeksforgeeks.org/how-to-use-sys-argv-in-python](https://www.geeksforgeeks.org/how-to-use-sys-argv-in-python)), + which would be fully sufficient for usage with OmniOpt. Nevertheless, this basic approach + has no consistency checks or error handling etc. + +1. Mark the output of the optimization target (chosen here: average loss) by prefixing it with the + RESULT string. OmniOpt takes the **last appearing value** prefixed with the RESULT string. In + the example, different epochs are performed and the average from the last epoch is caught by + OmniOpt. Additionally, the `RESULT` output has to be a **single line**. After all these changes, + the final script is as follows (with the lines containing relevant changes highlighted). + + ??? example "Final modified Python script: MNIST Fashion " + + ```python linenums="1" hl_lines="18-33 52-53 66-68 72 74 76 85 125-126" + #!/usr/bin/env python + # coding: utf-8 + + # # Example for using OmniOpt + # + # source code taken from: https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html + # parameters under consideration:# + # 1. batch size + # 2. epochs + # 3. size output layer 1 + # 4. size output layer 2 + + import torch + from torch import nn + from torch.utils.data import DataLoader + from torchvision import datasets + from torchvision.transforms import ToTensor, Lambda, Compose + import argparse + + # parsing hpyerparameters as arguments + parser = argparse.ArgumentParser(description="Demo application for OmniOpt for hyperparameter optimization, example: neural network on MNIST fashion data.") + + parser.add_argument("--out-layer1", type=int, help="the number of outputs of layer 1", default = 512) + parser.add_argument("--out-layer2", type=int, help="the number of outputs of layer 2", default = 512) + parser.add_argument("--batchsize", type=int, help="batchsize for training", default = 64) + parser.add_argument("--epochs", type=int, help="number of epochs", default = 5) + + args = parser.parse_args() + + batch_size = args.batchsize + epochs = args.epochs + num_nodes_out1 = args.out_layer1 + num_nodes_out2 = args.out_layer2 + + # Download training data from open data sets. + training_data = datasets.FashionMNIST( + root="data", + train=True, + download=True, + transform=ToTensor(), + ) + + # Download test data from open data sets. + test_data = datasets.FashionMNIST( + root="data", + train=False, + download=True, + transform=ToTensor(), + ) + + # Create data loaders. + train_dataloader = DataLoader(training_data, batch_size=batch_size) + test_dataloader = DataLoader(test_data, batch_size=batch_size) + + for X, y in test_dataloader: + print("Shape of X [N, C, H, W]: ", X.shape) + print("Shape of y: ", y.shape, y.dtype) + break + + # Get cpu or gpu device for training. + device = "cuda" if torch.cuda.is_available() else "cpu" + print("Using {} device".format(device)) + + # Define model + class NeuralNetwork(nn.Module): + def __init__(self, out1, out2): + self.o1 = out1 + self.o2 = out2 + super(NeuralNetwork, self).__init__() + self.flatten = nn.Flatten() + self.linear_relu_stack = nn.Sequential( + nn.Linear(28*28, out1), + nn.ReLU(), + nn.Linear(out1, out2), + nn.ReLU(), + nn.Linear(out2, 10), + nn.ReLU() + ) + + def forward(self, x): + x = self.flatten(x) + logits = self.linear_relu_stack(x) + return logits + + model = NeuralNetwork(out1=num_nodes_out1, out2=num_nodes_out2).to(device) + print(model) + + loss_fn = nn.CrossEntropyLoss() + optimizer = torch.optim.SGD(model.parameters(), lr=1e-3) + + def train(dataloader, model, loss_fn, optimizer): + size = len(dataloader.dataset) + for batch, (X, y) in enumerate(dataloader): + X, y = X.to(device), y.to(device) + + # Compute prediction error + pred = model(X) + loss = loss_fn(pred, y) + + # Backpropagation + optimizer.zero_grad() + loss.backward() + optimizer.step() + + if batch % 200 == 0: + loss, current = loss.item(), batch * len(X) + print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]") + + def test(dataloader, model, loss_fn): + size = len(dataloader.dataset) + num_batches = len(dataloader) + model.eval() + test_loss, correct = 0, 0 + with torch.no_grad(): + for X, y in dataloader: + X, y = X.to(device), y.to(device) + pred = model(X) + test_loss += loss_fn(pred, y).item() + correct += (pred.argmax(1) == y).type(torch.float).sum().item() + test_loss /= num_batches + correct /= size + print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n") + + + #print statement esp. for OmniOpt (single line!!) + print(f"RESULT: {test_loss:>8f} \n") + + for t in range(epochs): + print(f"Epoch {t+1}\n-------------------------------") + train(train_dataloader, model, loss_fn, optimizer) + test(test_dataloader, model, loss_fn) + print("Done!") + ``` + +1. Testing script functionality and determine software requirements for the chosen + [partition](../jobs_and_resources/partitions_and_limits.md). In the following, the alpha + partition is used. Please note the parameters `--out-layer1`, `--batchsize`, `--epochs` when + calling the Python script. Additionally, note the `RESULT` string with the output for OmniOpt. + + ??? hint "Hint for installing Python modules" + + Note that for this example the module `torchvision` is not available on the partition `alpha` + and it is installed by creating a [virtual environment](python_virtual_environments.md). It is + recommended to install such a virtual environment into a + [workspace](../data_lifecycle/workspaces.md). + + ```console + marie@login$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch/1.9.0 + marie@login$ mkdir </path/to/workspace/python-environments> #create folder + marie@login$ virtualenv --system-site-packages </path/to/workspace/python-environments/torchvision_env> + marie@login$ source </path/to/workspace/python-environments/torchvision_env>/bin/activate #activate virtual environment + marie@login$ pip install torchvision #install torchvision module + ``` + + ```console + # Job submission on alpha nodes with 1 GPU on 1 node with 800 MB per CPU + marie@login$ srun -p alpha --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash + marie@alpha$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch/1.9.0 + # Activate virtual environment + marie@alpha$ source </path/to/workspace/python-environments/torchvision_env>/bin/activate + The following have been reloaded with a version change: + 1) modenv/scs5 => modenv/hiera + + Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5, PyTorch/1.9.0 and 54 dependencies loaded. + marie@alpha$ python </path/to/your/script/mnistFashion.py> --out-layer1=200 --batchsize=10 --epochs=3 + [...] + Epoch 3 + ------------------------------- + loss: 1.422406 [ 0/60000] + loss: 0.852647 [10000/60000] + loss: 1.139685 [20000/60000] + loss: 0.572221 [30000/60000] + loss: 1.516888 [40000/60000] + loss: 0.445737 [50000/60000] + Test Error: + Accuracy: 69.5%, Avg loss: 0.878329 + + RESULT: 0.878329 + + Done! + ``` + +Using the modified script within OmniOpt requires configuring and loading of the software +environment. The recommended way is to wrap the necessary calls in a shell script. + +??? example "Example for wrapping with shell script" + + ```bash + #!/bin/bash -l + # ^ Shebang-Line, so that it is known that this is a bash file + # -l means 'load this as login shell', so that /etc/profile gets loaded and you can use 'module load' or 'ml' as usual + + # If you don't use this script via `./run.sh' or just `srun run.sh', but like `srun bash run.sh', please add the '-l' there too. + # Like this: + # srun bash -l run.sh + + # Load modules your program needs, always specify versions! + module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch/1.7.1 + source </path/to/workspace/python-environments/torchvision_env>/bin/activate #activate virtual environment + + # Load your script. $@ is all the parameters that are given to this shell file. + python </path/to/your/script/mnistFashion.py> $@ + ``` + +When the wrapped shell script is running properly, the preparations are finished and the next step +is configuring OmniOpt. + +### Configure and Run OmniOpt + +Configuring OmniOpt is done via the GUI at +[https://imageseg.scads.ai/omnioptgui/](https://imageseg.scads.ai/omnioptgui/). +This GUI guides through the configuration process and as result a configuration file is created +automatically according to the GUI input. If you are more familiar with using OmniOpt later on, +this configuration file can be modified directly without using the GUI. + +A screenshot of +[the GUI](https://imageseg.scads.ai/omnioptgui/?maxevalserror=5&mem_per_worker=1000&number_of_parameters=3¶m_0_values=10%2C50%2C100¶m_1_values=8%2C16%2C32¶m_2_values=10%2C15%2C30¶m_0_name=out-layer1¶m_1_name=batchsize¶m_2_name=batchsize&account=&projectname=mnist_fashion_optimization_set_1&partition=alpha&searchtype=tpe.suggest¶m_0_type=hp.choice¶m_1_type=hp.choice¶m_2_type=hp.choice&max_evals=1000&objective_program=bash%20%3C%2Fpath%2Fto%2Fwrapper-script%2Frun-mnist-fashion.sh%3E%20--out-layer1%3D%28%24x_0%29%20--batchsize%3D%28%24x_1%29%20--epochs%3D%28%24x_2%29&workdir=%3C%2Fscratch%2Fws%2Fomniopt-workdir%2F%3E), +including a properly configuration for the MNIST fashion example is shown below. + +Please modify the paths for `objective program` and `workdir` according to your needs. + + +{: align="center"} + +Using OmniOpt for a first trial example, it is often sufficient to concentrate on the following +configuration parameters: + +1. **Optimization run name:** A name for an OmniOpt run given a belonging configuration. +1. **Partition:** Choose the partition on the ZIH system that fits the programs' needs. +1. **Enable GPU:** Decide whether a program could benefit from GPU usage or not. +1. **Workdir:** The directory where OmniOpt is saving its necessary files and all results. Derived + from the optimization run name, each configuration creates a single directory. + Make sure that this working directory is writable from the compute nodes. It is recommended to + use a [workspace](../data_lifecycle/workspaces.md). +1. **Objective program:** Provide all information for program execution. Typically, this will + contain the command for executing a wrapper script. +1. **Parameters:** The hyperparameters to be optimized with the names OmniOpt should use. For the + example here, the variable names are identical to the input parameters of the Python script. + However, these names can be chosen differently, since the connection to OmniOpt is realized via + the variables (`$x_0`), (`$x_1`), etc. from the GUI section "Objective program". Please note that + it is not necessary to name the parameters explicitly in your script but only within the OmniOpt + configuration. + +After all parameters are entered into the GUI, the call for OmniOpt is generated automatically and +displayed on the right. This command contains all necessary instructions (including requesting +resources with Slurm). **Thus, this command can be executed directly on a login node on the ZIH +system.** + + +{: align="center"} + +After executing this command OmniOpt is doing all the magic in the background and there are no +further actions necessary. + +??? hint "Hints on the working directory" + + 1. Starting OmniOpt without providing a working directory will store OmniOpt into the present directory. + 1. Within the given working directory, a new folder named "omniopt" as default, is created. + 1. Within one OmniOpt working directory, there can be multiple optimization projects. + 1. It is possible to have as many working directories as you want (with multiple optimization runs). + 1. It is recommended to use a [workspace](../data_lifecycle/workspaces.md) as working directory, but not the home directory. + +### Check and Evaluate OmniOpt Results + +For getting informed about the current status of OmniOpt or for looking into results, the evaluation +tool of OmniOpt is used. Switch to the OmniOpt folder and run `evaluate-run.sh`. + +``` console +marie@login$ bash </scratch/ws/omniopt-workdir/>evaluate-run.sh +``` + +After initializing and checking for updates in the background, OmniOpt is asking to select the +optimization run of interest. After selecting the optimization run, there will be a menu with the +items as shown below. If OmniOpt has still running jobs there appear some menu items that refer to +these running jobs (image shown below to the right). + +evaluation options (all jobs finished) | evaluation options (still running jobs) +:--------------------------------------------------------------:|:-------------------------: + |  + +For now, we assume that OmniOpt has finished already. +In order to look into the results, there are the following basic approaches. + +1. **Graphical approach:** + There are basically two graphical approaches: two dimensional scatter plots and parallel plots. + + Below there is shown a parallel plot from the MNIST fashion example. + {: align="center"} + + ??? hint "Hints on parallel plots" + + Parallel plots are suitable especially for dealing with multiple dimensions. The parallel + plot created by OmniOpt is an interactive `html` file that is stored in the OminOpt working + directory under `projects/<name_of_optimization_run>/parallel-plot`. The interactivity + of this plot is intended to make optimal combinations of the hyperparameters visible more + easily. Get more information about this interactivity by clicking the "Help" button at the + top of the graphic (see red arrow on the image above). + + After creating a 2D scatter plot or a parallel plot, OmniOpt will try to display the + corresponding file (`html`, `png`) directly on the ZIH system. Therefore, it is necessary to + login via ssh with the option `-X` (X11 forwarding), e.g., `ssh -X taurus.hrsk.tu-dresden.de`. + Nevertheless, because of latency using x11 forwarding, it is recommended to download the created + files and explore them on the local machine (esp. for the parallel plot). The created files are + saved at `projects/<name_of_optimization_run>/{2d-scatterplots,parallel-plot}`. + +1. **Getting the raw data:** + As a second approach, the raw data of the optimization process can be exported as a CSV file. + The created output files are stored in the folder `projects/<name_of_optimization_run>/csv`. diff --git a/doc.zih.tu-dresden.de/docs/software/keras.md b/doc.zih.tu-dresden.de/docs/software/keras.md deleted file mode 100644 index 356e5b17e0ed1a3224ef815629e456391192b5ba..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/keras.md +++ /dev/null @@ -1,237 +0,0 @@ -# Keras - -This is an introduction on how to run a -Keras machine learning application on the new machine learning partition -of Taurus. - -Keras is a high-level neural network API, -written in Python and capable of running on top of -[TensorFlow](https://github.com/tensorflow/tensorflow). -In this page, [Keras](https://www.tensorflow.org/guide/keras) will be -considered as a TensorFlow's high-level API for building and training -deep learning models. Keras includes support for TensorFlow-specific -functionality, such as [eager execution](https://www.tensorflow.org/guide/keras#eager_execution) -, [tf.data](https://www.tensorflow.org/api_docs/python/tf/data) pipelines -and [estimators](https://www.tensorflow.org/guide/estimator). - -On the machine learning nodes (machine learning partition), you can use -the tools from [IBM Power AI](./power_ai.md). PowerAI is an enterprise -software distribution that combines popular open-source deep learning -frameworks, efficient AI development tools (Tensorflow, Caffe, etc). - -In machine learning partition (modenv/ml) Keras is available as part of -the Tensorflow library at Taurus and also as a separate module named -"Keras". For using Keras in machine learning partition you have two -options: - -- use Keras as part of the TensorFlow module; -- use Keras separately and use Tensorflow as an interface between - Keras and GPUs. - -**Prerequisites**: To work with Keras you, first of all, need -[access](../access/ssh_login.md) for the Taurus system, loaded -Tensorflow module on ml partition, activated Python virtual environment. -Basic knowledge about Python, SLURM system also required. - -**Aim** of this page is to introduce users on how to start working with -Keras and TensorFlow on the [HPC-DA](../jobs_and_resources/hpcda.md) -system - part of the TU Dresden HPC system. - -There are three main options on how to work with Keras and Tensorflow on -the HPC-DA: 1. Modules; 2. JupyterNotebook; 3. Containers. One of the -main ways is using the **TODO LINK MISSING** (Modules -system)(RuntimeEnvironment#Module_Environments) and Python virtual -environment. Please see the -[Python page](./python.md) for the HPC-DA -system. - -The information about the Jupyter notebook and the **JupyterHub** could -be found [here](../access/jupyterhub.md). The use of -Containers is described [here](tensorflow_container_on_hpcda.md). - -Keras contains numerous implementations of commonly used neural-network -building blocks such as layers, -[objectives](https://en.wikipedia.org/wiki/Objective_function), -[activation functions](https://en.wikipedia.org/wiki/Activation_function) -[optimizers](https://en.wikipedia.org/wiki/Mathematical_optimization), -and a host of tools -to make working with image and text data easier. Keras, for example, has -a library for preprocessing the image data. - -The core data structure of Keras is a -**model**, a way to organize layers. The Keras functional API is the way -to go for defining as simple (sequential) as complex models, such as -multi-output models, directed acyclic graphs, or models with shared -layers. - -## Getting started with Keras - -This example shows how to install and start working with TensorFlow and -Keras (using the module system). To get started, import [tf.keras](https://www.tensorflow.org/api_docs/python/tf/keras) -as part of your TensorFlow program setup. -tf.keras is TensorFlow's implementation of the [Keras API -specification](https://keras.io/). This is a modified example that we -used for the [Tensorflow page](./tensorflow.md). - -```bash -srun -p ml --gres=gpu:1 -n 1 --pty --mem-per-cpu=8000 bash - -module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml - -mkdir python-virtual-environments -cd python-virtual-environments -module load TensorFlow #example output: Module TensorFlow/1.10.0-PythonAnaconda-3.6 and 1 dependency loaded. -which python -python3 -m venv --system-site-packages env #create virtual environment "env" which inheriting with global site packages -source env/bin/activate #example output: (env) bash-4.2$ -module load TensorFlow -python -import tensorflow as tf -from tensorflow.keras import layers - -print(tf.VERSION) #example output: 1.10.0 -print(tf.keras.__version__) #example output: 2.1.6-tf -``` - -As was said the core data structure of Keras is a **model**, a way to -organize layers. In Keras, you assemble *layers* to build *models*. A -model is (usually) a graph of layers. For our example we use the most -common type of model is a stack of layers. The below [example](https://www.tensorflow.org/guide/keras#model_subclassing) -of using the advanced model with model -subclassing and custom layers illustrate using TF-Keras API. - -```python -import tensorflow as tf -from tensorflow.keras import layers -import numpy as np - -# Numpy arrays to train and evaluate a model -data = np.random.random((50000, 32)) -labels = np.random.random((50000, 10)) - -# Create a custom layer by subclassing -class MyLayer(layers.Layer): - - def __init__(self, output_dim, **kwargs): - self.output_dim = output_dim - super(MyLayer, self).__init__(**kwargs) - -# Create the weights of the layer - def build(self, input_shape): - shape = tf.TensorShape((input_shape[1], self.output_dim)) -# Create a trainable weight variable for this layer - self.kernel = self.add_weight(name='kernel', - shape=shape, - initializer='uniform', - trainable=True) - super(MyLayer, self).build(input_shape) -# Define the forward pass - def call(self, inputs): - return tf.matmul(inputs, self.kernel) - -# Specify how to compute the output shape of the layer given the input shape. - def compute_output_shape(self, input_shape): - shape = tf.TensorShape(input_shape).as_list() - shape[-1] = self.output_dim - return tf.TensorShape(shape) - -# Serializing the layer - def get_config(self): - base_config = super(MyLayer, self).get_config() - base_config['output_dim'] = self.output_dim - return base_config - - @classmethod - def from_config(cls, config): - return cls(**config) -# Create a model using your custom layer -model = tf.keras.Sequential([ - MyLayer(10), - layers.Activation('softmax')]) - -# The compile step specifies the training configuration -model.compile(optimizer=tf.compat.v1.train.RMSPropOptimizer(0.001), - loss='categorical_crossentropy', - metrics=['accuracy']) - -# Trains for 10 epochs(steps). -model.fit(data, labels, batch_size=32, epochs=10) -``` - -## Running the sbatch script on ML modules (modenv/ml) - -Generally, for machine learning purposes ml partition is used but for -some special issues, SCS5 partition can be useful. The following sbatch -script will automatically execute the above Python script on ml -partition. If you have a question about the sbatch script see the -article about [SLURM](./../jobs_and_resources/binding_and_distribution_of_tasks.md). -Keep in mind that you need to put the executable file (Keras_example) with -python code to the same folder as bash script or specify the path. - -```bash -#!/bin/bash -#SBATCH --mem=4GB # specify the needed memory -#SBATCH -p ml # specify ml partition -#SBATCH --gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task) -#SBATCH --nodes=1 # request 1 node -#SBATCH --time=00:05:00 # runs for 5 minutes -#SBATCH -c 16 # how many cores per task allocated -#SBATCH -o HLR_Keras_example.out # save output message under HLR_${SLURMJOBID}.out -#SBATCH -e HLR_Keras_example.err # save error messages under HLR_${SLURMJOBID}.err - -module load modenv/ml -module load TensorFlow - -python Keras_example.py - -## when finished writing, submit with: sbatch <script_name> -``` - -Output results and errors file you can see in the same folder in the -corresponding files after the end of the job. Part of the example -output: - -``` -...... -Epoch 9/10 -50000/50000 [==============================] - 2s 37us/sample - loss: 11.5159 - acc: 0.1000 -Epoch 10/10 -50000/50000 [==============================] - 2s 37us/sample - loss: 11.5159 - acc: 0.1020 -``` - -## Tensorflow 2 - -[TensorFlow 2.0](https://blog.tensorflow.org/2019/09/tensorflow-20-is-now-available.html) -is a significant milestone for the -TensorFlow and the community. There are multiple important changes for -users. - -Tere are a number of TensorFlow 2 modules for both ml and scs5 -partitions in Taurus (2.0 (anaconda), 2.0 (python), 2.1 (python)) -(11.04.20). Please check **TODO MISSING DOC**(the software modules list)(./SoftwareModulesList.md -for the information about available -modules. - -<span style="color:red">**NOTE**</span>: Tensorflow 2 of the -current version is loading by default as a Tensorflow module. - -TensorFlow 2.0 includes many API changes, such as reordering arguments, -renaming symbols, and changing default values for parameters. Thus in -some cases, it makes code written for the TensorFlow 1 not compatible -with TensorFlow 2. However, If you are using the high-level APIs -**(tf.keras)** there may be little or no action you need to take to make -your code fully TensorFlow 2.0 [compatible](https://www.tensorflow.org/guide/migrate). -It is still possible to run 1.X code, -unmodified ([except for contrib](https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md) -), in TensorFlow 2.0: - -```python -import tensorflow.compat.v1 as tf -tf.disable_v2_behavior() #instead of "import tensorflow as tf" -``` - -To make the transition to TF 2.0 as seamless as possible, the TensorFlow -team has created the [tf_upgrade_v2](https://www.tensorflow.org/guide/upgrade) -utility to help transition legacy code to the new API. - -## F.A.Q: diff --git a/doc.zih.tu-dresden.de/docs/software/licenses.md b/doc.zih.tu-dresden.de/docs/software/licenses.md index af7a4e376f22a0711df8eaff944bd7830367cacd..3173cf98a1b9987c87a74e5175fc7746236613d9 100644 --- a/doc.zih.tu-dresden.de/docs/software/licenses.md +++ b/doc.zih.tu-dresden.de/docs/software/licenses.md @@ -1,6 +1,6 @@ # Use of External Licenses -It is possible (please [contact the support team](../support.md) first) for users to install +It is possible (please [contact the support team](../support/support.md) first) for users to install their own software and use their own license servers, e.g. FlexLM. The outbound IP addresses from ZIH systems are: diff --git a/doc.zih.tu-dresden.de/docs/software/lo2s.md b/doc.zih.tu-dresden.de/docs/software/lo2s.md new file mode 100644 index 0000000000000000000000000000000000000000..cf34feccfca15e1e37d5278f30117aaba827e800 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/lo2s.md @@ -0,0 +1,141 @@ +# lo2s - Lightweight Node-Level Performance Monitoring + +`lo2s` creates parallel OTF2 traces with a focus on both application and system view. +The traces can contain any of the following information: + +* From running threads + * Calling context samples based on instruction overflows + * The calling context samples are annotated with the disassembled assembler instruction string + * The frame pointer-based call-path for each calling context sample + * Per-thread performance counter readings + * Which thread was scheduled on which CPU at what time +* From the system + * Metrics from tracepoints (e.g., the selected C-state or P-state) + * The node-level system tree (CPUs (HW-threads), cores, packages) + * CPU power measurements (x86_energy) + * Microarchitecture specific metrics (x86_adapt, per package or core) + * Arbitrary metrics through plugins (Score-P compatible) + +In general, `lo2s` operates either in **process monitoring** or **system monitoring** mode. + +With **process monitoring**, all information is grouped by each thread of a monitored process +group - it shows you *on which CPU is each monitored thread running*. `lo2s` either acts as a +prefix command to run the process (and also tracks its children) or `lo2s` attaches to a running +process. + +In the **system monitoring** mode, information is grouped by logical CPU - it shows you +*which thread was running on a given CPU*. Metrics are also shown per CPU. + +In both modes, `lo2s` always groups system-level metrics (e.g., tracepoints) by their respective +system hardware component. + +## Usage + +Only the basic usage is shown in this Wiki. For a more detailed explanation, refer to the +[Lo2s website](https://github.com/tud-zih-energy/lo2s). + +Before using `lo2s`, set up the correct environment with + +```console +marie@login$ module load lo2s +``` + +As `lo2s` is built upon [perf](perf_tools.md), its usage and limitations are very similar to that. +In particular, you can use `lo2s` as a prefix command just like `perf`. Even some of the command +line arguments are inspired by `perf`. The main difference to `perf` is that `lo2s` will output +a [Vampir trace](vampir.md), which allows a full-blown performance analysis almost like +[Score-P](scorep.md). + +To record the behavior of an application, prefix the application run with `lo2s`. We recommend +using the double dash `--` to prevent mixing command line arguments between `lo2s` and the user +application. In the following example, we run `lo2s` on the application `sleep 2`. + +```console +marie@compute$ lo2s --no-kernel -- sleep 2 +[ lo2s: sleep 2 (0), 1 threads, 0.014082s CPU, 2.03315s total ] +[ lo2s: 5 wakeups, wrote 2.48 KiB lo2s_trace_2021-10-12T12-39-06 ] +``` + +This will record the application in the `process monitoring mode`. This means, the applications +process, its forked processes, and threads are recorded and can be analyzed using Vampir. +The main view will represent each process and thread over time. There will be a metric "CPU" +indicating for each process, on which CPU it was executed during the runtime. + +## Required Permissions + +By design, `lo2s` almost exclusively utilizes Linux Kernel facilities such as perf and tracepoints +to perform the application measurements. For security reasons, these facilities require special +permissions, in particular `perf_event_paranoid` and read permissions to the `debugfs` under +`/sys/kernel/debug`. + +Luckily, for the `process monitoring mode` the default settings allow you to run `lo2s` just fine. +All you need to do is pass the `--no-kernel` parameter like in the example above. + +For the `system monitoring mode` you can get the required permission with the Slurm parameter +`--exclusive`. (Note: Regardless of the actual requested processes per node, you will accrue +cpu-hours as if you had reserved all cores on the node.) + +## Memory Requirements + +When requesting memory for your jobs, you need to take into account that `lo2s` needs a substantial +amount of memory for its operation. Unfortunately, the amount of memory depends on the application. +The amount mainly scales with the number of processes spawned by the traced application. For each +processes, there is a fixed-sized buffer. This should be fine for a typical HPC application, but +can lead to extreme cases there the buffers are orders of magnitude larger than the resulting trace. +For instance, recording a CMake run, which spawns hundreds of processes, each running only for +a few milliseconds, leaving each buffer almost empty. Still, the buffers needs to be allocated +and thus require a lot of memory. + +Given such a case, we recommend to use the `system monitoring mode` instead, as the memory in this +mode scales with the number of logical CPUs instead of the number of processes. + +## Advanced Topic: System Monitoring + +The `system monitoring mode` gives a different view. As the name implies, the focus isn't on processes +anymore, but the system as a whole. In particular, a trace recorded in this mode will show a timeline +for each logical CPU of the system. To enable this mode, you need to pass `-a` parameter. + +```console +marie@compute$ lo2s -a +^C[ lo2s (system mode): monitored processes: 0, 0.136623s CPU, 13.7872s total ] +[ lo2s (system mode): 36 wakeups, wrote 301.39 KiB lo2s_trace_2021-11-01T09-44-31 ] +``` + +Note: As you can read in the above example, `lo2s` monitored zero processes even though it was run +in the `system monitoring mode`. Certainly, there are more than none processes running on a system. +However, as the user accounts on our HPC systems are limited to only see their own processes and `lo2s` +records in the scope of the user, it will only see the users own processes. Hence, in the example +above, there are no other processes running. + +When using the `system monitoring mode` without passing a program, `lo2s` will run indefinitely. +You can stop the measurement by sending `lo2s` a `SIGINT` signal or hit `ctrl+C`. However, if you pass +a program, `lo2s` will start that program and run the measurement until the started process finishes. +Of course, the process and any of its child processes and threads will be visible in the resulting trace. + +```console +marie@compute$ lo2s -a -- sleep 10 +[ lo2s (system mode): sleep 10 (0), 1 threads, monitored processes: 1, 0.133598s CPU, 10.3996s total ] +[ lo2s (system mode): 39 wakeups, wrote 280.39 KiB lo2s_trace_2021-11-01T09-55-04 ] +``` + +Like in the `process monitoring mode`, `lo2s` can also sample instructions in the system monitoring mode. +You can enable the instruction sampling by passing the parameter `--instruction-sampling` to `lo2s`. + +```console +marie@compute$ lo2s -a --instruction-sampling -- make -j +[ lo2s (system mode): make -j (0), 268 threads, monitored processes: 286, 258.789s CPU, 445.076s total ] +[ lo2s (system mode): 3815 wakeups, wrote 39.24 MiB lo2s_trace_2021-10-29T15-08-44 ] +``` + +## Advanced Topic: Metric Plugins + +`Lo2s` is compatible with [Score-P](scorep.md) metric plugins, but only a subset will work. +In particular, `lo2s` only supports asynchronous plugins with the per host or once scope. +You can find a large set of plugins in the [Score-P Organization on GitHub](https://github.com/score-p). + +To activate plugins, you can use the same environment variables as with Score-P, or with `LO2S` as +prefix: + + - LO2S_METRIC_PLUGINS + - LO2S_METRIC_PLUGIN + - LO2S_METRIC_PLUGIN_PLUGIN diff --git a/doc.zih.tu-dresden.de/docs/software/machine_learning.md b/doc.zih.tu-dresden.de/docs/software/machine_learning.md index e80e6c346dfbeff977fdf74fc251507cc171bbcb..f2e5f24aa9f4f8e5f8fb516310b842584d30a614 100644 --- a/doc.zih.tu-dresden.de/docs/software/machine_learning.md +++ b/doc.zih.tu-dresden.de/docs/software/machine_learning.md @@ -1,59 +1,169 @@ # Machine Learning -On the machine learning nodes, you can use the tools from [IBM Power -AI](power_ai.md). +This is an introduction of how to run machine learning applications on ZIH systems. +For machine learning purposes, we recommend to use the partitions `alpha` and/or `ml`. -## Interactive Session Examples +## Partition `ml` -### Tensorflow-Test +The compute nodes of the partition ML are built on the base of +[Power9 architecture](https://www.ibm.com/it-infrastructure/power/power9) from IBM. The system was created +for AI challenges, analytics and working with data-intensive workloads and accelerated databases. - tauruslogin6 :~> srun -p ml --gres=gpu:1 -n 1 --pty --mem-per-cpu=10000 bash - srun: job 4374195 queued and waiting for resources - srun: job 4374195 has been allocated resources - taurusml22 :~> ANACONDA2_INSTALL_PATH='/opt/anaconda2' - taurusml22 :~> ANACONDA3_INSTALL_PATH='/opt/anaconda3' - taurusml22 :~> export PATH=$ANACONDA3_INSTALL_PATH/bin:$PATH - taurusml22 :~> source /opt/DL/tensorflow/bin/tensorflow-activate - taurusml22 :~> tensorflow-test - Basic test of tensorflow - A Hello World!!!... +The main feature of the nodes is the ability to work with the +[NVIDIA Tesla V100](https://www.nvidia.com/en-gb/data-center/tesla-v100/) GPU with **NV-Link** +support that allows a total bandwidth with up to 300 GB/s. Each node on the +partition ML has 6x Tesla V-100 GPUs. You can find a detailed specification of the partition in our +[Power9 documentation](../jobs_and_resources/power9.md). - #or: - taurusml22 :~> module load TensorFlow/1.10.0-PythonAnaconda-3.6 +!!! note -Or to use the whole node: `--gres=gpu:6 --exclusive --pty` + The partition ML is based on the Power9 architecture, which means that the software built + for x86_64 will not work on this partition. Also, users need to use the modules which are + specially build for this architecture (from `modenv/ml`). -### In Singularity container: +### Modules - rotscher@tauruslogin6:~> srun -p ml --gres=gpu:6 --pty bash - [rotscher@taurusml22 ~]$ singularity shell --nv /scratch/singularity/powerai-1.5.3-all-ubuntu16.04-py3.img - Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~> export PATH=/opt/anaconda3/bin:$PATH - Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~> . /opt/DL/tensorflow/bin/tensorflow-activate - Singularity powerai-1.5.3-all-ubuntu16.04-py3.img:~> tensorflow-test +On the partition ML load the module environment: -## Additional libraries +```console +marie@ml$ module load modenv/ml +The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml +``` + +### Power AI + +There are tools provided by IBM, that work on partition ML and are related to AI tasks. +For more information see our [Power AI documentation](power_ai.md). + +## Partition: Alpha + +Another partition for machine learning tasks is Alpha. It is mainly dedicated to +[ScaDS.AI](https://scads.ai/) topics. Each node on Alpha has 2x AMD EPYC CPUs, 8x NVIDIA A100-SXM4 +GPUs, 1 TB RAM and 3.5 TB local space (`/tmp`) on an NVMe device. You can find more details of the +partition in our [Alpha Centauri](../jobs_and_resources/alpha_centauri.md) documentation. + +### Modules + +On the partition alpha load the module environment: + +```console +marie@alpha$ module load modenv/hiera +The following have been reloaded with a version change: 1) modenv/ml => modenv/hiera +``` + +!!! note + + On partition Alpha, the most recent modules are build in `hiera`. Alternative modules might be + build in `scs5`. + +## Machine Learning via Console + +### Python and Virtual Environments + +Python users should use a [virtual environment](python_virtual_environments.md) when conducting +machine learning tasks via console. + +For more details on machine learning or data science with Python see +[data analytics with Python](data_analytics_with_python.md). + +### R + +R also supports machine learning via console. It does not require a virtual environment due to a +different package management. + +For more details on machine learning or data science with R see +[data analytics with R](data_analytics_with_r.md#r-console). + +## Machine Learning with Jupyter + +The [Jupyter Notebook](https://jupyter.org/) is an open-source web application that allows you to +create documents containing live code, equations, visualizations, and narrative text. +[JupyterHub](../access/jupyterhub.md) allows to work with machine learning frameworks (e.g. +TensorFlow or PyTorch) on ZIH systems and to run your Jupyter notebooks on HPC nodes. + +After accessing JupyterHub, you can start a new session and configure it. For machine learning +purposes, select either partition **Alpha** or **ML** and the resources, your application requires. + +In your session you can use [Python](data_analytics_with_python.md#jupyter-notebooks), +[R](data_analytics_with_r.md#r-in-jupyterhub) or [RStudio](data_analytics_with_rstudio.md) for your +machine learning and data science topics. + +## Machine Learning with Containers + +Some machine learning tasks require using containers. In the HPC domain, the +[Singularity](https://singularity.hpcng.org/) container system is a widely used tool. Docker +containers can also be used by Singularity. You can find further information on working with +containers on ZIH systems in our [containers documentation](containers.md). + +There are two sources for containers for Power9 architecture with TensorFlow and PyTorch on the +board: + +* [TensorFlow-ppc64le](https://hub.docker.com/r/ibmcom/tensorflow-ppc64le): + Community-supported `ppc64le` docker container for TensorFlow. +* [PowerAI container](https://hub.docker.com/r/ibmcom/powerai/): + Official Docker container with TensorFlow, PyTorch and many other packages. + +!!! note + + You could find other versions of software in the container on the "tag" tab on the docker web + page of the container. + +In the following example, we build a Singularity container with TensorFlow from the DockerHub and +start it: + +```console +marie@ml$ singularity build my-ML-container.sif docker://ibmcom/tensorflow-ppc64le #create a container from the DockerHub with the last TensorFlow version +[...] +marie@ml$ singularity run --nv my-ML-container.sif #run my-ML-container.sif container supporting the Nvidia's GPU. You can also work with your container by: singularity shell, singularity exec +[...] +``` + +## Additional Libraries for Machine Learning The following NVIDIA libraries are available on all nodes: -| | | -|-------|---------------------------------------| -| NCCL | /usr/local/cuda/targets/ppc64le-linux | -| cuDNN | /usr/local/cuda/targets/ppc64le-linux | +| Name | Path | +|-------|-----------------------------------------| +| NCCL | `/usr/local/cuda/targets/ppc64le-linux` | +| cuDNN | `/usr/local/cuda/targets/ppc64le-linux` | + +!!! note -Note: For optimal NCCL performance it is recommended to set the -**NCCL_MIN_NRINGS** environment variable during execution. You can try -different values but 4 should be a pretty good starting point. + For optimal NCCL performance it is recommended to set the + **NCCL_MIN_NRINGS** environment variable during execution. You can try + different values but 4 should be a pretty good starting point. - export NCCL_MIN_NRINGS=4 +```console +marie@compute$ export NCCL_MIN_NRINGS=4 +``` -\<span style="color: #222222; font-size: 1.385em;">HPC\</span> +### HPC-Related Software The following HPC related software is installed on all nodes: -| | | -|------------------|------------------------| -| IBM Spectrum MPI | /opt/ibm/spectrum_mpi/ | -| PGI compiler | /opt/pgi/ | -| IBM XLC Compiler | /opt/ibm/xlC/ | -| IBM XLF Compiler | /opt/ibm/xlf/ | -| IBM ESSL | /opt/ibmmath/essl/ | -| IBM PESSL | /opt/ibmmath/pessl/ | +| Name | Path | +|------------------|--------------------------| +| IBM Spectrum MPI | `/opt/ibm/spectrum_mpi/` | +| PGI compiler | `/opt/pgi/` | +| IBM XLC Compiler | `/opt/ibm/xlC/` | +| IBM XLF Compiler | `/opt/ibm/xlf/` | +| IBM ESSL | `/opt/ibmmath/essl/` | +| IBM PESSL | `/opt/ibmmath/pessl/` | + +## Datasets for Machine Learning + +There are many different datasets designed for research purposes. If you would like to download some +of them, keep in mind that many machine learning libraries have direct access to public datasets +without downloading it, e.g. [TensorFlow Datasets](https://www.tensorflow.org/datasets). If you +still need to download some datasets use [datamover](../data_transfer/datamover.md) machine. + +### The ImageNet Dataset + +The ImageNet project is a large visual database designed for use in visual object recognition +software research. In order to save space in the filesystem by avoiding to have multiple duplicates +of this lying around, we have put a copy of the ImageNet database (ILSVRC2012 and ILSVR2017) under +`/scratch/imagenet` which you can use without having to download it again. For the future, the +ImageNet dataset will be available in +[Warm Archive](../data_lifecycle/workspaces.md#mid-term-storage). ILSVR2017 also includes a dataset +for recognition objects from a video. Please respect the corresponding +[Terms of Use](https://image-net.org/download.php). diff --git a/doc.zih.tu-dresden.de/docs/software/mathematics.md b/doc.zih.tu-dresden.de/docs/software/mathematics.md index 3ae820eda962a63a1ff59c55536865f1437d582a..5b8e23b2fd3ed373bdf7bf6394ae3b2faf98ce74 100644 --- a/doc.zih.tu-dresden.de/docs/software/mathematics.md +++ b/doc.zih.tu-dresden.de/docs/software/mathematics.md @@ -1,11 +1,9 @@ # Mathematics Applications -!!! cite +!!! cite "Galileo Galilei" Nature is written in mathematical language. - (Galileo Galilei) - <!--*Please do not run expensive interactive sessions on the login nodes. Instead, use* `srun --pty--> <!--...` *to let the batch system place it on a compute node.*--> @@ -16,16 +14,16 @@ interface capabilities within a document-like user interface paradigm. ### Fonts -To remotely use the graphical frontend, you have to add the Mathematica fonts to the local -fontmanager. +To remotely use the graphical front-end, you have to add the Mathematica fonts to the local +font manager. #### Linux Workstation You need to copy the fonts from ZIH systems to your local system and expand the font path -```bash -localhost$ scp -r taurus.hrsk.tu-dresden.de:/sw/global/applications/mathematica/10.0/SystemFiles/Fonts/Type1/ ~/.fonts -localhost$ xset fp+ ~/.fonts/Type1 +```console +marie@local$ scp -r taurus.hrsk.tu-dresden.de:/sw/global/applications/mathematica/10.0/SystemFiles/Fonts/Type1/ ~/.fonts +marie@local$ xset fp+ ~/.fonts/Type1 ``` #### Windows Workstation @@ -95,41 +93,41 @@ interfaces with the Maple symbolic engine, allowing it to be part of a full comp Running MATLAB via the batch system could look like this (for 456 MB RAM per core and 12 cores reserved). Please adapt this to your needs! -```bash -zih$ module load MATLAB -zih$ srun -t 8:00 -c 12 --mem-per-cpu=456 --pty --x11=first bash -zih$ matlab +```console +marie@login$ module load MATLAB +marie@login$ srun --time=8:00 --cpus-per-task=12 --mem-per-cpu=456 --pty --x11=first bash +marie@compute$ matlab ``` With following command you can see a list of installed software - also the different versions of matlab. -```bash -zih$ module avail +```console +marie@login$ module avail ``` Please choose one of these, then load the chosen software with the command: ```bash -zih$ module load MATLAB/version +marie@login$ module load MATLAB/<version> ``` Or use: -```bash -zih$ module load MATLAB +```console +marie@login$ module load MATLAB ``` (then you will get the most recent Matlab version. -[Refer to the modules section for details.](../software/runtime_environment.md#modules)) +[Refer to the modules section for details.](../software/modules.md#modules)) ### Interactive If X-server is running and you logged in at ZIH systems, you should allocate a CPU for your work with command -```bash -zih$ srun --pty --x11=first bash +```console +marie@login$ srun --pty --x11=first bash ``` - now you can call "matlab" (you have 8h time to work with the matlab-GUI) @@ -140,8 +138,9 @@ Using Scripts You have to start matlab-calculation as a Batch-Job via command -```bash -srun --pty matlab -nodisplay -r basename_of_your_matlab_script #NOTE: you must omit the file extension ".m" here, because -r expects a matlab command or function call, not a file-name. +```console +marie@login$ srun --pty matlab -nodisplay -r basename_of_your_matlab_script +# NOTE: you must omit the file extension ".m" here, because -r expects a matlab command or function call, not a file-name. ``` !!! info "License occupying" @@ -149,20 +148,20 @@ srun --pty matlab -nodisplay -r basename_of_your_matlab_script #NOTE: you must o While running your calculations as a script this way is possible, it is generally frowned upon, because you are occupying Matlab licenses for the entire duration of your calculation when doing so. Since the available licenses are limited, it is highly recommended you first compile your script via - the Matlab Compiler (mcc) before running it for a longer period of time on our systems. That way, + the Matlab Compiler (`mcc`) before running it for a longer period of time on our systems. That way, you only need to check-out a license during compile time (which is relatively short) and can run as many instances of your calculation as you'd like, since it does not need a license during runtime when compiled to a binary. You can find detailed documentation on the Matlab compiler at -[Mathworks' help pages](https://de.mathworks.com/help/compiler/). +[MathWorks' help pages](https://de.mathworks.com/help/compiler/). -### Using the MATLAB Compiler (mcc) +### Using the MATLAB Compiler Compile your `.m` script into a binary: ```bash -mcc -m name_of_your_matlab_script.m -o compiled_executable -R -nodisplay -R -nosplash +marie@login$ mcc -m name_of_your_matlab_script.m -o compiled_executable -R -nodisplay -R -nosplash ``` This will also generate a wrapper script called `run_compiled_executable.sh` which sets the required @@ -174,41 +173,35 @@ Then run the binary via the wrapper script in a job (just a simple example, you [sbatch script](../jobs_and_resources/slurm.md#job-submission) for that) ```bash -zih$ srun ./run_compiled_executable.sh $EBROOTMATLAB +marie@login$ srun ./run_compiled_executable.sh $EBROOTMATLAB ``` ### Parallel MATLAB #### With 'local' Configuration -- If you want to run your code in parallel, please request as many - cores as you need! -- start a batch job with the number N of processes -- example for N= 4: \<pre> srun -c 4 --pty --x11=first bash\</pre> -- run Matlab with the GUI or the CLI or with a script -- inside use \<pre>matlabpool open 4\</pre> to start parallel - processing +- If you want to run your code in parallel, please request as many cores as you need! +- Start a batch job with the number `N` of processes, e.g., `srun --cpus-per-task=4 --pty + --x11=first bash -l` +- Run Matlab with the GUI or the CLI or with a script +- Inside Matlab use `matlabpool open 4` to start parallel processing -- example for 1000\*1000 matrixmutliplication - -!!! example +!!! example "Example for 1000*1000 matrix-matrix multiplication" ```bash R = distributed.rand(1000); D = R * R ``` -- to close parallel task: -`matlabpool close` +- Close parallel task using `matlabpool close` -#### With Parfor +#### With parfor -- start a batch job with the number N of processes (e.g. N=12) -- inside use `matlabpool open N` or - `matlabpool(N)` to start parallel processing. It will use +- Start a batch job with the number `N` of processes (,e.g., `N=12`) +- Inside use `matlabpool open N` or `matlabpool(N)` to start parallel processing. It will use the 'local' configuration by default. -- Use 'parfor' for a parallel loop, where the **independent** loop - iterations are processed by N threads +- Use `parfor` for a parallel loop, where the **independent** loop iterations are processed by `N` + threads !!! example diff --git a/doc.zih.tu-dresden.de/docs/software/misc/FlinkExample.ipynb b/doc.zih.tu-dresden.de/docs/software/misc/FlinkExample.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..5a867b8750704ea92a318087d82bb0ca3355018d --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/misc/FlinkExample.ipynb @@ -0,0 +1,159 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "!{sys.executable} -m pip install apache-flink --user" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "echo $FLINK_ROOT_DIR\n", + "echo $JAVA_HOME\n", + "hostname\n", + "if [ ! -d $HOME/jupyter-flink-conf ]\n", + "then\n", + "cp -r $FLINK_ROOT_DIR/conf $HOME/jupyter-flink-conf\n", + "chmod -R u+w $HOME/jupyter-flink-conf\n", + "fi" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "import os\n", + "os.environ['FLINK_CONF_DIR'] = os.environ['HOME'] + '/cluster-conf-' + os.environ['SLURM_JOBID'] + '/flink'\n", + "os.environ['PYTHONPATH'] = os.environ['PYTHONPATH'] + ':' + os.environ['HOME'] + '/.local/lib/python3.6/site-packages'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!SHELL=/bin/bash bash framework-configure.sh flink $HOME/jupyter-flink-conf" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "exitcode = os.system('start-cluster.sh')\n", + "if not exitcode:\n", + " print(\"started Flink cluster successful\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "echo \"This is a short story for you. In this story nothing is happening. Have a nice day!\" > myFlinkTestFile" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pyflink.datastream import StreamExecutionEnvironment\n", + "from pyflink.datastream.connectors import FileSource\n", + "from pyflink.datastream.connectors import StreamFormat\n", + "from pyflink.common.watermark_strategy import WatermarkStrategy\n", + "from pyflink.common.typeinfo import Types\n", + "\n", + "env = StreamExecutionEnvironment.get_execution_environment()\n", + "env.set_parallelism(2)\n", + "#set the Python executable for the workers\n", + "env.set_python_executable(sys.executable)\n", + "# define the source\n", + "ds = env.from_source(source=FileSource.for_record_stream_format(StreamFormat.text_line_format(),\n", + " \"myFlinkTestFile\").process_static_file_set().build(),\n", + " watermark_strategy=WatermarkStrategy.for_monotonous_timestamps(),\n", + " source_name=\"file_source\")\n", + "\n", + "def split(line):\n", + " yield from line.split()\n", + "\n", + " \n", + "# compute word count\n", + "ds = ds.flat_map(split) \\\n", + " .map(lambda i: (i, 1), output_type=Types.TUPLE([Types.STRING(), Types.INT()])) \\\n", + " .key_by(lambda i: i[0]) \\\n", + " .reduce(lambda i, j: (i[0], i[1] + j[1])) \\\n", + " .map(lambda i: print(i))\n", + "\n", + "# submit for execution\n", + "env.execute()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "stop-cluster.sh" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!ps -ef | grep -i java" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pkill -f \"java\"" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.10" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-evaluate-menu.png b/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-evaluate-menu.png new file mode 100644 index 0000000000000000000000000000000000000000..6d425818925017b52e455ddfb92b00904a0f302d Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-evaluate-menu.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-graph-result.png b/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-graph-result.png new file mode 100644 index 0000000000000000000000000000000000000000..8dbbec668465134bbd35a78d63052b7c7d253d0e Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-graph-result.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-parallel-plot.png b/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-parallel-plot.png new file mode 100644 index 0000000000000000000000000000000000000000..3702d69383fe4cb248456102f97e8a7fc8127ca0 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-parallel-plot.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-still-running-jobs.png b/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-still-running-jobs.png new file mode 100644 index 0000000000000000000000000000000000000000..d4cd05138805d13e6eedd61b3ad8b0c5c9416afe Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/OmniOpt-still-running-jobs.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/Pytorch_jupyter_module.png b/doc.zih.tu-dresden.de/docs/software/misc/Pytorch_jupyter_module.png new file mode 100644 index 0000000000000000000000000000000000000000..5f3e324da2114dc24382f57dfeb14c10554d60f5 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/Pytorch_jupyter_module.png differ diff --git a/Compendium_attachments/FEMSoftware/Rot-modell-BenjaminGroeger.inp b/doc.zih.tu-dresden.de/docs/software/misc/Rot-modell-BenjaminGroeger.inp similarity index 100% rename from Compendium_attachments/FEMSoftware/Rot-modell-BenjaminGroeger.inp rename to doc.zih.tu-dresden.de/docs/software/misc/Rot-modell-BenjaminGroeger.inp diff --git a/doc.zih.tu-dresden.de/docs/software/misc/SparkExample.ipynb b/doc.zih.tu-dresden.de/docs/software/misc/SparkExample.ipynb index ffe1aa174859fe6697f65af7ce7bd09d526e4bc1..959b536b85dd3d5d01c79217b697506a7517d4f3 100644 --- a/doc.zih.tu-dresden.de/docs/software/misc/SparkExample.ipynb +++ b/doc.zih.tu-dresden.de/docs/software/misc/SparkExample.ipynb @@ -1,5 +1,24 @@ { "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "!{sys.executable} -m pip install findspark --user" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!which python" + ] + }, { "cell_type": "code", "execution_count": null, @@ -9,7 +28,13 @@ "%%bash\n", "echo $SPARK_HOME\n", "echo $JAVA_HOME\n", - "hostname" + "hostname\n", + "if [ ! -d $HOME/jupyter-spark-conf ]\n", + "then\n", + "cp -r $SPARK_HOME/conf $HOME/jupyter-spark-conf\n", + "chmod -R u+w $HOME/jupyter-spark-conf\n", + "echo \"ml `ml -t list Spark` 2>/dev/null\" >> $HOME/jupyter-spark-conf/spark-env.sh\n", + "fi" ] }, { @@ -21,7 +46,8 @@ "import sys\n", "import os\n", "os.environ['PYSPARK_PYTHON'] = sys.executable\n", - "os.environ['SPARK_CONF_DIR'] = os.environ['HOME'] + '/cluster-conf-' + os.environ['SLURM_JOBID'] + '/spark'" + "os.environ['SPARK_CONF_DIR'] = os.environ['HOME'] + '/cluster-conf-' + os.environ['SLURM_JOBID'] + '/spark'\n", + "os.environ['PYTHONPATH'] = os.environ['PYTHONPATH'] + ':' + os.environ['HOME'] + '/.local/lib/python3.6/site-packages'" ] }, { @@ -30,7 +56,7 @@ "metadata": {}, "outputs": [], "source": [ - "!SHELL=/bin/bash bash framework-configure.sh spark $SPARK_HOME/conf " + "!SHELL=/bin/bash bash framework-configure.sh spark $HOME/jupyter-spark-conf" ] }, { @@ -48,8 +74,16 @@ "metadata": {}, "outputs": [], "source": [ - "#import findspark\n", - "#findspark.init()\n", + "import findspark\n", + "findspark.init(os.environ['SPARK_HOME'])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "import platform\n", "import pyspark\n", "from pyspark import SparkContext" @@ -109,14 +143,16 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "!pkill -f \"pyspark-shell\"" + ] } ], "metadata": { "kernelspec": { - "display_name": "haswell-py3.6-spark", + "display_name": "Python 3", "language": "python", - "name": "haswell-py3.6-spark" + "name": "python3" }, "language_info": { "codemirror_mode": { diff --git a/Compendium_attachments/FEMSoftware/ABAQUS-SLURM.pdf b/doc.zih.tu-dresden.de/docs/software/misc/abaqus-slurm.pdf similarity index 100% rename from Compendium_attachments/FEMSoftware/ABAQUS-SLURM.pdf rename to doc.zih.tu-dresden.de/docs/software/misc/abaqus-slurm.pdf diff --git a/doc.zih.tu-dresden.de/docs/software/misc/data_analytics_with_r_RStudio_launcher.png b/doc.zih.tu-dresden.de/docs/software/misc/data_analytics_with_r_RStudio_launcher.png deleted file mode 100644 index fd50be1824655ef7e39c2adf74287fa14a716148..0000000000000000000000000000000000000000 Binary files a/doc.zih.tu-dresden.de/docs/software/misc/data_analytics_with_r_RStudio_launcher.png and /dev/null differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/data_analytics_with_rstudio_launcher.jpg b/doc.zih.tu-dresden.de/docs/software/misc/data_analytics_with_rstudio_launcher.jpg new file mode 100644 index 0000000000000000000000000000000000000000..8f12eb7e8afc8c1c12c1d772ccb391791ec3b550 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/data_analytics_with_rstudio_launcher.jpg differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/example-spark.sbatch b/doc.zih.tu-dresden.de/docs/software/misc/example-spark.sbatch deleted file mode 100644 index 5a418a9c5e98f70b027618a4da1158010619556b..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/misc/example-spark.sbatch +++ /dev/null @@ -1,27 +0,0 @@ -#!/bin/bash -#SBATCH --time=00:03:00 -#SBATCH --partition=haswell -#SBATCH --nodes=1 -#SBATCH --exclusive -#SBATCH --mem=60G -#SBATCH -J "example-spark" - -ml Spark - -function myExitHandler () { - stop-all.sh -} - -#configuration -. framework-configure.sh spark $SPARK_HOME/conf - -#register cleanup hook in case something goes wrong -trap myExitHandler EXIT - -start-all.sh - -spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.4.jar 1000 - -stop-all.sh - -exit 0 diff --git a/Compendium_attachments/PyTorch/example_PyTorch_parallel.zip b/doc.zih.tu-dresden.de/docs/software/misc/example_PyTorch_parallel.zip similarity index 100% rename from Compendium_attachments/PyTorch/example_PyTorch_parallel.zip rename to doc.zih.tu-dresden.de/docs/software/misc/example_PyTorch_parallel.zip diff --git a/doc.zih.tu-dresden.de/docs/software/misc/hyperparameter_optimization-OmniOpt-GUI.png b/doc.zih.tu-dresden.de/docs/software/misc/hyperparameter_optimization-OmniOpt-GUI.png new file mode 100644 index 0000000000000000000000000000000000000000..c292e7cefb46224585894acc8623e1bfa9878052 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/hyperparameter_optimization-OmniOpt-GUI.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/hyperparameter_optimization-OmniOpt-final-command.png b/doc.zih.tu-dresden.de/docs/software/misc/hyperparameter_optimization-OmniOpt-final-command.png new file mode 100644 index 0000000000000000000000000000000000000000..b0b714462939f9acbd2e25e0d0eb39b431dba5de Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/hyperparameter_optimization-OmniOpt-final-command.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/tensorflow_jupyter_module.png b/doc.zih.tu-dresden.de/docs/software/misc/tensorflow_jupyter_module.png new file mode 100644 index 0000000000000000000000000000000000000000..1327ee6304faf4b293c385981a750f362063ecbf Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/tensorflow_jupyter_module.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/zsh_autocd.png b/doc.zih.tu-dresden.de/docs/software/misc/zsh_autocd.png new file mode 100644 index 0000000000000000000000000000000000000000..1d30a13f2dcc3af6e706fe8849aff6ee0739a76c Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/zsh_autocd.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/zsh_autocomplete_parameters.png b/doc.zih.tu-dresden.de/docs/software/misc/zsh_autocomplete_parameters.png new file mode 100644 index 0000000000000000000000000000000000000000..374e34a84ee88d6c0c9d47c47af609d01fc2c63c Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/zsh_autocomplete_parameters.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/zsh_autosuggestion.png b/doc.zih.tu-dresden.de/docs/software/misc/zsh_autosuggestion.png new file mode 100644 index 0000000000000000000000000000000000000000..872ed226a3f66e78063ad610e5edd8c0463a2922 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/zsh_autosuggestion.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/zsh_syntax_highlighting.png b/doc.zih.tu-dresden.de/docs/software/misc/zsh_syntax_highlighting.png new file mode 100644 index 0000000000000000000000000000000000000000..0e1e888c2bab317d1309289c07582dc08cdd1858 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/zsh_syntax_highlighting.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/misc/zsh_typo.png b/doc.zih.tu-dresden.de/docs/software/misc/zsh_typo.png new file mode 100644 index 0000000000000000000000000000000000000000..de04ba3d061cfb3c402e8b6d02bd7f60698e69c8 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/software/misc/zsh_typo.png differ diff --git a/doc.zih.tu-dresden.de/docs/software/modules.md b/doc.zih.tu-dresden.de/docs/software/modules.md index 8f5a0ae2c4792fd92c458dc89033b2058a1e22de..b4aa437d270b4dda1a64f655d3c8a9db9238df2c 100644 --- a/doc.zih.tu-dresden.de/docs/software/modules.md +++ b/doc.zih.tu-dresden.de/docs/software/modules.md @@ -1,199 +1,375 @@ # Modules -Usage of software on HPC systems is managed by a **modules system**. A module is a user interface -that provides utilities for the dynamic modification of a user's environment (e.g., *PATH*, -*LD_LIBRARY_PATH* etc.) to access the compilers, loader, libraries, and utilities. With the help -of modules, users can smoothly switch between different versions of installed software packages -and libraries. +Usage of software on HPC systems is managed by a **modules system**. -For all applications, tools, libraries etc. the correct environment can be easily set by the command +!!! note "Module" -``` -module load -``` + A module is a user interface that provides utilities for the dynamic modification of a user's + environment, e.g. prepending paths to: -e.g: `module load MATLAB`. If several versions are installed they can be chosen like: `module load -MATLAB/2019b`. + * `PATH` + * `LD_LIBRARY_PATH` + * `MANPATH` + * and more -A list of all modules shows by command + to help you to access compilers, loader, libraries and utilities. -``` -module available -#or -module avail -#or -ml av + By using modules, you can smoothly switch between different versions of + installed software packages and libraries. -``` +## Module Commands -Other important commands are: +Using modules is quite straightforward and the following table lists the basic commands. -```Bash -module help #show all module options -module list #list all user-installed modules -module purge #remove all user-installed modules -module spider #search for modules across all environments, can take a parameter -module load <modname> #load module modname -module rm <modname> #unload module modname -module switch <mod> <mod2> #unload module mod1; load module mod2 -``` +| Command | Description | +|:------------------------------|:-----------------------------------------------------------------| +| `module help` | Show all module options | +| `module list` | List active modules in the user environment | +| `module purge` | Remove modules from the user environment | +| `module avail [modname]` | List all available modules | +| `module spider [modname]` | Search for modules across all environments | +| `module load <modname>` | Load module `modname` in the user environment | +| `module unload <modname>` | Remove module `modname` from the user environment | +| `module switch <mod1> <mod2>` | Replace module `mod1` with module `mod2` | -Module files are ordered by their topic on Taurus. By default, with `module available` you will see -all available module files and topics. If you just wish to see the installed versions of a certain -module, you can use `module av <softwarename>` and all available versions of the exact software will -be displayed. +Module files are ordered by their topic on ZIH systems. By default, with `module avail` you will +see all topics and their available module files. If you just wish to see the installed versions of a +certain module, you can use `module avail softwarename` and it will display the available versions of +`softwarename` only. -## Module environments +### Examples -On Taurus, there exist different module environments, each containing a set of software modules. -They are activated via the meta module modenv which has different versions, one of which is loaded -by default. You can switch between them by simply loading the desired modenv-version, e.g.: +???+ example "Finding available software" + This examples illustrates the usage of the command `module avail` to search for available Matlab + installations. + + ```console + marie@compute$ module avail matlab + + ------------------------------ /sw/modules/scs5/math ------------------------------ + MATLAB/2017a MATLAB/2018b MATLAB/2020a + MATLAB/2018a MATLAB/2019b MATLAB/2021a (D) + + Wo: + D: Standard Modul. + + Verwenden Sie "module spider" um alle verfügbaren Module anzuzeigen. + Verwenden Sie "module keyword key1 key2 ...", um alle verfügbaren Module + anzuzeigen, die mindestens eines der Schlüsselworte enthält. + ``` + +???+ example "Loading and removing modules" + + A particular module or several modules are loaded into your environment using the `module load` + command. The counter part to remove a module or several modules is `module unload`. + + ```console + marie@compute$ module load Python/3.8.6 + Module Python/3.8.6-GCCcore-10.2.0 and 11 dependencies loaded. + ``` + +???+ example "Removing all modules" + + To remove all loaded modules from your environment with one keystroke, invoke + + ```console + marie@compute$ module purge + Die folgenden Module wurden nicht entladen: + (Benutzen Sie "module --force purge" um alle Module zu entladen): + + 1) modenv/scs5 + Module Python/3.8.6-GCCcore-10.2.0 and 11 dependencies unloaded. + ``` + +### Front-End ml + +There is a front end for the module command, which helps you to type less. It is `ml`. + Any module command can be given after `ml`: + +| ml Command | module Command | +|:------------------|:------------------------------------------| +| `ml` | `module list` | +| `ml foo bar` | `module load foo bar` | +| `ml -foo -bar baz`| `module unload foo bar; module load baz` | +| `ml purge` | `module purge` | +| `ml show foo` | `module show foo` | + +???+ example "Usage of front-end ml" + + ```console + marie@compute$ ml +Python/3.8.6 + Module Python/3.8.6-GCCcore-10.2.0 and 11 dependencies loaded. + marie@compute$ ml + + Derzeit geladene Module: + 1) modenv/scs5 (S) 5) bzip2/1.0.8-GCCcore-10.2.0 9) SQLite/3.33.0-GCCcore-10.2.0 13) Python/3.8.6-GCCcore-10.2.0 + 2) GCCcore/10.2.0 6) ncurses/6.2-GCCcore-10.2.0 10) XZ/5.2.5-GCCcore-10.2.0 + 3) zlib/1.2.11-GCCcore-10.2.0 7) libreadline/8.0-GCCcore-10.2.0 11) GMP/6.2.0-GCCcore-10.2.0 + 4) binutils/2.35-GCCcore-10.2.0 8) Tcl/8.6.10-GCCcore-10.2.0 12) libffi/3.3-GCCcore-10.2.0 + + Wo: + S: Das Modul ist angeheftet. Verwenden Sie "--force", um das Modul zu entladen. + + marie@compute$ ml -Python/3.8.6 +ANSYS/2020R2 + Module Python/3.8.6-GCCcore-10.2.0 and 11 dependencies unloaded. + Module ANSYS/2020R2 loaded. + ``` + +## Module Environments + +On ZIH systems, there exist different **module environments**, each containing a set of software modules. +They are activated via the meta module `modenv` which has different versions, one of which is loaded +by default. You can switch between them by simply loading the desired modenv-version, e.g. + +```console +marie@compute$ module load modenv/ml ``` -module load modenv/ml -``` -| modenv/scs5 | SCS5 software | default | -| | | | -| modenv/ml | HPC-DA software (for use on the "ml" partition) | | -| modenv/hiera | WIP hierarchical module tree | | -| modenv/classic | Manually built pre-SCS5 (AE4.0) software | default | -| | | | - -The old modules (pre-SCS5) are still available after loading the corresponding `modenv` version -(classic), however, due to changes in the libraries of the operating system, it is not guaranteed -that they still work under SCS5. Please don't use modenv/classic if you do not absolutely have to. -Most software is available under modenv/scs5, too, just be aware of the possibly different spelling -(case-sensitivity). - -The command `module spider <modname>` allows searching for specific software in all modenv -environments. It will also display information on how to load a found module when giving a precise +### modenv/scs5 (default) + +* SCS5 software +* usually optimized for Intel processors (Partitions: `haswell`, `broadwell`, `gpu2`, `julia`) + +### modenv/ml + +* data analytics software (for use on the partition ml) +* necessary to run most software on the partition ml +(The instruction set [Power ISA](https://en.wikipedia.org/wiki/Power_ISA#Power_ISA_v.3.0) +is different from the usual x86 instruction set. +Thus the 'machine code' of other modenvs breaks). + +### modenv/hiera + +* uses a hierarchical module load scheme +* optimized software for AMD processors (Partitions: romeo, alpha) + +### modenv/classic + +* deprecated, old software. Is not being curated. +* may break due to library inconsistencies with the operating system. +* please don't use software from that modenv + +### Searching for Software + +The command `module spider <modname>` allows searching for a specific software across all modenv +environments. It will also display information on how to load a particular module when giving a precise module (with version) as the parameter. -## Per-architecture builds +??? example "Spider command" + + ```console + marie@login$ module spider p7zip + + ---------------------------------------------------------------------------------------------------------------------------------------------------------- + p7zip: + ---------------------------------------------------------------------------------------------------------------------------------------------------------- + Beschreibung: + p7zip is a quick port of 7z.exe and 7za.exe (command line version of 7zip) for Unix. 7-Zip is a file archiver with highest compression ratio. + + Versionen: + p7zip/9.38.1 + p7zip/17.03-GCCcore-10.2.0 + p7zip/17.03 + + ---------------------------------------------------------------------------------------------------------------------------------------------------------- + Um detaillierte Informationen über ein bestimmtes "p7zip"-Modul zu erhalten (auch wie das Modul zu laden ist), verwenden sie den vollständigen Namen des Moduls. + Zum Beispiel: + $ module spider p7zip/17.03 + ---------------------------------------------------------------------------------------------------------------------------------------------------------- + ``` + +In some cases a desired software is available as an extension of a module. + +??? example "Extension module" + ```console hl_lines="9" + marie@login$ module spider tensorboard + + -------------------------------------------------------------------------------------------------------------------------------- + tensorboard: + -------------------------------------------------------------------------------------------------------------------------------- + Versions: + tensorboard/2.4.1 (E) + + Names marked by a trailing (E) are extensions provided by another module. + [...] + ``` + + You retrieve further information using the `spider` command. + + ```console + marie@login$ module spider tensorboard/2.4.1 + + -------------------------------------------------------------------------------------------------------------------------------- + tensorboard: tensorboard/2.4.1 (E) + -------------------------------------------------------------------------------------------------------------------------------- + This extension is provided by the following modules. To access the extension you must load one of the following modules. Note that any module names in parentheses show the module location in the software hierarchy. + + TensorFlow/2.4.1 (modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5) + TensorFlow/2.4.1-fosscuda-2019b-Python-3.7.4 (modenv/ml) + TensorFlow/2.4.1-foss-2020b (modenv/scs5) + + Names marked by a trailing (E) are extensions provided by another module. + ``` + + Finaly, you can load the dependencies and `tensorboard/2.4.1` and check the version. + + ```console + marie@login$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 + + The following have been reloaded with a version change: + 1) modenv/scs5 => modenv/hiera + + Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5 and 15 dependencies loaded. + marie@login$ module load TensorFlow/2.4.1 + Module TensorFlow/2.4.1 and 34 dependencies loaded. + marie@login$ tensorboard --version + 2.4.1 + ``` + +## Per-Architecture Builds Since we have a heterogeneous cluster, we do individual builds of some of the software for each architecture present. This ensures that, no matter what partition the software runs on, a build -optimized for the host architecture is used automatically. This is achieved by having -'/sw/installed' symlinked to different directories on the compute nodes. +optimized for the host architecture is used automatically. +For that purpose we have created symbolic links on the compute nodes, +at the system path `/sw/installed`. However, not every module will be available for each node type or partition. Especially when introducing new hardware to the cluster, we do not want to rebuild all of the older module versions and in some cases cannot fall-back to a more generic build either. That's why we provide the script: `ml_arch_avail` that displays the availability of modules for the different node architectures. +### Example Invocation of ml_arch_avail + +```console +marie@compute$ ml_arch_avail TensorFlow/2.4.1 +TensorFlow/2.4.1: haswell, rome +TensorFlow/2.4.1: haswell, rome ``` -ml_arch_avail CP2K -Example output: +The command shows all modules that match on `TensorFlow/2.4.1`, and their respective availability. +Note that this will not work for meta-modules that do not have an installation directory +(like some tool chain modules). -#CP2K/6.1-foss-2019a: haswell, rome -#CP2K/5.1-intel-2018a: haswell -#CP2K/6.1-foss-2019a-spglib: haswell, rome -#CP2K/6.1-intel-2018a: haswell -#CP2K/6.1-intel-2018a-spglib: haswell -``` +## Advanced Usage -The command shows all modules that match on CP2K, and their respective availability. Note that this -will not work for meta-modules that do not have an installation directory (like some toolchain -modules). +For writing your own module files please have a look at the +[Guide for writing project and private module files](private_modules.md). -## Project and User Private Modules +## Troubleshooting -Private module files allow you to load your own installed software packages into your environment -and to handle different versions without getting into conflicts. Private modules can be setup for a -single user as well as all users of project group. The workflow and settings for user private module -files is described in the following. The [settings for project private -modules](#project-private-modules) differ only in details. +### When I log in, the wrong modules are loaded by default -The command +Reset your currently loaded modules with `module purge` +(or `module purge --force` if you also want to unload your basic `modenv` module). +Then run `module save` to overwrite the +list of modules you load by default when logging in. -``` -module use <path_to_module_files> -``` +### I can't load module TensorFlow -adds directory by user choice to the list of module directories that are searched by the `module` -command. Within directory `../privatemodules` user can add directories for every software user wish -to install and add also in this directory a module file for every version user have installed. -Further information about modules can be found [here](http://modules.sourceforge.net/). +Check the dependencies by e.g. calling `module spider TensorFlow/2.4.1` +it will list a number of modules that need to be loaded +before the TensorFlow module can be loaded. -This is an example of work a private module file: +??? example "Loading the dependencies" -- create a directory in your home directory: + ```console + marie@compute$ module load TensorFlow/2.4.1 + Lmod hat den folgenden Fehler erkannt: Diese Module existieren, aber + können nicht wie gewünscht geladen werden: "TensorFlow/2.4.1" + Versuchen Sie: "module spider TensorFlow/2.4.1" um anzuzeigen, wie die Module + geladen werden. -``` -cd -mkdir privatemodules && cd privatemodules -mkdir testsoftware && cd testsoftware -``` -- add the directory in the list of module directories: + marie@compute$ module spider TensorFlow/2.4.1 -``` -module use $HOME/privatemodules -``` + ---------------------------------------------------------------------------------- + TensorFlow: TensorFlow/2.4.1 + ---------------------------------------------------------------------------------- + Beschreibung: + An open-source software library for Machine Intelligence -- create a file with the name `1.0` with a test software in the `testsoftware` directory (use e.g. -echo, emacs, etc): -``` -#%Module###################################################################### -## -## testsoftware modulefile -## -proc ModulesHelp { } { - puts stderr "Loads testsoftware" -} - -set version 1.0 -set arch x86_64 -set path /home/<user>/opt/testsoftware/$version/$arch/ - -prepend-path PATH $path/bin -prepend-path LD_LIBRARY_PATH $path/lib - -if [ module-info mode load ] { - puts stderr "Load testsoftware version $version" -} -``` + Sie müssen alle Module in einer der nachfolgenden Zeilen laden bevor Sie das Modul "TensorFlow/2.4.1" laden können. -- check the availability of the module with `ml av`, the output should look like this: + modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 + This extension is provided by the following modules. To access the extension you must load one of the following modules. Note that any module names in parentheses show the module location in the software hierarchy. -``` ---------------------- /home/masterman/privatemodules --------------------- - testsoftware/1.0 -``` -- load the test module with `module load testsoftware`, the output: + TensorFlow/2.4.1 (modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5) -``` -Load testsoftware version 1.0 -Module testsoftware/1.0 loaded. -``` -### Project Private Modules + This module provides the following extensions: -Private module files allow you to load project- or group-wide installed software into your -environment and to handle different versions without getting into conflicts. + absl-py/0.10.0 (E), astunparse/1.6.3 (E), cachetools/4.2.0 (E), dill/0.3.3 (E), gast/0.3.3 (E), google-auth-oauthlib/0.4.2 (E), google-auth/1.24.0 (E), google-pasta/0.2.0 (E), grpcio/1.32.0 (E), gviz-api/1.9.0 (E), h5py/2.10.0 (E), Keras-Preprocessing/1.1.2 (E), Markdown/3.3.3 (E), oauthlib/3.1.0 (E), opt-einsum/3.3.0 (E), portpicker/1.3.1 (E), pyasn1-modules/0.2.8 (E), requests-oauthlib/1.3.0 (E), rsa/4.7 (E), tblib/1.7.0 (E), tensorboard-plugin-profile/2.4.0 (E), tensorboard-plugin-wit/1.8.0 (E), tensorboard/2.4.1 (E), tensorflow-estimator/2.4.0 (E), TensorFlow/2.4.1 (E), termcolor/1.1.0 (E), Werkzeug/1.0.1 (E), wrapt/1.12.1 (E) -The module files have to be stored in your global projects directory -`/projects/p_projectname/privatemodules`. An example of a module file can be found in the section -above. To use a project-wide module file you have to add the path to the module file to the module -environment with the command + Help: + Description + =========== + An open-source software library for Machine Intelligence -``` -module use /projects/p_projectname/privatemodules -``` -After that, the modules are available in your module environment and you can load the modules with -the `module load` command. + More information + ================ + - Homepage: https://www.tensorflow.org/ + + + Included extensions + =================== + absl-py-0.10.0, astunparse-1.6.3, cachetools-4.2.0, dill-0.3.3, gast-0.3.3, + google-auth-1.24.0, google-auth-oauthlib-0.4.2, google-pasta-0.2.0, + grpcio-1.32.0, gviz-api-1.9.0, h5py-2.10.0, Keras-Preprocessing-1.1.2, + Markdown-3.3.3, oauthlib-3.1.0, opt-einsum-3.3.0, portpicker-1.3.1, + pyasn1-modules-0.2.8, requests-oauthlib-1.3.0, rsa-4.7, tblib-1.7.0, + tensorboard-2.4.1, tensorboard-plugin-profile-2.4.0, tensorboard-plugin- + wit-1.8.0, TensorFlow-2.4.1, tensorflow-estimator-2.4.0, termcolor-1.1.0, + Werkzeug-1.0.1, wrapt-1.12.1 + + + Names marked by a trailing (E) are extensions provided by another module. + + + + marie@compute$ ml +modenv/hiera +GCC/10.2.0 +CUDA/11.1.1 +OpenMPI/4.0.5 +TensorFlow/2.4.1 + + Die folgenden Module wurden in einer anderen Version erneut geladen: + 1) GCC/7.3.0-2.30 => GCC/10.2.0 3) binutils/2.30-GCCcore-7.3.0 => binutils/2.35 + 2) GCCcore/7.3.0 => GCCcore/10.2.0 4) modenv/scs5 => modenv/hiera -## Using Private Modules and Programs in the $HOME Directory + Module GCCcore/7.3.0, binutils/2.30-GCCcore-7.3.0, GCC/7.3.0-2.30, GCC/7.3.0-2.30 and 3 dependencies unloaded. + Module GCCcore/7.3.0, GCC/7.3.0-2.30, GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5, TensorFlow/2.4.1 and 50 dependencies loaded. + marie@compute$ module list -An automated backup system provides security for the HOME-directories on the cluster on a daily -basis. This is the reason why we urge users to store (large) temporary data (like checkpoint files) -on the /scratch -Filesystem or at local scratch disks. + Derzeit geladene Module: + 1) modenv/hiera (S) 28) Tcl/8.6.10 + 2) GCCcore/10.2.0 29) SQLite/3.33.0 + 3) zlib/1.2.11 30) GMP/6.2.0 + 4) binutils/2.35 31) libffi/3.3 + 5) GCC/10.2.0 32) Python/3.8.6 + 6) CUDAcore/11.1.1 33) pybind11/2.6.0 + 7) CUDA/11.1.1 34) SciPy-bundle/2020.11 + 8) numactl/2.0.13 35) Szip/2.1.1 + 9) XZ/5.2.5 36) HDF5/1.10.7 + 10) libxml2/2.9.10 37) cURL/7.72.0 + 11) libpciaccess/0.16 38) double-conversion/3.1.5 + 12) hwloc/2.2.0 39) flatbuffers/1.12.0 + 13) libevent/2.1.12 40) giflib/5.2.1 + 14) Check/0.15.2 41) ICU/67.1 + 15) GDRCopy/2.1-CUDA-11.1.1 42) JsonCpp/1.9.4 + 16) UCX/1.9.0-CUDA-11.1.1 43) NASM/2.15.05 + 17) libfabric/1.11.0 44) libjpeg-turbo/2.0.5 + 18) PMIx/3.1.5 45) LMDB/0.9.24 + 19) OpenMPI/4.0.5 46) nsync/1.24.0 + 20) OpenBLAS/0.3.12 47) PCRE/8.44 + 21) FFTW/3.3.8 48) protobuf/3.14.0 + 22) ScaLAPACK/2.1.0 49) protobuf-python/3.14.0 + 23) cuDNN/8.0.4.30-CUDA-11.1.1 50) flatbuffers-python/1.12 + 24) NCCL/2.8.3-CUDA-11.1.1 51) typing-extensions/3.7.4.3 + 25) bzip2/1.0.8 52) libpng/1.6.37 + 26) ncurses/6.2 53) snappy/1.1.8 + 27) libreadline/8.0 54) TensorFlow/2.4.1 -**Please note**: We have set `ulimit -c 0` as a default to prevent users from filling the disk with -the dump of a crashed program. bash -users can use `ulimit -Sc unlimited` to enable the debugging -via analyzing the core file (limit coredumpsize unlimited for tcsh). + Wo: + S: Das Modul ist angeheftet. Verwenden Sie "--force", um das Modul zu entladen. + ``` diff --git a/doc.zih.tu-dresden.de/docs/software/mpi_usage_error_detection.md b/doc.zih.tu-dresden.de/docs/software/mpi_usage_error_detection.md index 8d1d7e17a02c3dd2ab572216899cd37f7a9aee3a..b083e80cf9962a01a6580f8b5393912ebd2c3f40 100644 --- a/doc.zih.tu-dresden.de/docs/software/mpi_usage_error_detection.md +++ b/doc.zih.tu-dresden.de/docs/software/mpi_usage_error_detection.md @@ -40,7 +40,7 @@ Besides loading a MUST module, no further changes are needed during compilation ### Running your Application with MUST -In order to run your application with MUST you need to replace the srun command with mustrun: +In order to run your application with MUST you need to replace the `srun` command with `mustrun`: ```console marie@login$ mustrun -np <number of MPI processes> ./<your binary> @@ -65,14 +65,14 @@ marie@login$ mustrun -np 4 ./fancy-program [MUST] Execution finished, inspect "/home/marie/MUST_Output.html"! ``` -Besides replacing the srun command you need to be aware that **MUST always allocates an extra +Besides replacing the `srun` command you need to be aware that **MUST always allocates an extra process**, i.e. if you issue a `mustrun -np 4 ./a.out` then MUST will start 5 processes instead. This is usually not critical, however in batch jobs **make sure to allocate an extra CPU for this task**. Finally, MUST assumes that your application may crash at any time. To still gather correctness results under this assumption is extremely expensive in terms of performance overheads. Thus, if -your application does not crash, you should add an "--must:nocrash" to the mustrun command to make +your application does not crash, you should add `--must:nocrash` to the `mustrun` command to make MUST aware of this knowledge. Overhead is drastically reduced with this switch. ### Result Files diff --git a/doc.zih.tu-dresden.de/docs/software/nanoscale_simulations.md b/doc.zih.tu-dresden.de/docs/software/nanoscale_simulations.md index 1621756f01b01732b675d62dd97a7aa7543c74f2..7b40b9e480755b099c866f0dec2d9a707ea684af 100644 --- a/doc.zih.tu-dresden.de/docs/software/nanoscale_simulations.md +++ b/doc.zih.tu-dresden.de/docs/software/nanoscale_simulations.md @@ -230,7 +230,7 @@ Module ORCA/4.2.1-gompi-2019b and 11 dependencies loaded. ## Siesta -[Siesta](https://www.uam.es/siesta) (Spanish Initiative for Electronic Simulations with +[Siesta](https://siesta-project.org/siesta) (Spanish Initiative for Electronic Simulations with Thousands of Atoms) is both a method and its computer program implementation, to perform electronic structure calculations and ab initio molecular dynamics simulations of molecules and solids. diff --git a/doc.zih.tu-dresden.de/docs/software/ngc_containers.md b/doc.zih.tu-dresden.de/docs/software/ngc_containers.md new file mode 100644 index 0000000000000000000000000000000000000000..f19612d9a3310f869a483c20328d51168317552a --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/ngc_containers.md @@ -0,0 +1,161 @@ +# GPU-accelerated Containers for Deep Learning (NGC Containers) + +A [container](containers.md) is an executable and portable unit of software. +On ZIH systems, [Singularity](https://sylabs.io/) is used as a standard container solution. + +[NGC](https://developer.nvidia.com/ai-hpc-containers), +a registry of highly GPU-optimized software, +has been enabling scientists and researchers by providing regularly updated +and validated containers of HPC and AI applications. +NGC containers are **GPU-optimized** containers +for deep learning, machine learning, visualization: + +- Built-in libraries and dependencies; +- Faster training with Automatic Mixed Precision (AMP); +- Opportunity to scale up from single-node to multi-node systems; +- Performance optimized. + +!!! note "Advantages of NGC containers" + - NGC containers were highly optimized for cluster usage. + The performance provided by NGC containers is comparable to the performance + provided by the modules on the ZIH system (which is potentially the most performant way). + NGC containers are a quick and efficient way to apply the best models + on your dataset on a ZIH system; + - NGC containers allow using an exact version of the software + without installing it with all prerequisites manually. + Manual installation can result in poor performance (e.g. using conda to install a software). + +## Run NGC Containers on the ZIH System + +The first step is a choice of the necessary software (container) to run. +The [NVIDIA NGC catalog](https://ngc.nvidia.com/catalog) +contains a host of GPU-optimized containers for deep learning, +machine learning, visualization, and HPC applications that are tested +for performance, security, and scalability. +It is necessary to register to have full access to the catalog. + +To find a container that fits the requirements of your task, please check +the [official examples page](https://github.com/NVIDIA/DeepLearningExamples) +with the list of main containers with their features and peculiarities. + +### Run NGC container on a Single GPU + +!!! note + Almost all NGC containers can work with a single GPU. + +To use NGC containers, it is necessary to understand the main Singularity commands. + +If you are not familiar with Singularity's syntax, please find the information on the +[official page](https://sylabs.io/guides/3.0/user-guide/quick_start.html#interact-with-images). +However, some main commands will be explained. + +Create a container from the image from the NGC catalog. +(For this example, the alpha is used): + +```console +marie@login$ srun --partition=alpha --nodes=1 --ntasks-per-node=1 --ntasks=1 --gres=gpu:1 --time=08:00:00 --pty --mem=50000 bash + +marie@compute$ cd /scratch/ws/<name_of_your_workspace>/containers #please create a Workspace + +marie@compute$ singularity pull pytorch:21.08-py3.sif docker://nvcr.io/nvidia/pytorch:21.08-py3 +``` + +Now, you have a fully functional PyTorch container. + +Please pay attention, using `srun` directly on the shell will lead to +background by using batch jobs. +For that, you can conveniently put the parameters directly into the job file, +which you can submit using `sbatch` command. + +In the majority of cases, the container doesn't contain the dataset for training models. +To download the dataset, please follow the +[instructions](https://github.com/NVIDIA/DeepLearningExamples) for the exact container. +Also, you can find the instructions in a README file which you can find inside the container: + +```console +marie@compute$ singularity exec pytorch:21.06-py3_beegfs vim /workspace/examples/resnet50v1.5/README.md +``` + +It is recommended to run the container with a single command. +However, for the educational purpose, the separate commands will be presented below: + +```console +marie@login$ srun --partition=alpha --nodes=1 --ntasks-per-node=1 --ntasks=1 --gres=gpu:1 --time=08:00:00 --pty --mem=50000 bash +``` + +Run a shell within a container with the `singularity shell` command: + +```console +marie@compute$ singularity shell --nv -B /scratch/imagenet:/data/imagenet pytorch:21.06-py3 +``` + +The flag `--nv` in the command above was used to enable Nvidia support for GPU usage +and a flag `-B` for a user-bind path specification. + +Run the training inside the container: + +```console +marie@container$ python /workspace/examples/resnet50v1.5/multiproc.py --nnodes=1 --nproc_per_node=1 \ + --node_rank=0 /workspace/examples/resnet50v1.5/main.py --data-backend dali-cpu \ + --raport-file raport.json -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 \ + --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 \ + --wd 3.0517578125e-05 -b 256 --epochs 90 /data/imagenet +``` + +!!! warning + Please keep in mind that it is necessary to specify the amount of resources that you use inside + the container, especially if you have allocated more resources in the cluster. Regularly, you + can do it with flags such as `--nproc_per_node`. You can find more information in the README + file inside the container. + +As an example, please find the full command to run the ResNet50 model +on the ImageNet dataset inside the PyTorch container: + +```console +marie@login$ srun --partition=alpha --nodes=1 --ntasks-per-node=1 --ntasks=1 --gres=gpu:1 --time=08:00:00 --pty --mem=50000 \ + singularity exec --nv -B /scratch/ws/0/anpo879a-ImgNet/imagenet:/data/imagenet pytorch:21.06-py3 \ + python /workspace/examples/resnet50v1.5/multiproc.py --nnodes=1 --nproc_per_node 1 \ + --node_rank=0 /workspace/examples/resnet50v1.5/main.py --data-backend dali-cpu --raport-file raport.json \ + -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 \ + --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 -b 256 --epochs 90 /data/imagenet +``` + +### Multi-GPU Usage + +The majority of the NGC containers allow you to use multiple GPUs from one node +to run the model inside the container. +However, the NGC containers were made by Nvidia for the Nvidia cluster, +which is not ZIH system. +Moreover, editing NGC containers requires root privileges, +which can be done only with [containers](containers.md) on ZIH systems. +Thus, there is no guarantee that all NGC containers work right out of the box. + +However, PyTorch and TensorFlow containers support multi-GPU usage. + +An example of using the PyTorch container for the training of the ResNet50 model +on the classification task on the ImageNet dataset is presented below: + +```console +marie@login$ srun --partition=alpha --nodes=1 --ntasks-per-node=8 --ntasks=8 --gres=gpu:8 --time=08:00:00 --pty --mem=700G bash +``` + +```console +marie@alpha$ singularity exec --nv -B /scratch/ws/0/marie-ImgNet/imagenet:/data/imagenet pytorch:21.06-py3 \ + python /workspace/examples/resnet50v1.5/multiproc.py --nnodes=1 --nproc_per_node 8 \ + --node_rank=0 /workspace/examples/resnet50v1.5/main.py --data-backend dali-cpu \ + --raport-file raport.json -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 \ + --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 \ + --wd 3.0517578125e-05 -b 256 --epochs 90 /data/imagenet +``` + +Please pay attention to the parameter `--nproc_per_node`. +The value is equal to 8 because 8 GPUs per node were allocated with `--gres=gpu:8`. + +### Multi-node Usage + +There are few NGC containers with Multi-node support +[available](https://github.com/NVIDIA/DeepLearningExamples). +Moreover, the realization of the multi-node usage depends on the authors +of the exact container. +Thus, right now, it is not possible to run NGC containers with multi-node support +on the ZIH system without changing the source code inside the container. diff --git a/doc.zih.tu-dresden.de/docs/software/overview.md b/doc.zih.tu-dresden.de/docs/software/overview.md index 835d22204fcda298899f49d4b2a95b092b7e3da1..9d2d86d7c06989acfcb9415f908fc1453538b6a8 100644 --- a/doc.zih.tu-dresden.de/docs/software/overview.md +++ b/doc.zih.tu-dresden.de/docs/software/overview.md @@ -12,7 +12,7 @@ so called dotfiles in your home directory, e.g., `~/.bashrc` or `~/.bash_profile ## Software Environment There are different options to work with software on ZIH systems: [modules](#modules), -[JupyterNotebook](#jupyternotebook) and [containers](#containers). Brief descriptions and related +[Jupyter Notebook](#jupyternotebook) and [containers](#containers). Brief descriptions and related links on these options are provided below. !!! note @@ -21,19 +21,9 @@ links on these options are provided below. * `scs5` environment for the x86 architecture based compute resources * and `ml` environment for the Machine Learning partition based on the Power9 architecture. -According to [What software do I need]**todo link**, first of all, check the [Software module -list]**todo link**. - -<!--Work with the software on ZIH systems could be started only after allocating the resources by [batch--> -<!--systems]**todo link**.--> - -<!--After logging in, you are on one of the login nodes. They are not meant for work, but only for the--> -<!--login process and short tests. Allocating resources will be done by batch system--> -<!--[SLURM](../jobs_and_resources/slurm.md).--> - ## Modules -Usage of software on HPC systems, e.g., frameworks, compilers, loader and libraries, is +Usage of software on ZIH systems, e.g., frameworks, compilers, loader and libraries, is almost always managed by a **modules system**. Thus, it is crucial to be familiar with the [modules concept and its commands](modules.md). A module is a user interface that provides utilities for the dynamic modification of a user's environment without manual modifications. @@ -47,8 +37,8 @@ The [Jupyter Notebook](https://jupyter.org/) is an open-source web application t documents containing live code, equations, visualizations, and narrative text. There is a [JupyterHub](../access/jupyterhub.md) service on ZIH systems, where you can simply run your Jupyter notebook on compute nodes using [modules](#modules), preloaded or custom virtual environments. -Moreover, you can run a [manually created remote jupyter server](deep_learning.md) for more specific -cases. +Moreover, you can run a [manually created remote Jupyter server](../archive/install_jupyter.md) +for more specific cases. ## Containers diff --git a/doc.zih.tu-dresden.de/docs/software/papi.md b/doc.zih.tu-dresden.de/docs/software/papi.md index 9d96cc58f4453692ad7b57abe3e56abda1539290..2de80b4e8a0f420a6b42cd01a3de027b5fb89be2 100644 --- a/doc.zih.tu-dresden.de/docs/software/papi.md +++ b/doc.zih.tu-dresden.de/docs/software/papi.md @@ -20,8 +20,8 @@ To collect performance events, PAPI provides two APIs, the *high-level* and *low The high-level API provides the ability to record performance events inside instrumented regions of serial, multi-processing (MPI, SHMEM) and thread (OpenMP, Pthreads) parallel applications. It is -designed for simplicity, not flexibility. For more details click -[here](https://bitbucket.org/icl/papi/wiki/PAPI-HL.md). +designed for simplicity, not flexibility. More details can be found in the +[PAPI wiki High-Level API description](https://bitbucket.org/icl/papi/wiki/PAPI-HL.md). The following code example shows the use of the high-level API by marking a code section. @@ -86,19 +86,19 @@ more output files in JSON format. ### Low-Level API -The low-level API manages hardware events in user-defined groups -called Event Sets. It is meant for experienced application programmers and tool developers wanting -fine-grained measurement and control of the PAPI interface. It provides access to both PAPI preset -and native events, and supports all installed components. For more details on the low-level API, -click [here](https://bitbucket.org/icl/papi/wiki/PAPI-LL.md). +The low-level API manages hardware events in user-defined groups called Event Sets. It is meant for +experienced application programmers and tool developers wanting fine-grained measurement and +control of the PAPI interface. It provides access to both PAPI preset and native events, and +supports all installed components. The PAPI wiki contains also a page with more details on the +[low-level API](https://bitbucket.org/icl/papi/wiki/PAPI-LL.md). ## Usage on ZIH Systems Before you start a PAPI measurement, check which events are available on the desired architecture. -For this purpose PAPI offers the tools `papi_avail` and `papi_native_avail`. If you want to measure +For this purpose, PAPI offers the tools `papi_avail` and `papi_native_avail`. If you want to measure multiple events, please check which events can be measured concurrently using the tool -`papi_event_chooser`. For more details on the PAPI tools click -[here](https://bitbucket.org/icl/papi/wiki/PAPI-Overview.md#markdown-header-papi-utilities). +`papi_event_chooser`. The PAPI wiki contains more details on +[the PAPI tools](https://bitbucket.org/icl/papi/wiki/PAPI-Overview.md#markdown-header-papi-utilities). !!! hint @@ -133,8 +133,7 @@ compile your application against the PAPI library. !!! hint The PAPI modules on ZIH systems are only installed with the default `perf_event` component. If you - want to measure, e.g., GPU events, you have to install your own PAPI. Instructions on how to - download and install PAPI can be found - [here](https://bitbucket.org/icl/papi/wiki/Downloading-and-Installing-PAPI.md). To install PAPI - with additional components, you have to specify them during configure, for details click - [here](https://bitbucket.org/icl/papi/wiki/PAPI-Overview.md#markdown-header-components). + want to measure, e.g., GPU events, you have to install your own PAPI. Please see the + [external instructions on how to download and install PAPI](https://bitbucket.org/icl/papi/wiki/Downloading-and-Installing-PAPI.md). + To install PAPI with additional components, you have to specify them during configure as + described for the [Installation of Components](https://bitbucket.org/icl/papi/wiki/PAPI-Overview.md#markdown-header-components). diff --git a/doc.zih.tu-dresden.de/docs/software/perf_tools.md b/doc.zih.tu-dresden.de/docs/software/perf_tools.md index 16007698726b0430f84ef20acc80cb9e1766d64d..83398f49cb68a3255e051ae866a3679124559bef 100644 --- a/doc.zih.tu-dresden.de/docs/software/perf_tools.md +++ b/doc.zih.tu-dresden.de/docs/software/perf_tools.md @@ -1,8 +1,8 @@ # Introduction `perf` consists of two parts: the kernel space implementation and the userland tools. This wiki -entry focusses on the latter. These tools are installed on taurus, and others and provides support -for sampling applications and reading performance counters. +entry focusses on the latter. These tools are installed on ZIH systems, and others and provides +support for sampling applications and reading performance counters. ## Configuration @@ -34,18 +34,18 @@ Run `perf stat <Your application>`. This will provide you with a general overview on some counters. ```Bash -Performance counter stats for 'ls':= - 2,524235 task-clock # 0,352 CPUs utilized - 15 context-switches # 0,006 M/sec - 0 CPU-migrations # 0,000 M/sec - 292 page-faults # 0,116 M/sec - 6.431.241 cycles # 2,548 GHz - 3.537.620 stalled-cycles-frontend # 55,01% frontend cycles idle - 2.634.293 stalled-cycles-backend # 40,96% backend cycles idle - 6.157.440 instructions # 0,96 insns per cycle - # 0,57 stalled cycles per insn - 1.248.527 branches # 494,616 M/sec - 34.044 branch-misses # 2,73% of all branches +Performance counter stats for 'ls':= + 2,524235 task-clock # 0,352 CPUs utilized + 15 context-switches # 0,006 M/sec + 0 CPU-migrations # 0,000 M/sec + 292 page-faults # 0,116 M/sec + 6.431.241 cycles # 2,548 GHz + 3.537.620 stalled-cycles-frontend # 55,01% frontend cycles idle + 2.634.293 stalled-cycles-backend # 40,96% backend cycles idle + 6.157.440 instructions # 0,96 insns per cycle + # 0,57 stalled cycles per insn + 1.248.527 branches # 494,616 M/sec + 34.044 branch-misses # 2,73% of all branches 0,007167707 seconds time elapsed ``` @@ -142,10 +142,10 @@ If you added a callchain, it also gives you a callchain profile.\<br /> \*Discla not an appropriate way to gain exact numbers. So this is merely a rough overview and not guaranteed to be absolutely correct.\*\<span style="font-size: 1em;"> \</span> -### On Taurus +### On ZIH systems -On Taurus, users are not allowed to see the kernel functions. If you have multiple events defined, -then the first thing you select in `perf report` is the type of event. Press right +On ZIH systems, users are not allowed to see the kernel functions. If you have multiple events +defined, then the first thing you select in `perf report` is the type of event. Press right ```Bash Available samples @@ -165,7 +165,7 @@ If you'd select cycles, you would get such a screen: ```Bash Events: 96 cycles + 49,13% test_gcc_perf test_gcc_perf [.] main.omp_fn.0 -+ 34,48% test_gcc_perf test_gcc_perf [.] ++ 34,48% test_gcc_perf test_gcc_perf [.] + 6,92% test_gcc_perf test_gcc_perf [.] omp_get_thread_num@plt + 5,20% test_gcc_perf libgomp.so.1.0.0 [.] omp_get_thread_num + 2,25% test_gcc_perf test_gcc_perf [.] main.omp_fn.1 diff --git a/doc.zih.tu-dresden.de/docs/software/pika.md b/doc.zih.tu-dresden.de/docs/software/pika.md index 36aab905dbf33602c64333e2a695070ffc0ad9db..deecced31ce928fcb2347a286d7f13a83ed05d17 100644 --- a/doc.zih.tu-dresden.de/docs/software/pika.md +++ b/doc.zih.tu-dresden.de/docs/software/pika.md @@ -2,19 +2,19 @@ PIKA is a hardware performance monitoring stack to identify inefficient HPC jobs. Users of ZIH systems have the possibility to visualize and analyze the efficiency of their jobs via the -[PIKA web interface](https://selfservice.zih.tu-dresden.de/l/index.php/hpcportal/jobmonitoring/z../jobs_and_resources). +[PIKA web interface](https://selfservice.zih.tu-dresden.de/l/index.php/hpcportal/jobmonitoring/zih/jobs). !!! hint To understand this small guide, it is recommended to open the - [web interface](https://selfservice.zih.tu-dresden.de/l/index.php/hpcportal/jobmonitoring/z../jobs_and_resources) + [web interface](https://selfservice.zih.tu-dresden.de/l/index.php/hpcportal/jobmonitoring/zih/jobs) in a separate window. Furthermore, at least one real HPC job should have been submitted. ## Overview PIKA consists of several components and tools. It uses the collection daemon collectd, InfluxDB to store time-series data and MariaDB to store job metadata. Furthermore, it provides a powerful -[web interface](https://selfservice.zih.tu-dresden.de/l/index.php/hpcportal/jobmonitoring/z../jobs_and_resources) +[web interface](https://selfservice.zih.tu-dresden.de/l/index.php/hpcportal/jobmonitoring/zih/jobs) for the visualization and analysis of job performance data. ## Table View and Job Search @@ -90,7 +90,7 @@ reason for further investigation, since not all HUs are equally utilized. To identify imbalances between HUs over time, the visualization modes *Best* and *Lowest* are a first indicator how much the HUs differ in terms of resource usage. The timelines *Best* and -*Lowest* show the recoded performance data of the best/lowest average HU over time. +*Lowest* show the recorded performance data of the best/lowest average HU over time. ## Footprint Visualization @@ -111,7 +111,7 @@ investigating their correlation. ## Hints If users wish to perform their own measurement of performance counters using performance tools other -than PIKA, it is recommended to disable PIKA monitoring. This can be done using the following slurm +than PIKA, it is recommended to disable PIKA monitoring. This can be done using the following Slurm flags in the job script: ```Bash diff --git a/doc.zih.tu-dresden.de/docs/software/power_ai.md b/doc.zih.tu-dresden.de/docs/software/power_ai.md index dc0fa59b3fc53e180bd620dde71df5597c33298f..b4beda5cec2b8b2e1ede4729df7434b6e8c8e7d5 100644 --- a/doc.zih.tu-dresden.de/docs/software/power_ai.md +++ b/doc.zih.tu-dresden.de/docs/software/power_ai.md @@ -2,81 +2,56 @@ There are different documentation sources for users to learn more about the PowerAI Framework for Machine Learning. In the following the links -are valid for PowerAI version 1.5.4 +are valid for PowerAI version 1.5.4. -## General Overview: +!!! warning + The information provided here is available from IBM and can be used on partition ml only! -- \<a - href="<https://www.ibm.com/support/knowledgecenter/en/SS5SF7_1.5.3/welcome/welcome.htm>" - target="\_blank" title="Landing Page">Landing Page\</a> (note that - you can select different PowerAI versions with the drop down menu - "Change Product or version") -- \<a - href="<https://developer.ibm.com/linuxonpower/deep-learning-powerai/>" - target="\_blank" title="PowerAI Developer Portal">PowerAI Developer - Portal \</a>(Some Use Cases and examples) -- \<a - href="<https://www.ibm.com/support/knowledgecenter/en/SS5SF7_1.5.4/navigation/pai_software_pkgs.html>" - target="\_blank" title="Included Software Packages">Included - Software Packages\</a> (note that you can select different PowerAI - versions with the drop down menu "Change Product or version") - -## Specific User Howtos. Getting started with...: +## General Overview -- \<a - href="<https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted.htm>" - target="\_blank" title="Getting Started with PowerAI">PowerAI\</a> -- \<a - href="<https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_caffe.html>" - target="\_blank" title="Caffe">Caffe\</a> -- \<a - href="<https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_tensorflow.html?view=kc>" - target="\_blank" title="Tensorflow">TensorFlow\</a> -- \<a - href="<https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_tensorflow_prob.html?view=kc>" - target="\_blank" title="Tensorflow Probability">TensorFlow - Probability\</a>\<br />This release of PowerAI includes TensorFlow - Probability. TensorFlow Probability is a library for probabilistic - reasoning and statistical analysis in TensorFlow. -- \<a - href="<https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_tensorboard.html?view=kc>" - target="\_blank" title="Tensorboard">TensorBoard\</a> -- \<a - href="<https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_snapml.html>" - target="\_blank">Snap ML\</a>\<br />This release of PowerAI includes - Snap Machine Learning (Snap ML). Snap ML is a library for training - generalized linear models. It is being developed at IBM with the - vision to remove training time as a bottleneck for machine learning - applications. Snap ML supports many classical machine learning - models and scales gracefully to data sets with billions of examples - or features. It also offers distributed training, GPU acceleration, - and supports sparse data structures. -- \<a - href="<https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_pytorch.html>" - target="\_blank">PyTorch\</a>\<br />This release of PowerAI includes - the community development preview of PyTorch 1.0 (rc1). PowerAI's - PyTorch includes support for IBM's Distributed Deep Learning (DDL) - and Large Model Support (LMS). -- \<a - href="<https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_caffe2ONNX.html>" - target="\_blank">Caffe2 and ONNX\</a>\<br />This release of PowerAI - includes a Technology Preview of Caffe2 and ONNX. Caffe2 is a - companion to PyTorch. PyTorch is great for experimentation and rapid - development, while Caffe2 is aimed at production environments. ONNX - (Open Neural Network Exchange) provides support for moving models - between those frameworks. -- \<a - href="<https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_ddl.html?view=kc>" - target="\_blank" title="Distributed Deep Learning">Distributed Deep - Learning\</a> (DDL). \<br />Works up to 4 TaurusML worker nodes. - (Larger models with more nodes are possible with PowerAI Enterprise) +- [PowerAI Introduction](https://www.ibm.com/support/knowledgecenter/en/SS5SF7_1.5.3/welcome/welcome.htm) + (note that you can select different PowerAI versions with the drop down menu + "Change Product or version") +- [PowerAI Developer Portal](https://developer.ibm.com/linuxonpower/deep-learning-powerai/) + (Some Use Cases and examples) +- [Included Software Packages](https://www.ibm.com/support/knowledgecenter/en/SS5SF7_1.5.4/navigation/pai_software_pkgs.html) + (note that you can select different PowerAI versions with the drop down menu "Change Product + or version") + +## Specific User Guides + +- [Getting Started with PowerAI](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted.htm) +- [Caffe](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_caffe.html) +- [TensorFlow](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_tensorflow.html?view=kc) +- [TensorFlow Probability](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_tensorflow_prob.html?view=kc) + This release of PowerAI includes TensorFlow Probability. TensorFlow Probability is a library + for probabilistic reasoning and statistical analysis in TensorFlow. +- [TensorBoard](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_tensorboard.html?view=kc) +- [Snap ML](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_snapml.html) + This release of PowerAI includes Snap Machine Learning (Snap ML). Snap ML is a library for + training generalized linear models. It is being developed at IBM with the + vision to remove training time as a bottleneck for machine learning + applications. Snap ML supports many classical machine learning + models and scales gracefully to data sets with billions of examples + or features. It also offers distributed training, GPU acceleration, + and supports sparse data structures. +- [PyTorch](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_pytorch.html) + This release of PowerAI includes + the community development preview of PyTorch 1.0 (rc1). PowerAI's + PyTorch includes support for IBM's Distributed Deep Learning (DDL) + and Large Model Support (LMS). +- [Caffe2 and ONNX](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_caffe2ONNX.html) + This release of PowerAI includes a Technology Preview of Caffe2 and ONNX. Caffe2 is a + companion to PyTorch. PyTorch is great for experimentation and rapid + development, while Caffe2 is aimed at production environments. ONNX + (Open Neural Network Exchange) provides support for moving models + between those frameworks. +- [Distributed Deep Learning](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_ddl.html?view=kc) + Distributed Deep Learning (DDL). Works on up to 4 nodes on partition `ml`. ## PowerAI Container We have converted the official Docker container to Singularity. Here is a documentation about the Docker base container, including a table with the individual software versions of the packages installed within the -container: - -- \<a href="<https://hub.docker.com/r/ibmcom/powerai/>" - target="\_blank">PowerAI Docker Container Docu\</a> +container: [PowerAI Docker Container](https://hub.docker.com/r/ibmcom/powerai/). diff --git a/doc.zih.tu-dresden.de/docs/software/private_modules.md b/doc.zih.tu-dresden.de/docs/software/private_modules.md new file mode 100644 index 0000000000000000000000000000000000000000..4b79463f05988afd689b5fa18bddc758c16dfaa7 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/private_modules.md @@ -0,0 +1,105 @@ +# Project and User Private Modules + +Private module files allow you to load your own installed software packages into your environment +and to handle different versions without getting into conflicts. Private modules can be setup for a +single user as well as all users of project group. The workflow and settings for user private module +files is described in the following. The [settings for project private +modules](#project-private-modules) differ only in details. + +In order to use your own module files please use the command +`module use <path_to_module_files>`. It will add the path to the list of module directories +that are searched by lmod (i.e. the `module` command). You may use a directory `privatemodules` +within your home or project directory to setup your own module files. + +Please see the [Environment Modules open source project's web page](http://modules.sourceforge.net/) +for further information on writing module files. + +## 1. Create Directories + +```console +marie@compute$ cd $HOME +marie@compute$ mkdir --verbose --parents privatemodules/testsoftware +marie@compute$ cd privatemodules/testsoftware +``` + +(create a directory in your home directory) + +## 2. Notify lmod + +```console +marie@compute$ module use $HOME/privatemodules +``` + +(add the directory in the list of module directories) + +## 3. Create Modulefile + +Create a file with the name `1.0` with a +test software in the `testsoftware` directory you created earlier +(using your favorite editor) and paste the following text into it: + +``` +#%Module###################################################################### +## +## testsoftware modulefile +## +proc ModulesHelp { } { + puts stderr "Loads testsoftware" +} + +set version 1.0 +set arch x86_64 +set path /home/<user>/opt/testsoftware/$version/$arch/ + +prepend-path PATH $path/bin +prepend-path LD_LIBRARY_PATH $path/lib + +if [ module-info mode load ] { + puts stderr "Load testsoftware version $version" +} +``` + +## 4. Check lmod + +Check the availability of the module with `ml av`, the output should look like this: + +``` +--------------------- /home/masterman/privatemodules --------------------- + testsoftware/1.0 +``` + +## 5. Load Module + +Load the test module with `module load testsoftware`, the output should look like this: + +```console +Load testsoftware version 1.0 +Module testsoftware/1.0 loaded. +``` + +## Project Private Modules + +Private module files allow you to load project- or group-wide installed software into your +environment and to handle different versions without getting into conflicts. + +The module files have to be stored in your global projects directory +`/projects/p_projectname/privatemodules`. An example of a module file can be found in the section +above. To use a project-wide module file you have to add the path to the module file to the module +environment with the command + +```console +marie@compute$ module use /projects/p_projectname/privatemodules +``` + +After that, the modules are available in your module environment and you can load the modules with +the `module load` command. + +## Using Private Modules and Programs in the $HOME Directory + +An automated backup system provides security for the HOME-directories on the cluster on a daily +basis. This is the reason why we urge users to store (large) temporary data (like checkpoint files) +on the /scratch filesystem or at local scratch disks. + +**Please note**: We have set `ulimit -c 0` as a default to prevent users from filling the disk with +the dump of crashed programs. `bash` users can use `ulimit -Sc unlimited` to enable the debugging +via analyzing the core file. diff --git a/doc.zih.tu-dresden.de/docs/software/python.md b/doc.zih.tu-dresden.de/docs/software/python.md deleted file mode 100644 index b9bde2e2324d2d413c65f1cb4a6b34d45f5225bf..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/python.md +++ /dev/null @@ -1,298 +0,0 @@ -# Python for Data Analytics - -Python is a high-level interpreted language widely used in research and -science. Using HPC allows you to work with python quicker and more -effective. Taurus allows working with a lot of available packages and -libraries which give more useful functionalities and allow use all -features of Python and to avoid minuses. - -**Prerequisites:** To work with PyTorch you obviously need [access](../access/ssh_login.md) for the -Taurus system and basic knowledge about Python, Numpy and SLURM system. - -**Aim** of this page is to introduce users on how to start working with Python on the -[HPC-DA](../jobs_and_resources/power9.md) system - part of the TU Dresden HPC system. - -There are three main options on how to work with Keras and Tensorflow on the HPC-DA: 1. Modules; 2. -[JupyterNotebook](../access/jupyterhub.md); 3.[Containers](containers.md). The main way is using -the [Modules system](modules.md) and Python virtual environment. - -Note: You could work with simple examples in your home directory but according to -[HPCStorageConcept2019](../data_lifecycle/overview.md) please use **workspaces** -for your study and work projects. - -## Virtual environment - -There are two methods of how to work with virtual environments on -Taurus: - -1. **Vitualenv** is a standard Python tool to create isolated Python environments. - It is the preferred interface for - managing installations and virtual environments on Taurus and part of the Python modules. - -2. **Conda** is an alternative method for managing installations and -virtual environments on Taurus. Conda is an open-source package -management system and environment management system from Anaconda. The -conda manager is included in all versions of Anaconda and Miniconda. - -**Note:** Keep in mind that you **cannot** use virtualenv for working -with the virtual environments previously created with conda tool and -vice versa! Prefer virtualenv whenever possible. - -This example shows how to start working -with **Virtualenv** and Python virtual environment (using the module system) - -```Bash -srun -p ml -N 1 -n 1 -c 7 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash #Job submission in ml nodes with 1 gpu on 1 node. - -mkdir python-environments # Optional: Create folder. Please use Workspaces! - -module load modenv/ml # Changing the environment. Example output: The following have been reloaded with a version change: 1 modenv/scs5 => modenv/ml -ml av Python #Check the available modules with Python -module load Python #Load default Python. Example output: Module Python/3.7 4-GCCcore-8.3.0 with 7 dependencies loaded -which python #Check which python are you using -virtualenv --system-site-packages python-environments/envtest #Create virtual environment -source python-environments/envtest/bin/activate #Activate virtual environment. Example output: (envtest) bash-4.2$ -python #Start python - -from time import gmtime, strftime -print(strftime("%Y-%m-%d %H:%M:%S", gmtime())) #Example output: 2019-11-18 13:54:16 -deactivate #Leave the virtual environment -``` - -The [virtualenv](https://virtualenv.pypa.io/en/latest/) Python module (Python 3) provides support -for creating virtual environments with their own sitedirectories, optionally isolated from system -site directories. Each virtual environment has its own Python binary (which matches the version of -the binary that was used to create this environment) and can have its own independent set of -installed Python packages in its site directories. This allows you to manage separate package -installations for different projects. It essentially allows us to create a virtual isolated Python -installation and install packages into that virtual installation. When you switch projects, you can -simply create a new virtual environment and not have to worry about breaking the packages installed -in other environments. - -In your virtual environment, you can use packages from the (Complete List of -Modules)(SoftwareModulesList) or if you didn't find what you need you can install required packages -with the command: `pip install`. With the command `pip freeze`, you can see a list of all installed -packages and their versions. - -This example shows how to start working with **Conda** and virtual -environment (with using module system) - -```Bash -srun -p ml -N 1 -n 1 -c 7 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash # Job submission in ml nodes with 1 gpu on 1 node. - -module load modenv/ml -mkdir conda-virtual-environments #create a folder -cd conda-virtual-environments #go to folder -which python #check which python are you using -module load PythonAnaconda/3.6 #load Anaconda module -which python #check which python are you using now - -conda create -n conda-testenv python=3.6 #create virtual environment with the name conda-testenv and Python version 3.6 -conda activate conda-testenv #activate conda-testenv virtual environment - -conda deactivate #Leave the virtual environment -``` - -You can control where a conda environment -lives by providing a path to a target directory when creating the -environment. For example, the following command will create a new -environment in a workspace located in `scratch` - -```Bash -conda create --prefix /scratch/ws/<name_of_your_workspace>/conda-virtual-environment/<name_of_your_environment> -``` - -Please pay attention, -using srun directly on the shell will lead to blocking and launch an -interactive job. Apart from short test runs, it is **recommended to -launch your jobs into the background by using Slurm**. For that, you can conveniently put -the parameters directly into the job file which you can submit using -`sbatch [options] <job file>.` - -## Jupyter Notebooks - -Jupyter notebooks are a great way for interactive computing in your web -browser. Jupyter allows working with data cleaning and transformation, -numerical simulation, statistical modelling, data visualization and of -course with machine learning. - -There are two general options on how to work Jupyter notebooks using -HPC. - -On Taurus, there is [JupyterHub](../access/jupyterhub.md) where you can simply run your Jupyter -notebook on HPC nodes. Also, you can run a remote jupyter server within a sbatch GPU job and with -the modules and packages you need. The manual server setup you can find [here](deep_learning.md). - -With Jupyterhub you can work with general -data analytics tools. This is the recommended way to start working with -the Taurus. However, some special instruments could not be available on -the Jupyterhub. - -**Keep in mind that the remote Jupyter server can offer more freedom with settings and approaches.** - -## MPI for Python - -Message Passing Interface (MPI) is a standardized and portable -message-passing standard designed to function on a wide variety of -parallel computing architectures. The Message Passing Interface (MPI) is -a library specification that allows HPC to pass information between its -various nodes and clusters. MPI designed to provide access to advanced -parallel hardware for end-users, library writers and tool developers. - -### Why use MPI? - -MPI provides a powerful, efficient and portable way to express parallel -programs. -Among many parallel computational models, message-passing has proven to be an effective one. - -### Parallel Python with mpi4py - -Mpi4py(MPI for Python) package provides bindings of the MPI standard for -the python programming language, allowing any Python program to exploit -multiple processors. - -#### Why use mpi4py? - -Mpi4py based on MPI-2 C++ bindings. It supports almost all MPI calls. -This implementation is popular on Linux clusters and in the SciPy -community. Operations are primarily methods of communicator objects. It -supports communication of pickleable Python objects. Mpi4py provides -optimized communication of NumPy arrays. - -Mpi4py is included as an extension of the SciPy-bundle modules on -taurus. - -Please check the SoftwareModulesList for the modules availability. The availability of the mpi4py -in the module you can check by -the `module whatis <name_of_the module>` command. The `module whatis` -command displays a short information and included extensions of the -module. - -Moreover, it is possible to install mpi4py in your local conda -environment: - -```Bash -srun -p ml --time=04:00:00 -n 1 --pty --mem-per-cpu=8000 bash #allocate recources -module load modenv/ml -module load PythonAnaconda/3.6 #load module to use conda -conda create --prefix=<location_for_your_environment> python=3.6 anaconda #create conda virtual environment - -conda activate <location_for_your_environment> #activate your virtual environment - -conda install -c conda-forge mpi4py #install mpi4py - -python #start python - -from mpi4py import MPI #verify your mpi4py -comm = MPI.COMM_WORLD -print("%d of %d" % (comm.Get_rank(), comm.Get_size())) -``` - -### Horovod - -[Horovod](https://github.com/horovod/horovod) is the open source distributed training -framework for TensorFlow, Keras, PyTorch. It is supposed to make it easy -to develop distributed deep learning projects and speed them up with -TensorFlow. - -#### Why use Horovod? - -Horovod allows you to easily take a single-GPU TensorFlow and Pytorch -program and successfully train it on many GPUs! In -some cases, the MPI model is much more straightforward and requires far -less code changes than the distributed code from TensorFlow for -instance, with parameter servers. Horovod uses MPI and NCCL which gives -in some cases better results than pure TensorFlow and PyTorch. - -#### Horovod as a module - -Horovod is available as a module with **TensorFlow** or **PyTorch**for **all** module environments. -Please check the [software module list](modules.md) for the current version of the software. -Horovod can be loaded like other software on the Taurus: - -```Bash -ml av Horovod #Check available modules with Python -module load Horovod #Loading of the module -``` - -#### Horovod installation - -However, if it is necessary to use Horovod with **PyTorch** or use -another version of Horovod it is possible to install it manually. To -install Horovod you need to create a virtual environment and load the -dependencies (e.g. MPI). Installing PyTorch can take a few hours and is -not recommended - -**Note:** You could work with simple examples in your home directory but **please use workspaces -for your study and work projects** (see the Storage concept). - -Setup: - -```Bash -srun -N 1 --ntasks-per-node=6 -p ml --time=08:00:00 --pty bash #allocate a Slurm job allocation, which is a set of resources (nodes) -module load modenv/ml #Load dependencies by using modules -module load OpenMPI/3.1.4-gcccuda-2018b -module load Python/3.6.6-fosscuda-2018b -module load cuDNN/7.1.4.18-fosscuda-2018b -module load CMake/3.11.4-GCCcore-7.3.0 -virtualenv --system-site-packages <location_for_your_environment> #create virtual environment -source <location_for_your_environment>/bin/activate #activate virtual environment -``` - -Or when you need to use conda: - -```Bash -srun -N 1 --ntasks-per-node=6 -p ml --time=08:00:00 --pty bash #allocate a Slurm job allocation, which is a set of resources (nodes) -module load modenv/ml #Load dependencies by using modules -module load OpenMPI/3.1.4-gcccuda-2018b -module load PythonAnaconda/3.6 -module load cuDNN/7.1.4.18-fosscuda-2018b -module load CMake/3.11.4-GCCcore-7.3.0 - -conda create --prefix=<location_for_your_environment> python=3.6 anaconda #create virtual environment - -conda activate <location_for_your_environment> #activate virtual environment -``` - -Install Pytorch (not recommended) - -```Bash -cd /tmp -git clone https://github.com/pytorch/pytorch #clone Pytorch from the source -cd pytorch #go to folder -git checkout v1.7.1 #Checkout version (example: 1.7.1) -git submodule update --init #Update dependencies -python setup.py install #install it with python -``` - -##### Install Horovod for Pytorch with python and pip - -In the example presented installation for the Pytorch without -TensorFlow. Adapt as required and refer to the horovod documentation for -details. - -```Bash -HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir horovod -``` - -##### Verify that Horovod works - -```Bash -python #start python -import torch #import pytorch -import horovod.torch as hvd #import horovod -hvd.init() #initialize horovod -hvd.size() -hvd.rank() -print('Hello from:', hvd.rank()) -``` - -##### Horovod with NCCL - -If you want to use NCCL instead of MPI you can specify that in the -install command after loading the NCCL module: - -```Bash -module load NCCL/2.3.7-fosscuda-2018b -HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir horovod -``` diff --git a/doc.zih.tu-dresden.de/docs/software/python_virtual_environments.md b/doc.zih.tu-dresden.de/docs/software/python_virtual_environments.md new file mode 100644 index 0000000000000000000000000000000000000000..67b10817c738b414a3302388b5cca3392ff96bb1 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/python_virtual_environments.md @@ -0,0 +1,124 @@ +# Python Virtual Environments + +Virtual environments allow users to install additional Python packages and create an isolated +run-time environment. We recommend using `virtualenv` for this purpose. In your virtual environment, +you can use packages from the [modules list](modules.md) or if you didn't find what you need you can +install required packages with the command: `pip install`. With the command `pip freeze`, you can +see a list of all installed packages and their versions. + +There are two methods of how to work with virtual environments on ZIH systems: + +1. **virtualenv** is a standard Python tool to create isolated Python environments. + It is the preferred interface for + managing installations and virtual environments on ZIH system and part of the Python modules. + +2. **conda** is an alternative method for managing installations and +virtual environments on ZIH system. conda is an open-source package +management system and environment management system from Anaconda. The +conda manager is included in all versions of Anaconda and Miniconda. + +!!! warning + + Keep in mind that you **cannot** use virtualenv for working + with the virtual environments previously created with conda tool and + vice versa! Prefer virtualenv whenever possible. + +## Python Virtual Environment + +This example shows how to start working with **virtualenv** and Python virtual environment (using +the module system). + +!!! hint + + We recommend to use [workspaces](../data_lifecycle/workspaces.md) for your virtual + environments. + +At first, we check available Python modules and load the preferred version: + +```console +marie@compute$ module avail Python #Check the available modules with Python +[...] +marie@compute$ module load Python #Load default Python +Module Python/3.7 2-GCCcore-8.2.0 with 10 dependencies loaded +marie@compute$ which python #Check which python are you using +/sw/installed/Python/3.7.2-GCCcore-8.2.0/bin/python +``` + +Then create the virtual environment and activate it. + +```console +marie@compute$ ws_allocate -F scratch python_virtual_environment 1 +Info: creating workspace. +/scratch/ws/1/python_virtual_environment +[...] +marie@compute$ virtualenv --system-site-packages /scratch/ws/1/python_virtual_environment/env #Create virtual environment +[...] +marie@compute$ source /scratch/ws/1/python_virtual_environment/env/bin/activate #Activate virtual environment. Example output: (envtest) bash-4.2$ +``` + +Now you can work in this isolated environment, without interfering with other tasks running on the +system. Note that the inscription (env) at the beginning of each line represents that you are in +the virtual environment. You can deactivate the environment as follows: + +```console +(env) marie@compute$ deactivate #Leave the virtual environment +``` + +## Conda Virtual Environment + +This example shows how to start working with **conda** and virtual environment (with using module +system). At first, we use an interactive job and create a directory for the conda virtual +environment: + +```console +marie@compute$ ws_allocate -F scratch conda_virtual_environment 1 +Info: creating workspace. +/scratch/ws/1/conda_virtual_environment +[...] +``` + +Then, we load Anaconda, create an environment in our directory and activate the environment: + +```console +marie@compute$ module load Anaconda3 #load Anaconda module +marie@compute$ conda create --prefix /scratch/ws/1/conda_virtual_environment/conda-env python=3.6 #create virtual environment with Python version 3.6 +marie@compute$ conda activate /scratch/ws/1/conda_virtual_environment/conda-env #activate conda-env virtual environment +``` + +Now you can work in this isolated environment, without interfering with other tasks running on the +system. Note that the inscription (conda-env) at the beginning of each line represents that you +are in the virtual environment. You can deactivate the conda environment as follows: + +```console +(conda-env) marie@compute$ conda deactivate #Leave the virtual environment +``` + +??? example + + This is an example on partition Alpha. The example creates a virtual environment, and installs + the package `torchvision` with pip. + ```console + marie@login$ srun --partition=alpha-interactive -N=1 --gres=gpu:1 --time=01:00:00 --pty bash + marie@alpha$ mkdir python-environments # please use workspaces + marie@alpha$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch + Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5, PyTorch/1.9.0 and 54 dependencies loaded. + marie@alpha$ which python + /sw/installed/Python/3.8.6-GCCcore-10.2.0/bin/python + marie@alpha$ pip list + [...] + marie@alpha$ virtualenv --system-site-packages python-environments/my-torch-env + created virtual environment CPython3.8.6.final.0-64 in 42960ms + creator CPython3Posix(dest=~/python-environments/my-torch-env, clear=False, global=True) + seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=~/.local/share/virtualenv) + added seed packages: pip==21.1.3, setuptools==57.2.0, wheel==0.36.2 + activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator + marie@alpha$ source python-environments/my-torch-env/bin/activate + (my-torch-env) marie@alpha$ pip install torchvision + [...] + Installing collected packages: torchvision + Successfully installed torchvision-0.10.0 + [...] + (my-torch-env) marie@alpha$ python -c "import torchvision; print(torchvision.__version__)" + 0.10.0+cu102 + (my-torch-env) marie@alpha$ deactivate + ``` diff --git a/doc.zih.tu-dresden.de/docs/software/pytorch.md b/doc.zih.tu-dresden.de/docs/software/pytorch.md index cd476d7296e271e6f7eecf3a84b4af1f80c4ee84..e84f3aac54a88e0984b0da17e3e3527fe37e7b46 100644 --- a/doc.zih.tu-dresden.de/docs/software/pytorch.md +++ b/doc.zih.tu-dresden.de/docs/software/pytorch.md @@ -1,260 +1,165 @@ -# Pytorch for Data Analytics +# PyTorch [PyTorch](https://pytorch.org/) is an open-source machine learning framework. It is an optimized tensor library for deep learning using GPUs and CPUs. -PyTorch is a machine learning tool developed by Facebooks AI division -to process large-scale object detection, segmentation, classification, etc. -PyTorch provides a core datastructure, the tensor, a multi-dimensional array that shares many -similarities with Numpy arrays. -PyTorch also consumed Caffe2 for its backend and added support of ONNX. - -**Prerequisites:** To work with PyTorch you obviously need [access](../access/ssh_login.md) for the -Taurus system and basic knowledge about Python, Numpy and SLURM system. - -**Aim** of this page is to introduce users on how to start working with PyTorch on the -[HPC-DA](../jobs_and_resources/power9.md) system - part of the TU Dresden HPC system. - -There are numerous different possibilities of how to work with PyTorch on Taurus. -Here we will consider two main methods. +PyTorch is a machine learning tool developed by Facebook's AI division to process large-scale +object detection, segmentation, classification, etc. +PyTorch provides a core data structure, the tensor, a multi-dimensional array that shares many +similarities with NumPy arrays. -1\. The first option is using Jupyter notebook with HPC-DA nodes. The easiest way is by using -[Jupyterhub](../access/jupyterhub.md). It is a recommended way for beginners in PyTorch and users -who are just starting their work with Taurus. - -2\. The second way is using the Modules system and Python or conda virtual environment. -See [the Python page](python.md) for the HPC-DA system. - -Note: The information on working with the PyTorch using Containers could be found -[here](containers.md). - -## Get started with PyTorch +Please check the software modules list via -### Virtual environment +```console +marie@login$ module spider pytorch +``` -For working with PyTorch and python packages using virtual environments (kernels) is necessary. +to find out, which PyTorch modules are available. -Creating and using your kernel (environment) has the benefit that you can install your preferred -python packages and use them in your notebooks. +We recommend using partitions `alpha` and/or `ml` when working with machine learning workflows +and the PyTorch library. +You can find detailed hardware specification in our +[hardware documentation](../jobs_and_resources/hardware_overview.md). -A virtual environment is a cooperatively isolated runtime environment that allows Python users and -applications to install and upgrade Python distribution packages without interfering with -the behaviour of other Python applications running on the same system. So the -[Virtual environment](https://docs.python.org/3/glossary.html#term-virtual-environment) -is a self-contained directory tree that contains a Python installation for a particular version of -Python, plus several additional packages. At its core, the main purpose of -Python virtual environments is to create an isolated environment for Python projects. -Python virtual environment is the main method to work with Deep Learning software as PyTorch on the -HPC-DA system. +## PyTorch Console -### Conda and Virtualenv +On the partition `alpha`, load the module environment: -There are two methods of how to work with virtual environments on -Taurus: +```console +# Job submission on alpha nodes with 1 gpu on 1 node with 800 Mb per CPU +marie@login$ srun -p alpha --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash +marie@alpha$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch/1.9.0 +Die folgenden Module wurden in einer anderen Version erneut geladen: + 1) modenv/scs5 => modenv/hiera -1.**Vitualenv (venv)** is a standard Python tool to create isolated Python environments. -In general, It is the preferred interface for managing installations and virtual environments -on Taurus. -It has been integrated into the standard library under the -[venv module](https://docs.python.org/3/library/venv.html). -We recommend using **venv** to work with Python packages and Tensorflow on Taurus. +Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5, PyTorch/1.9.0 and 54 dependencies loaded. +``` -2\. The **conda** command is the interface for managing installations and virtual environments on -Taurus. -The **conda** is a tool for managing and deploying applications, environments and packages. -Conda is an open-source package management system and environment management system from Anaconda. -The conda manager is included in all versions of Anaconda and Miniconda. -**Important note!** Due to the use of Anaconda to create PyTorch modules for the ml partition, -it is recommended to use the conda environment for working with the PyTorch to avoid conflicts over -the sources of your packages (pip or conda). +??? hint "Torchvision on partition `alpha`" -**Note:** Keep in mind that you **cannot** use conda for working with the virtual environments -previously created with Vitualenv tool and vice versa + On the partition `alpha`, the module torchvision is not yet available within the module + system. (19.08.2021) + Torchvision can be made available by using a virtual environment: -This example shows how to install and start working with PyTorch (with -using module system) + ```console + marie@alpha$ virtualenv --system-site-packages python-environments/torchvision_env + marie@alpha$ source python-environments/torchvision_env/bin/activate + marie@alpha$ pip install torchvision --no-deps + ``` - srun -p ml -N 1 -n 1 -c 2 --gres=gpu:1 --time=01:00:00 --pty --mem-per-cpu=5772 bash #Job submission in ml nodes with 1 gpu on 1 node with 2 CPU and with 5772 mb for each cpu. - module load modenv/ml #Changing the environment. Example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml - mkdir python-virtual-environments #Create folder - cd python-virtual-environments #Go to folder - module load PythonAnaconda/3.6 #Load Anaconda with Python. Example output: Module Module PythonAnaconda/3.6 loaded. - which python #Check which python are you using - python3 -m venv --system-site-packages envtest #Create virtual environment - source envtest/bin/activate #Activate virtual environment. Example output: (envtest) bash-4.2$ - module load PyTorch #Load PyTorch module. Example output: Module PyTorch/1.1.0-PythonAnaconda-3.6 loaded. - python #Start python - import torch - torch.version.__version__ #Example output: 1.1.0 + Using the **--no-deps** option for "pip install" is necessary here as otherwise the PyTorch + version might be replaced and you will run into trouble with the CUDA drivers. -Keep in mind that using **srun** directly on the shell will lead to blocking and launch an -interactive job. Apart from short test runs, -it is **recommended to launch your jobs into the background by using batch jobs**. -For that, you can conveniently put the parameters directly into the job file -which you can submit using *sbatch [options] <job_file_name>*. +On the partition `ml`: -## Running the model and examples +```console +# Job submission in ml nodes with 1 gpu on 1 node with 800 Mb per CPU +marie@login$ srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash +``` -Below are examples of Jupyter notebooks with PyTorch models which you can run on ml nodes of HPC-DA. +After calling -There are two ways how to work with the Jupyter notebook on HPC-DA system. You can use a -[remote Jupyter server](deep_learning.md) or [JupyterHub](../access/jupyterhub.md). -Jupyterhub is a simple and recommended way to use PyTorch. -We are using Jupyterhub for our examples. +```console +marie@login$ module spider pytorch +``` -Prepared examples of PyTorch models give you an understanding of how to work with -Jupyterhub and PyTorch models. It can be useful and instructive to start -your acquaintance with PyTorch and HPC-DA system from these simple examples. +we know that we can load PyTorch (including torchvision) with -JupyterHub is available here: [taurus.hrsk.tu-dresden.de/jupyter](https://taurus.hrsk.tu-dresden.de/jupyter) +```console +marie@ml$ module load modenv/ml torchvision/0.7.0-fossCUDA-2019b-Python-3.7.4-PyTorch-1.6.0 +Module torchvision/0.7.0-fossCUDA-2019b-Python-3.7.4-PyTorch-1.6.0 and 55 dependencies loaded. +``` -After login, you can start a new session by clicking on the button. +Now, we check that we can access PyTorch: -**Note:** Detailed guide (with pictures and instructions) how to run the Jupyterhub -you could find on [the page](../access/jupyterhub.md). +```console +marie@{ml,alpha}$ python -c "import torch; print(torch.__version__)" +``` -Please choose the "IBM Power (ppc64le)". You need to download an example -(prepared as jupyter notebook file) that already contains all you need for the start of the work. -Please put the file into your previously created virtual environment in your working directory or -use the kernel for your notebook [see Jupyterhub page](../access/jupyterhub.md). +The following example shows how to create a python virtual environment and import PyTorch. -Note: You could work with simple examples in your home directory but according to -[HPCStorageConcept2019](../data_lifecycle/overview.md) please use **workspaces** -for your study and work projects. -For this reason, you have to use advanced options of Jupyterhub and put "/" in "Workspace scope" field. +```console +# Create folder +marie@ml$ mkdir python-environments +# Check which python are you using +marie@ml$ which python +/sw/installed/Python/3.7.4-GCCcore-8.3.0/bin/python +# Create virtual environment "env" which inheriting with global site packages +marie@ml$ virtualenv --system-site-packages python-environments/env +[...] +# Activate virtual environment "env". Example output: (env) bash-4.2$ +marie@ml$ source python-environments/env/bin/activate +marie@ml$ python -c "import torch; print(torch.__version__)" +``` -To download the first example (from the list below) into your previously created -virtual environment you could use the following command: +## PyTorch in JupyterHub - ws_list #list of your workspaces - cd <name_of_your_workspace> #go to workspace +In addition to using interactive and batch jobs, it is possible to work with PyTorch using +JupyterHub. The production and test environments of JupyterHub contain Python kernels, that come +with a PyTorch support. - wget https://doc.zih.tu-dresden.de/hpc-wiki/pub/Compendium/PyTorch/example_MNIST_Pytorch.zip - unzip example_MNIST_Pytorch.zip + +{: align="center"} -Also, you could use kernels for all notebooks, not only for them which -placed in your virtual environment. See the [jupyterhub](../access/jupyterhub.md) page. +## Distributed PyTorch -Examples: +For details on how to run PyTorch with multiple GPUs and/or multiple nodes, see +[distributed training](distributed_training.md). -1\. Simple MNIST model. The MNIST database is a large database of handwritten digits that is -commonly used for training various image processing systems. PyTorch allows us to import and -download the MNIST dataset directly from the Torchvision - package consists of datasets, -model architectures and transformations. -The model contains a neural network with sequential architecture and typical modules -for this kind of models. Recommended parameters for running this model are 1 GPU and 7 cores (28 thread) +## Migrate PyTorch-script from CPU to GPU -(example_MNIST_Pytorch.zip) +It is recommended to use GPUs when using large training data sets. While TensorFlow automatically +uses GPUs if they are available, in PyTorch you have to move your tensors manually. -### Running the model +First, you need to import `torch.CUDA`: -Open [JupyterHub](../access/jupyterhub.md) and follow instructions above. +```python3 +import torch.CUDA +``` -In Jupyterhub documents are organized with tabs and a very versatile split-screen feature. -On the left side of the screen, you can open your file. Use 'File-Open from Path' -to go to your workspace (e.g. `scratch/ws/<username-name_of_your_ws>`). -You could run each cell separately step by step and analyze the result of each step. -Default command for running one cell Shift+Enter'. Also, you could run all cells with the command ' -run all cells' in the 'Run' Tab. +Then you define a `device`-variable, which is set to 'CUDA' automatically when a GPU is available +with this code: -## Components and advantages of the PyTorch +```python3 +device = torch.device('CUDA' if torch.CUDA.is_available() else 'cpu') +``` -### Pre-trained networks +You then have to move all of your tensors to the selected device. This looks like this: -The PyTorch gives you an opportunity to use pre-trained models and networks for your purposes -(as a TensorFlow for instance) especially for computer vision and image recognition. As you know -computer vision is one of the fields that have been most impacted by the advent of deep learning. +```python3 +x_train = torch.FloatTensor(x_train).to(device) +y_train = torch.FloatTensor(y_train).to(device) +``` -We will use a network trained on ImageNet, taken from the TorchVision project, -which contains a few of the best performing neural network architectures for computer vision, -such as AlexNet, one of the early breakthrough networks for image recognition, and ResNet, -which won the ImageNet classification, detection, and localization competitions, in 2015. -[TorchVision](https://pytorch.org/vision/stable/index.html) also has easy access to datasets like -ImageNet and other utilities for getting up -to speed with computer vision applications in PyTorch. -The pre-defined models can be found in torchvision.models. +Remember that this does not break backward compatibility when you port the script back to a computer +without GPU, because without GPU, `device` is set to 'cpu'. -**Important note**: For the ml nodes only the Torchvision 0.2.2. is available (10.11.20). -The last updates from IBM include only Torchvision 0.4.1 CPU version. -Be careful some features from modern versions of Torchvision are not available in the 0.2.2 -(e.g. some kinds of `transforms`). Always check the version with: `print(torchvision.__version__)` +### Caveats -Examples: +#### Moving Data Back to the CPU-Memory -1. Image recognition example. This PyTorch script is using Resnet to single image classification. -Recommended parameters for running this model are 1 GPU and 7 cores (28 thread). +The CPU cannot directly access variables stored on the GPU. If you want to use the variables, e.g., +in a `print` statement or when editing with NumPy or anything that is not PyTorch, you have to move +them back to the CPU-memory again. This then may look like this: -(example_Pytorch_image_recognition.zip) +```python3 +cpu_x_train = x_train.cpu() +print(cpu_x_train) +... +error_train = np.sqrt(metrics.mean_squared_error(y_train[:,1].cpu(), y_prediction_train[:,1])) +``` -Remember that for using [JupyterHub service](../access/jupyterhub.md) -for PyTorch you need to create and activate -a virtual environment (kernel) with loaded essential modules (see "envtest" environment form the virtual -environment example. +Remember that, without `.detach()` before the CPU, if you change `cpu_x_train`, `x_train` will also +be changed. If you want to treat them independently, use -Run the example in the same way as the previous example (MNIST model). +```python3 +cpu_x_train = x_train.detach().cpu() +``` -### Using Multiple GPUs with PyTorch +Now you can change `cpu_x_train` without `x_train` being affected. -Effective use of GPUs is essential, and it implies using parallelism in -your code and model. Data Parallelism and model parallelism are effective instruments -to improve the performance of your code in case of GPU using. +#### Speed Improvements and Batch Size -The data parallelism is a widely-used technique. It replicates the same model to all GPUs, -where each GPU consumes a different partition of the input data. You could see this method [here](https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html). - -The example below shows how to solve that problem by using model -parallel, which, in contrast to data parallelism, splits a single model -onto different GPUs, rather than replicating the entire model on each -GPU. The high-level idea of model parallel is to place different sub-networks of a model onto different -devices. As the only part of a model operates on any individual device, a set of devices can -collectively serve a larger model. - -It is recommended to use [DistributedDataParallel] -(https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html), -instead of this class, to do multi-GPU training, even if there is only a single node. -See: Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel. -Check the [page](https://pytorch.org/docs/stable/notes/cuda.html#cuda-nn-ddp-instead) and -[Distributed Data Parallel](https://pytorch.org/docs/stable/notes/ddp.html#ddp). - -Examples: - -1\. The parallel model. The main aim of this model to show the way how -to effectively implement your neural network on several GPUs. It -includes a comparison of different kinds of models and tips to improve -the performance of your model. **Necessary** parameters for running this -model are **2 GPU** and 14 cores (56 thread). - -(example_PyTorch_parallel.zip) - -Remember that for using [JupyterHub service](../access/jupyterhub.md) -for PyTorch you need to create and activate -a virtual environment (kernel) with loaded essential modules. - -Run the example in the same way as the previous examples. - -#### Distributed data-parallel - -[DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel) -(DDP) implements data parallelism at the module level which can run across multiple machines. -Applications using DDP should spawn multiple processes and create a single DDP instance per process. -DDP uses collective communications in the [torch.distributed] -(https://pytorch.org/tutorials/intermediate/dist_tuto.html) -package to synchronize gradients and buffers. - -The tutorial could be found [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). - -To use distributed data parallelisation on Taurus please use following -parameters: `--ntasks-per-node` -parameter to the number of GPUs you use -per node. Also, it could be useful to increase `memomy/cpu` parameters -if you run larger models. Memory can be set up to: - ---mem=250000 and --cpus-per-task=7 for the **ml** partition. - ---mem=60000 and --cpus-per-task=6 for the **gpu2** partition. - -Keep in mind that only one memory parameter (`--mem-per-cpu` = <MB> or `--mem`=<MB>) can be specified - -## F.A.Q - -- (example_MNIST_Pytorch.zip) -- (example_Pytorch_image_recognition.zip) -- (example_PyTorch_parallel.zip) +When you have a lot of very small data points, the speed may actually decrease when you try to train +them on the GPU. This is because moving data from the CPU-memory to the GPU-memory takes time. If +this occurs, please try using a very large batch size. This way, copying back and forth only takes +places a few times and the bottleneck may be reduced. diff --git a/doc.zih.tu-dresden.de/docs/software/runtime_environment.md b/doc.zih.tu-dresden.de/docs/software/runtime_environment.md deleted file mode 100644 index 1bca8daa7cfa08f3b58b19e5608c2e333b9055f9..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/runtime_environment.md +++ /dev/null @@ -1,181 +0,0 @@ -# Runtime Environment - -Make sure you know how to work with a Linux system. Documentations and tutorials can be easily -found on the internet or in your library. - -## Modules - -To allow the user to switch between different versions of installed programs and libraries we use a -*module concept*. A module is a user interface that provides utilities for the dynamic modification -of a user's environment, i.e., users do not have to manually modify their environment variables ( -`PATH` , `LD_LIBRARY_PATH`, ...) to access the compilers, loader, libraries, and utilities. - -For all applications, tools, libraries etc. the correct environment can be easily set by e.g. -`module load Mathematica`. If several versions are installed they can be chosen like `module load -MATLAB/2019b`. A list of all modules shows `module avail`. Other important commands are: - -| Command | Description | -|:------------------------------|:-----------------------------------------------------------------| -| `module help` | show all module options | -| `module list` | list all user-installed modules | -| `module purge` | remove all user-installed modules | -| `module avail` | list all available modules | -| `module spider` | search for modules across all environments, can take a parameter | -| `module load <modname>` | load module `modname` | -| `module unload <modname>` | unloads module `modname` | -| `module switch <mod1> <mod2>` | unload module `mod1` ; load module `mod2` | - -Module files are ordered by their topic on our HPC systems. By default, with `module av` you will -see all available module files and topics. If you just wish to see the installed versions of a -certain module, you can use `module av softwarename` and it will display the available versions of -`softwarename` only. - -### Lmod: An Alternative Module Implementation - -Historically, the module command on our HPC systems has been provided by the rather dated -*Environment Modules* software which was first introduced in 1991. As of late 2016, we also offer -the new and improved [LMOD](https://www.tacc.utexas.edu/research-development/tacc-projects/lmod) as -an alternative. It has a handful of advantages over the old Modules implementation: - -- all modulefiles are cached, which especially speeds up tab - completion with bash -- sane version ordering (9.0 \< 10.0) -- advanced version requirement functions (atleast, between, latest) -- auto-swapping of modules (if a different version was already loaded) -- save/auto-restore of loaded module sets (module save) -- multiple language support -- properties, hooks, ... -- depends_on() function for automatic dependency resolution with - reference counting - -### Module Environments - -On Taurus, there exist different module environments, each containing a set of software modules. -They are activated via the meta module **modenv** which has different versions, one of which is -loaded by default. You can switch between them by simply loading the desired modenv-version, e.g.: - -```Bash -module load modenv/ml -``` - -| | | | -|--------------|------------------------------------------------------------------------|---------| -| modenv/scs5 | SCS5 software | default | -| modenv/ml | HPC-DA software (for use on the "ml" partition) | | -| modenv/hiera | Hierarchical module tree (for use on the "romeo" and "gpu3" partition) | | - -The old modules (pre-SCS5) are still available after loading **modenv**/**classic**, however, due to -changes in the libraries of the operating system, it is not guaranteed that they still work under -SCS5. Please don't use modenv/classic if you do not absolutely have to. Most software is available -under modenv/scs5, too, just be aware of the possibly different spelling (case-sensitivity). - -You can use `module spider \<modname>` to search for a specific -software in all modenv environments. It will also display information on -how to load a found module when giving a precise module (with version) -as the parameter. - -Also see the information under [SCS5 software](../software/scs5_software.md). - -### Per-Architecture Builds - -Since we have a heterogenous cluster, we do individual builds of some of the software for each -architecture present. This ensures that, no matter what partition the software runs on, a build -optimized for the host architecture is used automatically. This is achieved by having -`/sw/installed` symlinked to different directories on the compute nodes. - -However, not every module will be available for each node type or partition. Especially when -introducing new hardware to the cluster, we do not want to rebuild all of the older module versions -and in some cases cannot fall-back to a more generic build either. That's why we provide the script: -`ml_arch_avail` that displays the availability of modules for the different node architectures. - -E.g.: - -```Bash -$ ml_arch_avail CP2K -CP2K/6.1-foss-2019a: haswell, rome -CP2K/5.1-intel-2018a: sandy, haswell -CP2K/6.1-foss-2019a-spglib: haswell, rome -CP2K/6.1-intel-2018a: sandy, haswell -CP2K/6.1-intel-2018a-spglib: haswell -``` - -shows all modules that match on CP2K, and their respective availability. Note that this will not -work for meta-modules that do not have an installation directory (like some toolchain modules). - -### Private User Module Files - -Private module files allow you to load your own installed software into your environment and to -handle different versions without getting into conflicts. - -You only have to call `module use <path to your module files>`, which adds your directory to the -list of module directories that are searched by the `module` command. Within the privatemodules -directory you can add directories for each software you wish to install and add - also in this -directory - a module file for each version you have installed. Further information about modules can -be found at <https://lmod.readthedocs.io> . - -**todo** quite old - -This is an example of a private module file: - -```Bash -dolescha@venus:~/module use $HOME/privatemodules - -dolescha@venus:~/privatemodules> ls -null testsoftware - -dolescha@venus:~/privatemodules/testsoftware> ls -1.0 - -dolescha@venus:~> module av -------------------------------- /work/home0/dolescha/privatemodules --------------------------- -null testsoftware/1.0 - -dolescha@venus:~> module load testsoftware -Load testsoftware version 1.0 - -dolescha@venus:~/privatemodules/testsoftware> cat 1.0 -#%Module###################################################################### -## -## testsoftware modulefile -## -proc ModulesHelp { } { - puts stderr "Loads testsoftware" -} - -set version 1.0 -set arch x86_64 -set path /home/<user>/opt/testsoftware/$version/$arch/ - -prepend-path PATH $path/bin -prepend-path LD_LIBRARY_PATH $path/lib - -if [ module-info mode load ] { - puts stderr "Load testsoftware version $version" -} -``` - -### Private Project Module Files - -Private module files allow you to load your group-wide installed software into your environment and -to handle different versions without getting into conflicts. - -The module files have to be stored in your global projects directory, e.g. -`/projects/p_projectname/privatemodules`. An example for a module file can be found in the section -above. - -To use a project-wide module file you have to add the path to the module file to the module -environment with following command `module use /projects/p_projectname/privatemodules`. - -After that, the modules are available in your module environment and you -can load the modules with `module load` . - -## Misc - -An automated [backup](../data_lifecycle/file_systems.md#backup-and-snapshots-of-the-file-system) -system provides security for the HOME-directories on `Taurus` and `Venus` on a daily basis. This is -the reason why we urge our users to store (large) temporary data (like checkpoint files) on the -/scratch -Filesystem or at local scratch disks. - -`Please note`: We have set `ulimit -c 0` as a default to prevent you from filling the disk with the -dump of a crashed program. `bash` -users can use `ulimit -Sc unlimited` to enable the debugging via -analyzing the core file (limit coredumpsize unlimited for tcsh). diff --git a/doc.zih.tu-dresden.de/docs/software/scorep.md b/doc.zih.tu-dresden.de/docs/software/scorep.md index eeea99ad110477282ec3897d69d65e800885cda8..0e2dc6c2358c95f47373a2f046f3fe4d643ae643 100644 --- a/doc.zih.tu-dresden.de/docs/software/scorep.md +++ b/doc.zih.tu-dresden.de/docs/software/scorep.md @@ -144,7 +144,7 @@ After the application run, you will find an experiment directory in your current which contains all recorded data. In general, you can record a profile and/or a event trace. Whether a profile and/or a trace is recorded, is specified by the environment variables `SCOREP_ENABLE_PROFILING` and `SCOREP_ENABLE_TRACING` (see -[documentation](https://perftools.pages.jsc.fz-juelich.de/cicd/scorep/tags/latest/html/measurement.html)). +[official Score-P documentation](https://perftools.pages.jsc.fz-juelich.de/cicd/scorep/tags/latest/html/measurement.html)). If the value of this variables is zero or false, profiling/tracing is disabled. Otherwise Score-P will record a profile and/or trace. By default, profiling is enabled and tracing is disabled. For more information please see the list of Score-P measurement diff --git a/doc.zih.tu-dresden.de/docs/software/scs5_software.md b/doc.zih.tu-dresden.de/docs/software/scs5_software.md index f1606236729c0354e5129b71c5c93c14325cb097..b5a1bef60d20cdc9989c8db82f766d31a96d3cdc 100644 --- a/doc.zih.tu-dresden.de/docs/software/scs5_software.md +++ b/doc.zih.tu-dresden.de/docs/software/scs5_software.md @@ -21,7 +21,7 @@ remove it and accept the new one after comparing its fingerprint with those list ## Using Software Modules Starting with SCS5, we only provide -[Lmod](../software/runtime_environment.md#lmod-an-alternative-module-implementation) as the +[Lmod](../software/modules.md#lmod-an-alternative-module-implementation) as the environment module tool of choice. As usual, you can get a list of the available software modules via: @@ -38,7 +38,7 @@ There is a special module that is always loaded (sticky) called | | | | |----------------|-------------------------------------------------|---------| | modenv/scs5 | SCS5 software | default | -| modenv/ml | HPC-DA software (for use on the "ml" partition) | | +| modenv/ml | software for data analytics (partition ml) | | | modenv/classic | Manually built pre-SCS5 (AE4.0) software | hidden | The old modules (pre-SCS5) are still available after loading the diff --git a/doc.zih.tu-dresden.de/docs/software/singularity_example_definitions.md b/doc.zih.tu-dresden.de/docs/software/singularity_example_definitions.md deleted file mode 100644 index 28fe94a9d510e577148d7d0c2f526136e813d4ba..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/singularity_example_definitions.md +++ /dev/null @@ -1,110 +0,0 @@ -# Singularity Example Definitions - -## Basic example - -A usual workflow to create Singularity Definition consists of the -following steps: - -- Start from base image -- Install dependencies - - Package manager - - Other sources -- Build & Install own binaries -- Provide entrypoints & metadata - -An example doing all this: - -```Bash -Bootstrap: docker -From: alpine - -%post - . /.singularity.d/env/10-docker*.sh - - apk add g++ gcc make wget cmake - - wget https://github.com/fmtlib/fmt/archive/5.3.0.tar.gz - tar -xf 5.3.0.tar.gz - mkdir build && cd build - cmake ../fmt-5.3.0 -DFMT_TEST=OFF - make -j$(nproc) install - cd .. - rm -r fmt-5.3.0* - - cat hello.cpp -#include <fmt/format.h> - -int main(int argc, char** argv){ - if(argc == 1) fmt::print("No arguments passed!\n"); - else fmt::print("Hello {}!\n", argv[1]); -} -EOF - - g++ hello.cpp -o hello -lfmt - mv hello /usr/bin/hello - -%runscript - hello "$@" - -%labels - Author Alexander Grund - Version 1.0.0 - -%help - Display a greeting using the fmt library - - Usage: - ./hello -``` - -## CUDA + CuDNN + OpenMPI - -- Chosen CUDA version depends on installed driver of host -- OpenMPI needs PMI for SLURM integration -- OpenMPI needs CUDA for GPU copy-support -- OpenMPI needs ibverbs libs for Infiniband -- openmpi-mca-params.conf required to avoid warnings on fork (OK on - taurus) -- Environment variables SLURM_VERSION, OPENMPI_VERSION can be set to - choose different version when building the container - -``` -Bootstrap: docker -From: nvidia/cuda-ppc64le:10.1-cudnn7-devel-ubuntu18.04 - -%labels - Author ZIH - Requires CUDA driver 418.39+. - -%post - . /.singularity.d/env/10-docker*.sh - - apt-get update - apt-get install -y cuda-compat-10.1 - apt-get install -y libibverbs-dev ibverbs-utils - # Install basic development tools - apt-get install -y gcc g++ make wget python - apt-get autoremove; apt-get clean - - cd /tmp - - : ${SLURM_VERSION:=17-02-11-1} - wget https://github.com/SchedMD/slurm/archive/slurm-${SLURM_VERSION}.tar.gz - tar -xf slurm-${SLURM_VERSION}.tar.gz - cd slurm-slurm-${SLURM_VERSION} - ./configure --prefix=/usr/ --sysconfdir=/etc/slurm --localstatedir=/var --disable-debug - make -C contribs/pmi2 -j$(nproc) install - cd .. - rm -rf slurm-* - - : ${OPENMPI_VERSION:=3.1.4} - wget https://download.open-mpi.org/release/open-mpi/v${OPENMPI_VERSION%.*}/openmpi-${OPENMPI_VERSION}.tar.gz - tar -xf openmpi-${OPENMPI_VERSION}.tar.gz - cd openmpi-${OPENMPI_VERSION}/ - ./configure --prefix=/usr/ --with-pmi --with-verbs --with-cuda - make -j$(nproc) install - echo "mpi_warn_on_fork = 0" >> /usr/etc/openmpi-mca-params.conf - echo "btl_openib_warn_default_gid_prefix = 0" >> /usr/etc/openmpi-mca-params.conf - cd .. - rm -rf openmpi-* -``` diff --git a/doc.zih.tu-dresden.de/docs/software/singularity_recipe_hints.md b/doc.zih.tu-dresden.de/docs/software/singularity_recipe_hints.md index 5e4388fcf95ed06370d7d633544ee685113df1a7..b8304b57de0f1ae5da98341c92f6d9067b838ecd 100644 --- a/doc.zih.tu-dresden.de/docs/software/singularity_recipe_hints.md +++ b/doc.zih.tu-dresden.de/docs/software/singularity_recipe_hints.md @@ -1,6 +1,117 @@ -# Singularity Recipe Hints +# Singularity Recipes and Hints -## GUI (X11) applications +## Example Definitions + +### Basic Example + +A usual workflow to create Singularity Definition consists of the following steps: + +* Start from base image +* Install dependencies + * Package manager + * Other sources +* Build and install own binaries +* Provide entry points and metadata + +An example doing all this: + +```bash +Bootstrap: docker +From: alpine + +%post + . /.singularity.d/env/10-docker*.sh + + apk add g++ gcc make wget cmake + + wget https://github.com/fmtlib/fmt/archive/5.3.0.tar.gz + tar -xf 5.3.0.tar.gz + mkdir build && cd build + cmake ../fmt-5.3.0 -DFMT_TEST=OFF + make -j$(nproc) install + cd .. + rm -r fmt-5.3.0* + + cat hello.cpp +#include <fmt/format.h> + +int main(int argc, char** argv){ + if(argc == 1) fmt::print("No arguments passed!\n"); + else fmt::print("Hello {}!\n", argv[1]); +} +EOF + + g++ hello.cpp -o hello -lfmt + mv hello /usr/bin/hello + +%runscript + hello "$@" + +%labels + Author Alexander Grund + Version 1.0.0 + +%help + Display a greeting using the fmt library + + Usage: + ./hello +``` + +### CUDA + CuDNN + OpenMPI + +* Chosen CUDA version depends on installed driver of host +* OpenMPI needs PMI for Slurm integration +* OpenMPI needs CUDA for GPU copy-support +* OpenMPI needs `ibverbs` library for Infiniband +* `openmpi-mca-params.conf` required to avoid warnings on fork (OK on ZIH systems) +* Environment variables `SLURM_VERSION` and `OPENMPI_VERSION` can be set to choose different + version when building the container + +```bash +Bootstrap: docker +From: nvidia/cuda-ppc64le:10.1-cudnn7-devel-ubuntu18.04 + +%labels + Author ZIH + Requires CUDA driver 418.39+. + +%post + . /.singularity.d/env/10-docker*.sh + + apt-get update + apt-get install -y cuda-compat-10.1 + apt-get install -y libibverbs-dev ibverbs-utils + # Install basic development tools + apt-get install -y gcc g++ make wget python + apt-get autoremove; apt-get clean + + cd /tmp + + : ${SLURM_VERSION:=17-02-11-1} + wget https://github.com/SchedMD/slurm/archive/slurm-${SLURM_VERSION}.tar.gz + tar -xf slurm-${SLURM_VERSION}.tar.gz + cd slurm-slurm-${SLURM_VERSION} + ./configure --prefix=/usr/ --sysconfdir=/etc/slurm --localstatedir=/var --disable-debug + make -C contribs/pmi2 -j$(nproc) install + cd .. + rm -rf slurm-* + + : ${OPENMPI_VERSION:=3.1.4} + wget https://download.open-mpi.org/release/open-mpi/v${OPENMPI_VERSION%.*}/openmpi-${OPENMPI_VERSION}.tar.gz + tar -xf openmpi-${OPENMPI_VERSION}.tar.gz + cd openmpi-${OPENMPI_VERSION}/ + ./configure --prefix=/usr/ --with-pmi --with-verbs --with-cuda + make -j$(nproc) install + echo "mpi_warn_on_fork = 0" >> /usr/etc/openmpi-mca-params.conf + echo "btl_openib_warn_default_gid_prefix = 0" >> /usr/etc/openmpi-mca-params.conf + cd .. + rm -rf openmpi-* +``` + +## Hints + +### GUI (X11) Applications Running GUI applications inside a singularity container is possible out of the box. Check the following definition: @@ -15,25 +126,25 @@ yum install -y xeyes This image may be run with -```Bash +```console singularity exec xeyes.sif xeyes. ``` -This works because all the magic is done by singularity already like setting $DISPLAY to the outside -display and mounting $HOME so $HOME/.Xauthority (X11 authentication cookie) is found. When you are -using \`--contain\` or \`--no-home\` you have to set that cookie yourself or mount/copy it inside -the container. Similar for \`--cleanenv\` you have to set $DISPLAY e.g. via +This works because all the magic is done by Singularity already like setting `$DISPLAY` to the outside +display and mounting `$HOME` so `$HOME/.Xauthority` (X11 authentication cookie) is found. When you are +using `--contain` or `--no-home` you have to set that cookie yourself or mount/copy it inside +the container. Similar for `--cleanenv` you have to set `$DISPLAY`, e.g., via -```Bash +```console export SINGULARITY_DISPLAY=$DISPLAY ``` -When you run a container as root (via \`sudo\`) you may need to allow root for your local display +When you run a container as root (via `sudo`) you may need to allow root for your local display port: `xhost +local:root\` -### Hardware acceleration +### Hardware Acceleration -If you want hardware acceleration you **may** need [VirtualGL](https://virtualgl.org). An example +If you want hardware acceleration, you **may** need [VirtualGL](https://virtualgl.org). An example definition file is as follows: ```Bash @@ -55,25 +166,28 @@ rm VirtualGL-*.rpm yum install -y mesa-dri-drivers # for e.g. intel integrated GPU drivers. Replace by your driver ``` -You can now run the application with vglrun: +You can now run the application with `vglrun`: -```Bash +```console singularity exec vgl.sif vglrun glxgears ``` -**Attention:**Using VirtualGL may not be required at all and could even decrease the performance. To -check install e.g. glxgears as above and your graphics driver (or use the VirtualGL image from -above) and disable vsync: +!!! warning -``` + Using VirtualGL may not be required at all and could even decrease the performance. + +To check install, e.g., `glxgears` as above and your graphics driver (or use the VirtualGL image +from above) and disable `vsync`: + +```console vblank_mode=0 singularity exec vgl.sif glxgears ``` -Compare the FPS output with the glxgears prefixed by vglrun (see above) to see which produces more +Compare the FPS output with the `glxgears` prefixed by `vglrun` (see above) to see which produces more FPS (or runs at all). -**NVIDIA GPUs** need the `--nv` parameter for the singularity command: +**NVIDIA GPUs** need the `--nv` parameter for the Singularity command: -``Bash +``console singularity exec --nv vgl.sif glxgears ``` diff --git a/doc.zih.tu-dresden.de/docs/software/tensorboard.md b/doc.zih.tu-dresden.de/docs/software/tensorboard.md new file mode 100644 index 0000000000000000000000000000000000000000..d2c838d3961d8f48794e544ce1ca7846d24e7325 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/tensorboard.md @@ -0,0 +1,84 @@ +# TensorBoard + +TensorBoard is a visualization toolkit for TensorFlow and offers a variety of functionalities such +as presentation of loss and accuracy, visualization of the model graph or profiling of the +application. + +## Using JupyterHub + +The easiest way to use TensorBoard is via [JupyterHub](../access/jupyterhub.md). The default +TensorBoard log directory is set to `/tmp/<username>/tf-logs` on the compute node, where Jupyter +session is running. In order to show your own directory with logs, it can be soft-linked to the +default folder. Open a "New Launcher" menu (`Ctrl+Shift+L`) and select "Terminal" session. It +will start new terminal on the respective compute node. Create a directory `/tmp/$USER/tf-logs` +and link it with your log directory +`ln -s <your-tensorboard-target-directory> <local-tf-logs-directory>` + +```Bash +mkdir -p /tmp/$USER/tf-logs +ln -s <your-tensorboard-target-directory> /tmp/$USER/tf-logs +``` + +Update TensorBoard tab if needed with `F5`. + +## Using TensorBoard from Module Environment + +On ZIH systems, TensorBoard is also available as an extension of the TensorFlow module. To check +whether a specific TensorFlow module provides TensorBoard, use the following command: + +```console hl_lines="9" +marie@compute$ module spider TensorFlow/2.3.1 +[...] + Included extensions + =================== + absl-py-0.10.0, astor-0.8.0, astunparse-1.6.3, cachetools-4.1.1, gast-0.3.3, + google-auth-1.21.3, google-auth-oauthlib-0.4.1, google-pasta-0.2.0, + grpcio-1.32.0, Keras-Preprocessing-1.1.2, Markdown-3.2.2, oauthlib-3.1.0, opt- + einsum-3.3.0, pyasn1-modules-0.2.8, requests-oauthlib-1.3.0, rsa-4.6, + tensorboard-2.3.0, tensorboard-plugin-wit-1.7.0, TensorFlow-2.3.1, tensorflow- + estimator-2.3.0, termcolor-1.1.0, Werkzeug-1.0.1, wrapt-1.12.1 +``` + +If TensorBoard occurs in the `Included extensions` section of the output, TensorBoard is available. + +To use TensorBoard, you have to connect via ssh to the ZIH system as usual, schedule an interactive +job and load a TensorFlow module: + +```console +marie@compute$ module load TensorFlow/2.3.1 +Module TensorFlow/2.3.1-fosscuda-2019b-Python-3.7.4 and 47 dependencies loaded. +``` + +Then, create a workspace for the event data, that should be visualized in TensorBoard. If you +already have an event data directory, you can skip that step. + +```console +marie@compute$ ws_allocate -F scratch tensorboard_logdata 1 +Info: creating workspace. +/scratch/ws/1/marie-tensorboard_logdata +[...] +``` + +Now, you can run your TensorFlow application. Note that you might have to adapt your code to make it +accessible for TensorBoard. Please find further information on the official [TensorBoard website](https://www.tensorflow.org/tensorboard/get_started) +Then, you can start TensorBoard and pass the directory of the event data: + +```console +marie@compute$ tensorboard --logdir /scratch/ws/1/marie-tensorboard_logdata --bind_all +[...] +TensorBoard 2.3.0 at http://taurusi8034.taurus.hrsk.tu-dresden.de:6006/ +[...] +``` + +TensorBoard then returns a server address on Taurus, e.g. `taurusi8034.taurus.hrsk.tu-dresden.de:6006` + +For accessing TensorBoard now, you have to set up some port forwarding via ssh to your local +machine: + +```console +marie@local$ ssh -N -f -L 6006:taurusi8034.taurus.hrsk.tu-dresden.de:6006 <zih-login>@taurus.hrsk.tu-dresden.de +``` + +Now, you can see the TensorBoard in your browser at `http://localhost:6006/`. + +Note that you can also use TensorBoard in an [sbatch file](../jobs_and_resources/slurm.md). diff --git a/doc.zih.tu-dresden.de/docs/software/tensorflow.md b/doc.zih.tu-dresden.de/docs/software/tensorflow.md index 346eb9a1da4e0728c2751773d656ac70d00a60c4..09a8352a32648178f3634a4099eee52ad6c0ccd0 100644 --- a/doc.zih.tu-dresden.de/docs/software/tensorflow.md +++ b/doc.zih.tu-dresden.de/docs/software/tensorflow.md @@ -1,264 +1,156 @@ # TensorFlow -## Introduction - -This is an introduction of how to start working with TensorFlow and run -machine learning applications on the [HPC-DA](../jobs_and_resources/hpcda.md) system of Taurus. - -\<span style="font-size: 1em;">On the machine learning nodes (machine -learning partition), you can use the tools from [IBM PowerAI](power_ai.md) or the other -modules. PowerAI is an enterprise software distribution that combines popular open-source -deep learning frameworks, efficient AI development tools (Tensorflow, Caffe, etc). For -this page and examples was used [PowerAI version 1.5.4](https://www.ibm.com/support/knowledgecenter/en/SS5SF7_1.5.4/navigation/pai_software_pkgs.html) - -[TensorFlow](https://www.tensorflow.org/guide/) is a free end-to-end open-source -software library for dataflow and differentiable programming across many -tasks. It is a symbolic math library, used primarily for machine -learning applications. It has a comprehensive, flexible ecosystem of tools, libraries and -community resources. It is available on taurus along with other common machine -learning packages like Pillow, SciPY, Numpy. - -**Prerequisites:** To work with Tensorflow on Taurus, you obviously need -[access](../access/ssh_login.md) for the Taurus system and basic knowledge about Python, SLURM system. - -**Aim** of this page is to introduce users on how to start working with -TensorFlow on the \<a href="HPCDA" target="\_self">HPC-DA\</a> system - -part of the TU Dresden HPC system. - -There are three main options on how to work with Tensorflow on the -HPC-DA: **1.** **Modules,** **2.** **JupyterNotebook, 3. Containers**. The best option is -to use [module system](../software/runtime_environment.md#Module_Environments) and -Python virtual environment. Please see the next chapters and the [Python page](python.md) for the -HPC-DA system. - -The information about the Jupyter notebook and the **JupyterHub** could -be found [here](../access/jupyterhub.md). The use of -Containers is described [here](tensorflow_container_on_hpcda.md). - -On Taurus, there exist different module environments, each containing a set -of software modules. The default is *modenv/scs5* which is already loaded, -however for the HPC-DA system using the "ml" partition you need to use *modenv/ml*. -To find out which partition are you using use: `ml list`. -You can change the module environment with the command: - - module load modenv/ml - -The machine learning partition is based on the PowerPC Architecture (ppc64le) -(Power9 processors), which means that the software built for x86_64 will not -work on this partition, so you most likely can't use your already locally -installed packages on Taurus. Also, users need to use the modules which are -specially made for the ml partition (from modenv/ml) and not for the rest -of Taurus (e.g. from modenv/scs5). - -Each node on the ml partition has 6x Tesla V-100 GPUs, with 176 parallel threads -on 44 cores per node (Simultaneous multithreading (SMT) enabled) and 256GB RAM. -The specification could be found [here](../jobs_and_resources/power9.md). - -%RED%Note:<span class="twiki-macro ENDCOLOR"></span> Users should not -reserve more than 28 threads per each GPU device so that other users on -the same node still have enough CPUs for their computations left. - -## Get started with Tensorflow - -This example shows how to install and start working with TensorFlow -(with using modules system) and the python virtual environment. Please, -check the next chapter for the details about the virtual environment. - - srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=8000 bash #Job submission in ml nodes with 1 gpu on 1 node with 8000 mb. - - module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml - - mkdir python-environments #create folder - module load TensorFlow #load TensorFlow module. Example output: Module TensorFlow/1.10.0-PythonAnaconda-3.6 and 1 dependency loaded. - which python #check which python are you using - virtualenvv --system-site-packages python-environments/env #create virtual environment "env" which inheriting with global site packages - source python-environments/env/bin/activate #Activate virtual environment "env". Example output: (env) bash-4.2$ - python #start python - import tensorflow as tf - print(tf.VERSION) #example output: 1.10.0 - -Keep in mind that using **srun** directly on the shell will be blocking -and launch an interactive job. Apart from short test runs, it is -recommended to launch your jobs into the background by using batch -jobs:\<span> **sbatch \[options\] \<job file>** \</span>. The example -will be presented later on the page. - -As a Tensorflow example, we will use a \<a -href="<https://www.tensorflow.org/tutorials>" target="\_blank">simple -mnist model\</a>. Even though this example is in Python, the information -here will still apply to other tools. - -The ml partition has very efficacious GPUs to offer. Do not assume that -more power means automatically faster computational speed. The GPU is -only one part of a typical machine learning application. Do not forget -that first the input data needs to be loaded and in most cases even -rescaled or augmented. If you do not specify that you want to use more -than the default one worker (=one CPU thread), then it is very likely -that your GPU computes faster, than it receives the input data. It is, -therefore, possible, that you will not be any faster, than on other GPU -partitions. \<span style="font-size: 1em;">You can solve this by using -multithreading when loading your input data. The \</span>\<a -href="<https://keras.io/models/sequential/#fit_generator>" -target="\_blank">fit_generator\</a>\<span style="font-size: 1em;"> -method supports multiprocessing, just set \`use_multiprocessing\` to -\`True\`, \</span>\<a href="Slurm#Job_Submission" -target="\_blank">request more Threads\</a>\<span style="font-size: -1em;"> from SLURM and set the \`Workers\` amount accordingly.\</span> - -The example below with a \<a -href="<https://www.tensorflow.org/tutorials>" target="\_blank">simple -mnist model\</a> of the python script illustrates using TF-Keras API -from TensorFlow. \<a href="<https://www.tensorflow.org/guide/keras>" -target="\_top">Keras\</a> is TensorFlows high-level API. - -**You can read in detail how to work with Keras on Taurus \<a -href="Keras" target="\_blank">here\</a>.** - - import tensorflow as tf - # Load and prepare the MNIST dataset. Convert the samples from integers to floating-point numbers: - mnist = tf.keras.datasets.mnist - - (x_train, y_train),(x_test, y_test) = mnist.load_data() - x_train, x_test = x_train / 255.0, x_test / 255.0 - - # Build the tf.keras model by stacking layers. Select an optimizer and loss function used for training - model = tf.keras.models.Sequential([ - tf.keras.layers.Flatten(input_shape=(28, 28)), - tf.keras.layers.Dense(512, activation=tf.nn.relu), - tf.keras.layers.Dropout(0.2), - tf.keras.layers.Dense(10, activation=tf.nn.softmax) - ]) - model.compile(optimizer='adam', - loss='sparse_categorical_crossentropy', - metrics=['accuracy']) - - # Train and evaluate model - model.fit(x_train, y_train, epochs=5) - model.evaluate(x_test, y_test) - -The example can train an image classifier with \~98% accuracy based on -this dataset. - -## Python virtual environment - -A virtual environment is a cooperatively isolated runtime environment -that allows Python users and applications to install and update Python -distribution packages without interfering with the behaviour of other -Python applications running on the same system. At its core, the main -purpose of Python virtual environments is to create an isolated -environment for Python projects. - -**Vitualenv**is a standard Python tool to create isolated Python -environments and part of the Python installation/module. We recommend -using virtualenv to work with Tensorflow and Pytorch on Taurus.\<br -/>However, if you have reasons (previously created environments etc) you -can also use conda which is the second way to use a virtual environment -on the Taurus. \<a -href="<https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html>" -target="\_blank">Conda\</a> is an open-source package management system -and environment management system. Note that using conda means that -working with other modules from taurus will be harder or impossible. -Hence it is highly recommended to use virtualenv. - -## Running the sbatch script on ML modules (modenv/ml) and SCS5 modules (modenv/scs5) - -Generally, for machine learning purposes the ml partition is used but -for some special issues, the other partitions can be useful also. The -following sbatch script can execute the above Python script both on ml -partition or gpu2 partition.\<br /> When not using the -TensorFlow-Anaconda modules you may need some additional modules that -are not included (e.g. when using the TensorFlow module from modenv/scs5 -on gpu2).\<br />If you have a question about the sbatch script see the -article about \<a href="Slurm" target="\_blank">SLURM\</a>. Keep in mind -that you need to put the executable file (machine_learning_example.py) -with python code to the same folder as the bash script file -\<script_name>.sh (see below) or specify the path. - - #!/bin/bash - #SBATCH --mem=8GB # specify the needed memory - #SBATCH -p ml # specify ml partition or gpu2 partition - #SBATCH --gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task) - #SBATCH --nodes=1 # request 1 node - #SBATCH --time=00:10:00 # runs for 10 minutes - #SBATCH -c 7 # how many cores per task allocated - #SBATCH -o HLR_<name_your_script>.out # save output message under HLR_${SLURMJOBID}.out - #SBATCH -e HLR_<name_your_script>.err # save error messages under HLR_${SLURMJOBID}.err - - if [ "$SLURM_JOB_PARTITION" == "ml" ]; then - module load modenv/ml - module load TensorFlow/2.0.0-PythonAnaconda-3.7 - else - module load modenv/scs5 - module load TensorFlow/2.0.0-fosscuda-2019b-Python-3.7.4 - module load Pillow/6.2.1-GCCcore-8.3.0 # Optional - module load h5py/2.10.0-fosscuda-2019b-Python-3.7.4 # Optional - fi - - python machine_learning_example.py - - ## when finished writing, submit with: sbatch <script_name> - -Output results and errors file can be seen in the same folder in the -corresponding files after the end of the job. Part of the example -output: - - 1600/10000 [===>..........................] - ETA: 0s - 3168/10000 [========>.....................] - ETA: 0s - 4736/10000 [=============>................] - ETA: 0s - 6304/10000 [=================>............] - ETA: 0s - 7872/10000 [======================>.......] - ETA: 0s - 9440/10000 [===========================>..] - ETA: 0s - 10000/10000 [==============================] - 0s 38us/step - -## TensorFlow 2 - -[TensorFlow -2.0](https://blog.tensorflow.org/2019/09/tensorflow-20-is-now-available.html) -is a significant milestone for TensorFlow and the community. There are -multiple important changes for users. TensorFlow 2.0 removes redundant -APIs, makes APIs more consistent (Unified RNNs, Unified Optimizers), and -better integrates with the Python runtime with Eager execution. Also, -TensorFlow 2.0 offers many performance improvements on GPUs. - -There are a number of TensorFlow 2 modules for both ml and scs5 modenvs -on Taurus. Please check\<a href="SoftwareModulesList" target="\_blank"> -the software modules list\</a> for the information about available -modules or use - - module spider TensorFlow - -%RED%Note:<span class="twiki-macro ENDCOLOR"></span> Tensorflow 2 will -be loaded by default when loading the Tensorflow module without -specifying the version. - -\<span style="font-size: 1em;">TensorFlow 2.0 includes many API changes, -such as reordering arguments, renaming symbols, and changing default -values for parameters. Thus in some cases, it makes code written for the -TensorFlow 1 not compatible with TensorFlow 2. However, If you are using -the high-level APIs (tf.keras) there may be little or no action you need -to take to make your code fully TensorFlow 2.0 \<a -href="<https://www.tensorflow.org/guide/migrate>" -target="\_blank">compatible\</a>. It is still possible to run 1.X code, -unmodified ( [except for -contrib](https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md)), -in TensorFlow 2.0:\</span> - - import tensorflow.compat.v1 as tf - tf.disable_v2_behavior() #instead of "import tensorflow as tf" - -To make the transition to TF 2.0 as seamless as possible, the TensorFlow -team has created the -[`tf_upgrade_v2`](https://www.tensorflow.org/guide/upgrade) utility to -help transition legacy code to the new API. - -## FAQ: - -Q: Which module environment should I use? modenv/ml, modenv/scs5, -modenv/hiera - -A: On the ml partition use modenv/ml, on rome and gpu3 use modenv/hiera, -else stay with the default of modenv/scs5. - -Q: How to change the module environment and know more about modules? - -A: [Modules](../software/runtime_environment.md#Modules) +[TensorFlow](https://www.tensorflow.org) is a free end-to-end open-source software library for data +flow and differentiable programming across many tasks. It is a symbolic math library, used primarily +for machine learning applications. It has a comprehensive, flexible ecosystem of tools, libraries +and community resources. + +Please check the software modules list via + +```console +marie@compute$ module spider TensorFlow +[...] +``` + +to find out, which TensorFlow modules are available on your partition. + +On ZIH systems, TensorFlow 2 is the default module version. For compatibility hints between +TensorFlow 2 and TensorFlow 1, see the corresponding [section below](#compatibility-tf2-and-tf1). + +We recommend using partitions **Alpha** and/or **ML** when working with machine learning workflows +and the TensorFlow library. You can find detailed hardware specification in our +[Hardware](../jobs_and_resources/hardware_overview.md) documentation. + +## TensorFlow Console + +On the partition Alpha, load the module environment: + +```console +marie@alpha$ module load modenv/scs5 +``` + +Alternatively you can use `modenv/hiera` module environment, where the newest versions are +available + +```console +marie@alpha$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 + +The following have been reloaded with a version change: + 1) modenv/scs5 => modenv/hiera + +Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5 and 15 dependencies loaded. +marie@alpha$ module avail TensorFlow + +-------------- /sw/modules/hiera/all/MPI/GCC-CUDA/10.2.0-11.1.1/OpenMPI/4.0.5 ------------------- + Horovod/0.21.1-TensorFlow-2.4.1 TensorFlow/2.4.1 + +[...] +``` + +On the partition ML load the module environment: + +```console +marie@ml$ module load modenv/ml +The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml +``` + +This example shows how to install and start working with TensorFlow using the modules system. + +```console +marie@ml$ module load TensorFlow +Module TensorFlow/2.3.1-fosscuda-2019b-Python-3.7.4 and 47 dependencies loaded. +``` + +Now we can use TensorFlow. Nevertheless when working with Python in an interactive job, we recommend +to use a virtual environment. In the following example, we create a python virtual environment and +import TensorFlow: + +!!! example + + ```console + marie@ml$ ws_allocate -F scratch python_virtual_environment 1 + Info: creating workspace. + /scratch/ws/1/python_virtual_environment + [...] + marie@ml$ which python #check which python are you using + /sw/installed/Python/3.7.2-GCCcore-8.2.0 + marie@ml$ virtualenv --system-site-packages /scratch/ws/1/python_virtual_environment/env + [...] + marie@ml$ source /scratch/ws/1/python_virtual_environment/env/bin/activate + marie@ml$ python -c "import tensorflow as tf; print(tf.__version__)" + [...] + 2.3.1 + ``` + +## TensorFlow in JupyterHub + +In addition to interactive and batch jobs, it is possible to work with TensorFlow using +JupyterHub. The production and test environments of JupyterHub contain Python and R kernels, that +both come with TensorFlow support. However, you can specify the TensorFlow version when spawning +the notebook by pre-loading a specific TensorFlow module: + + +{: align="center"} + +!!! hint + + You can also define your own Jupyter kernel for more specific tasks. Please read about Jupyter + kernels and virtual environments in our + [JupyterHub](../access/jupyterhub.md#creating-and-using-your-own-environment) documentation. + +## TensorFlow in Containers + +Another option to use TensorFlow are containers. In the HPC domain, the +[Singularity](https://singularity.hpcng.org/) container system is a widely used tool. In the +following example, we use the tensorflow-test in a Singularity container: + +```console +marie@ml$ singularity shell --nv /scratch/singularity/powerai-1.5.3-all-ubuntu16.04-py3.img +Singularity>$ export PATH=/opt/anaconda3/bin:$PATH +Singularity>$ source activate /opt/anaconda3 #activate conda environment +(base) Singularity>$ . /opt/DL/tensorflow/bin/tensorflow-activate +(base) Singularity>$ tensorflow-test +Basic test of tensorflow - A Hello World!!!... +[...] +``` + +## TensorFlow with Python or R + +For further information on TensorFlow in combination with Python see +[data analytics with Python](data_analytics_with_python.md), for R see +[data analytics with R](data_analytics_with_r.md). + +## Distributed TensorFlow + +For details on how to run TensorFlow with multiple GPUs and/or multiple nodes, see +[distributed training](distributed_training.md). + +## Compatibility TF2 and TF1 + +TensorFlow 2.0 includes many API changes, such as reordering arguments, renaming symbols, and +changing default values for parameters. Thus in some cases, it makes code written for the TensorFlow +1.X not compatible with TensorFlow 2.X. However, If you are using the high-level APIs (`tf.keras`) +there may be little or no action you need to take to make your code fully +[TensorFlow 2.0](https://www.tensorflow.org/guide/migrate) compatible. It is still possible to +run 1.X code, unmodified (except for contrib), in TensorFlow 2.0: + +```python +import tensorflow.compat.v1 as tf +tf.disable_v2_behavior() #instead of "import tensorflow as tf" +``` + +To make the transition to TensorFlow 2.0 as seamless as possible, the TensorFlow team has created +the tf_upgrade_v2 utility to help transition legacy code to the new API. + +## Keras + +[Keras](https://keras.io) is a high-level neural network API, written in Python and capable +of running on top of TensorFlow. Please check the software modules list via + +```console +marie@compute$ module spider Keras +[...] +``` + +to find out, which Keras modules are available on your partition. TensorFlow should be automatically +loaded as a dependency. After loading the module, you can use Keras as usual. diff --git a/doc.zih.tu-dresden.de/docs/software/tensorflow_container_on_hpcda.md b/doc.zih.tu-dresden.de/docs/software/tensorflow_container_on_hpcda.md deleted file mode 100644 index 7b77f7da32f720efa0145971b1d3b9b9612a3e92..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/tensorflow_container_on_hpcda.md +++ /dev/null @@ -1,85 +0,0 @@ -# Container on HPC-DA (TensorFlow, PyTorch) - -<span class="twiki-macro RED"></span> **Note: This page is under -construction** <span class="twiki-macro ENDCOLOR"></span> - -\<span style="font-size: 1em;">A container is a standard unit of -software that packages up code and all its dependencies so the -application runs quickly and reliably from one computing environment to -another.\</span> - -**Prerequisites:** To work with Tensorflow, you need \<a href="Login" -target="\_blank">access\</a> for the Taurus system and basic knowledge -about containers, Linux systems. - -**Aim** of this page is to introduce users on how to use Machine -Learning Frameworks such as TensorFlow or PyTorch on the \<a -href="HPCDA" target="\_self">HPC-DA\</a> system - part of the TU Dresden -HPC system. - -Using a container is one of the options to use Machine learning -workflows on Taurus. Using containers gives you more flexibility working -with modules and software but at the same time required more effort. - -\<span style="font-size: 1em;">On Taurus \</span>\<a -href="<https://sylabs.io/>" target="\_blank">Singularity\</a>\<span -style="font-size: 1em;"> used as a standard container solution. -Singularity enables users to have full control of their environment. -Singularity containers can be used to package entire scientific -workflows, software and libraries, and even data. This means that -\</span>**you dont have to ask an HPC support to install anything for -you - you can put it in a Singularity container and run!**\<span -style="font-size: 1em;">As opposed to Docker (the most famous container -solution), Singularity is much more suited to being used in an HPC -environment and more efficient in many cases. Docker containers also can -easily be used in Singularity.\</span> - -Future information is relevant for the HPC-DA system (ML partition) -based on Power9 architecture. - -In some cases using Singularity requires a Linux machine with root -privileges, the same architecture and a compatible kernel. For many -reasons, users on Taurus cannot be granted root permissions. A solution -is a Virtual Machine (VM) on the ml partition which allows users to gain -root permissions in an isolated environment. There are two main options -on how to work with VM on Taurus: - -1\. [VM tools](vm_tools.md). Automative algorithms for using virtual -machines; - -2\. [Manual method](virtual_machines.md). It required more operations but gives you -more flexibility and reliability. - -Short algorithm to run the virtual machine manually: - - srun -p ml -N 1 -c 4 --hint=nomultithread --cloud=kvm --pty /bin/bash<br />cat ~/.cloud_$SLURM_JOB_ID #Example output: ssh root@192.168.0.1<br />ssh root@192.168.0.1 #Copy and paste output from the previous command <br />./mount_host_data.sh - -with VMtools: - -VMtools contains two main programs: -**\<span>buildSingularityImage\</span>** and -**\<span>startInVM.\</span>** - -Main options on how to create a container on ML nodes: - -1\. Create a container from the definition - -1.1 Create a Singularity definition from the Dockerfile. - -\<span style="font-size: 1em;">2. Importing container from the \</span> -[DockerHub](https://hub.docker.com/search?q=ppc64le&type=image&page=1)\<span -style="font-size: 1em;"> or \</span> -[SingularityHub](https://singularity-hub.org/) - -Two main sources for the Tensorflow containers for the Power9 -architecture: - -<https://hub.docker.com/r/ibmcom/tensorflow-ppc64le> - -<https://hub.docker.com/r/ibmcom/powerai> - -Pytorch: - -<https://hub.docker.com/r/ibmcom/powerai> - --- Main.AndreiPolitov - 2020-01-03 diff --git a/doc.zih.tu-dresden.de/docs/software/tensorflow_on_jupyter_notebook.md b/doc.zih.tu-dresden.de/docs/software/tensorflow_on_jupyter_notebook.md deleted file mode 100644 index e011dfd2dc35d7dc5ef1576d7a5dbefa5d52f6d4..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/tensorflow_on_jupyter_notebook.md +++ /dev/null @@ -1,252 +0,0 @@ -# Tensorflow on Jupyter Notebook - -%RED%Note: This page is under construction<span -class="twiki-macro ENDCOLOR"></span> - -Disclaimer: This page dedicates a specific question. For more general -questions please check the JupyterHub webpage. - -The Jupyter Notebook is an open-source web application that allows you -to create documents that contain live code, equations, visualizations, -and narrative text. \<span style="font-size: 1em;">Jupyter notebook -allows working with TensorFlow on Taurus with GUI (graphic user -interface) and the opportunity to see intermediate results step by step -of your work. This can be useful for users who dont have huge experience -with HPC or Linux. \</span> - -**Prerequisites:** To work with Tensorflow and jupyter notebook you need -\<a href="Login" target="\_blank">access\</a> for the Taurus system and -basic knowledge about Python, Slurm system and the Jupyter notebook. - -\<span style="font-size: 1em;"> **This page aims** to introduce users on -how to start working with TensorFlow on the [HPCDA](../jobs_and_resources/hpcda.md) system - part -of the TU Dresden HPC system with a graphical interface.\</span> - -## Get started with Jupyter notebook - -Jupyter notebooks are a great way for interactive computing in your web -browser. Jupyter allows working with data cleaning and transformation, -numerical simulation, statistical modelling, data visualization and of -course with machine learning. - -\<span style="font-size: 1em;">There are two general options on how to -work Jupyter notebooks using HPC. \</span> - -- \<span style="font-size: 1em;">There is \</span>**\<a - href="JupyterHub" target="\_self">jupyterhub\</a>** on Taurus, where - you can simply run your Jupyter notebook on HPC nodes. JupyterHub is - available [here](https://taurus.hrsk.tu-dresden.de/jupyter) -- For more specific cases you can run a manually created **remote - jupyter server.** \<span style="font-size: 1em;"> You can find the - manual server setup [here](deep_learning.md). - -\<span style="font-size: 13px;">Keep in mind that with Jupyterhub you -can't work with some special instruments. However general data analytics -tools are available. Still and all, the simplest option for beginners is -using JupyterHub.\</span> - -## Virtual environment - -\<span style="font-size: 1em;">For working with TensorFlow and python -packages using virtual environments (kernels) is necessary.\</span> - -Interactive code interpreters that are used by Jupyter Notebooks are -called kernels.\<br />Creating and using your kernel (environment) has -the benefit that you can install your preferred python packages and use -them in your notebooks. - -A virtual environment is a cooperatively isolated runtime environment -that allows Python users and applications to install and upgrade Python -distribution packages without interfering with the behaviour of other -Python applications running on the same system. So the [Virtual -environment](https://docs.python.org/3/glossary.html#term-virtual-environment) -is a self-contained directory tree that contains a Python installation -for a particular version of Python, plus several additional packages. At -its core, the main purpose of Python virtual environments is to create -an isolated environment for Python projects. Python virtual environment is -the main method to work with Deep Learning software as TensorFlow on the -[HPCDA](../jobs_and_resources/hpcda.md) system. - -### Conda and Virtualenv - -There are two methods of how to work with virtual environments on -Taurus. **Vitualenv (venv)** is a -standard Python tool to create isolated Python environments. We -recommend using venv to work with Tensorflow and Pytorch on Taurus. It -has been integrated into the standard library under -the [venv](https://docs.python.org/3/library/venv.html). -However, if you have reasons (previously created environments etc) you -could easily use conda. The conda is the second way to use a virtual -environment on the Taurus. -[Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) -is an open-source package management system and environment management system -from the Anaconda. - -**Note:** Keep in mind that you **can not** use conda for working with -the virtual environments previously created with Vitualenv tool and vice -versa! - -This example shows how to start working with environments and prepare -environment (kernel) for working with Jupyter server - - srun -p ml --gres=gpu:1 -n 1 --pty --mem-per-cpu=8000 bash #Job submission in ml nodes with 1 gpu on 1 node with 8000 mb. - - module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml - - mkdir python-virtual-environments #create folder for your environments - cd python-virtual-environments #go to folder - module load TensorFlow #load TensorFlow module. Example output: Module TensorFlow/1.10.0-PythonAnaconda-3.6 and 1 dependency loaded. - which python #check which python are you using - python3 -m venv --system-site-packages env #create virtual environment "env" which inheriting with global site packages - source env/bin/activate #Activate virtual environment "env". Example output: (env) bash-4.2$ - module load TensorFlow #load TensorFlow module in the virtual environment - -The inscription (env) at the beginning of each line represents that now -you are in the virtual environment. - -Now you can check the working capacity of the current environment. - - python #start python - import tensorflow as tf - print(tf.VERSION) #example output: 1.14.0 - -### Install Ipykernel - -Ipykernel is an interactive Python shell and a Jupyter kernel to work -with Python code in Jupyter notebooks. The IPython kernel is the Python -execution backend for Jupyter. The Jupyter Notebook -automatically ensures that the IPython kernel is available. - -``` - (env) bash-4.2$ pip install ipykernel #example output: Collecting ipykernel - ... - #example output: Successfully installed ... ipykernel-5.1.0 ipython-7.5.0 ... - - (env) bash-4.2$ python -m ipykernel install --user --name env --display-name="env" - - #example output: Installed kernelspec my-kernel in .../.local/share/jupyter/kernels/env - [install now additional packages for your notebooks] -``` - -Deactivate the virtual environment - - (env) bash-4.2$ deactivate - -So now you have a virtual environment with included TensorFlow module. -You can use this workflow for your purposes particularly for the simple -running of your jupyter notebook with Tensorflow code. - -## Examples and running the model - -Below are brief explanations examples of Jupyter notebooks with -Tensorflow models which you can run on ml nodes of HPC-DA. Prepared -examples of TensorFlow models give you an understanding of how to work -with jupyterhub and tensorflow models. It can be useful and instructive -to start your acquaintance with Tensorflow and HPC-DA system from these -simple examples. - -You can use a [remote Jupyter server](../access/jupyterhub.md). For simplicity, we -will recommend using Jupyterhub for our examples. - -JupyterHub is available [here](https://taurus.hrsk.tu-dresden.de/jupyter) - -Please check updates and details [JupyterHub](../access/jupyterhub.md). However, -the general pipeline can be briefly explained as follows. - -After logging, you can start a new session and configure it. There are -simple and advanced forms to set up your session. On the simple form, -you have to choose the "IBM Power (ppc64le)" architecture. You can -select the required number of CPUs and GPUs. For the acquaintance with -the system through the examples below the recommended amount of CPUs and -1 GPU will be enough. With the advanced form, you can use the -configuration with 1 GPU and 7 CPUs. To access all your workspaces -use " / " in the workspace scope. - -You need to download the file with a jupyter notebook that already -contains all you need for the start of the work. Please put the file -into your previously created virtual environment in your working -directory or use the kernel for your notebook. - -Note: You could work with simple examples in your home directory but according to -[new storage concept](../data_lifecycle/overview.md) please use -[workspaces](../data_lifecycle/workspaces.md) for your study and work projects**. -For this reason, you have to use advanced options and put "/" in "Workspace scope" field. - -To download the first example (from the list below) into your previously -created virtual environment you could use the following command: - -``` - ws_list - cd <name_of_your_workspace> #go to workspace - - wget https://doc.zih.tu-dresden.de/hpc-wiki/pub/Compendium/TensorFlowOnJupyterNotebook/Mnistmodel.zip - unzip Example_TensorFlow_Automobileset.zip -``` - -Also, you could use kernels for all notebooks, not only for them which placed -in your virtual environment. See the [jupyterhub](../access/jupyterhub.md) page. - -### Examples: - -1\. Simple MNIST model. The MNIST database is a large database of -handwritten digits that is commonly used for \<a -href="<https://en.wikipedia.org/wiki/Training_set>" title="Training -set">t\</a>raining various image processing systems. This model -illustrates using TF-Keras API. \<a -href="<https://www.tensorflow.org/guide/keras>" -target="\_top">Keras\</a> is TensorFlow's high-level API. Tensorflow and -Keras allow us to import and download the MNIST dataset directly from -their API. Recommended parameters for running this model is 1 GPU and 7 -cores (28 thread) - -[doc.zih.tu-dresden.de/hpc-wiki/pub/Compendium/TensorFlowOnJupyterNotebook/Mnistmodel.zip]**todo**(Mnistmodel.zip) - -### Running the model - -\<span style="font-size: 1em;">Documents are organized with tabs and a -very versatile split-screen feature. On the left side of the screen, you -can open your file. Use 'File-Open from Path' to go to your workspace -(e.g. /scratch/ws/\<username-name_of_your_ws>). You could run each cell -separately step by step and analyze the result of each step. Default -command for running one cell Shift+Enter'. Also, you could run all cells -with the command 'run all cells' how presented on the picture -below\</span> - -**todo** \<img alt="Screenshot_from_2019-09-03_15-20-16.png" height="250" -src="Screenshot_from_2019-09-03_15-20-16.png" -title="Screenshot_from_2019-09-03_15-20-16.png" width="436" /> - -#### Additional advanced models - -1\. A simple regression model uses [Automobile -dataset](https://archive.ics.uci.edu/ml/datasets/Automobile). In a -regression problem, we aim to predict the output of a continuous value, -in this case, we try to predict fuel efficiency. This is the simple -model created to present how to work with a jupyter notebook for the -TensorFlow models. Recommended parameters for running this model is 1 -GPU and 7 cores (28 thread) - -[doc.zih.tu-dresden.de/hpc-wiki/pub/Compendium/TensorFlowOnJupyterNotebook/Example_TensorFlow_Automobileset.zip]**todo**(Example_TensorFlow_Automobileset.zip) - -2\. The regression model uses the -[dataset](https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data) -with meteorological data from the Beijing airport and the US embassy. -The data set contains almost 50 thousand on instances and therefore -needs more computational effort. Recommended parameters for running this -model is 1 GPU and 7 cores (28 threads) - -[doc.zih.tu-dresden.de/hpc-wiki/pub/Compendium/TensorFlowOnJupyterNotebook/Example_TensorFlow_Meteo_airport.zip]**todo**(Example_TensorFlow_Meteo_airport.zip) - -**Note**: All examples created only for study purposes. The main aim is -to introduce users of the HPC-DA system of TU-Dresden with TensorFlow -and Jupyter notebook. Examples do not pretend to completeness or -science's significance. Feel free to improve the models and use them for -your study. - -- [Mnistmodel.zip]**todo**(Mnistmodel.zip): Mnistmodel.zip -- [Example_TensorFlow_Automobileset.zip]**todo**(Example_TensorFlow_Automobileset.zip): - Example_TensorFlow_Automobileset.zip -- [Example_TensorFlow_Meteo_airport.zip]**todo**(Example_TensorFlow_Meteo_airport.zip): - Example_TensorFlow_Meteo_airport.zip -- [Example_TensorFlow_3D_road_network.zip]**todo**(Example_TensorFlow_3D_road_network.zip): - Example_TensorFlow_3D_road_network.zip diff --git a/doc.zih.tu-dresden.de/docs/software/vampir.md b/doc.zih.tu-dresden.de/docs/software/vampir.md index 465d28925302091bf0e0d66156753452c3608912..9df5eb62a0d461da97fcb2ce28f461d9042e93a2 100644 --- a/doc.zih.tu-dresden.de/docs/software/vampir.md +++ b/doc.zih.tu-dresden.de/docs/software/vampir.md @@ -146,8 +146,8 @@ marie@local$ ssh -L 30000:taurusi1253:30055 taurus.hrsk.tu-dresden.de ``` Now, the port 30000 on your desktop is connected to the VampirServer port 30055 at the compute node -taurusi1253 of Taurus. Finally, start your local Vampir client and establish a remote connection to -`localhost`, port 30000 as described in the manual. +`taurusi1253` of the ZIH system. Finally, start your local Vampir client and establish a remote +connection to `localhost`, port 30000 as described in the manual. ```console marie@local$ vampir diff --git a/doc.zih.tu-dresden.de/docs/software/virtual_desktops.md b/doc.zih.tu-dresden.de/docs/software/virtual_desktops.md index 123c323b2d3acc9c24863ca203179f8338da4dce..cb809c3a99022c8ec5c5d3e9a98b96b8533baa0b 100644 --- a/doc.zih.tu-dresden.de/docs/software/virtual_desktops.md +++ b/doc.zih.tu-dresden.de/docs/software/virtual_desktops.md @@ -1,89 +1,3 @@ -# Virtual desktops +# Virtual Desktops -Use WebVNC or NICE DCV to run GUI applications on HPC resources. - -<span class="twiki-macro TABLE" columnwidths="10%,45%,45%"></span> - -| | | | -|----------------|-------------------------------------------------------|-------------------------------------------------| -| | **WebVNC** | **NICE DCV** | -| **use case** | all GUI applications that do \<u>not need\</u> OpenGL | only GUI applications that \<u>need\</u> OpenGL | -| **partitions** | all\* (except partitions with GPUs (gpu2, hpdlf, ml) | dcv | - -## Launch a virtual desktop - -<span class="twiki-macro TABLE" columnwidths="10%,45%,45%"></span> \| -**step 1** \| Navigate to \<a href="<https://taurus.hrsk.tu-dresden.de>" -target="\_blank"><https://taurus.hrsk.tu-dresden.de>\</a>. There is our -[JupyterHub](../access/jupyterhub.md) instance. \|\| \| **step 2** \| -Click on the "advanced" tab and choose a preset: \|\| - -| | | | -|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------| -| ^ | **WebVNC** | **DCV** | -| **step 3** | Optional: Finetune your session with the available SLURM job parameters or assign a certain project or reservation. Then save your settings in a new preset for future use. | | -| **step 4** | Click on "Spawn". JupyterHub starts now a SLURM job for you. If everything is ready the JupyterLab interface will appear to you. | | -| **step 5"** | Click on the **button "WebVNC"** to start a virtual desktop. | Click on the \*button "NICE DCV"to start a virtual desktop. | -| ^ | The virtual desktop starts in a new tab or window. | | - -### Demonstration - -\<video controls="" width="320" style="border: 1px solid black">\<source -src="<https://doc.zih.tu-dresden.de/hpc-wiki/pub/Compendium/VirtualDesktops/start-virtual-desktop-dcv.mp4>" -type="video/mp4">\<source -src="<https://doc.zih.tu-dresden.de/hpc-wiki/pub/Compendium/VirtualDesktops/start-virtual-desktop-dcv.webm>" -type="video/webm">\</video> - -### Using the quickstart feature - -JupyterHub can start a job automatically if the URL contains certain -parameters. - -<span class="twiki-macro TABLE" columnwidths="10%,45%,45%"></span> - -| | | | -|----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| examples | \<a href="<https://taurus.hrsk.tu-dresden.de/jupyter/hub/spawn#/>\~(partition\~'interactive\~cpuspertask\~'2\~mempercpu\~'2583)" target="\_blank" style="font-size: 1.5em">WebVNC\</a> | \<a href="<https://taurus.hrsk.tu-dresden.de/jupyter/hub/spawn#/>\~(partition\~'dcv\~cpuspertask\~'6\~gres\~'gpu\*3a1\~mempercpu\~'2583)" target="\_blank" style="font-size: 1.5em">NICE DCV\</a> | -| details about the examples | `interactive` partition, 2 CPUs with 2583 MB RAM per core, no GPU | `dcv` partition, 6 CPUs with 2583 MB RAM per core, 1 GPU | -| link creator | Use the spawn form to set your preferred options. The browser URL will be updated with the corresponding parameters. | | - -If you close the browser tabs or windows or log out from your local -machine, you are able to open the virtual desktop later again - as long -as the session runs. But please remember that a SLURM job is running in -the background which has a certain timelimit. - -## Reconnecting to a session - -In order to reconnect to an active instance of WebVNC, simply repeat the -steps required to start a session, beginning - if required - with the -login, then clicking "My server", then by pressing the "+" sign on the -upper left corner. Provided your server is still running and you simply -closed the window or logged out without stopping your server, you will -find your WebVNC desktop the way you left it. - -## Terminate a remote session - -<span class="twiki-macro TABLE" columnwidths="10%,90%"></span> \| **step -1** \| Close the VNC viewer tab or window. \| \| **step 2** \| Click on -File \> Log Out in the JupyterLab main menu. Now you get redirected to -the JupyterLab control panel. If you don't have your JupyterLab tab or -window anymore, navigate directly to \<a -href="<https://taurus.hrsk.tu-dresden.de/jupyter/hub/home>" -target="\_blank"><https://taurus.hrsk.tu-dresden.de/jupyter/hub/home>\</a>. -\| \| **step 3** \| Click on "Stop My Server". This cancels the SLURM -job and terminates your session. \| - -### Demonstration - -\<video controls="" width="320" style="border: 1px solid black">\<source -src="<https://doc.zih.tu-dresden.de/hpc-wiki/pub/Compendium/VirtualDesktops/terminate-virtual-desktop-dcv.mp4>" -type="video/mp4">\<source -src="<https://doc.zih.tu-dresden.de/hpc-wiki/pub/Compendium/VirtualDesktops/terminate-virtual-desktop-dcv.webm>" -type="video/webm">\</video> - -**Remark:** This does not work if you click on the "Logout"-Btn in your -virtual desktop. Instead this will just close your DCV session or cause -a black screen in your WebVNC window without a possibility to recover a -virtual desktop in the same Jupyter session. The solution for now would -be to terminate the whole jupyter session and start a new one like -mentioned above. +coming soon diff --git a/doc.zih.tu-dresden.de/docs/software/virtual_machines.md b/doc.zih.tu-dresden.de/docs/software/virtual_machines.md index 5104c7b35587aaeaca86d64419ffd8965d2fa27b..2527bbe91cbb735824598cc90311b88df2eab808 100644 --- a/doc.zih.tu-dresden.de/docs/software/virtual_machines.md +++ b/doc.zih.tu-dresden.de/docs/software/virtual_machines.md @@ -1,88 +1,89 @@ -# Virtual machine on Taurus +# Virtual Machines -The following instructions are primarily aimed at users who want to build their -[Singularity](containers.md) containers on Taurus. +The following instructions are primarily aimed at users who want to build their own +[Singularity](containers.md) containers on ZIH systems. The Singularity container setup requires a Linux machine with root privileges, the same architecture -and a compatible kernel. If some of these requirements can not be fulfilled, then there is -also the option of using the provided virtual machines on Taurus. +and a compatible kernel. If some of these requirements cannot be fulfilled, then there is also the +option of using the provided virtual machines (VM) on ZIH systems. -Currently, starting VMs is only possible on ML and HPDLF nodes. The VMs on the ML nodes are used to -build singularity containers for the Power9 architecture and the HPDLF nodes to build singularity -containers for the x86 architecture. +Currently, starting VMs is only possible on partitions `ml` and `hpdlf`. The VMs on the ML nodes are +used to build singularity containers for the Power9 architecture and the HPDLF nodes to build +Singularity containers for the x86 architecture. -## Create a virtual machine +## Create a Virtual Machine -The `--cloud=kvm` SLURM parameter specifies that a virtual machine should be started. +The Slurm parameter `--cloud=kvm` specifies that a virtual machine should be started. -### On Power9 architecture +### On Power9 Architecture -```Bash -rotscher@tauruslogin3:~> srun -p ml -N 1 -c 4 --hint=nomultithread --cloud=kvm --pty /bin/bash +```console +marie@login$ srun --partition=ml --nodes=1 --cpus-per-task=4 --hint=nomultithread --cloud=kvm --pty /bin/bash srun: job 6969616 queued and waiting for resources srun: job 6969616 has been allocated resources bash-4.2$ ``` -### On x86 architecture +### On x86 Architecture -```Bash -rotscher@tauruslogin3:~> srun -p hpdlf -N 1 -c 4 --hint=nomultithread --cloud=kvm --pty /bin/bash +```console +marie@login$ srun --partition=hpdlf --nodes=1 --cpus-per-task=4 --hint=nomultithread --cloud=kvm --pty /bin/bash srun: job 2969732 queued and waiting for resources srun: job 2969732 has been allocated resources bash-4.2$ ``` -## Access virtual machine +## Access a Virtual Machine -Since the security issue on Taurus, we restricted the file system permissions. Now you have to wait -until the file /tmp/${SLURM_JOB_USER}\_${SLURM_JOB_ID}/activate is created, then you can try to ssh -into the virtual machine (VM), but it could be that the VM needs some more seconds to boot and start -the SSH daemon. So you may need to try the `ssh` command multiple times till it succeeds. +After a security issue on ZIH systems, we restricted the filesystem permissions. Now, you have to +wait until the file `/tmp/${SLURM_JOB_USER}_${SLURM_JOB_ID}/activate` is created. Then, you can try +to connect via `ssh` into the virtual machine, but it could be that the virtual machine needs some +more seconds to boot and accept the connection. So you may need to try the `ssh` command multiple +times till it succeeds. -```Bash -bash-4.2$ cat /tmp/rotscher_2759627/activate +```console +bash-4.2$ cat /tmp/marie_2759627/activate #!/bin/bash -if ! grep -q -- "Key for the VM on the ml partition" "/home/rotscher/.ssh/authorized_keys" >& /dev/null; then - cat "/tmp/rotscher_2759627/kvm.pub" >> "/home/rotscher/.ssh/authorized_keys" +if ! grep -q -- "Key for the VM on the partition ml" "/home/marie/.ssh/authorized_keys" > /dev/null; then + cat "/tmp/marie_2759627/kvm.pub" >> "/home/marie/.ssh/authorized_keys" else - sed -i "s|.*Key for the VM on the ml partition.*|ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC3siZfQ6vQ6PtXPG0RPZwtJXYYFY73TwGYgM6mhKoWHvg+ZzclbBWVU0OoU42B3Ddofld7TFE8sqkHM6M+9jh8u+pYH4rPZte0irw5/27yM73M93q1FyQLQ8Rbi2hurYl5gihCEqomda7NQVQUjdUNVc6fDAvF72giaoOxNYfvqAkw8lFyStpqTHSpcOIL7pm6f76Jx+DJg98sXAXkuf9QK8MurezYVj1qFMho570tY+83ukA04qQSMEY5QeZ+MJDhF0gh8NXjX/6+YQrdh8TklPgOCmcIOI8lwnPTUUieK109ndLsUFB5H0vKL27dA2LZ3ZK+XRCENdUbpdoG2Czz Key for the VM on the ml partition|" "/home/rotscher/.ssh/authorized_keys" + sed -i "s|.*Key for the VM on the partition ml.*|ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC3siZfQ6vQ6PtXPG0RPZwtJXYYFY73TwGYgM6mhKoWHvg+ZzclbBWVU0OoU42B3Ddofld7TFE8sqkHM6M+9jh8u+pYH4rPZte0irw5/27yM73M93q1FyQLQ8Rbi2hurYl5gihCEqomda7NQVQUjdUNVc6fDAvF72giaoOxNYfvqAkw8lFyStpqTHSpcOIL7pm6f76Jx+DJg98sXAXkuf9QK8MurezYVj1qFMho570tY+83ukA04qQSMEY5QeZ+MJDhF0gh8NXjX/6+YQrdh8TklPgOCmcIOI8lwnPTUUieK109ndLsUFB5H0vKL27dA2LZ3ZK+XRCENdUbpdoG2Czz Key for the VM on the partition ml|" "/home/marie/.ssh/authorized_keys" fi -ssh -i /tmp/rotscher_2759627/kvm root@192.168.0.6 -bash-4.2$ source /tmp/rotscher_2759627/activate +ssh -i /tmp/marie_2759627/kvm root@192.168.0.6 +bash-4.2$ source /tmp/marie_2759627/activate Last login: Fri Jul 24 13:53:48 2020 from gateway -[root@rotscher_2759627 ~]# +[root@marie_2759627 ~]# ``` -## Example usage +## Example Usage ## Automation -We provide [Tools](vm_tools.md) to automate these steps. You may just type `startInVM --arch=power9` -on a tauruslogin node and you will be inside the VM with everything mounted. +We provide [tools](virtual_machines_tools.md) to automate these steps. You may just type `startInVM +--arch=power9` on a login node and you will be inside the VM with everything mounted. ## Known Issues ### Temporary Memory -The available space inside the VM can be queried with `df -h`. Currently the whole VM has 8G and -with the installed operating system, 6.6GB of available space. +The available space inside the VM can be queried with `df -h`. Currently the whole VM has 8 GB and +with the installed operating system, 6.6 GB of available space. -Sometimes the Singularity build might fail because of a disk out-of-memory error. In this case it +Sometimes, the Singularity build might fail because of a disk out-of-memory error. In this case, it might be enough to delete leftover temporary files from Singularity: -```Bash +```console rm -rf /tmp/sbuild-* ``` If that does not help, e.g., because one build alone needs more than the available disk memory, then it will be necessary to use the tmp folder on scratch. In order to ensure that the files in the -temporary folder will be owned by root, it is necessary to set up an image inside /scratch/tmp -instead of using it directly. E.g., to create a 25GB of temporary memory image: +temporary folder will be owned by root, it is necessary to set up an image inside `/scratch/tmp` +instead of using it directly. E.g., to create a 25 GB of temporary memory image: -```Bash +```console tmpDir="$( mktemp -d --tmpdir=/host_data/tmp )" && tmpImg="$tmpDir/singularity-build-temp-dir" export LANG_BACKUP=$LANG unset LANG @@ -90,13 +91,17 @@ truncate -s 25G "$tmpImg.ext4" && echo yes | mkfs.ext4 "$tmpImg.ext4" export LANG=$LANG_BACKUP ``` -The image can now be mounted and with the **SINGULARITY_TMPDIR** environment variable can be +The image can now be mounted and with the `SINGULARITY_TMPDIR` environment variable can be specified as the temporary directory for Singularity builds. Unfortunately, because of an open Singularity [bug](https://github.com/sylabs/singularity/issues/32) it is should be avoided to mount -the image using **/dev/loop0**. +the image using `/dev/loop0`. -```Bash -mkdir -p "$tmpImg" && i=1 && while test -e "/dev/loop$i"; do (( ++i )); done && mknod -m 0660 "/dev/loop$i" b 7 "$i"<br />mount -o loop="/dev/loop$i" "$tmpImg"{.ext4,}<br /><br />export SINGULARITY_TMPDIR="$tmpImg"<br /><br />singularity build my-container.{sif,def} +```console +mkdir -p "$tmpImg" && i=1 && while test -e "/dev/loop$i"; do (( ++i )); done && mknod -m 0660 "/dev/loop$i" b 7 "$i" +mount -o loop="/dev/loop$i" "$tmpImg"{.ext4,} + +export SINGULARITY_TMPDIR="$tmpImg" +singularity build my-container.{sif,def} ``` The architecture of the base image is automatically chosen when you use an image from DockerHub. @@ -106,4 +111,4 @@ Bootstraps **shub** and **library** should be avoided. ### Transport Endpoint is not Connected This happens when the SSHFS mount gets unmounted because it is not very stable. It is sufficient to -run `\~/mount_host_data.sh` again or just the sshfs command inside that script. +run `~/mount_host_data.sh` again or just the SSHFS command inside that script. diff --git a/doc.zih.tu-dresden.de/docs/software/virtual_machines_tools.md b/doc.zih.tu-dresden.de/docs/software/virtual_machines_tools.md new file mode 100644 index 0000000000000000000000000000000000000000..fbec2e51bc453cc17e2d131d7229c50ff90aa23f --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/virtual_machines_tools.md @@ -0,0 +1,135 @@ +# Singularity on Partition `ml` + +!!! note "Root privileges" + + Building Singularity containers from a recipe on ZIH system is normally not possible due to the + requirement of root (administrator) rights, see [Containers](containers.md). For obvious reasons + users cannot be granted root permissions. + +The solution is to build your container on your local Linux workstation using Singularity and copy +it to ZIH systems for execution. + +**This does not work on the partition `ml`** as it uses the Power9 architecture which your +workstation likely doesn't. + +For this, we provide a Virtual Machine (VM) on the partition `ml` which allows users to gain root +permissions in an isolated environment. The workflow to use this manually is described for +[virtual machines](virtual_machines.md) but is quite cumbersome. + +To make this easier, two programs are provided: `buildSingularityImage` and `startInVM`, which do +what they say. The latter is for more advanced use cases, so you should be fine using +`buildSingularityImage`, see the following section. + +!!! note "SSH key without password" + + You need to have your default SSH key without a password for the scripts to work as + entering a password through the scripts is not supported. + +**The recommended workflow** is to create and test a definition file locally. You usually start from +a base Docker container. Those typically exist for different architectures but with a common name +(e.g. `ubuntu:18.04`). Singularity automatically uses the correct Docker container for your current +architecture when building. So, in most cases, you can write your definition file, build it and test +it locally, then move it to ZIH systems and build it on Power9 (partition `ml`) without any further +changes. However, sometimes Docker containers for different architectures have different suffixes, +in which case you'd need to change that when moving to ZIH systems. + +## Build a Singularity Container in a Job + +To build a Singularity container for the power9-architecture on ZIH systems simply run: + +```console +marie@login$ buildSingularityImage --arch=power9 myContainer.sif myDefinition.def +``` + +To build a singularity image on the x86-architecture, run: + +```console +marie@login$ buildSingularityImage --arch=x86 myContainer.sif myDefinition.def +``` + +These commands will submit a batch job and immediately return. If you want it to block while the +image is built and see live output, add the option `--interactive`: + +```console +marie@login$ buildSingularityImage --arch=power9 --interactive myContainer.sif myDefinition.def +``` + +There are more options available, which can be shown by running `buildSingularityImage --help`. All +have reasonable defaults. The most important ones are: + +* `--time <time>`: Set a higher job time if the default time is not + enough to build your image and your job is canceled before completing. The format is the same as + for Slurm. +* `--tmp-size=<size in GB>`: Set a size used for the temporary + location of the Singularity container, basically the size of the extracted container. +* `--output=<file>`: Path to a file used for (log) output generated + while building your container. +* Various Singularity options are passed through. E.g. + `--notest, --force, --update`. See, e.g., `singularity --help` for details. + +For **advanced users**, it is also possible to manually request a job with a VM (`srun -p ml +--cloud=kvm ...`) and then use this script to build a Singularity container from within the job. In +this case, the `--arch` and other Slurm related parameters are not required. The advantage of using +this script is that it automates the waiting for the VM and mounting of host directories into it +(can also be done with `startInVM`) and creates a temporary directory usable with Singularity inside +the VM controlled by the `--tmp-size` parameter. + +## Filesystem + +**Read here if you have problems like "File not found".** + +As the build starts in a VM, you may not have access to all your files. It is usually bad practice +to refer to local files from inside a definition file anyway as this reduces reproducibility. +However, common directories are available by default. For others, care must be taken. In short: + +* `/home/$USER`, `/scratch/$USER` are available and should be used `/scratch/<group>` also works for + all groups the users is in +* `/projects/<group>` similar, but is read-only! So don't use this to store your generated + container directly, but rather move it here afterwards +* `/tmp` is the VM local temporary directory. All files put here will be lost! + +If the current directory is inside (or equal to) one of the above (except `/tmp`), then relative paths +for container and definition work as the script changes to the VM equivalent of the current +directory. Otherwise, you need to use absolute paths. Using `~` in place of `$HOME` does work too. + +Under the hood, the filesystem of ZIH systems is mounted via SSHFS at `/host_data`. So if you need any +other files, they can be found there. + +There is also a new SSH key named `kvm` which is created by the scripts and authorized inside the VM +to allow for password-less access to SSHFS. This is stored at `~/.ssh/kvm` and regenerated if it +does not exist. It is also added to `~/.ssh/authorized_keys`. Note that removing the key file does +not remove it from `authorized_keys`, so remove it manually if you need to. It can be easily +identified by the comment on the key. However, removing this key is **NOT** recommended, as it +needs to be re-generated on every script run. + +## Start a Job in a VM + +Especially when developing a Singularity definition file, it might be useful to get a shell directly +on a VM. To do so on the power9-architecture, simply run: + +```console +startInVM --arch=power9 +``` + +To do so on the x86-architecture, run: + +```console +startInVM --arch=x86 +``` + +This will execute an `srun` command with the `--cloud=kvm` parameter, wait till the VM is ready, +mount all folders (just like `buildSingularityImage`, see the Filesystem section above) and come +back with a bash inside the VM. Inside that you are root, so you can directly execute `singularity +build` commands. + +As usual, more options can be shown by running `startInVM --help`, the most important one being +`--time`. + +There are two special use cases for this script: + +1. Execute an arbitrary command inside the VM instead of getting a bash by appending the command to + the script. Example: `startInVM --arch=power9 singularity build ~/myContainer.sif ~/myDefinition.de` +1. Use the script in a job manually allocated via srun/sbatch. This will work the same as when + running outside a job but will **not** start a new job. This is useful for using it inside batch + scripts, when you already have an allocation or need special arguments for the job system. Again, + you can run an arbitrary command by passing it to the script. diff --git a/doc.zih.tu-dresden.de/docs/software/visualization.md b/doc.zih.tu-dresden.de/docs/software/visualization.md index 328acc490f5fa5c65e687d50bf9f43ceae44c541..f1e551c968cb4478069c98e691eef11bce7ccb01 100644 --- a/doc.zih.tu-dresden.de/docs/software/visualization.md +++ b/doc.zih.tu-dresden.de/docs/software/visualization.md @@ -49,10 +49,10 @@ marie@login$ mpiexec -bind-to -help` or from [mpich wiki](https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager#Process-core_Binding%7Cwiki.mpich.org). -In the following, we provide two examples on how to use `pvbatch` from within a jobfile and an +In the following, we provide two examples on how to use `pvbatch` from within a job file and an interactive allocation. -??? example "Example jobfile" +??? example "Example job file" ```Bash #!/bin/bash @@ -97,7 +97,7 @@ cards (GPUs) specified by the device index. For that, make sure to use the modul *-egl*, e.g., `ParaView/5.9.0-RC1-egl-mpi-Python-3.8`, and pass the option `--egl-device-index=$CUDA_VISIBLE_DEVICES`. -??? example "Example jobfile" +??? example "Example job file" ```Bash #!/bin/bash @@ -171,7 +171,7 @@ are outputed.* This contains the node name which your job and server runs on. However, since the node names of the cluster are not present in the public domain name system (only cluster-internally), you cannot just use this line as-is for connection with your client. **You first have to resolve** the name to an IP -address on ZIH systems: Suffix the nodename with `-mn` to get the management network (ethernet) +address on ZIH systems: Suffix the node name with `-mn` to get the management network (ethernet) address, and pass it to a lookup-tool like `host` in another SSH session: ```console diff --git a/doc.zih.tu-dresden.de/docs/software/vm_tools.md b/doc.zih.tu-dresden.de/docs/software/vm_tools.md deleted file mode 100644 index 5a4d58a7e2ac7a1532d5029312e3ff3b479d7939..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/vm_tools.md +++ /dev/null @@ -1,123 +0,0 @@ -# Singularity on Power9 / ml partition - -Building Singularity containers from a recipe on Taurus is normally not possible due to the -requirement of root (administrator) rights, see [Containers](containers.md). For obvious reasons -users on Taurus cannot be granted root permissions. - -The solution is to build your container on your local Linux machine by executing something like - -```Bash -sudo singularity build myContainer.sif myDefinition.def -``` - -Then you can copy the resulting myContainer.sif to Taurus and execute it there. - -This does **not** work on the ml partition as it uses the Power9 architecture which your laptop -likely doesn't. - -For this we provide a Virtual Machine (VM) on the ml partition which allows users to gain root -permissions in an isolated environment. The workflow to use this manually is described at -[another page](virtual_machines.md) but is quite cumbersome. - -To make this easier two programs are provided: `buildSingularityImage` and `startInVM` which do what -they say. The latter is for more advanced use cases so you should be fine using -*buildSingularityImage*, see the following section. - -**IMPORTANT:** You need to have your default SSH key without a password for the scripts to work as -entering a password through the scripts is not supported. - -**The recommended workflow** is to create and test a definition file locally. You usually start from -a base Docker container. Those typically exist for different architectures but with a common name -(e.g. 'ubuntu:18.04'). Singularity automatically uses the correct Docker container for your current -architecture when building. So in most cases you can write your definition file, build it and test -it locally, then move it to Taurus and build it on Power9 without any further changes. However, -sometimes Docker containers for different architectures have different suffixes, in which case you'd -need to change that when moving to Taurus. - -## Building a Singularity container in a job - -To build a singularity container on Taurus simply run: - -```Bash -buildSingularityImage --arch=power9 myContainer.sif myDefinition.def -``` - -This command will submit a batch job and immediately return. Note that while "power9" is currently -the only supported architecture, the parameter is still required. If you want it to block while the -image is built and see live output, use the parameter `--interactive`: - -```Bash -buildSingularityImage --arch=power9 --interactive myContainer.sif myDefinition.def -``` - -There are more options available which can be shown by running `buildSingularityImage --help`. All -have reasonable defaults.The most important ones are: - -- `--time <time>`: Set a higher job time if the default time is not - enough to build your image and your job is cancelled before completing. The format is the same - as for SLURM. -- `--tmp-size=<size in GB>`: Set a size used for the temporary - location of the Singularity container. Basically the size of the extracted container. -- `--output=<file>`: Path to a file used for (log) output generated - while building your container. -- Various singularity options are passed through. E.g. - `--notest, --force, --update`. See, e.g., `singularity --help` for details. - -For **advanced users** it is also possible to manually request a job with a VM (`srun -p ml ---cloud=kvm ...`) and then use this script to build a Singularity container from within the job. In -this case the `--arch` and other SLURM related parameters are not required. The advantage of using -this script is that it automates the waiting for the VM and mounting of host directories into it -(can also be done with `startInVM`) and creates a temporary directory usable with Singularity inside -the VM controlled by the `--tmp-size` parameter. - -## Filesystem - -**Read here if you have problems like "File not found".** - -As the build starts in a VM you may not have access to all your files. It is usually bad practice -to refer to local files from inside a definition file anyway as this reduces reproducibility. -However common directories are available by default. For others, care must be taken. In short: - -- `/home/$USER`, `/scratch/$USER` are available and should be used `/scratch/\<group>` also works for -- all groups the users is in `/projects/\<group>` similar, but is read-only! So don't use this to - store your generated container directly, but rather move it here afterwards -- /tmp is the VM local temporary directory. All files put here will be lost! - -If the current directory is inside (or equal to) one of the above (except `/tmp`), then relative paths -for container and definition work as the script changes to the VM equivalent of the current -directory. Otherwise you need to use absolute paths. Using `~` in place of `$HOME` does work too. - -Under the hood, the filesystem of Taurus is mounted via SSHFS at `/host_data`, so if you need any -other files they can be found there. - -There is also a new SSH key named "kvm" which is created by the scripts and authorized inside the VM -to allow for password-less access to SSHFS. This is stored at `~/.ssh/kvm` and regenerated if it -does not exist. It is also added to `~/.ssh/authorized_keys`. Note that removing the key file does -not remove it from `authorized_keys`, so remove it manually if you need to. It can be easily -identified by the comment on the key. However, removing this key is **NOT** recommended, as it -needs to be re-generated on every script run. - -## Starting a Job in a VM - -Especially when developing a Singularity definition file it might be useful to get a shell directly -on a VM. To do so simply run: - -```Bash -startInVM --arch=power9 -``` - -This will execute an `srun` command with the `--cloud=kvm` parameter, wait till the VM is ready, -mount all folders (just like `buildSingularityImage`, see the Filesystem section above) and come -back with a bash inside the VM. Inside that you are root, so you can directly execute `singularity -build` commands. - -As usual more options can be shown by running `startInVM --help`, the most important one being -`--time`. - -There are 2 special use cases for this script: 1 Execute an arbitrary command inside the VM instead -of getting a bash by appending the command to the script. Example: \<pre>startInVM --arch=power9 -singularity build \~/myContainer.sif \~/myDefinition.def\</pre> 1 Use the script in a job manually -allocated via srun/sbatch. This will work the same as when running outside a job but will **not** -start a new job. This is useful for using it inside batch scripts, when you already have an -allocation or need special arguments for the job system. Again you can run an arbitrary command by -passing it to the script. diff --git a/doc.zih.tu-dresden.de/docs/software/zsh.md b/doc.zih.tu-dresden.de/docs/software/zsh.md new file mode 100644 index 0000000000000000000000000000000000000000..147758a6a66dd84aeb040c80d0000110f4af882c --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/zsh.md @@ -0,0 +1,238 @@ +# ZSH + +!!! warning + Though all efforts have been made to ensure the accuracy and + currency of the content on this website, please be advised that + some content might be out of date and there is no continuous + website support available. In case of any ambiguity or doubts, + users are advised to do their own research on the content's + accuracy and currency. + +The [ZSH](https://www.zsh.org), short for `z-shell`, is an alternative shell for Linux that offers +many convenience features for productive use that `bash`, the default shell, does not offer. + +This should be a short introduction to `zsh` and offer some examples that are especially useful +on ZIH systems. + +## `oh-my-zsh` + +`oh-my-zsh` is a plugin that adds many features to the `zsh` with a very simple install. Simply run: + +``` +marie@login$ sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)" +``` + +and then, if it is not already your login shell, run `zsh` or re-login. + +The rest of this document assumes that you have `oh-my-zsh` installed and running. + +## Features + +### Themes + +There are many different themes for the `zsh`. See the +[GitHub-page of `oh-my-zsh`](https://github.com/ohmyzsh/ohmyzsh) for more details. + +### Auto-completion + +`zsh` offers more auto-completion features than `bash`. You can auto-complete programs, filenames, parameters, +`man`-pages and a lot more, and you can cycle through the suggestions with `TAB`-button. + + + +### Syntax-highlighting + +When you add this line to your `~/.zshrc` with `oh-my-zsh` installed, you get syntax-highlighting directly +in the shell: + +```bash +plugins+=( + zsh-syntax-highlighting +) +``` + + + +### Typo-correction + +With + +```bash +setopt correct_all +ENABLE_CORRECTION="true" +``` + +in `~/.zshrc` you get correction suggestions when the shell thinks +that it might be what you want, e.g. when a command +is expected to be handed an existing file. + + + +### Automatic `cd` + +Adding `AUTO_CD` to `~/.zshrc` file allows to leave out the `cd` when a folder name is provided. + +```bash +setopt AUTO_CD +``` + + + +### `fish`-like auto-suggestions + +Install [`zsh-autosuggestions`](https://github.com/zsh-users/zsh-autosuggestions) to get `fish`-shell-like +auto-suggestions of previous commands that start with the same letters and that you can complete with +the right arrow key. + + + +??? example "Addons for your shell" + === "`bash`" + ```bash + # Create a new directory and directly `cd` into it + mcd () { + mkdir -p $1 + cd $1 + } + + # Find the largest files in the current directory easily + function treesizethis { + du -k --max-depth=1 | sort -nr | awk ' + BEGIN { + split("KB,MB,GB,TB", Units, ","); + } + { + u = 1; + while ($1 >= 1024) { + $1 = $1 / 1024; + u += 1 + } + $1 = sprintf("%.1f %s", $1, Units[u]); + print $0; + } + ' + } + + #This allows you to run `slurmlogpath $SLURM_ID` and get the log-path directly in stdout: + function slurmlogpath { + scontrol show job $1 | sed -n -e 's/^\s*StdOut=//p' + } + + # `ftails` follow-tails a slurm-log. Call it without parameters to tail the only running job or + # get a list of running jobs or use `ftails $JOBID` to tail a specific job + function ftails { + JOBID=$1 + if [[ -z $JOBID ]]; then + JOBS=$(squeue --format="%i \\'%j\\' " --me | grep -v JOBID) + NUMBER_OF_JOBS=$(echo "$JOBS" | wc -l) + JOBID= + if [[ "$NUMBER_OF_JOBS" -eq 1 ]]; then + JOBID=$(echo $JOBS | sed -e "s/'//g" | sed -e 's/ .*//') + else + JOBS=$(echo $JOBS | tr -d '\n') + JOBID=$(eval "whiptail --title 'Choose jobs to tail' --menu 'Choose Job to tail' 25 78 16 $JOBS" 3>&1 1>&2 2>&3) + fi + fi + SLURMLOGPATH=$(slurmlogpath $JOBID) + if [[ -e $SLURMLOGPATH ]]; then + tail -n100 -f $SLURMLOGPATH + else + echo "No slurm-log-file found" + fi + } + + #With this, you only need to type `sq` instead of `squeue -u $USER`. + alias sq="squeue --me" + ``` + === "`zsh`" + ```bash + # Create a new directory and directly `cd` into it + mcd () { + mkdir -p $1 + cd $1 + } + + # Find the largest files in the current directory easily + function treesizethis { + du -k --max-depth=1 | sort -nr | awk ' + BEGIN { + split("KB,MB,GB,TB", Units, ","); + } + { + u = 1; + while ($1 >= 1024) { + $1 = $1 / 1024; + u += 1 + } + $1 = sprintf("%.1f %s", $1, Units[u]); + print $0; + } + ' + } + + #This allows you to run `slurmlogpath $SLURM_ID` and get the log-path directly in stdout: + function slurmlogpath { + scontrol show job $1 | sed -n -e 's/^\s*StdOut=//p' + } + + # `ftails` follow-tails a slurm-log. Call it without parameters to tail the only running job or + # get a list of running jobs or use `ftails $JOBID` to tail a specific job + function ftails { + JOBID=$1 + if [[ -z $JOBID ]]; then + JOBS=$(squeue --format="%i \\'%j\\' " --me | grep -v JOBID) + NUMBER_OF_JOBS=$(echo "$JOBS" | wc -l) + JOBID= + if [[ "$NUMBER_OF_JOBS" -eq 1 ]]; then + JOBID=$(echo $JOBS | sed -e "s/'//g" | sed -e 's/ .*//') + else + JOBS=$(echo $JOBS | tr -d '\n') + JOBID=$(eval "whiptail --title 'Choose jobs to tail' --menu 'Choose Job to tail' 25 78 16 $JOBS" 3>&1 1>&2 2>&3) + fi + fi + SLURMLOGPATH=$(slurmlogpath $JOBID) + if [[ -e $SLURMLOGPATH ]]; then + tail -n100 -f $SLURMLOGPATH + else + echo "No slurm-log-file found" + fi + } + + #With this, you only need to type `sq` instead of `squeue -u $USER`. + alias sq="squeue --me" + + #This will automatically replace `...` with `../..` and `....` with `../../..` + # and so on (each additional `.` adding another `/..`) when typing commands: + rationalise-dot() { + if [[ $LBUFFER = *.. ]]; then + LBUFFER+=/.. + else + LBUFFER+=. + fi + } + zle -N rationalise-dot + bindkey . rationalise-dot + + # This allows auto-completion for `module load`: + function _module { + MODULE_COMMANDS=( + '-t:Show computer parsable output' + 'load:Load a module' + 'unload:Unload a module' + 'spider:Search for a module' + 'avail:Show available modules' + 'list:List loaded modules' + ) + + MODULE_COMMANDS_STR=$(printf "\n'%s'" "${MODULE_COMMANDS[@]}") + + eval "_describe 'command' \"($MODULE_COMMANDS_STR)\"" + _values -s ' ' 'flags' $(ml -t avail | sed -e 's#/$##' | tr '\n' ' ') + } + + compdef _module "module" + ``` + +## Setting `zsh` as default-shell + +Please ask HPC support if you want to set the `zsh` as your default login shell. diff --git a/doc.zih.tu-dresden.de/docs/specific_software.md b/doc.zih.tu-dresden.de/docs/specific_software.md deleted file mode 100644 index fd98e303e5448ae7ce128ddfbc4e78c63e754075..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/specific_software.md +++ /dev/null @@ -1,44 +0,0 @@ -# Use of Specific Software (packages, libraries, etc) - -## Modular System - -The modular concept is the easiest way to work with the software on Taurus. It allows to user to -switch between different versions of installed programs and provides utilities for the dynamic -modification of a user's environment. The information can be found [here]**todo link**. - -### Private project and user modules files - -[Private project module files]**todo link** allow you to load your group-wide installed software -into your environment and to handle different versions. It allows creating your own software -environment for the project. You can create a list of modules that will be loaded for every member -of the team. It gives opportunity on unifying work of the team and defines the reproducibility of -results. Private modules can be loaded like other modules with module load. - -[Private user module files]**todo link** allow you to load your own installed software into your -environment. It works in the same manner as to project modules but for your private use. - -## Use of containers - -[Containerization]**todo link** encapsulating or packaging up software code and all its dependencies -to run uniformly and consistently on any infrastructure. On Taurus [Singularity]**todo link** used -as a standard container solution. Singularity enables users to have full control of their -environment. This means that you don’t have to ask an HPC support to install anything for you - you -can put it in a Singularity container and run! As opposed to Docker (the most famous container -solution), Singularity is much more suited to being used in an HPC environment and more efficient in -many cases. Docker containers can easily be used in Singularity. Information about the use of -Singularity on Taurus can be found [here]**todo link**. - -In some cases using Singularity requires a Linux machine with root privileges (e.g. using the ml -partition), the same architecture and a compatible kernel. For many reasons, users on Taurus cannot -be granted root permissions. A solution is a Virtual Machine (VM) on the ml partition which allows -users to gain root permissions in an isolated environment. There are two main options on how to work -with VM on Taurus: - - 1. [VM tools]**todo link**. Automative algorithms for using virtual machines; - 1. [Manual method]**todo link**. It required more operations but gives you more flexibility and reliability. - -Additional Information: Examples of the definition for the Singularity container ([here]**todo -link**) and some hints ([here]**todo link**). - -Useful links: [Containers]**todo link**, [Custom EasyBuild Environment]**todo link**, [Virtual -machine on Taurus]**todo link** diff --git a/doc.zih.tu-dresden.de/docs/support.md b/doc.zih.tu-dresden.de/docs/support.md deleted file mode 100644 index d85f71226115f277cef27bdb6841e276e85ec1d9..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/support.md +++ /dev/null @@ -1,19 +0,0 @@ -# What if everything didn't help? - -## Create a Ticket: how do I do that? - -The best way to ask about the help is to create a ticket. In order to do that you have to write a -message to the <a href="mailto:hpcsupport@zih.tu-dresden.de">hpcsupport@zih.tu-dresden.de</a> with a -detailed description of your problem. If possible please add logs, used environment and write a -minimal executable example for the purpose to recreate the error or issue. - -## Communication with HPC Support - -There is the HPC support team who is responsible for the support of HPC users and stable work of the -cluster. You could find the [details]**todo link** in the right part of any page of the compendium. -However, please, before the contact with the HPC support team check the documentation carefully -(starting points: [main page]**todo link**, [HPC-DA]**todo link**), use a search and then create a -ticket. The ticket is a preferred way to solve the issue, but in some terminable cases, you can call -to ask for help. - -Useful link: [Further Documentation]**todo link** diff --git a/doc.zih.tu-dresden.de/docs/support/support.md b/doc.zih.tu-dresden.de/docs/support/support.md new file mode 100644 index 0000000000000000000000000000000000000000..c2c9fbda8bbb70c1dddb82fb384b69a8201e6fb8 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/support/support.md @@ -0,0 +1,31 @@ +# How to Ask for Support + +## Create a Ticket + +The best way to ask for help send a message to +[hpcsupport@zih.tu-dresden.de](mailto:hpcsupport@zih.tu-dresden.de) with a +detailed description of your problem. + +It should include: + +- Who is reporting? (login name) +- Where have you seen the problem? (name of the HPC system and/or of the node) +- When has the issue occurred? Maybe, when did it work last? +- What exactly happened? + +If possible include + +- job ID, +- batch script, +- filesystem path, +- loaded modules and environment, +- output and error logs, +- steps to reproduce the error. + +This email automatically opens a trouble ticket which will be tracked by the HPC team. Please +always keep the ticket number in the subject on your answers so that our system can keep track +on our communication. + +For a new request, please simply send a new email (without any ticket number). + +!!! hint "Please try to find an answer in this documentation first." diff --git a/doc.zih.tu-dresden.de/docs/tests.md b/doc.zih.tu-dresden.de/docs/tests.md deleted file mode 100644 index 7601eb3748d21ce8d414cdb24c7ebef9c0a68cd4..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/tests.md +++ /dev/null @@ -1,12 +0,0 @@ -# Tests - -Dies ist eine Seite zum Testen der Markdown-Syntax. - -```python -import os - -def debug(mystring): - print("Debug: ", mystring) - -debug("Dies ist ein Syntax-Highligthing-Test") -``` diff --git a/doc.zih.tu-dresden.de/hackathon.md b/doc.zih.tu-dresden.de/hackathon.md deleted file mode 100644 index 4a49d2b68ede0134d9672d6b8513ceb8d0210060..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/hackathon.md +++ /dev/null @@ -1,71 +0,0 @@ -# Hackathon June 2021 - -The goals for the hackathon are: - -* Familiarize main editors (ZIH admin group and domain experts) with new workflow and system -* Bringing new compendium to life by - 1. Transferring content from old compendium into new structure and system - 1. Fixing checks - 1. Reviewing and updating transferred content - -## twiki2md - -The script `twiki2md` converts twiki source files into markdown source files using pandoc. It outputs the -markdown source files according to the old pages tree into subdirectories. The output and **starting -point for transferring** old content into the new system can be found at branch `preview` within -directory `twiki2md/root/`. - -## Steps - -### Familiarize with New Wiki System - -* Make sure your are member of the [repository](https://gitlab.hrz.tu-chemnitz.de/zih/hpc-compendium/hpc-compendium). - If not, ask Danny Rotscher for adding you. -* Clone repository and checkout branch `preview` - -```Shell Session -~ git clone git@gitlab.hrz.tu-chemnitz.de:zih/hpc-compendium/hpc-compendium.git -~ cd hpc-compendium -~ git checkout preview -``` - -* Open terminal and build documentation using `mkdocs` - * [using mkdocs](README.md#preview-using-mkdocs) - * [installing dependencies](README.md#install-dependencies) - -### Transferring Content - -1. Grab a markdown source file from `twiki2md/root/` directory (a topic you are comfortable with) -1. Find place in new structure according to -[Typical Project Schedule](https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/TypicalProjectSchedule) - * Create new feature branch holding your work `~ git checkout -b <BRANCHNAME>`, whereas branch name can - be `<FILENAME>` for simplicity - * Copy reviewed markdown source file to `docs/` directory via - `~ git mv twiki2md/root/<FILENAME>.md doc.zih.tu-dresden.de/docs/<SUBDIR>/<FILENAME>.md` - * Update navigation section in `mkdocs.yaml` -1. Commit and push to feature branch via -```Shell Session -~ git commit docs/<SUBDIR>/<FILENAME>.md mkdocs.yaml -m "MESSAGE" -~ git push origin <BRANCHNAME> -``` -1. Run checks locally and fix the issues. Otherwise the pipeline will fail. - * [Check links](README.md#check-links) (There might be broken links which can only be solved - with ongoing transfer of content.) - * [Check pages structure](README.md#check-pages-structure) - * [Markdown Linter](README.md#markdown-linter) -1. Create - [merge request](https://gitlab.hrz.tu-chemnitz.de/zih/hpc-compendium/hpc-compendium/-/merge_requests) - against `preview` branch - -### Review Content - -The following steps are optional in a sense, that the first goal of the hackathon is to transfer all -old pages into new structure. If this is done, the content of the files need to be reviewed: - - * Remove outdated information - * Update content - * Apply [writing style](README.md#writing-style) - * Replace or remove (leftover) html constructs in markdown source file - * Add ticks for code blocks and command if necessary - * Fix internal links (mark as todo if necessary) - * Review and update, remove outdated content diff --git a/doc.zih.tu-dresden.de/mkdocs.yml b/doc.zih.tu-dresden.de/mkdocs.yml index 5331be71aa55c06dfdb1ec70813697a423dff3e7..4a4a5eb7db818e29c9223f8ccaa64b43b367946b 100644 --- a/doc.zih.tu-dresden.de/mkdocs.yml +++ b/doc.zih.tu-dresden.de/mkdocs.yml @@ -18,19 +18,22 @@ nav: - Security Restrictions: access/security_restrictions.md - Transfer of Data: - Overview: data_transfer/overview.md - - Data Mover: data_transfer/data_mover.md + - Datamover: data_transfer/datamover.md - Export Nodes: data_transfer/export_nodes.md - Environment and Software: - Overview: software/overview.md - Environment: - Modules: software/modules.md - - Runtime Environment: software/runtime_environment.md + - Private Modulefiles: software/private_modules.md - Custom EasyBuild Modules: software/custom_easy_build_environment.md + - Python Virtual Environments: software/python_virtual_environments.md + - ZSH: software/zsh.md - Containers: - Singularity: software/containers.md - - Singularity Recicpe Hints: software/singularity_recipe_hints.md - - Singularity Example Definitions: software/singularity_example_definitions.md - - VM tools: software/vm_tools.md + - Singularity Recipes and Hints: software/singularity_recipe_hints.md + - Virtual Machines Tools: software/virtual_machines_tools.md + - Virtual Machines: software/virtual_machines.md + - NGC Containers: software/ngc_containers.md - Applications: - Licenses: software/licenses.md - Computational Fluid Dynamics (CFD): software/cfd.md @@ -38,23 +41,21 @@ nav: - Nanoscale Simulations: software/nanoscale_simulations.md - FEM Software: software/fem_software.md - Visualization: software/visualization.md - - HPC-DA: - - Get started with HPC-DA: software/get_started_with_hpcda.md - - Machine Learning: software/machine_learning.md - - Deep Learning: software/deep_learning.md + - Data Analytics: + - Overview: software/data_analytics.md - Data Analytics with R: software/data_analytics_with_r.md - - Data Analytics with Python: software/python.md - - TensorFlow: - - TensorFlow Overview: software/tensorflow.md - - TensorFlow in Container: software/tensorflow_container_on_hpcda.md - - TensorFlow in JupyterHub: software/tensorflow_on_jupyter_notebook.md - - Keras: software/keras.md - - Dask: software/dask.md - - Power AI: software/power_ai.md + - Data Analytics with RStudio: software/data_analytics_with_rstudio.md + - Data Analytics with Python: software/data_analytics_with_python.md + - Big Data Analytics: software/big_data_frameworks.md + - Machine Learning: + - Overview: software/machine_learning.md + - TensorFlow: software/tensorflow.md + - TensorBoard: software/tensorboard.md - PyTorch: software/pytorch.md - - Apache Spark, Apache Flink, Apache Hadoop: software/big_data_frameworks.md + - Distributed Training: software/distributed_training.md + - Hyperparameter Optimization (OmniOpt): software/hyperparameter_optimization.md + - PowerAI: software/power_ai.md - SCS5 Migration Hints: software/scs5_software.md - - Virtual Machines: software/virtual_machines.md - Virtual Desktops: software/virtual_desktops.md - Software Development and Tools: - Overview: software/software_development_overview.md @@ -65,10 +66,10 @@ nav: - Debugging: software/debuggers.md - MPI Error Detection: software/mpi_usage_error_detection.md - Score-P: software/scorep.md + - lo2s: software/lo2s.md - PAPI Library: software/papi.md - Pika: software/pika.md - - Perf Tools: software/perf_tools.md - - Score-P: software/scorep.md + - Perf Tools: software/perf_tools.md - Vampir: software/vampir.md - Data Life Cycle Management: - Overview: data_lifecycle/overview.md @@ -79,38 +80,28 @@ nav: - BeeGFS: data_lifecycle/beegfs.md - Warm Archive: data_lifecycle/warm_archive.md - Intermediate Archive: data_lifecycle/intermediate_archive.md - - Quotas: data_lifecycle/quotas.md - Workspaces: data_lifecycle/workspaces.md - Preservation of Research Data: data_lifecycle/preservation_research_data.md - Structuring Experiments: data_lifecycle/experiments.md - HPC Resources and Jobs: - Overview: jobs_and_resources/overview.md - - Batch Systems: jobs_and_resources/batch_systems.md - HPC Resources: - - Hardware Taurus: jobs_and_resources/hardware_taurus.md + - Overview: jobs_and_resources/hardware_overview.md - AMD Rome Nodes: jobs_and_resources/rome_nodes.md - IBM Power9 Nodes: jobs_and_resources/power9.md - NVMe Storage: jobs_and_resources/nvme_storage.md - Alpha Centauri: jobs_and_resources/alpha_centauri.md - HPE Superdome Flex: jobs_and_resources/sd_flex.md - - Checkpoint/Restart: jobs_and_resources/checkpoint_restart.md - - Overview2: jobs_and_resources/index.md - - Taurus: jobs_and_resources/system_taurus.md - - Slurm Examples: jobs_and_resources/slurm_examples.md - - Slurm: jobs_and_resources/slurm.md - - HPC-DA: jobs_and_resources/hpcda.md - - Binding And Distribution Of Tasks: jobs_and_resources/binding_and_distribution_of_tasks.md - # - Queue Policy: jobs/policy.md - # - Examples: jobs/examples/index.md - # - Affinity: jobs/affinity/index.md - # - Interactive: jobs/interactive.md - # - Best Practices: jobs/best-practices.md - # - Reservations: jobs/reservations.md - # - Monitoring: jobs/monitoring.md - # - FAQs: jobs/jobs-faq.md - #- Tests: tests.md - - Support: support.md - - Archive: + - Running Jobs: + - Batch System Slurm: jobs_and_resources/slurm.md + - Job Examples: jobs_and_resources/slurm_examples.md + - Partitions and Limits : jobs_and_resources/partitions_and_limits.md + - Checkpoint/Restart: jobs_and_resources/checkpoint_restart.md + - Job Profiling: jobs_and_resources/slurm_profiling.md + - Binding And Distribution Of Tasks: jobs_and_resources/binding_and_distribution_of_tasks.md + - Support: + - How to Ask for Support: support/support.md + - Archive of the Old Wiki: - Overview: archive/overview.md - Bio Informatics: archive/bioinformatics.md - CXFS End of Support: archive/cxfs_end_of_support.md @@ -119,6 +110,7 @@ nav: - Phase2 Migration: archive/phase2_migration.md - Platform LSF: archive/platform_lsf.md - BeeGFS on Demand: archive/beegfs_on_demand.md + - Install JupyterHub: archive/install_jupyter.md - Switched-Off Systems: - Overview: archive/systems_switched_off.md - From Deimos to Atlas: archive/migrate_to_atlas.md @@ -133,32 +125,48 @@ nav: - UNICORE Rest API: archive/unicore_rest_api.md - VampirTrace: archive/vampirtrace.md - Windows Batchjobs: archive/windows_batch.md - - + - Contribute: + - How-To: contrib/howto_contribute.md + - Content Rules: contrib/content_rules.md + - Browser-based Editing: contrib/contribute_browser.md + - Work Locally Using Containers: contrib/contribute_container.md + # Project Information + site_name: ZIH HPC Compendium site_description: ZIH HPC Compendium site_author: ZIH Team site_dir: public -site_url: https://gitlab.hrz.tu-chemnitz.de/zih/hpc-compendium/hpc-compendium +site_url: https://gitlab.hrz.tu-chemnitz.de/zih/hpcsupport/hpc-compendium + # uncomment next 3 lines if link to repo should not be displayed in the navbar + repo_name: GitLab hpc-compendium -repo_url: https://gitlab.hrz.tu-chemnitz.de/zih/hpc-compendium/hpc-compendium -edit_uri: blob/master/docs/ +repo_url: https://gitlab.hrz.tu-chemnitz.de/zih/hpcsupport/hpc-compendium +edit_uri: blob/main/doc.zih.tu-dresden.de/docs/ # Configuration -#strict: true + +# strict: true theme: + # basetheme + name: material + # disable fonts being loaded from google fonts + font: false language: en + # dir containing all customizations + custom_dir: tud_theme favicon: assets/images/Logo_klein.png + # logo in header and footer + logo: assets/images/TUD_Logo_weiss_57.png second_logo: assets/images/zih_weiss.png features: @@ -166,6 +174,7 @@ theme: # extends base css/js extra_css: + - stylesheets/extra.css extra_javascript: @@ -180,6 +189,8 @@ markdown_extensions: permalink: True - attr_list - footnotes + - pymdownx.tabbed: + alternate_style: true extra: tud_homepage: https://tu-dresden.de @@ -188,7 +199,9 @@ extra: zih_homepage: https://tu-dresden.de/zih zih_name: "center for information services and high performance computing (ZIH)" hpcsupport_mail: hpcsupport@zih.tu-dresden.de + # links in footer + footer: - link: /legal_notice name: "Legal Notice / Impressum" diff --git a/doc.zih.tu-dresden.de/requirements.txt b/doc.zih.tu-dresden.de/requirements.txt deleted file mode 100644 index 272b09c7c7ffb6b945eaa66e14e2e695f5502f17..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/requirements.txt +++ /dev/null @@ -1,6 +0,0 @@ -# Documentation static site generator & deployment tool -mkdocs>=1.1.2 - -# Add custom theme if not inside a theme_dir -# (https://github.com/mkdocs/mkdocs/wiki/MkDocs-Themes) -mkdocs-material>=7.1.0 diff --git a/doc.zih.tu-dresden.de/tud_theme/stylesheets/extra.css b/doc.zih.tu-dresden.de/tud_theme/stylesheets/extra.css index 8e3b70cd5218006970db1a4453a7dafacd3dea97..1a0b6cfdd9f2d2ad9abdde37b6c3eb64c896de78 100644 --- a/doc.zih.tu-dresden.de/tud_theme/stylesheets/extra.css +++ b/doc.zih.tu-dresden.de/tud_theme/stylesheets/extra.css @@ -32,19 +32,24 @@ .md-typeset h5 { font-family: 'Open Sans Semibold'; line-height: 130%; + margin: 0.2em; } .md-typeset h1 { font-family: 'Open Sans Regular'; - font-size: 1.6rem; + font-size: 1.6rem; + margin-bottom: 0.5em; } .md-typeset h2 { - font-size: 1.4rem; + font-size: 1.2rem; + margin: 0.5em; + border-bottom-style: solid; + border-bottom-width: 1px; } .md-typeset h3 { - font-size: 1.2rem; + font-size: 1.1rem; } .md-typeset h4 { @@ -52,8 +57,8 @@ } .md-typeset h5 { - font-size: 0.9rem; - line-height: 120%; + font-size: 0.8rem; + text-transform: initial; } strong { @@ -151,23 +156,6 @@ strong { width: 125px; } -.md-header__button.md-icon { - display: flex; - justify-content: center; - align-items: center; -} - -@media screen and (min-width: 76.25rem) { - .md-header__button.md-icon { - display: none; - } -} - -@media screen and (min-width: 60rem) { - .md-header__button.md-icon { - display: none; - } -} /* toc */ /* operation-status */ .operation-status-logo { @@ -180,6 +168,7 @@ hr.solid { p { padding: 0 0.6rem; + margin: 0.2em; } /* main */ diff --git a/doc.zih.tu-dresden.de/util/check-bash-syntax.sh b/doc.zih.tu-dresden.de/util/check-bash-syntax.sh new file mode 100755 index 0000000000000000000000000000000000000000..ac0fcd4621741d7f094e29aaf772f283b64c284d --- /dev/null +++ b/doc.zih.tu-dresden.de/util/check-bash-syntax.sh @@ -0,0 +1,79 @@ +#!/bin/bash + +set -euo pipefail + +scriptpath=${BASH_SOURCE[0]} +basedir=`dirname "$scriptpath"` +basedir=`dirname "$basedir"` + +function usage () { + echo "$0 [options]" + echo "Search for bash files that have an invalid syntax." + echo "" + echo "Options:" + echo " -a Search in all bash files (default: git-changed files)" + echo " -f=FILE Search in a specific bash file" + echo " -s Silent mode" + echo " -h Show help message" +} + +# Options +all_files=false +silent=false +file="" +while getopts ":ahsf:" option; do + case $option in + a) + all_files=true + ;; + f) + file=$2 + shift + ;; + s) + silent=true + ;; + h) + usage + exit;; + \?) # Invalid option + echo "Error: Invalid option." + usage + exit;; + esac +done + +branch="origin/${CI_MERGE_REQUEST_TARGET_BRANCH_NAME:-preview}" + +if [ $all_files = true ]; then + echo "Search in all bash files." + files=`git ls-tree --full-tree -r --name-only HEAD $basedir/docs/ | grep '\.sh$' || true` +elif [[ ! -z $file ]]; then + files=$file +else + echo "Search in git-changed files." + files=`git diff --name-only "$(git merge-base HEAD "$branch")" | grep '\.sh$' || true` +fi + + +cnt=0 +for f in $files; do + if ! bash -n $f; then + if [ $silent = false ]; then + echo "Bash file $f has invalid syntax" + fi + ((cnt=cnt+1)) + fi +done + +case $cnt in + 1) + echo "Bash files with invalid syntax: 1 match found" + ;; + *) + echo "Bash files with invalid syntax: $cnt matches found" + ;; +esac +if [ $cnt -gt 0 ]; then + exit 1 +fi diff --git a/doc.zih.tu-dresden.de/util/check-empty-page.sh b/doc.zih.tu-dresden.de/util/check-empty-page.sh new file mode 100755 index 0000000000000000000000000000000000000000..7c4fdc2cd07b167b39b0b0ece58e199df0df6d84 --- /dev/null +++ b/doc.zih.tu-dresden.de/util/check-empty-page.sh @@ -0,0 +1,11 @@ +#!/bin/bash + +set -euo pipefail + +scriptpath=${BASH_SOURCE[0]} +basedir=`dirname "$scriptpath"` +basedir=`dirname "$basedir"` + +if find $basedir -name \*.md -exec wc -l {} \; | grep '^0 '; then + exit 1 +fi diff --git a/doc.zih.tu-dresden.de/util/check-filesize.sh b/doc.zih.tu-dresden.de/util/check-filesize.sh new file mode 100755 index 0000000000000000000000000000000000000000..9b11b09c742a387513a265da28aca57d5533516b --- /dev/null +++ b/doc.zih.tu-dresden.de/util/check-filesize.sh @@ -0,0 +1,48 @@ +#!/bin/bash + +# BSD 3-Clause License +# +# Copyright (c) 2017, The Regents of the University of California, through +# Lawrence Berkeley National Laboratory (subject to receipt of any required +# approvals from the U.S. Dept. of Energy). All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions are met: +# +# * Redistributions of source code must retain the above copyright notice, this +# list of conditions and the following disclaimer. +# +# * Redistributions in binary form must reproduce the above copyright notice, +# this list of conditions and the following disclaimer in the documentation +# and/or other materials provided with the distribution. +# +# * Neither the name of the copyright holder nor the names of its +# contributors may be used to endorse or promote products derived from +# this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE +# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE +# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +large_files_present=false +branch="origin/${CI_MERGE_REQUEST_TARGET_BRANCH_NAME:-preview}" +source_hash=`git merge-base HEAD "$branch"` + +for f in $(git diff $source_hash --name-only); do + fs=$(wc -c $f | awk '{print $1}') + if [ $fs -gt 1048576 ]; then + echo $f 'is over 1M ('$fs' bytes)' + large_files_present=true + fi +done + +if [ "$large_files_present" == true ]; then + exit 1 +fi diff --git a/doc.zih.tu-dresden.de/util/check-links.sh b/doc.zih.tu-dresden.de/util/check-links.sh index e553f9c4828a2286a5f053181dd09eaaa28746ad..a1b28c271d654f117f344490fd3875e70f77b15e 100755 --- a/doc.zih.tu-dresden.de/util/check-links.sh +++ b/doc.zih.tu-dresden.de/util/check-links.sh @@ -8,52 +8,96 @@ ## ## Author: Martin.Schroschk@tu-dresden.de -set -euo pipefail +set -eo pipefail + +scriptpath=${BASH_SOURCE[0]} +basedir=`dirname "$scriptpath"` +basedir=`dirname "$basedir"` usage() { - echo "Usage: bash $0" + cat <<-EOF +usage: $0 [file | -a] +If file is given, checks whether all links in it are reachable. +If parameter -a (or --all) is given instead of the file, checks all markdown files. +Otherwise, checks whether any changed file contains broken links. +EOF } -# Any arguments? -if [ $# -gt 0 ]; then - usage - exit 1 -fi - mlc=markdown-link-check if ! command -v $mlc &> /dev/null; then echo "INFO: $mlc not found in PATH (global module)" mlc=./node_modules/markdown-link-check/$mlc if ! command -v $mlc &> /dev/null; then echo "INFO: $mlc not found (local module)" - echo "INFO: See CONTRIBUTE.md for information." - echo "INFO: Abort." exit 1 fi fi echo "mlc: $mlc" +LINK_CHECK_CONFIG="$basedir/util/link-check-config.json" +if [ ! -f "$LINK_CHECK_CONFIG" ]; then + echo $LINK_CHECK_CONFIG does not exist + exit 1 +fi + branch="preview" if [ -n "$CI_MERGE_REQUEST_TARGET_BRANCH_NAME" ]; then branch="origin/$CI_MERGE_REQUEST_TARGET_BRANCH_NAME" fi -any_fails=false +function checkSingleFile(){ + theFile="$1" + if [ -e "$theFile" ]; then + echo "Checking links in $theFile" + if ! $mlc -q -c "$LINK_CHECK_CONFIG" -p "$theFile"; then + return 1 + fi + fi + return 0 +} -files=$(git diff --name-only "$(git merge-base HEAD "$branch")") +function checkFiles(){ +any_fails=false +echo "Check files:" +echo "$files" +echo "" for f in $files; do - if [ "${f: -3}" == ".md" ]; then - # do not check links for deleted files - if [ -e x.txt ]; then - echo "Checking links for $f" - if ! $mlc -q -p "$f"; then - any_fails=true - fi - fi + if ! checkSingleFile "$f"; then + any_fails=true fi done if [ "$any_fails" == true ]; then exit 1 fi +} + +function checkAllFiles(){ +files=$(git ls-tree --full-tree -r --name-only HEAD $basedir/ | grep '.md$' || true) +checkFiles +} + +function checkChangedFiles(){ +files=$(git diff --name-only "$(git merge-base HEAD "$branch")" | grep '.md$' || true) +checkFiles +} + +if [ $# -eq 1 ]; then + case $1 in + help | -help | --help) + usage + exit + ;; + -a | --all) + checkAllFiles + ;; + *) + checkSingleFile "$1" + ;; + esac +elif [ $# -eq 0 ]; then + checkChangedFiles +else + usage +fi diff --git a/doc.zih.tu-dresden.de/util/check-no-floating.sh b/doc.zih.tu-dresden.de/util/check-no-floating.sh index 6f94039f3125f87502b1583e699140e15e0e5f5f..4fbc5affe7c670c9dc2d998447c29e3a1e99fe55 100755 --- a/doc.zih.tu-dresden.de/util/check-no-floating.sh +++ b/doc.zih.tu-dresden.de/util/check-no-floating.sh @@ -4,30 +4,41 @@ if [ ${#} -ne 1 ]; then echo "Usage: ${0} <path>" fi -DOCUMENT_ROOT=${1} +basedir=${1} +DOCUMENT_ROOT=${basedir}/docs +maxDepth=4 +expectedFooter="$DOCUMENT_ROOT/legal_notice.md $DOCUMENT_ROOT/accessibility.md $DOCUMENT_ROOT/data_protection_declaration.md" -check_md() { - awk -F'/' '{print $0,NF,$NF}' <<< "${1}" | while IFS=' ' read string depth md; do - #echo "string=${string} depth=${depth} md=${md}" +MSG=$(find ${DOCUMENT_ROOT} -name "*.md" | awk -F'/' '{print $0,NF}' | while IFS=' ' read string depth + do + #echo "string=${string} depth=${depth}" # max depth check - if [ "${depth}" -gt "5" ]; then - echo "max depth (4) exceeded for ${string}" - exit -1 + if [ "${depth}" -gt $maxDepth ]; then + echo "max depth ($maxDepth) exceeded for ${string}" fi + md=${string#${DOCUMENT_ROOT}/} + # md included in nav - if ! sed -n '/nav:/,/^$/p' ${2}/mkdocs.yml | grep --quiet ${md}; then - echo "${md} is not included in nav" - exit -1 + numberOfReferences=`sed -n '/nav:/,/^$/p' ${basedir}/mkdocs.yml | grep -c ${md}` + if [ $numberOfReferences -eq 0 ]; then + # fallback: md included in footer + if [[ "${expectedFooter}" =~ ${string} ]]; then + numberOfReferencesInFooter=`sed -n '/footer:/,/^$/p' ${basedir}/mkdocs.yml | grep -c /${md%.md}` + if [ $numberOfReferencesInFooter -eq 0 ]; then + echo "${md} is not included in footer" + elif [ $numberOfReferencesInFooter -ne 1 ]; then + echo "${md} is included $numberOfReferencesInFooter times in footer" + fi + else + echo "${md} is not included in nav" + fi + elif [ $numberOfReferences -ne 1 ]; then + echo "${md} is included $numberOfReferences times in nav" fi done -} - -export -f check_md - -#find ${DOCUMENT_ROOT}/docs -name "*.md" -exec bash -c 'check_md "${0#${1}}" "${1}"' {} ${DOCUMENT_ROOT} \; -MSG=$(find ${DOCUMENT_ROOT}/docs -name "*.md" -exec bash -c 'check_md "${0#${1}}" "${1}"' {} ${DOCUMENT_ROOT} \;) +) if [ ! -z "${MSG}" ]; then echo "${MSG}" exit -1 diff --git a/doc.zih.tu-dresden.de/util/check-spelling.sh b/doc.zih.tu-dresden.de/util/check-spelling.sh index 7fa9d2824d4a61ce86ae258d656acfe90c574269..0d574c1e6adeadacb895f31209b16a9d7f25a123 100755 --- a/doc.zih.tu-dresden.de/util/check-spelling.sh +++ b/doc.zih.tu-dresden.de/util/check-spelling.sh @@ -7,6 +7,7 @@ basedir=`dirname "$scriptpath"` basedir=`dirname "$basedir"` wordlistfile=$(realpath $basedir/wordlist.aspell) branch="origin/${CI_MERGE_REQUEST_TARGET_BRANCH_NAME:-preview}" +files_to_skip=(doc.zih.tu-dresden.de/docs/accessibility.md doc.zih.tu-dresden.de/docs/data_protection_declaration.md data_protection_declaration.md) aspellmode= if aspell dump modes | grep -q markdown; then aspellmode="--mode=markdown" @@ -14,9 +15,10 @@ fi function usage() { cat <<-EOF -usage: $0 [file] +usage: $0 [file | -a] If file is given, outputs all words of the file, that the spell checker cannot recognize. -If file is omitted, checks whether any changed file contains more unrecognizable words than before the change. +If parameter -a (or --all) is given instead of the file, checks all markdown files. +Otherwise, checks whether any changed file contains more unrecognizable words than before the change. If you are sure a word is correct, you can put it in $wordlistfile. EOF } @@ -29,12 +31,52 @@ function getNumberOfAspellOutputLines(){ getAspellOutput | wc -l } +function isWordlistSorted(){ + #Unfortunately, sort depends on locale and docker does not provide much. + #Therefore, it uses bytewise comparison. We avoid problems with the command tr. + if sed 1d "$wordlistfile" | tr [:upper:] [:lower:] | sort -C; then + return 1 + fi + return 0 +} + +function shouldSkipFile(){ + printf '%s\n' "${files_to_skip[@]}" | grep -xq $1 +} + +function checkAllFiles(){ + any_fails=false + + if isWordlistSorted; then + echo "Unsorted wordlist in $wordlistfile" + any_fails=true + fi + + files=$(git ls-tree --full-tree -r --name-only HEAD $basedir/ | grep .md) + while read file; do + if [ "${file: -3}" == ".md" ]; then + if shouldSkipFile ${file}; then + echo "Skip $file" + else + echo "Check $file" + echo "-- File $file" + if { cat "$file" | getAspellOutput | tee /dev/fd/3 | grep -xq '.*'; } 3>&1; then + any_fails=true + fi + fi + fi + done <<< "$files" + + if [ "$any_fails" == true ]; then + return 1 + fi + return 0 +} + function isMistakeCountIncreasedByChanges(){ any_fails=false - #Unfortunately, sort depends on locale and docker does not provide much. - #Therefore, it uses bytewise comparison. We avoid problems with the command tr. - if ! sed 1d "$wordlistfile" | tr [:upper:] [:lower:] | sort -C; then + if isWordlistSorted; then echo "Unsorted wordlist in $wordlistfile" any_fails=true fi @@ -48,9 +90,7 @@ function isMistakeCountIncreasedByChanges(){ while read oldfile; do read newfile if [ "${newfile: -3}" == ".md" ]; then - if [[ $newfile == *"accessibility.md"* || - $newfile == *"data_protection_declaration.md"* || - $newfile == *"legal_notice.md"* ]]; then + if shouldSkipFile ${newfile:2}; then echo "Skip $newfile" else echo "Check $newfile" @@ -70,7 +110,8 @@ function isMistakeCountIncreasedByChanges(){ fi if [ $current_count -gt $previous_count ]; then echo "-- File $newfile" - echo "Change increases spelling mistake count (from $previous_count to $current_count)" + echo "Change increases spelling mistake count (from $previous_count to $current_count), misspelled/unknown words:" + cat "$newfile" | getAspellOutput any_fails=true fi fi @@ -89,6 +130,9 @@ if [ $# -eq 1 ]; then usage exit ;; + -a | --all) + checkAllFiles + ;; *) cat "$1" | getAspellOutput ;; diff --git a/doc.zih.tu-dresden.de/util/grep-forbidden-patterns.sh b/doc.zih.tu-dresden.de/util/grep-forbidden-patterns.sh new file mode 100755 index 0000000000000000000000000000000000000000..f3cfa673ce063a674cb2f850d7f7da252a6ab093 --- /dev/null +++ b/doc.zih.tu-dresden.de/util/grep-forbidden-patterns.sh @@ -0,0 +1,176 @@ +#!/bin/bash + +set -euo pipefail + +scriptpath=${BASH_SOURCE[0]} +basedir=`dirname "$scriptpath"` +basedir=`dirname "$basedir"` + +#This is the ruleset. Each rule consists of a message (first line), a tab-separated list of files to skip (second line) and a pattern specification (third line). +#A pattern specification is a tab-separated list of fields: +#The first field represents whether the match should be case-sensitive (s) or insensitive (i). +#The second field represents the pattern that should not be contained in any file that is checked. +#Further fields represent patterns with exceptions. +#For example, the first rule says: +# The pattern \<io\> should not be present in any file (case-insensitive match), except when it appears as ".io". +ruleset="The word \"IO\" should not be used, use \"I/O\" instead. +doc.zih.tu-dresden.de/docs/contrib/content_rules.md +i \<io\> \.io +\"SLURM\" (only capital letters) should not be used, use \"Slurm\" instead. +doc.zih.tu-dresden.de/docs/contrib/content_rules.md +s \<SLURM\> +\"File system\" should be written as \"filesystem\", except when used as part of a proper name. +doc.zih.tu-dresden.de/docs/contrib/content_rules.md +i file \+system HDFS +Use \"ZIH systems\" or \"ZIH system\" instead of \"Taurus\". \"taurus\" is only allowed when used in ssh commands and other very specific situations. +doc.zih.tu-dresden.de/docs/contrib/content_rules.md doc.zih.tu-dresden.de/docs/archive/phase2_migration.md +i \<taurus\> taurus\.hrsk /taurus /TAURUS ssh ^[0-9]\+:Host taurus$ +\"HRSKII\" should be avoided, use \"ZIH system\" instead. +doc.zih.tu-dresden.de/docs/contrib/content_rules.md +i \<hrskii\> +The term \"HPC-DA\" should be avoided. Depending on the situation, use \"data analytics\" or similar. +doc.zih.tu-dresden.de/docs/contrib/content_rules.md +i hpc[ -]\+da\> +\"ATTACHURL\" was a keyword in the old wiki, don't use it. + +i attachurl +Replace \"todo\" with real content. +doc.zih.tu-dresden.de/docs/archive/system_triton.md +i \<todo\> <!--.*todo.*--> +Replace variations of \"Coming soon\" with real content. + +i \(\<coming soon\>\|This .* under construction\|posted here\) +Avoid spaces at end of lines. +doc.zih.tu-dresden.de/docs/accessibility.md +i [[:space:]]$ +When referencing partitions, put keyword \"partition\" in front of partition name, e. g. \"partition ml\", not \"ml partition\". +doc.zih.tu-dresden.de/docs/contrib/content_rules.md +i \(alpha\|ml\|haswell\|romeo\|gpu\|smp\|julia\|hpdlf\|scs5\|dcv\)-\?\(interactive\)\?[^a-z]*partition +Give hints in the link text. Words such as \"here\" or \"this link\" are meaningless. +doc.zih.tu-dresden.de/docs/contrib/content_rules.md +i \[\s\?\(documentation\|here\|more info\|this \(link\|page\|subsection\)\|slides\?\|manpage\)\s\?\] +Use \"workspace\" instead of \"work space\" or \"work-space\". +doc.zih.tu-dresden.de/docs/contrib/content_rules.md +i work[ -]\+space" + +function grepExceptions () { + if [ $# -gt 0 ]; then + firstPattern=$1 + shift + grep -v "$firstPattern" | grepExceptions "$@" + else + cat - + fi +} + +function checkFile(){ + f=$1 + echo "Check wording in file $f" + while read message; do + IFS=$'\t' read -r -a files_to_skip + skipping="" + if (printf '%s\n' "${files_to_skip[@]}" | grep -xq $f); then + skipping=" -- skipping" + fi + IFS=$'\t' read -r flags pattern exceptionPatterns + while IFS=$'\t' read -r -a exceptionPatternsArray; do + if [ $silent = false ]; then + echo " Pattern: $pattern$skipping" + fi + if [ -z "$skipping" ]; then + grepflag= + case "$flags" in + "i") + grepflag=-i + ;; + esac + if grep -n $grepflag $color "$pattern" "$f" | grepExceptions "${exceptionPatternsArray[@]}" ; then + number_of_matches=`grep -n $grepflag $color "$pattern" "$f" | grepExceptions "${exceptionPatternsArray[@]}" | wc -l` + ((cnt=cnt+$number_of_matches)) + if [ $silent = false ]; then + echo " $message" + fi + fi + fi + done <<< $exceptionPatterns + done <<< $ruleset +} + +function usage () { + echo "$0 [options]" + echo "Search forbidden patterns in markdown files." + echo "" + echo "Options:" + echo " -a Search in all markdown files (default: git-changed files)" + echo " -f Search in a specific markdown file" + echo " -s Silent mode" + echo " -h Show help message" + echo " -c Show git matches in color" +} + +# Options +all_files=false +silent=false +file="" +color="" +while getopts ":ahsf:c" option; do + case $option in + a) + all_files=true + ;; + f) + file=$2 + shift + ;; + s) + silent=true + ;; + c) + color=" --color=always " + ;; + h) + usage + exit;; + \?) # Invalid option + echo "Error: Invalid option." + usage + exit;; + esac +done + +branch="origin/${CI_MERGE_REQUEST_TARGET_BRANCH_NAME:-preview}" + +if [ $all_files = true ]; then + echo "Search in all markdown files." + files=$(git ls-tree --full-tree -r --name-only HEAD $basedir/ | grep .md) +elif [[ ! -z $file ]]; then + files=$file +else + echo "Search in git-changed files." + files=`git diff --name-only "$(git merge-base HEAD "$branch")"` +fi + +echo "... $files ..." +cnt=0 +if [[ ! -z $file ]]; then + checkFile $file +else + for f in $files; do + if [ "${f: -3}" == ".md" -a -f "$f" ]; then + checkFile $f + fi + done +fi + +echo "" +case $cnt in + 1) + echo "Forbidden Patterns: 1 match found" + ;; + *) + echo "Forbidden Patterns: $cnt matches found" + ;; +esac +if [ $cnt -gt 0 ]; then + exit 1 +fi diff --git a/doc.zih.tu-dresden.de/util/grep-forbidden-patterns.testdoc b/doc.zih.tu-dresden.de/util/grep-forbidden-patterns.testdoc new file mode 100644 index 0000000000000000000000000000000000000000..2b674702cd81304662b439a61d2fe15246ef8215 --- /dev/null +++ b/doc.zih.tu-dresden.de/util/grep-forbidden-patterns.testdoc @@ -0,0 +1,46 @@ +# Diese Datei versucht alles falsch zu machen, worauf grep-forbidden-words.sh checkt. + +`i \[\s\?\(documentation\|here\|this \(link\|page\|subsection\)\|slides\?\|manpage\)\s\?\]` + +Man kann Workspace schreiben oder aber auch +work-Space, beides sollte auffallen. + +Die ML-Partition, +die Alpha-Partition, +die Haswell-Partition, +die Romeo-Partition, +die GPU-Partition, +die SMP-Partition, +die Julia-Partition, +die HPDLF-Partition, +die scs5-Partition (was ist das überhaupt?), +alle gibt es auch in interaktiv: +Die ML-interactive partition, +die Alpha-interactive partition, +die Haswell-interactive Partition, +die Romeo-interactive partition, +die GPU-interactive partition, +die SMP-interactive partition, +die Julia-interactive partition, +die HPDLF-interactive partition, +die scs5-interactive partition (was ist das überhaupt?), +alle diese Partitionen existieren, aber man darf sie nicht benennen. +``` +Denn sonst kommt das Leerzeichenmonster und packt Leerzeichen ans Ende der Zeile. +``` + +TODO: io sollte mit SLURM laufen. + +Das HDFS ist ein sehr gutes +file system auf taurus. + +Taurus ist erreichbar per +taurus.hrsk oder per +/taurus oder per +/TAURUS + +Was ist hrskii? Keine Ahnung! + +Was ist HPC-DA? Ist es ein attachurl? See (this page). +Or (here). +Or (manpage). diff --git a/doc.zih.tu-dresden.de/util/grep-forbidden-words.sh b/doc.zih.tu-dresden.de/util/grep-forbidden-words.sh deleted file mode 100755 index aa20c5a06de665a4420d8c6d41061ee0d6459015..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/util/grep-forbidden-words.sh +++ /dev/null @@ -1,118 +0,0 @@ -#!/bin/bash - -set -euo pipefail - -scriptpath=${BASH_SOURCE[0]} -basedir=`dirname "$scriptpath"` -basedir=`dirname "$basedir"` - -#This is the ruleset. Each line represents a rule of tab-separated fields. -#The first field represents whether the match should be case-sensitive (s) or insensitive (i). -#The second field represents the pattern that should not be contained in any file that is checked. -#Further fields represent patterns with exceptions. -#For example, the first rule says: -# The pattern \<io\> should not be present in any file (case-insensitive match), except when it appears as ".io". -ruleset="i \<io\> \.io -s \<SLURM\> -i file \+system -i \<taurus\> taurus\.hrsk /taurus -i \<hrskii\> -i hpc \+system -i hpc[ -]\+da\> -i work[ -]\+space" - -function grepExceptions () { - if [ $# -gt 0 ]; then - firstPattern=$1 - shift - grep -v "$firstPattern" | grepExceptions "$@" - else - cat - - fi -} - -function usage () { - echo "$0 [options]" - echo "Search forbidden patterns in markdown files." - echo "" - echo "Options:" - echo " -a Search in all markdown files (default: git-changed files)" - echo " -f Search in a specific markdown file" - echo " -s Silent mode" - echo " -h Show help message" -} - -# Options -all_files=false -silent=false -file="" -while getopts ":ahsf:" option; do - case $option in - a) - all_files=true - ;; - f) - file=$2 - shift - ;; - s) - silent=true - ;; - h) - usage - exit;; - \?) # Invalid option - echo "Error: Invalid option." - usage - exit;; - esac -done - -branch="origin/${CI_MERGE_REQUEST_TARGET_BRANCH_NAME:-preview}" - -if [ $all_files = true ]; then - echo "Search in all markdown files." - files=$(git ls-tree --full-tree -r --name-only HEAD $basedir/docs/ | grep .md) -elif [[ ! -z $file ]]; then - files=$file -else - echo "Search in git-changed files." - files=`git diff --name-only "$(git merge-base HEAD "$branch")"` -fi - -echo "... $files ..." -cnt=0 -for f in $files; do - if [ "$f" != doc.zih.tu-dresden.de/README.md -a "${f: -3}" == ".md" -a -f "$f" ]; then - echo "Check wording in file $f" - while IFS=$'\t' read -r flags pattern exceptionPatterns; do - while IFS=$'\t' read -r -a exceptionPatternsArray; do - if [ $silent = false ]; then - echo " Pattern: $pattern" - fi - grepflag= - case "$flags" in - "i") - grepflag=-i - ;; - esac - if grep -n $grepflag "$pattern" "$f" | grepExceptions "${exceptionPatternsArray[@]}" ; then - ((cnt=cnt+1)) - fi - done <<< $exceptionPatterns - done <<< $ruleset - fi -done - -echo "" -case $cnt in - 1) - echo "Forbidden Patterns: 1 match found" - ;; - *) - echo "Forbidden Patterns: $cnt matches found" - ;; -esac -if [ $cnt -gt 0 ]; then - exit 1 -fi diff --git a/doc.zih.tu-dresden.de/util/link-check-config.json b/doc.zih.tu-dresden.de/util/link-check-config.json new file mode 100644 index 0000000000000000000000000000000000000000..fdbb8373f2ebe4d14098d1af5eb62c15733c8f3c --- /dev/null +++ b/doc.zih.tu-dresden.de/util/link-check-config.json @@ -0,0 +1,10 @@ +{ + "ignorePatterns": [ + { + "pattern": "^https://gitlab.hrz.tu-chemnitz.de/zih/hpcsupport/hpc-compendium/-/merge_requests/new$" + }, + { + "pattern": "^https://doc.zih.tu-dresden.de/preview$" + } + ] +} diff --git a/doc.zih.tu-dresden.de/util/lint-changes.sh b/doc.zih.tu-dresden.de/util/lint-changes.sh index ba277da7ae8e3ea367424153a8f116ba3e9d6d2c..05ee5784468bed8d49adbbad8c9389bd3823590b 100755 --- a/doc.zih.tu-dresden.de/util/lint-changes.sh +++ b/doc.zih.tu-dresden.de/util/lint-changes.sh @@ -7,13 +7,16 @@ if [ -n "$CI_MERGE_REQUEST_TARGET_BRANCH_NAME" ]; then branch="origin/$CI_MERGE_REQUEST_TARGET_BRANCH_NAME" fi +configfile=$(dirname $0)/../.markdownlintrc +echo "config: $configfile" + any_fails=false files=$(git diff --name-only "$(git merge-base HEAD "$branch")") for f in $files; do if [ "${f: -3}" == ".md" ]; then echo "Linting $f" - if ! markdownlint "$f"; then + if ! markdownlint -c $configfile "$f"; then any_fails=true fi fi diff --git a/doc.zih.tu-dresden.de/util/pre-commit b/doc.zih.tu-dresden.de/util/pre-commit new file mode 100755 index 0000000000000000000000000000000000000000..1cc901e00efbece94209bfa6c4bbbc54aad682e9 --- /dev/null +++ b/doc.zih.tu-dresden.de/util/pre-commit @@ -0,0 +1,90 @@ +#!/bin/bash +function testPath(){ +path_to_test=doc.zih.tu-dresden.de/docs/$1 +test -f "$path_to_test" || echo $path_to_test does not exist +} + +if ! `docker image inspect hpc-compendium:latest > /dev/null 2>&1` +then + echo Container not built, building... + docker build -t hpc-compendium . +fi + +export -f testPath + +exit_ok=yes +branch="origin/${CI_MERGE_REQUEST_TARGET_BRANCH_NAME:-preview}" +if [ -f "$GIT_DIR/MERGE_HEAD" ] +then + source_hash=`git merge-base HEAD "$branch"` +else + source_hash=`git rev-parse HEAD` +fi +#Remove everything except lines beginning with --- or +++ +files=`git diff $source_hash | sed -E -n 's#^(---|\+\+\+) ((/|./)[^[:space:]]+)$#\2#p'` +#Assume that we have pairs of lines (starting with --- and +++). +while read oldfile; do + read newfile + + if [ "$newfile" == doc.zih.tu-dresden.de/mkdocs.yml ] + then + echo Testing "$newfile" + sed -n '/^ *- /s#.*: \([A-Za-z_/]*.md\).*#\1#p' doc.zih.tu-dresden.de/mkdocs.yml | xargs -L1 -I {} bash -c "testPath '{}'" + if [ $? -ne 0 ] + then + exit_ok=no + fi + elif [[ "$newfile" =~ ^b/doc.zih.tu-dresden.de/(.*.md)$ ]] + then + filepattern=${BASH_REMATCH[1]} + + echo "Linting..." + docker run --name=hpc-compendium --rm -w /docs --mount src="$(pwd)"/doc.zih.tu-dresden.de,target=/docs,type=bind hpc-compendium markdownlint $filepattern + if [ $? -ne 0 ] + then + exit_ok=no + fi + + echo "Checking links..." + docker run --name=hpc-compendium --rm -w /docs --mount src="$(pwd)"/doc.zih.tu-dresden.de,target=/docs,type=bind hpc-compendium markdown-link-check $filepattern + if [ $? -ne 0 ] + then + exit_ok=no + fi + fi +done <<< "$files" + +echo "Testing syntax of bash files..." +docker run --name=hpc-compendium --rm -w /docs --mount src="$(pwd)",target=/docs,type=bind hpc-compendium ./doc.zih.tu-dresden.de/util/check-bash-syntax.sh +if [ $? -ne 0 ] +then + exit_ok=no +fi + +echo "Spell-checking..." +docker run --name=hpc-compendium --rm -w /docs --mount src="$(pwd)",target=/docs,type=bind hpc-compendium ./doc.zih.tu-dresden.de/util/check-spelling.sh +if [ $? -ne 0 ] +then + exit_ok=no +fi + +echo "Forbidden words checking..." +docker run --name=hpc-compendium --rm -w /docs --mount src="$(pwd)",target=/docs,type=bind hpc-compendium ./doc.zih.tu-dresden.de/util/grep-forbidden-patterns.sh +if [ $? -ne 0 ] +then + exit_ok=no +fi + +echo "Looking for empty files..." +docker run --name=hpc-compendium --rm -w /docs --mount src="$(pwd)",target=/docs,type=bind hpc-compendium ./doc.zih.tu-dresden.de/util/check-empty-page.sh +if [ $? -ne 0 ] +then + exit_ok=no +fi + +if [ $exit_ok == yes ] +then + exit 0 +else + exit 1 +fi diff --git a/doc.zih.tu-dresden.de/util/test-grep-forbidden-patterns.sh b/doc.zih.tu-dresden.de/util/test-grep-forbidden-patterns.sh new file mode 100755 index 0000000000000000000000000000000000000000..1e98caf528d9b1a1d640e9dd3e5c7dc23ec937ea --- /dev/null +++ b/doc.zih.tu-dresden.de/util/test-grep-forbidden-patterns.sh @@ -0,0 +1,13 @@ +#!/bin/bash + +expected_match_count=32 + +number_of_matches=$(bash ./doc.zih.tu-dresden.de/util/grep-forbidden-patterns.sh -f doc.zih.tu-dresden.de/util/grep-forbidden-patterns.testdoc -c -c | grep "Forbidden Patterns:" | sed -e 's/.*: //' | sed -e 's/ matches.*//') + +if [ $number_of_matches -eq $expected_match_count ]; then + echo "Test OK" + exit 0 +else + echo "Test failed: $expected_match_count matches expected, but only $number_of_matches found" + exit 1 +fi diff --git a/doc.zih.tu-dresden.de/wordlist.aspell b/doc.zih.tu-dresden.de/wordlist.aspell index 30eaee21e2befa638eefe67e87a591f7dbc6c708..262f5eeae1b153648d59418137b8bac2dc2cf5fb 100644 --- a/doc.zih.tu-dresden.de/wordlist.aspell +++ b/doc.zih.tu-dresden.de/wordlist.aspell @@ -1,70 +1,199 @@ -personal_ws-1.1 en 1805 +personal_ws-1.1 en 203 +Abaqus +Addon +Addons +ALLREDUCE Altix +Amber Amdahl's analytics +Analytics anonymized -Anonymized +Ansys +APIs +AVX +awk BeeGFS benchmarking BLAS +broadwell bsub +bullx +CCM ccNUMA +centauri +CentOS +CFX +cgroups +checkpointing +Chemnitz citable +CLI +CMake +COMSOL +conda +config +CONFIG +cpu CPU +CPUID +cpus CPUs +crossentropy +css +CSV CUDA +cuDNN CXFS +dask +Dask +dataframes +DataFrames +datamover +DataParallel +dataset +DCV +ddl +DDP DDR DFG +distr +DistributedDataParallel +DMTCP +DNS +Dockerfile +Dockerfiles +DockerHub +dockerized +dotfile +dotfiles +downtime +downtimes +EasyBlocks EasyBuild +EasyConfig +ecryptfs engl english +env +EPYC +Espresso +ESSL +facto fastfs FFT FFTW filesystem filesystems -Filesystem +flink Flink +FlinkExample +FMA +foreach Fortran +Galilei +Gauss +Gaussian +GBit +GDB +GDDR GFLOPS gfortran GiB +gifferent +GitHub +GitLab +GitLab's +glibc +Gloo gnuplot -Gnuplot +gpu GPU +GPUs +gres +GROMACS +GUIs hadoop -Haswell +haswell +HBM +HDF HDFS +HDFView +hiera +horovod Horovod +horovodrun +hostname +Hostnames +hpc HPC +hpcsupport +HPE HPL +html +hvd +hyperparameter +hyperparameters +hyperthreading icc icpc ifort ImageNet +img Infiniband +init inode +Instrumenter +IOPS +IPs +ipynb +ISA Itanium +jobqueue jpg +jss +jupyter Jupyter JupyterHub JupyterLab Keras KNL +Kunststofftechnik +LAMMPS LAPACK +lapply +Leichtbau LINPACK +linter +Linter +lmod LoadLeveler +localhost lsf -LSF lustre +markdownlint +Mathematica +MathKernel +MathWorks +matlab MEGWARE +mem +Memcheck MiB +Microarchitecture MIMD +Miniconda +mkdocs MKL +MNIST +MobaXTerm +modenv +modenvs +modulefile Montecito mountpoint -MPI +mpi +Mpi mpicc mpiCC mpicxx @@ -72,12 +201,33 @@ mpif mpifort mpirun multicore +multiphysics +Multiphysics multithreaded +Multithreading +NAMD +natively +nbgitpuller +nbsp +NCCL Neptun NFS +NGC +nodelist +NODELIST +NRINGS +ntasks +NUM NUMA NUMAlink +NumPy Nutzungsbedingungen +Nvidia +NVLINK +NVMe +NWChem +OME +OmniOpt OPARI OpenACC OpenBLAS @@ -86,57 +236,158 @@ OpenGL OpenMP openmpi OpenMPI +OpenSSH Opteron +OTF +overfitting +pandarallel +Pandarallel PAPI parallelization +parallelize +parallelized +parfor pdf +perf Perf +performant +PESSL +PGI PiB Pika pipelining +PMI png +PowerAI +ppc +pre +Pre +Preload +preloaded +preloading +prepend +preprocessing +PSOCK +Pthread Pthreads +pty +PuTTY +pymdownx +PythonAnaconda +pytorch +PyTorch +Quantum +queue +quickstart +Quickstart +randint reachability +README +reproducibility +requeueing +resnet +ResNet +RHEL +Rmpi rome romeo RSA +RSS +RStudio +Rsync +runnable +runtime +Runtime +sacct salloc +Sandybridge Saxonid sbatch ScaDS -ScaLAPACK +scalability scalable +ScaLAPACK Scalasca scancel +Scikit +SciPy scontrol scp scs +SFTP SGEMM SGI SHA SHMEM SLES Slurm +SLURMCluster SMP -queue +SMT +SparkExample +spython +squeue srun ssd -SSD +SSHFS +STAR stderr stdout +subdirectories +subdirectory +SubMathKernel +Superdome SUSE +SXM TBB +TCP +TensorBoard +tensorflow TensorFlow TFLOPS Theano tmp +todo +ToDo +toolchain +toolchains +torchvision +Torchvision tracefile tracefiles +tracepoints +transferability Trition +undistinguishable +unencrypted +uplink +userspace +Valgrind Vampir +VampirServer VampirTrace VampirTrace's +VASP +vectorization +venv +virtualenv +VirtualGL +VMs +VMSize +VPN WebVNC +WinSCP +Workdir +workspace workspaces +XArray Xeon +XGBoost +XLC +XLF +Xming +yaml +zih ZIH +ZIH's +ZSH diff --git a/twiki2md/root/Applications/Electrodynamics.md b/twiki2md/root/Applications/Electrodynamics.md deleted file mode 100644 index c7b99613df2abfbad8b708d27cae01925a382925..0000000000000000000000000000000000000000 --- a/twiki2md/root/Applications/Electrodynamics.md +++ /dev/null @@ -1,17 +0,0 @@ -# Electromagnetic Field Simulation - -The following applications are installed at ZIH: - -| | | | -|----------|----------|------------| -| | **Mars** | **Deimos** | -| **HFSS** | | 11.0.2 | - -## HFSS - -[HFSS](http://www.ansoft.com/products/hf/hfss/) is the industry-standard -simulation tool for 3D full-wave electromagnetic field simulation. HFSS -provides E- and H-fields, currents, S-parameters and near and far -radiated field results. - --- Main.mark - 2010-01-06 diff --git a/twiki2md/root/Applications/SoftwareModulesList.md b/twiki2md/root/Applications/SoftwareModulesList.md deleted file mode 100644 index 70feb25fc836e0cee838889ef15cc9ca284d9cfa..0000000000000000000000000000000000000000 --- a/twiki2md/root/Applications/SoftwareModulesList.md +++ /dev/null @@ -1,962 +0,0 @@ -### SCS5 Environment - -<span class="twiki-macro TABLE">headerrows= 1"</span> <span -class="twiki-macro EDITTABLE" -format="| text, 30, Software | text, 40, Kategorie | text, 30, Letzte Änderung | text, 30, SGI-UV | text, 30, Taurus | " -changerows="on"></span> - -| Software | Category | Last change | Venus | Taurus | -|:--------------------------|:----------|:------------|:------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| ABAQUS | cae | 2020-04-30 | \- | 2019 | -| ABINIT | chem | 2019-07-12 | \- | 8.6.3\<br />8.10.3 | -| ACE | lib | 2018-11-22 | \- | 6.5.1 | -| ACTC | lib | 2018-11-22 | \- | 1.1 | -| AFNI | bio | 2018-11-22 | \- | 20180521 | -| AMDLibM | perf | 2019-08-20 | \- | 3.4.0 | -| AMDuProf | perf | 2020-11-26 | \- | 3.3.462\<br />3.1.35 | -| ANSA | cae | 2021-04-08 | \- | 20.1.4 | -| ANSYS | tools | 2020-10-23 | \- | 2020R2\<br />19.5\<br />19.2 | -| ANTLR | tools | 2020-09-29 | \- | 2.7.7 | -| ANTs | data | 2020-10-06 | \- | 2.3.4 | -| APR | tools | 2019-11-08 | \- | 1.7.0\<br />1.6.3 | -| APR-util | tools | 2019-11-08 | \- | 1.6.1 | -| ASE | chem | 2020-09-29 | \- | 3.19.0\<br />3.18.1\<br />3.16.2 | -| ATK | vis | 2020-09-02 | \- | 2.34.1\<br />2.28.1 | -| Advisor | perf | 2018-11-22 | \- | 2018 | -| Anaconda3 | lang | 2019-10-17 | \- | 2019.03 | -| AnsysEM | phys | 2018-11-22 | \- | 19.0 | -| Arrow | data | 2020-12-01 | \- | 0.16.0\<br />0.14.1 | -| AtomPAW | chem | 2019-09-11 | \- | 4.1.0.6 | -| Autoconf | devel | 2021-02-17 | \- | 2.69 | -| Automake | devel | 2021-02-17 | \- | 1.16.2\<br />1.16.1\<br />1.15.1\<br />1.15 | -| Autotools | devel | 2021-02-17 | \- | 20200321\<br />20180311\<br />20170619\<br />20150215 | -| Bazel | devel | 2021-03-19 | \- | 3.7.2\<br />3.7.1\<br />3.4.1\<br />0.29.1\<br />0.26.1\<br />0.20.0\<br />0.16.0\<br />0.12.0 | -| BigDataFrameworkConfigure | devel | 2019-09-16 | \- | 0.0.2\<br />0.0.1 | -| Bison | lang | 2020-12-11 | \- | 3.7.1\<br />3.5.3\<br />3.3.2\<br />3.0.5\<br />3.0.4\<br />2.7 | -| Blitz++ | lib | 2019-04-09 | \- | 0.10 | -| Boost | devel | 2021-04-09 | \- | 1.74.0\<br />1.72.0\<br />1.71.0\<br />1.70.0\<br />1.69.0\<br />1.68.0\<br />1.67.0\<br />1.66.0\<br />1.61.0\<br />1.55.0 | -| Boost.Python | lib | 2019-02-26 | \- | 1.66.0 | -| CDO | data | 2020-10-06 | \- | 1.9.8\<br />1.9.4 | -| CFITSIO | lib | 2019-06-03 | \- | 3.45 | -| CGAL | numlib | 2020-08-20 | \- | 4.14.3\<br />4.14.1\<br />4.11.1 | -| CGNS | cae | 2020-10-09 | \- | 3.3.1 | -| CMake | devel | 2021-02-17 | \- | 3.9.5\<br />3.9.1\<br />3.8.0\<br />3.18.4\<br />3.16.4\<br />3.15.3\<br />3.14.5\<br />3.13.3\<br />3.12.1\<br />3.11.4\<br />3.10.2\<br />3.10.1\<br />3.10.0 | -| COMSOL | phys | 2021-02-10 | \- | 5.6\<br />5.5\<br />5.4 | -| CP2K | chem | 2019-12-12 | \- | 6.1\<br />5.1 | -| CUDA | system | 2021-02-17 | \- | 9.2.88\<br />9.1.85\<br />11.1.1\<br />11.0.2\<br />10.1.243\<br />10.0.130 | -| CUDAcore | system | 2021-02-17 | \- | 11.1.1\<br />11.0.2 | -| Check | lib | 2021-02-17 | \- | 0.15.2 | -| Clang | compiler | 2020-12-09 | \- | 9.0.1\<br />5.0.0 | -| ClustalW2 | bio | 2018-11-22 | \- | 2.1 | -| CubeGUI | perf | 2021-01-29 | \- | 4.4.4\<br />4.4 | -| CubeLib | perf | 2020-10-12 | \- | 4.4.4\<br />4.4 | -| CubeW | perf | 2018-11-22 | \- | 4.4 | -| CubeWriter | perf | 2020-10-12 | \- | 4.4.3 | -| Cython | lang | 2018-11-22 | \- | 0.28.5\<br />0.28.2 | -| DAMASK | phys | 2020-12-10 | \- | 3.0.0\<br />2.0.3-2992\<br />2.0.3 | -| DASH | lib | 2018-11-22 | \- | dash | -| DB | tools | 2021-02-17 | \- | 18.1.40 | -| DBus | devel | 2020-08-11 | \- | 1.13.8\<br />1.13.6\<br />1.13.12 | -| DFTB+ | phys | 2021-04-12 | \- | 19.1\<br />18.2\<br />18.1 | -| DMTCP | tools | 2018-12-06 | \- | 2.5.2\<br />2.5.1 | -| DOLFIN | math | 2020-12-21 | \- | 2019.1.0\<br />2018.1.0.post1\<br />2017.2.0 | -| Delft3D | geo | 2020-01-29 | \- | 6.03 | -| Devel-NYTProf | perf | 2019-08-30 | \- | 6.06 | -| Doxygen | devel | 2021-03-03 | \- | 1.8.20\<br />1.8.17\<br />1.8.16\<br />1.8.15\<br />1.8.14\<br />1.8.13 | -| Dyninst | tools | 2019-01-21 | \- | 9.3.2 | -| ELPA | math | 2021-01-05 | \- | 2019.11.001\<br />2018.11.001\<br />2016.11.001.pre | -| ELSI | math | 2020-02-27 | \- | 2.5.0 | -| EMBOSS | bio | 2019-02-11 | \- | 6.6.0 | -| ESPResSo | phys | 2019-11-19 | \- | 4.1.1 | -| ETSF_IO | lib | 2020-10-15 | \- | 1.0.4 | -| EasyBuild | tools | 2021-03-08 | \- | 4.3.3\<br />4.3.2\<br />4.3.1\<br />4.3.0\<br />4.2.2\<br />4.2.0\<br />4.1.2\<br />4.1.1\<br />4.1.0\<br />4.0.1\<br />3.9.4\<br />3.9.3\<br />3.9.2\<br />3.9.1\<br />3.8.1\<br />3.8.0\<br />3.7.1\<br />3.7.0\<br />3.6.2\<br />3.6.1 | -| Eigen | math | 2021-02-17 | \- | 3.3.8\<br />3.3.7\<br />3.3.4 | -| Emacs | tools | 2020-02-20 | \- | 25.3 | -| ErlangOTP | lang | 2019-03-13 | \- | 21.3-no | -| FFC | math | 2019-02-26 | \- | 2018.1.0 | -| FFTW | numlib | 2021-02-17 | \- | 3.3.8\<br />3.3.7\<br />3.3.4 | -| FFmpeg | vis | 2020-08-11 | \- | 4.2.2\<br />4.2.1\<br />4.1.3\<br />4.1\<br />3.4.2 | -| FIAT | math | 2019-02-26 | \- | 2018.1.0 | -| FLANN | lib | 2019-08-13 | \- | 1.8.4 | -| FLTK | vis | 2019-11-08 | \- | 1.3.5\<br />1.3.4 | -| FSL | bio | 2020-10-30 | \- | 6.0.2\<br />5.0.11 | -| Flink | devel | 2019-09-16 | \- | 1.9.0\<br />1.8.1 | -| FoX | lib | 2018-11-22 | \- | 4.1.2 | -| FreeSurfer | bio | 2020-07-22 | \- | 7.1.1\<br />6.0.0\<br />5.3.0\<br />5.1.0 | -| FriBidi | vis | 2020-08-11 | \- | 1.0.9\<br />1.0.5\<br />1.0.1 | -| GAMS | math | 2018-11-28 | \- | 25.1.3 | -| GCC | compiler | 2020-12-11 | \- | 9.3.0\<br />9.1.0-2.32\<br />8.3.0\<br />8.2.0-2.31.1\<br />8.2.0-2.30\<br />7.3.0-2.30\<br />6.4.0-2.28\<br />10.2.0 | -| GCCcore | compiler | 2020-12-11 | \- | 9.3.0\<br />9.1.0\<br />8.3.0\<br />8.2.0\<br />7.3.0\<br />6.4.0\<br />6.3.0\<br />10.2.0 | -| GCL | compiler | 2018-11-22 | \- | 2.6.12 | -| GDAL | data | 2020-10-06 | \- | 3.0.2\<br />3.0.0\<br />2.2.3 | -| GDB | debugger | 2020-08-11 | \- | 9.1\<br />8.1 | -| GDRCopy | lib | 2021-02-17 | \- | 2.1 | -| GEOS | math | 2020-10-06 | \- | 3.8.0\<br />3.7.2\<br />3.6.2 | -| GL2PS | vis | 2019-11-08 | \- | 1.4.0 | -| GLM | lib | 2018-12-11 | \- | 0.9.9.0 | -| GLPK | tools | 2020-08-28 | \- | 4.65 | -| GLib | vis | 2020-08-11 | \- | 2.64.1\<br />2.62.0\<br />2.60.1\<br />2.54.3 | -| GLibmm | vis | 2020-09-24 | \- | 2.49.7 | -| GMP | math | 2021-02-17 | \- | 6.2.0\<br />6.1.2 | -| GObject-Introspection | devel | 2020-09-02 | \- | 1.63.1\<br />1.60.1\<br />1.54.1 | -| GPAW | chem | 2020-01-15 | \- | 19.8.1 | -| GPAW-setups | chem | 2020-01-15 | \- | 0.9.20000 | -| GPI2 | base | 2019-05-29 | \- | next-27-05-19\<br />1.3.0 | -| GROMACS | bio | 2020-10-12 | \- | 2020\<br />2019.4\<br />2018.2 | -| GSL | numlib | 2020-08-11 | \- | 2.6\<br />2.5\<br />2.4 | -| GTK+ | vis | 2020-01-13 | \- | 2.24.32 | -| Gdk-Pixbuf | vis | 2020-09-02 | \- | 2.38.2\<br />2.36.12\<br />2.36.11 | -| Ghostscript | tools | 2020-08-28 | \- | 9.52\<br />9.50\<br />9.27\<br />9.23\<br />9.22 | -| GitPython | lib | 2020-04-16 | \- | 3.1.1\<br />3.0.3 | -| GlobalArrays | lib | 2020-10-12 | \- | 5.7 | -| Go | compiler | 2019-06-26 | \- | 1.12 | -| GraphicsMagick | vis | 2019-11-08 | \- | 1.3.33\<br />1.3.31\<br />1.3.28 | -| Guile | lang | 2020-08-11 | \- | 1.8.8 | -| Gurobi | math | 2020-12-01 | \- | 9.1.0\<br />9.0.1\<br />8.0.1 | -| HDF | data | 2018-11-22 | \- | 4.2.13 | -| HDF5 | data | 2021-03-17 | \- | 1.10.7\<br />1.10.6\<br />1.10.5\<br />1.10.2\<br />1.10.1 | -| HDFView | vis | 2019-02-20 | \- | 2.14 | -| Hadoop | devel | 2019-08-29 | \- | 2.7.7 | -| HarfBuzz | vis | 2020-09-02 | \- | 2.6.4\<br />2.4.0\<br />2.2.0\<br />1.7.5 | -| Horovod | tools | 2020-01-24 | \- | 0.18.2 | -| Hyperopt | lib | 2020-02-19 | \- | 0.2.2\<br />0.1.1 | -| Hypre | numlib | 2020-12-08 | \- | 2.18.2\<br />2.14.0 | -| ICU | lib | 2021-03-17 | \- | 67.1\<br />66.1\<br />64.2\<br />61.1\<br />56.1 | -| IPython | tools | 2019-11-19 | \- | 7.7.0\<br />6.4.0\<br />6.3.1 | -| ImageMagick | vis | 2020-08-28 | \- | 7.0.9-5\<br />7.0.8-46\<br />7.0.8-11\<br />7.0.7-39\<br />7.0.10-1 | -| Inspector | tools | 2019-01-22 | \- | 2019\<br />2018 | -| Ipopt | lib | 2018-11-22 | \- | 3.12.11 | -| JUnit | devel | 2019-02-13 | \- | 4.12 | -| JasPer | vis | 2020-08-11 | \- | 2.0.14\<br />1.900.1 | -| Java | lang | 2020-12-02 | \- | 14.0.2\<br />11.0.2\<br />1.8.0-162\<br />1.8.0-161 | -| JsonCpp | lib | 2021-03-17 | \- | 1.9.4\<br />1.9.3 | -| Julia | lang | 2020-07-14 | \- | 1.4.2\<br />1.1.1\<br />1.0.2 | -| Keras | math | 2020-03-20 | \- | 2.3.1\<br />2.2.4\<br />2.2.0 | -| LAME | data | 2020-08-11 | \- | 3.100 | -| LAMMPS | chem | 2020-08-11 | \- | 7Aug19\<br />3Mar2020\<br />20180316\<br />12Dec2018 | -| LLVM | compiler | 2020-09-25 | \- | 9.0.1\<br />9.0.0\<br />8.0.1\<br />7.0.1\<br />6.0.0\<br />5.0.1 | -| LMDB | lib | 2021-03-17 | \- | 0.9.24 | -| LS-DYNA | cae | 2020-12-11 | \- | DEV-81069\<br />12.0.0\<br />11.1.0\<br />11.0.0\<br />10.1.0 | -| LS-Opt | cae | 2019-08-20 | \- | 6.0.0\<br />5.2.1 | -| LS-PrePost | cae | 2019-05-03 | \- | 4.6\<br />4.5\<br />4.3 | -| LibTIFF | lib | 2021-02-17 | \- | 4.1.0\<br />4.0.9\<br />4.0.10 | -| LibUUID | lib | 2019-11-08 | \- | 1.0.3 | -| Libint | chem | 2019-10-14 | \- | 1.1.6 | -| LittleCMS | vis | 2020-08-28 | \- | 2.9 | -| M4 | devel | 2020-12-11 | \- | 1.4.18\<br />1.4.17 | -| MATIO | lib | 2018-12-11 | \- | 1.5.12 | -| MATLAB | math | 2021-03-22 | \- | 2021a\<br />2020a\<br />2019b\<br />2018b\<br />2018a\<br />2017a | -| MDAnalysis | phys | 2020-04-21 | \- | 0.20.1 | -| METIS | math | 2020-08-20 | \- | 5.1.0 | -| MPFR | math | 2020-08-20 | \- | 4.0.2\<br />4.0.1 | -| MUMPS | math | 2020-12-08 | \- | 5.2.1\<br />5.1.2 | -| MUST | perf | 2019-01-25 | \- | 1.6.0-rc3 | -| Mako | devel | 2020-08-11 | \- | 1.1.2\<br />1.1.0\<br />1.0.8\<br />1.0.7 | -| Mathematica | math | 2018-11-22 | \- | 11.3.0\<br />11.2.0 | -| Maven | devel | 2020-04-29 | \- | 3.6.3 | -| Maxima | math | 2018-11-22 | \- | 5.42.1 | -| Mercurial | tools | 2018-11-22 | \- | 4.6.1 | -| Mesa | vis | 2020-08-11 | \- | 20.0.2\<br />19.1.7\<br />19.0.1\<br />18.1.1\<br />17.3.6 | -| Meson | tools | 2021-02-17 | \- | 0.55.3\<br />0.53.2\<br />0.51.2\<br />0.50.0 | -| Mesquite | math | 2020-02-12 | \- | 2.3.0 | -| Miniconda2 | lang | 2018-11-22 | \- | 4.5.11 | -| Miniconda3 | lang | 2019-10-17 | \- | 4.5.4 | -| MongoDB | data | 2019-07-29 | \- | 4.0.3 | -| NAMD | chem | 2018-11-22 | \- | 2.12 | -| NASM | lang | 2021-02-17 | \- | 2.15.05\<br />2.14.02\<br />2.13.03\<br />2.13.01 | -| NCCL | lib | 2021-03-17 | \- | 2.8.3\<br />2.4.8\<br />2.4.2\<br />2.3.7 | -| NCO | tools | 2020-09-29 | \- | 4.9.3 | -| NFFT | lib | 2019-08-09 | \- | 3.5.1 | -| NLTK | data | 2018-12-05 | \- | 3.4 | -| NLopt | numlib | 2020-08-28 | \- | 2.6.1\<br />2.4.2 | -| NSPR | lib | 2020-08-11 | \- | 4.25\<br />4.21\<br />4.20 | -| NSS | lib | 2020-08-11 | \- | 3.51\<br />3.45\<br />3.42.1\<br />3.39 | -| NWChem | chem | 2018-11-22 | \- | 6.8.revision47\<br />6.6.revision27746 | -| Nektar++ | math | 2020-02-19 | \- | 5.0.0 | -| NetLogo | math | 2018-11-22 | \- | 6.0.4-64 | -| Ninja | tools | 2021-02-17 | \- | 1.9.0\<br />1.10.1\<br />1.10.0 | -| OPARI2 | perf | 2020-10-12 | \- | 2.0.5\<br />2.0.3 | -| ORCA | chem | 2020-03-04 | \- | 4.2.1\<br />4.1.1 | -| OTF2 | perf | 2020-10-12 | \- | 2.2\<br />2.1.1 | -| Octave | math | 2019-11-08 | \- | 5.1.0 | -| Octopus | chem | 2020-10-15 | \- | 8.4\<br />10.1 | -| OpenBLAS | numlib | 2021-02-17 | \- | 0.3.9\<br />0.3.7\<br />0.3.5\<br />0.3.12\<br />0.3.1\<br />0.2.20 | -| OpenBabel | chem | 2018-11-22 | \- | 2.4.1 | -| OpenCV | vis | 2020-01-13 | \- | 4.0.1\<br />3.4.1 | -| OpenFOAM | cae | 2020-09-23 | \- | v2006\<br />v1912\<br />v1806\<br />8\<br />7\<br />6\<br />5.0\<br />4.1\<br />2.3.1 | -| OpenFOAM-Extend | cae | 2020-02-12 | \- | 4.0 | -| OpenMPI | mpi | 2021-02-17 | \- | 4.0.5\<br />4.0.4\<br />4.0.3\<br />4.0.1\<br />3.1.6\<br />3.1.4\<br />3.1.3\<br />3.1.2\<br />3.1.1\<br />2.1.5\<br />2.1.2\<br />1.10.7 | -| OpenMX | phys | 2020-12-08 | \- | 3.9.2 | -| OpenMolcas | chem | 2020-10-12 | \- | 19.11\<br />18.09 | -| OpenPGM | system | 2019-11-05 | \- | 5.2.122 | -| OpenSSL | system | 2019-11-08 | \- | 1.1.1b\<br />1.0.2l | -| PAPI | perf | 2020-10-12 | \- | 6.0.0\<br />5.7.0\<br />5.6.0 | -| PCL | vis | 2019-08-13 | \- | 1.9.1 | -| PCRE | devel | 2021-03-17 | \- | 8.44\<br />8.43\<br />8.41 | -| PCRE2 | devel | 2020-08-11 | \- | 10.34\<br />10.33\<br />10.32 | -| PDFCrop | tools | 2018-11-22 | \- | 0.4b | -| PDT | perf | 2020-10-12 | \- | 3.25.1\<br />3.25 | -| PETSc | numlib | 2020-12-08 | \- | 3.9.4\<br />3.9.3\<br />3.8.3\<br />3.7.7\<br />3.13.3\<br />3.12.4\<br />3.11.0\<br />3.10.5 | -| PFFT | numlib | 2020-10-15 | \- | 1.0.8 | -| PGI | compiler | 2020-02-07 | \- | 19.4\<br />19.10\<br />18.7\<br />18.4\<br />18.10\<br />17.7\<br />17.10 | -| PLUMED | chem | 2020-08-11 | \- | 2.6.0\<br />2.5.1\<br />2.4.0 | -| PLY | lib | 2019-02-26 | \- | 3.11 | -| PMIx | lib | 2021-02-24 | \- | 3.1.5\<br />3.1.1 | -| PROJ | lib | 2020-10-06 | \- | 6.2.1\<br />6.0.0\<br />5.0.0 | -| Pandoc | tools | 2019-10-14 | \- | 2.5 | -| Pango | vis | 2020-09-02 | \- | 1.44.7\<br />1.43.0\<br />1.42.4\<br />1.41.1 | -| ParMETIS | math | 2020-02-12 | \- | 4.0.3 | -| ParMGridGen | math | 2020-02-12 | \- | 1.0 | -| ParaView | vis | 2020-11-12 | \- | 5.9.0-RC1\<br />5.8.0\<br />5.7.0\<br />5.6.2\<br />5.5.2\<br />5.4.1 | -| Perl | lang | 2021-02-17 | \- | 5.32.0\<br />5.30.2\<br />5.30.0\<br />5.28.1\<br />5.28.0\<br />5.26.1 | -| Pillow | vis | 2021-02-17 | \- | 8.0.1\<br />7.0.0\<br />6.2.1\<br />5.0.0 | -| Pillow-SIMD | vis | 2018-11-22 | \- | 5.0.0 | -| PnetCDF | data | 2019-03-06 | \- | 1.9.0 | -| PyQt5 | vis | 2018-11-22 | \- | 5.10.1 | -| PyTorch | devel | 2021-01-04 | \- | 1.6.0\<br />0.3.1 | -| PyYAML | lib | 2020-04-16 | \- | 5.1.2\<br />5.1\<br />3.13\<br />3.12 | -| Python | lang | 2021-02-17 | \- | 3.8.6\<br />3.8.2\<br />3.7.4\<br />3.7.2\<br />3.6.6\<br />3.6.4\<br />2.7.18\<br />2.7.16\<br />2.7.15\<br />2.7.14 | -| Qhull | math | 2019-11-08 | \- | 2019.1\<br />2015.2 | -| Qt | devel | 2018-11-22 | \- | 4.8.7 | -| Qt5 | devel | 2020-08-11 | \- | 5.9.3\<br />5.14.1\<br />5.13.1\<br />5.12.3\<br />5.10.1 | -| QuantumESPRESSO | chem | 2021-01-05 | \- | 6.6\<br />6.5\<br />6.4.1\<br />6.3\<br />6.2 | -| Qwt | lib | 2020-09-24 | \- | 6.1.4 | -| R | lang | 2020-10-06 | \- | 4.0.0\<br />3.6.2\<br />3.6.0\<br />3.5.1\<br />3.4.4 | -| RDFlib | lib | 2020-09-29 | \- | 4.2.2 | -| RELION | bio | 2019-02-27 | \- | 3.0\<br />2.1 | -| ROOT | data | 2019-06-03 | \- | 6.14.06 | -| Ruby | lang | 2019-11-14 | \- | 2.6.3\<br />2.6.1 | -| SCOTCH | math | 2020-12-08 | \- | 6.0.9\<br />6.0.6\<br />6.0.5\<br />6.0.4\<br />5.1.12b | -| SCons | devel | 2020-01-24 | \- | 3.1.1\<br />3.0.5\<br />3.0.1 | -| SHARC | chem | 2019-01-07 | \- | 2.0 | -| SIONlib | lib | 2020-10-12 | \- | 1.7.6\<br />1.7.4\<br />1.7.2 | -| SIP | lang | 2018-11-22 | \- | 4.19.8\<br />4.19.12 | -| SLEPc | numlib | 2020-02-27 | \- | 3.9.2\<br />3.12.2 | -| SPM | math | 2018-11-22 | \- | 12-r7219 | -| SQLite | devel | 2021-02-17 | \- | 3.33.0\<br />3.31.1\<br />3.29.0\<br />3.27.2\<br />3.26.0\<br />3.24.0\<br />3.21.0\<br />3.20.1 | -| STAR-CCM+ | cae | 2021-03-19 | \- | 15.06.008\<br />15.04.010-R8\<br />15.02.007\<br />14.04.011\<br />14.02.012\<br />13.06.012-R8\<br />13.04.011\<br />13.02.013-R8 | -| SUNDIALS | math | 2018-11-22 | \- | 2.7.0 | -| SWASH | phys | 2020-03-18 | \- | 6.01\<br />5.01 | -| SWIG | devel | 2020-10-06 | \- | 4.0.1\<br />3.0.12 | -| ScaFaCoS | math | 2020-08-11 | \- | 1.0.1 | -| ScaLAPACK | numlib | 2021-02-17 | \- | 2.1.0\<br />2.0.2 | -| Scalasca | perf | 2021-02-02 | \- | 2.5\<br />2.4 | -| SciPy-bundle | lang | 2021-02-17 | \- | 2020.11\<br />2020.03\<br />2019.10\<br />2019.03 | -| Score-P | perf | 2020-10-19 | \- | 6.0\<br />4.0 | -| Serf | tools | 2019-11-08 | \- | 1.3.9 | -| Siesta | phys | 2020-02-27 | \- | 4.1-b4\<br />4.1-b3\<br />4.1 | -| Six | lib | 2018-11-22 | \- | 1.11.0 | -| Spark | devel | 2020-10-22 | \- | 3.0.1\<br />2.4.4\<br />2.4.3 | -| Subversion | tools | 2019-11-08 | \- | 1.9.7\<br />1.12.0 | -| SuiteSparse | numlib | 2020-09-02 | \- | 5.7.1\<br />5.6.0\<br />5.4.0\<br />5.1.2 | -| SuperLU | numlib | 2019-01-15 | \- | 5.2.1 | -| SuperLU_DIST | numlib | 2019-01-16 | \- | 6.1.0 | -| SuperLU_MT | numlib | 2019-01-16 | \- | 3.1 | -| Szip | tools | 2021-03-03 | \- | 2.1.1 | -| Tcl | lang | 2021-02-17 | \- | 8.6.9\<br />8.6.8\<br />8.6.7\<br />8.6.10 | -| TensorFlow | lib | 2021-03-15 | \- | 2.4.1\<br />2.3.1\<br />2.1.0\<br />2.0.0\<br />1.8.0\<br />1.15.0\<br />1.10.0 | -| Tk | vis | 2021-02-17 | \- | 8.6.9\<br />8.6.8\<br />8.6.10 | -| Tkinter | lang | 2021-02-17 | \- | 3.8.6\<br />3.8.2\<br />3.7.4\<br />3.7.2\<br />3.6.6\<br />3.6.4\<br />2.7.15\<br />2.7.14 | -| TotalView | debugger | 2020-09-24 | \- | 8.14.1-8 | -| Trilinos | numlib | 2019-02-26 | \- | 12.12.1 | -| UCX | lib | 2021-02-24 | \- | 1.9.0\<br />1.8.0\<br />1.5.1 | -| UDUNITS | phys | 2020-08-28 | \- | 2.2.26 | -| UFL | cae | 2019-02-26 | \- | 2018.1.0 | -| UnZip | tools | 2021-02-17 | \- | 6.0 | -| VASP | phys | 2020-05-19 | \- | 5.4.4 | -| VMD | vis | 2018-11-22 | \- | 1.9.3 | -| VSEARCH | bio | 2018-11-22 | \- | 2.8.4 | -| VTK | vis | 2020-10-06 | \- | 8.2.0\<br />8.1.1\<br />8.1.0\<br />5.10.1 | -| VTune | tools | 2020-04-28 | \- | 2020\<br />2019\<br />2018 | -| Valgrind | debugger | 2019-11-05 | \- | 3.14.0\<br />3.13.0 | -| Vampir | tools | 2021-01-13 | \- | unstable\<br />9.9.0\<br />9.8.0\<br />9.7.1\<br />9.7.0\<br />9.6.1\<br />9.5.0\<br />9.11\<br />9.10.0 | -| Voro++ | math | 2020-08-11 | \- | 0.4.6 | -| WRF | geo | 2018-12-12 | \- | 3.8.1 | -| Wannier90 | chem | 2020-12-14 | \- | 2.1.0\<br />2.0.1.1\<br />1.2 | -| X11 | vis | 2021-02-17 | \- | 20201008\<br />20200222\<br />20190717\<br />20190311\<br />20180604\<br />20180131 | -| XML-Parser | data | 2018-11-22 | \- | 2.44-01 | -| XZ | tools | 2021-02-17 | \- | 5.2.5\<br />5.2.4\<br />5.2.3\<br />5.2.2 | -| YAXT | tools | 2020-10-06 | \- | 0.6.2\<br />0.6.0 | -| Yasm | lang | 2020-08-11 | \- | 1.3.0 | -| ZeroMQ | devel | 2019-11-05 | \- | 4.3.2\<br />4.2.5 | -| Zip | tools | 2021-03-17 | \- | 3.0 | -| ace | lib | 2018-11-22 | \- | 6.5.0 | -| ant | devel | 2020-09-02 | \- | 1.10.7\<br />1.10.1 | -| archspec | tools | 2020-08-11 | \- | 0.1.0 | -| arpack-ng | numlib | 2019-11-08 | \- | 3.7.0\<br />3.6.1\<br />3.5.0 | -| asciidoc | base | 2018-11-22 | \- | 8.6.9 | -| at-spi2-atk | vis | 2020-09-02 | \- | 2.34.1 | -| at-spi2-core | vis | 2020-09-02 | \- | 2.34.0 | -| auto_ml | lang | 2019-10-29 | \- | 2.9.9 | -| basemap | vis | 2019-04-02 | \- | 1.0.7 | -| binutils | tools | 2020-12-11 | \- | 2.35\<br />2.34\<br />2.32\<br />2.31.1\<br />2.30\<br />2.28\<br />2.27\<br />2.26 | -| bzip2 | tools | 2021-02-17 | \- | 1.0.8\<br />1.0.6 | -| cURL | tools | 2021-02-17 | \- | 7.72.0\<br />7.69.1\<br />7.66.0\<br />7.63.0\<br />7.60.0\<br />7.58.0 | -| cairo | vis | 2020-08-28 | \- | 1.16.0\<br />1.14.12 | -| cftime | data | 2019-07-17 | \- | 1.0.1 | -| chrpath | tools | 2018-11-22 | \- | 0.16 | -| ctags | devel | 2018-11-22 | \- | 5.8 | -| cuDNN | numlib | 2021-03-17 | \- | 8.0.4.30\<br />7.6.4.38\<br />7.4.2.24\<br />7.1.4.18\<br />7.1.4\<br />7.0.5 | -| ddt | tools | 2021-04-12 | \- | 20.2.1\<br />20.0.1\<br />18.2.2 | -| dftd3-lib | chem | 2021-02-17 | \- | 0.9 | -| dill | data | 2019-10-29 | \- | 0.3.1.1\<br />0.3.1 | -| double-conversion | lib | 2021-03-17 | \- | 3.1.5\<br />3.1.4 | -| ecCodes | tools | 2020-10-06 | \- | 2.8.2\<br />2.17.0 | -| expat | tools | 2021-02-17 | \- | 2.2.9\<br />2.2.7\<br />2.2.6\<br />2.2.5 | -| flair | vis | 2019-01-25 | \- | 2.3-0 | -| flair-geoviewer | vis | 2019-01-25 | \- | 2.3-0 | -| flatbuffers | devel | 2021-03-17 | \- | 1.12.0 | -| flatbuffers-python | devel | 2021-03-19 | \- | 1.12 | -| flex | lang | 2020-12-11 | \- | 2.6.4\<br />2.6.3\<br />2.6.0\<br />2.5.39 | -| fontconfig | vis | 2021-02-17 | \- | 2.13.92\<br />2.13.1\<br />2.13.0\<br />2.12.6 | -| foss | toolchain | 2021-02-17 | \- | 2020b\<br />2020a\<br />2019b\<br />2019a\<br />2018b\<br />2018a | -| fosscuda | toolchain | 2021-02-17 | \- | 2020b\<br />2020a\<br />2019b\<br />2019a\<br />2018b | -| freeglut | lib | 2019-11-08 | \- | 3.0.0 | -| freetype | vis | 2021-02-17 | \- | 2.9.1\<br />2.9\<br />2.10.3\<br />2.10.1 | -| future | lib | 2018-11-22 | \- | 0.16.0 | -| gc | lib | 2020-08-11 | \- | 7.6.12\<br />7.6.10\<br />7.6.0 | -| gcccuda | toolchain | 2021-02-17 | \- | 2020b\<br />2020a\<br />2019b\<br />2019a\<br />2018b | -| gettext | tools | 2021-02-17 | \- | 0.21\<br />0.20.1\<br />0.19.8.1\<br />0.19.8 | -| gflags | devel | 2020-09-29 | \- | 2.2.2 | -| giflib | lib | 2021-03-17 | \- | 5.2.1 | -| git | tools | 2021-03-17 | \- | 2.28.0\<br />2.23.0\<br />2.21.0\<br />2.19.1\<br />2.18.0\<br />2.16.1 | -| git-cola | tools | 2018-11-22 | \- | 3.2 | -| git-lfs | tools | 2019-06-26 | \- | 2.7.2 | -| glew | devel | 2020-09-23 | \- | 2.1.0 | -| glog | devel | 2020-09-29 | \- | 0.4.0 | -| gmsh | vis | 2019-11-26 | \- | 4.4.1 | -| gnuplot | vis | 2019-11-08 | \- | 5.2.6\<br />5.2.5\<br />5.2.4\<br />5.2.2 | -| golf | toolchain | 2018-11-22 | \- | 2018a | -| gomkl | toolchain | 2019-07-17 | \- | 2019a | -| gompi | toolchain | 2021-02-17 | \- | 2020b\<br />2020a\<br />2019b\<br />2019a\<br />2018b\<br />2018a | -| gompic | toolchain | 2021-02-17 | \- | 2020b\<br />2020a\<br />2019b\<br />2019a\<br />2018b | -| gperf | devel | 2021-02-17 | \- | 3.1 | -| gperftools | tools | 2018-11-22 | \- | 2.7 | -| grib_api | data | 2018-11-22 | \- | 1.27.0 | -| gzip | tools | 2020-08-11 | \- | 1.9\<br />1.8\<br />1.10 | -| h5py | data | 2020-08-11 | \- | 2.9.0\<br />2.8.0\<br />2.7.1\<br />2.10.0 | -| help2man | tools | 2020-12-11 | \- | 1.47.8\<br />1.47.6\<br />1.47.4\<br />1.47.16\<br />1.47.12\<br />1.47.10 | -| hwloc | system | 2021-02-17 | \- | 2.2.0\<br />2.0.3\<br />1.11.8\<br />1.11.12\<br />1.11.11\<br />1.11.10 | -| hypothesis | tools | 2021-02-17 | \- | 5.41.2\<br />4.44.2 | -| icc | compiler | 2019-10-11 | \- | 2019.1.144\<br />2019.0.117\<br />2018.3.222\<br />2018.1.163 | -| iccifort | toolchain | 2020-09-21 | \- | 2020.2.254\<br />2020.1.217\<br />2019.5.281\<br />2019.1.144\<br />2019.0.117\<br />2018.3.222\<br />2018.1.163 | -| ifort | compiler | 2019-10-11 | \- | 2019.1.144\<br />2019.0.117\<br />2018.3.222\<br />2018.1.163 | -| iimpi | toolchain | 2020-08-03 | \- | 2020a\<br />2019b\<br />2019a\<br />2018b\<br />2018a | -| imkl | numlib | 2021-01-05 | \- | 2020.1.217\<br />2019.5.281\<br />2019.1.144\<br />2018.3.222\<br />2018.1.163 | -| impi | mpi | 2020-08-13 | \- | 2019.7.217\<br />2018.5.288\<br />2018.4.274\<br />2018.3.222\<br />2018.1.163 | -| intel | toolchain | 2020-08-03 | \- | 2020a\<br />2019b\<br />2019a\<br />2018b\<br />2018a | -| intltool | devel | 2021-02-17 | \- | 0.51.0 | -| iomkl | toolchain | 2021-01-05 | \- | 2020a\<br />2019a\<br />2018a | -| iompi | toolchain | 2021-01-05 | \- | 2020a\<br />2019a\<br />2018a | -| itac | tools | 2018-11-22 | \- | 2018.3.022 | -| kim-api | chem | 2020-08-11 | \- | 2.1.3 | -| libGLU | vis | 2020-08-11 | \- | 9.0.1\<br />9.0.0 | -| libGridXC | chem | 2020-02-27 | \- | 0.8.5 | -| libPSML | data | 2020-02-27 | \- | 1.1.8 | -| libarchive | tools | 2021-02-17 | \- | 3.4.3 | -| libcerf | math | 2019-11-08 | \- | 1.7\<br />1.5\<br />1.11 | -| libcint | lib | 2019-01-09 | \- | 3.0.14 | -| libdap | lib | 2020-09-29 | \- | 3.20.6 | -| libdrm | lib | 2020-08-11 | \- | 2.4.99\<br />2.4.97\<br />2.4.92\<br />2.4.91\<br />2.4.100 | -| libelf | devel | 2020-12-11 | \- | 0.8.13 | -| libepoxy | lib | 2020-09-02 | \- | 1.5.4 | -| libevent | lib | 2021-02-24 | \- | 2.1.8\<br />2.1.12\<br />2.1.11 | -| libfabric | lib | 2021-02-17 | \- | 1.11.0 | -| libffi | lib | 2021-02-17 | \- | 3.3\<br />3.2.1 | -| libgd | lib | 2019-11-08 | \- | 2.2.5 | -| libgeotiff | lib | 2020-10-06 | \- | 1.5.1\<br />1.4.2 | -| libglvnd | lib | 2020-08-11 | \- | 1.2.0 | -| libharu | lib | 2018-11-22 | \- | 2.3.0 | -| libiconv | lib | 2021-03-03 | \- | 1.16 | -| libjpeg-turbo | lib | 2021-02-17 | \- | 2.0.5\<br />2.0.4\<br />2.0.3\<br />2.0.2\<br />2.0.0\<br />1.5.3\<br />1.5.2 | -| libmatheval | lib | 2020-08-11 | \- | 1.1.11 | -| libpciaccess | system | 2021-02-17 | \- | 0.16\<br />0.14 | -| libpng | lib | 2021-02-17 | \- | 1.6.37\<br />1.6.36\<br />1.6.34\<br />1.6.32 | -| libreadline | lib | 2021-02-17 | \- | 8.0\<br />7.0 | -| libsigc++ | devel | 2020-09-24 | \- | 2.10.1 | -| libsndfile | lib | 2020-08-28 | \- | 1.0.28 | -| libsodium | lib | 2019-11-05 | \- | 1.0.17\<br />1.0.16 | -| libssh2 | tools | 2018-11-22 | \- | 1.8.0 | -| libtirpc | lib | 2020-09-29 | \- | 1.2.6 | -| libtool | lib | 2021-02-17 | \- | 2.4.6 | -| libunistring | lib | 2020-08-11 | \- | 0.9.7\<br />0.9.10 | -| libunwind | lib | 2020-08-11 | \- | 1.3.1\<br />1.2.1 | -| libvdwxc | chem | 2020-01-15 | \- | 0.4.0 | -| libxc | chem | 2021-01-05 | \- | 4.3.4\<br />4.2.3\<br />3.0.1 | -| libxml++ | lib | 2020-09-24 | \- | 2.40.1 | -| libxml2 | lib | 2021-02-17 | \- | 2.9.9\<br />2.9.8\<br />2.9.7\<br />2.9.4\<br />2.9.10 | -| libxslt | lib | 2020-10-27 | \- | 1.1.34\<br />1.1.33\<br />1.1.32 | -| libxsmm | math | 2019-10-14 | \- | 1.8.3\<br />1.10 | -| libyaml | lib | 2020-04-16 | \- | 0.2.2\<br />0.2.1\<br />0.1.7 | -| likwid | tools | 2020-10-14 | \- | 5.0.1 | -| lo2s | perf | 2020-01-27 | \- | 1.3.0\<br />1.2.2\<br />1.1.1\<br />1.0.2\<br />1.0.1 | -| log4cxx | lang | 2020-02-18 | \- | 0.10.0 | -| lpsolve | math | 2018-11-22 | \- | 5.5.2.5 | -| lz4 | lib | 2020-08-11 | \- | 1.9.2\<br />1.9.1 | -| magma | math | 2021-01-04 | \- | 2.5.4\<br />2.3.0 | -| makedepend | devel | 2018-11-22 | \- | 1.0.5 | -| matplotlib | vis | 2021-02-17 | \- | 3.3.3\<br />3.2.1\<br />3.1.1\<br />3.0.3\<br />3.0.0\<br />2.1.2 | -| mkl-dnn | lib | 2018-11-22 | \- | 0.13 | -| molmod | math | 2020-08-11 | \- | 1.4.5 | -| motif | vis | 2018-11-22 | \- | 2.3.8 | -| ncdf4 | math | 2020-10-06 | \- | 1.17 | -| ncurses | devel | 2021-02-17 | \- | 6.2\<br />6.1\<br />6.0 | -| netCDF | data | 2021-03-03 | \- | 4.7.4\<br />4.7.1\<br />4.6.2\<br />4.6.1\<br />4.6.0 | -| netCDF-Fortran | data | 2021-03-04 | \- | 4.5.3\<br />4.5.2\<br />4.4.5\<br />4.4.4 | -| netcdf4-python | data | 2019-07-17 | \- | 1.4.3 | -| nettle | lib | 2020-01-24 | \- | 3.5.1\<br />3.4.1\<br />3.4 | -| nextstrain | bio | 2020-07-20 | \- | 2.0.0.post1 | -| nfft | math | 2018-11-28 | \- | 3.3.2DLR | -| nsync | devel | 2021-03-17 | \- | 1.24.0 | -| numactl | tools | 2021-02-17 | \- | 2.0.13\<br />2.0.12\<br />2.0.11 | -| numba | lang | 2020-09-25 | \- | 0.47.0 | -| numeca | cae | 2018-11-22 | \- | all | -| nvidia-nsight | tools | 2020-01-21 | \- | 2019.3.1 | -| p7zip | tools | 2018-11-22 | \- | 9.38.1 | -| parallel | tools | 2020-02-25 | \- | 20190922\<br />20190622\<br />20180822 | -| patchelf | tools | 2019-08-09 | \- | 0.9 | -| petsc4py | tools | 2019-02-26 | \- | 3.9.1 | -| pigz | tools | 2018-11-22 | \- | 2.4 | -| pixman | vis | 2020-08-28 | \- | 0.38.4\<br />0.38.0\<br />0.34.0 | -| pkg-config | devel | 2021-02-17 | \- | 0.29.2 | -| pkgconfig | devel | 2021-03-17 | \- | 1.5.1\<br />1.3.1 | -| pocl | lib | 2020-02-19 | \- | 1.4 | -| pompi | toolchain | 2018-11-22 | \- | 2018.04 | -| protobuf | devel | 2021-03-17 | \- | 3.6.1.2\<br />3.6.1\<br />3.14.0\<br />3.10.0 | -| protobuf-python | devel | 2021-03-17 | \- | 3.14.0\<br />3.10.0 | -| pybind11 | lib | 2021-02-17 | \- | 2.6.0\<br />2.4.3\<br />2.2.4 | -| pyscf | chem | 2019-11-13 | \- | 1.6.1\<br />1.6.0 | -| pytecplot | data | 2020-04-06 | \- | 1.0.0 | -| pytest | tools | 2019-02-26 | \- | 3.8.0 | -| qrupdate | numlib | 2019-11-08 | \- | 1.1.2 | -| re2c | tools | 2020-08-11 | \- | 1.3\<br />1.2.1 | -| rgdal | geo | 2019-11-27 | \- | 1.4-4 | -| rstudio | lang | 2020-02-18 | \- | 1.2.5001\<br />1.2.1335\<br />1.1.456 | -| scikit-learn | data | 2020-09-24 | \- | 0.21.3 | -| scorep_plugin_fileparser | perf | 2018-11-22 | \- | 1.3.1 | -| sf | lib | 2020-10-06 | \- | 0.9-5 | -| slepc4py | tools | 2019-02-26 | \- | 3.9.0 | -| snakemake | tools | 2020-04-16 | \- | 5.7.1\<br />5.14.0 | -| snappy | lib | 2021-03-17 | \- | 1.1.8\<br />1.1.7 | -| source-highlight | tools | 2018-11-22 | \- | 3.1.8 | -| spacy | lang | 2018-12-05 | \- | 2.0.18 | -| spglib | chem | 2019-12-12 | \- | 1.14.1 | -| tbb | lib | 2020-08-11 | \- | 2020.1\<br />2019-U4\<br />2018-U5 | -| tcsh | tools | 2018-11-22 | \- | 6.20.00 | -| tecplot360ex | vis | 2020-07-20 | \- | 2019r1 | -| texinfo | devel | 2020-08-11 | \- | 6.7\<br />6.6\<br />6.5 | -| tmux | tools | 2021-04-09 | \- | 3.1c\<br />2.3 | -| torchvision | vis | 2018-11-22 | \- | 0.2.1 | -| tqdm | lib | 2020-09-29 | \- | 4.41.1 | -| typing-extensions | devel | 2021-03-19 | \- | 3.7.4.3 | -| utf8proc | lib | 2019-11-08 | \- | 2.3.0 | -| util-linux | tools | 2021-02-17 | \- | 2.36\<br />2.35\<br />2.34\<br />2.33\<br />2.32\<br />2.31.1 | -| wheel | tools | 2018-11-22 | \- | 0.31.1\<br />0.31.0 | -| x264 | vis | 2020-08-11 | \- | 20191217\<br />20190925\<br />20190413\<br />20181203\<br />20180128 | -| x265 | vis | 2020-08-11 | \- | 3.3\<br />3.2\<br />3.0\<br />2.9\<br />2.6 | -| xbitmaps | devel | 2018-11-22 | \- | 1.1.1 | -| xmlf90 | data | 2020-02-27 | \- | 1.5.4 | -| xorg-macros | devel | 2021-02-17 | \- | 1.19.2\<br />1.19.1 | -| xprop | vis | 2019-11-08 | \- | 1.2.4\<br />1.2.3\<br />1.2.2 | -| xproto | devel | 2018-11-22 | \- | 7.0.31 | -| yaff | chem | 2020-08-11 | \- | 1.6.0 | -| zlib | lib | 2020-12-11 | \- | 1.2.8\<br />1.2.11 | -| zsh | tools | 2021-01-06 | \- | 5.8 | -| zstd | lib | 2020-08-11 | \- | 1.4.4 | - -### Classic Environment - -<span class="twiki-macro TABLE">headerrows= 1"</span> <span -class="twiki-macro EDITTABLE" -format="| text, 30, Software | text, 40, Kategorie | text, 30, Letzte Änderung | text, 30, SGI-UV | text, 30, Taurus | " -changerows="on"></span> - -| Software | Category | Last change | Venus | Taurus | -|:-------------------------|:-------------|:------------|:-----------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| AVS-Express | applications | 2015-06-26 | \- | mpepst8.2\<br />8.2 | -| FPLO | applications | 2018-04-13 | \- | 18.00-53\<br />14.00-49 | -| VirtualGL | tools | 2013-10-02 | \- | | -| abaqus | applications | 2018-04-19 | 2017\<br />2016 | *2018* \<br /> 6.9-EF1\<br />6.13\<br />6.12\<br />2017\<br />2016 | -| abinit | applications | 2013-11-21 | 7.2.1 | 7.2.1 | -| ace | lib | 2018-11-22 | \- | 6.3.3 | -| adolc | libraries | 2014-07-24 | *2.5.0* \<br /> 2.4.1 | *2.5.0* \<br /> 2.4.1 | -| afni | applications | 2014-07-01 | \- | 2011-12-21-1014 | -| amber | applications | 2017-10-06 | \- | 15 | -| ansys | applications | 2018-09-04 | 18.0\<br />17.1\<br />16.1 | 19.0\<br />18.2\<br />18.1\<br />18.0\<br />17.2\<br />17.1\<br />17.0\<br />16.1 | -| ansysem | applications | 2017-07-20 | \- | 16.0 | -| asm | tools | 2017-03-17 | \- | 5.2 | -| autoconf | tools | 2013-10-30 | 2.69 | 2.69 | -| automake | tools | 2014-09-11 | 1.14\<br />1.12.2\<br />1.12 | 1.14\<br />1.12.2 | -| autotools | tools | 2017-02-01 | \- | default\<br />2015 | -| bazel | compilers | 2017-07-13 | \- | 0.5.2 | -| bison | libraries | 2015-03-11 | \- | 3.0.4 | -| blcr | tools | 2016-03-02 | \- | | -| boost | libraries | 2019-03-29 | *1.49* \<br /> 1.69.0\<br />1.54\<br />1.51.0 | *1.54.0* \<br /> 1.66.0\<br />1.65.1\<br />1.65.0\<br />1.64.0\<br />1.63.0\<br />1.62.0\<br />1.61.0\<br />1.60.0\<br />1.59.0\<br />1.58.0\<br />1.57.0\<br />1.56.0\<br />1.55.0\<br />1.49 | -| bowtie | applications | 2013-01-16 | 0.12.8 | \- | -| bullxmpi | libraries | 2016-10-14 | \- | *1.2.8.4* \<br /> 1.2.9.2 | -| casita | tools | 2017-06-08 | \- | 1.9 | -| cdo | tools | 2013-04-08 | 1.6.0 | \- | -| ceph | libraries | 2017-01-13 | \- | 11.1 | -| cereal | libraries | 2016-12-07 | \- | 1.2.1 | -| cg | libraries | 2015-09-18 | \- | 3.1 | -| cgal | libraries | 2018-03-08 | \- | 4.11.1 | -| clFFT | libraries | 2017-07-12 | \- | 2.12.2dev\<br />2.12.2\<br />2.12.1\<br />2.12.0\<br />2.10.0 | -| clang | compilers | 2018-08-20 | \- | 4.0.0 | -| cmake | tools | 2019-04-03 | 3.3.1\<br />3.11.4\<br />2.8.2\<br />2.8.12.2\<br />2.8.11 | *3.10.1* \<br /> 3.9.0\<br />3.6.2\<br />3.3.1\<br />2.8.2\<br />2.8.12.2\<br />2.8.11 | -| collectl | tools | 2017-12-05 | 4.2.0\<br />4.1.2\<br />3.6.7\<br />3.5.1 | 4.2.0\<br />4.1.2\<br />3.6.7\<br />3.5.1 | -| comsol | applications | 2018-04-20 | \- | *5.3a* \<br /> 5.3 | -| conn | libraries | 2017-07-12 | \- | 17f | -| cp2k | applications | 2017-12-05 | 2.5 | *5.1* \<br /> r16298\<br />r15503\<br />r14075\<br />r13178\<br />3.0\<br />2.6.2\<br />2.6\<br />2.4\<br />2.3\<br />130326 | -| cpufrequtils | libraries | 2017-02-16 | \- | gcc5.3.0 | -| ctool | libraries | 2013-04-17 | 2.12 | 2.12 | -| cube | tools | 2018-05-31 | *4.3* | *4.3* \<br /> 4.4 | -| cuda | libraries | 2018-06-07 | \- | *9.2.88* \<br /> 9.1.85\<br />9.0.176\<br />8.0.61\<br />8.0.44\<br />7.5.18\<br />7.0.28 | -| curl | libraries | 2013-01-18 | 7.28.1 | \- | -| cusp | libraries | 2014-04-22 | \- | 0.4.0\<br />0.3.1 | -| cython | libraries | 2016-08-26 | \- | 0.24.1\<br />0.24\<br />0.19.2 | -| dalton | applications | 2016-04-07 | \- | 2016.0 | -| darshan | tools | 2017-09-02 | \- | darshan-3.1.4 | -| dash | libraries | 2019-02-14 | \- | dash | -| dataheap | libraries | 2017-01-12 | \- | 1.2\<br />1.1 | -| ddt | tools | 2021-04-12 | 4.2\<br />4.0\<br />3.2.1 | *18.0.1* \<br /> 6.0.5\<br />6.0 | -| dftb+ | applications | 2017-06-26 | \- | mpi\<br />1.3\<br />1.2.2 | -| dmtcp | tools | 2017-10-18 | \- | *2.5.1-ib* \<br /> ib-id | -| doxygen | tools | 2016-01-27 | \- | 1.8.11\<br />1.7.4 | -| dune | libraries | 2014-05-28 | \- | 2.2.1 | -| dyninst | libraries | 2016-02-16 | 8.1.1 | 8.2.1\<br />8.1.1 | -| eigen | libraries | 2017-03-23 | \- | 3.3.3\<br />3.2.0 | -| eirods | tools | 2013-12-11 | \- | 3.1 | -| eman2 | applications | 2017-06-21 | \- | 2.2 | -| ensight | applications | 2015-07-13 | \- | 10.1.5a\<br />10.0 | -| extrae | applications | 2017-01-03 | \- | 3.4.1 | -| fftw | libraries | 2017-03-18 | \- | 3.3.6pl1\<br />3.3.5\<br />3.3.4 | -| firestarter | applications | 2016-04-20 | \- | 1.4 | -| flex | lang | 2020-12-11 | \- | 2.5.39 | -| fme | applications | 2017-04-21 | \- | 2017 | -| freecad | applications | 2014-05-12 | \- | *0.14* \<br /> 0.13 | -| freeglut | lib | 2019-11-08 | 2.8.1 | 2.8.1 | -| freesurfer | applications | 2015-12-04 | \- | *5.3.0* \<br /> 5.1.0 | -| fsl | libraries | 2018-01-17 | \- | 5.0.5\<br />5.0.4\<br />5.0.10 | -| ga | libraries | 2015-06-22 | \- | 5.2 | -| gamess | applications | 2014-12-11 | \- | 2013 | -| gams | applications | 2017-07-10 | \- | 24.8\<br />24.3.3 | -| gaussian | applications | 2017-06-01 | g16\<br />g09d01\<br />g09b01\<br />g09\<br />g03 | *g16* \<br /> g09d01\<br />g09b01\<br />g09 | -| gautomatch | applications | 2017-06-21 | \- | 0.53 | -| gcc | compilers | 2019-04-02 | 8.3.0\<br />7.1.0\<br />6.3.0\<br />5.5.0\<br />4.9.3\<br />4.9.1\<br />4.8.2\<br />4.8.0\<br />4.7.1 | *7.1.0* \<br /> 6.3.0\<br />6.2.0\<br />5.5.0\<br />5.3.0\<br />5.2.0\<br />5.1.0\<br />4.9.3\<br />4.9.1\<br />4.8.2\<br />4.8.0\<br />4.7.1\<br />4.6.2 | -| gcl | compilers | 2017-03-13 | \- | 2.6.12 | -| gcovr | tools | 2015-02-12 | *3.2* | *3.2* | -| gctf | applications | 2017-06-21 | \- | 0.50 | -| gdb | tools | 2017-08-03 | 7.9.1 | 7.5 | -| gdk | tools | 2015-12-14 | \- | 352 | -| geany | tools | 2014-05-12 | \- | 1.24.1 | -| ghc | compilers | 2015-01-16 | 7.6.3 | 7.6.3 | -| git | tools | 2021-03-17 | *1.8.3.1* \<br /> 2.17.1\<br />1.7.7\<br />1.7.4.1\<br />1.7.3.2 | *2.15.1* \<br /> 2.7.3\<br />1.9.0 | -| glib | libraries | 2016-10-27 | \- | 2.50.1\<br />2.44.1 | -| gmap | tools | 2018-07-13 | \- | 2018-07-04 | -| gmock | tools | 2013-10-17 | \- | *1.6.0* | -| gnuplot | vis | 2019-11-08 | 4.6.1\<br />4.4.0 | 4.6.1\<br />4.4.0 | -| gpaw | applications | 2016-02-19 | \- | 0.11.0 | -| gperftools | tools | 2018-11-22 | \- | gperftools-2.6.1.lua | -| gpi2 | libraries | 2018-04-25 | \- | *git* \<br /> 1.3.0\<br />1.2.2\<br />1.1.0 | -| gpi2-mpi | libraries | 2015-03-30 | \- | *1.1.1* | -| gpudevkit | libraries | 2016-07-25 | \- | 352-79 | -| grads | applications | 2014-08-05 | 2.0.2 | \- | -| grid | tools | 2014-12-09 | \- | 2012 | -| gromacs | applications | 2018-01-22 | \- | *5.1.3* \<br /> 5.1.4\<br />5.1.1\<br />5.1\<br />4.6.7\<br />4.5.5\<br />3.3.3 | -| gsl | libraries | 2015-01-28 | \- | 1.16 | -| gulp | applications | 2015-10-09 | \- | 4.3 | -| gurobi | applications | 2017-06-28 | 7.0.2\<br />6.0.4 | 7.0.2\<br />6.0.4 | -| h5utils | tools | 2014-08-18 | \- | 1.12.1 | -| haskell-platform | tools | 2015-01-16 | 2013.2.0.0 | 2013.2.0.0 | -| hdeem | libraries | 2018-01-22 | \- | *deprecated* \<br /> 2.2.20ms\<br />2.2.2\<br />2.2.19ms\<br />2.2.16ms\<br />2.2.15ms\<br />2.2.13ms\<br />2.1.9ms\<br />2.1.5\<br />2.1.4\<br />2.1.10ms | -| hdf5 | libraries | 2018-02-09 | 1.8.14\<br />1.8.10 | hdfview\<br />1.8.19\<br />1.8.18\<br />1.8.16\<br />1.8.15\<br />1.8.14\<br />1.8.10\<br />1.6.5\<br />1.10.1 | -| hip | libraries | 2018-08-27 | \- | git | -| hoomd-blue | applications | 2016-07-29 | \- | 2.0.1 | -| hpc-x | libraries | 2017-04-10 | \- | 1.8.0 | -| hpctoolkit | tools | 2013-05-28 | 5.3.2 | \- | -| hpx | libraries | 2017-09-15 | \- | hpx | -| htop | tools | 2016-11-04 | 1.0.2 | 1.0.2 | -| hwloc | system | 2021-02-17 | \- | 1.11.8\<br />1.11.6 | -| hyperdex | tools | 2015-08-20 | \- | default\<br />1.8.1\<br />1.7.1 | -| hyperopt | libraries | 2018-03-19 | \- | 0.1 | -| imagemagick | applications | 2015-03-18 | \- | 6.9.0 | -| intel | toolchain | 2020-08-03 | *2013* \<br /> 2017.2.174\<br />2016.2.181\<br />2016.1.150\<br />2015.3.187\<br />2015.2.164\<br />2013-sp1\<br />11.1.069 | *2018.1.163* \<br /> 2018.0.128\<br />2017.4.196\<br />2017.2.174\<br />2017.1.132\<br />2017.0.020\<br />2016.2.181\<br />2016.1.150\<br />2015.3.187\<br />2015.2.164\<br />2015.1.133\<br />2013-sp1\<br />2013\<br />12.1\<br />11.1.069 | -| intelmpi | libraries | 2017-11-21 | \- | *2018.1.163* \<br /> 5.1.3.181\<br />5.1.2.150\<br />5.0.3.048\<br />5.0.1.035\<br />2018.0.128\<br />2017.3.196\<br />2017.2.174\<br />2017.1.132\<br />2017.0.098\<br />2013 | -| iotop | tools | 2013-07-16 | \- | 0.5 | -| iotrack | tools | 2013-07-16 | \- | 0.5 | -| java | tools | 2015-11-17 | jre1.6.0-21\<br />jdk1.8.0-66\<br />jdk1.7.0-25\<br />jdk1.7.0-03 | jdk1.8.0-66\<br />jdk1.7.0-25 | -| julia | compilers | 2018-05-15 | \- | *0.6.2* \<br /> 0.4.6\<br />0.4.1 | -| knime | applications | 2017-03-20 | \- | 3.3.1\<br />3.1.0\<br />2.11.3-24\<br />2.11.3 | -| lammps | applications | 2017-08-31 | 2014sep\<br />2014jun\<br />2013feb | *2016jul* \<br /> 2017aug\<br />2016may\<br />2015aug\<br />2014sep\<br />2014feb\<br />2013feb\<br />2013aug | -| lbfgsb | libraries | 2013-08-02 | *3.0* \<br /> 2.1 | *3.0* \<br /> 2.1 | -| libnbc | libraries | 2014-05-28 | \- | 1.1.1 | -| libssh2 | tools | 2018-11-22 | \- | 1.8.0 | -| libsvm | tools | 2015-11-20 | \- | 3.20 | -| libtool | lib | 2021-02-17 | \- | 2.4.2 | -| libunwind | lib | 2020-09-25 | \- | 1.1 | -| libxc | chem | 2021-01-05 | \- | 3.0.0\<br />2.2.2 | -| liggghts | applications | 2014-05-28 | \- | 2.3.8\<br />2.3.2 | -| llview | tools | 2015-01-28 | \- | | -| llvm | compilers | 2018-04-20 | \- | *4.0.0* \<br /> ykt\<br />3.9.1\<br />3.7\<br />3.4\<br />3.3.1 | -| lo2s | perf | 2020-01-27 | \- | 2018-02-13\<br />2017-12-06\<br />2017-08-07 | -| ls-dyna | applications | 2017-12-05 | *9.0.1* \<br /> 971\<br />7.0 | *10.0.0* \<br /> dev-121559\<br />971\<br />9.0.1\<br />8.1\<br />7.1.2\<br />7.1.1\<br />7.0\<br />6.0 | -| ls-dyna-usermat | applications | 2016-08-25 | \- | 9.0.1-s\<br />9.0.1-d\<br />7.1.2-d\<br />7.1.1-s\<br />7.1.1-d | -| ls-prepost | applications | 2016-08-22 | \- | *4.3* | -| lumerical | applications | 2016-06-01 | \- | fdtd-8.11.422 | -| m4 | tools | 2013-10-30 | \- | 1.4.16 | -| m4ri | libraries | 2017-03-27 | \- | 20140914 | -| make | tools | 2018-02-21 | \- | 4.2 | -| map | tools | 2016-11-22 | \- | 6.0.5 | -| mathematica | applications | 2015-10-16 | \- | *10.0* \<br /> 8.0 | -| matlab | applications | 2019-02-26 | deprecated.lua\<br />2017a.lua\<br />2016b.lua\<br />2015b.lua\<br />2014a.lua\<br />2013a.lua\<br />2012a.lua\<br />2010b.lua\<br />2010a.lua | \- | -| maxima | applications | 2017-03-15 | \- | 5.39.0 | -| med | libraries | 2017-09-27 | \- | 3.2.0 | -| meep | applications | 2015-04-23 | \- | 1.3\<br />1.2.1 | -| mercurial | tools | 2014-10-22 | 3.1.2 | 3.1.2 | -| metis | libraries | 2013-12-17 | 5.1.0 | 5.1.0\<br />4.0.3 | -| mkl | libraries | 2017-05-10 | 2013 | 2017\<br />2015\<br />2013 | -| modenv | environment | 2020-03-25 | \- | scs5.lua\<br />ml.lua\<br />hiera.lua\<br />classic.lua | -| mongodb | applications | 2018-03-19 | \- | 3.6.3 | -| motioncor2 | applications | 2017-06-21 | \- | 01-30-2017 | -| mpb | applications | 2014-08-19 | \- | 1.4.2 | -| mpi4py | libraries | 2016-05-02 | 1.3.1 | 2.0.0\<br />1.3.1 | -| mpirt | libraries | 2017-11-21 | \- | *2018.1.163* \<br /> 5.1.3.181\<br />5.1.2.150\<br />5.0.3.048\<br />5.0.1.035\<br />2018.0.128\<br />2017.3.196\<br />2017.2.174\<br />2017.1.132\<br />2017.0.098\<br />2013 | -| mumps | libraries | 2017-05-11 | \- | 5.1.1 | -| must | tools | 2018-02-01 | \- | 1.5.0\<br />1.4.0 | -| mvapich2 | libraries | 2017-05-23 | \- | 2.2 | -| mysql | tools | 2013-12-06 | \- | 6.0.11 | -| namd | applications | 2015-09-08 | \- | *2.10* \<br /> 2.9 | -| nco | tools | 2013-08-01 | 4.3.0 | \- | -| nedit | tools | 2013-04-30 | 5.6\<br />5.5 | 5.6\<br />5.5 | -| netbeans | applications | 2018-03-07 | \- | 8.2 | -| netcdf | libraries | 2018-02-09 | 4.1.3 | 4.6.0\<br />4.4.0\<br />4.3.3.1\<br />4.1.3 | -| netlogo | applications | 2017-08-08 | \- | 6.0.1\<br />5.3.0\<br />5.2.0 | -| nsys | tools | 2018-09-11 | \- | 2018.1.1.36\<br />2018.0.1.173 | -| numeca | cae | 2018-11-22 | \- | all | -| nwchem | applications | 2016-02-15 | 6.3 | *6.6* \<br /> custom\<br />6.5patched\<br />6.5\<br />6.3.r2\<br />6.3 | -| octave | applications | 2018-03-23 | \- | 3.8.1 | -| octopus | applications | 2017-03-07 | \- | 6.0 | -| openbabel | applications | 2014-03-07 | 2.3.2 | 2.3.2 | -| opencl | libraries | 2015-07-13 | \- | 1.2-4.4.0.117 | -| openems | applications | 2017-08-01 | \- | 0.0.35 | -| openfoam | applications | 2020-10-15 | \- | *2.3.0* \<br /> v1712\<br />v1706\<br />5.0\<br />4.0\<br />2.4.0\<br />2.3.1\<br />2.2.2 | -| openmpi | libraries | 2018-02-01 | \- | *1.10.2* \<br /> 3.0.0\<br />2.1.1\<br />2.1.0\<br />1.8.8\<br />1.10.4\<br />1.10.3 | -| opentelemac | applications | 2017-09-29 | \- | v7p2r3\<br />v7p1r1 | -| oprofile | tools | 2013-06-05 | 0.9.8 | \- | -| orca | applications | 2017-07-27 | \- | 4.0.1\<br />4.0.0.2\<br />3.0.3 | -| otf2 | libraries | 2018-02-12 | \- | *2.0* \<br /> 2.1\<br />1.4\<br />1.3 | -| papi | libraries | 2017-11-06 | 5.1.0 | 5.5.1\<br />5.4.3\<br />5.4.1 | -| parallel | tools | 2020-02-25 | \- | 20170222 | -| paraview | applications | 2016-03-03 | \- | *4.1.0* \<br /> 4.0.1 | -| parmetis | libraries | 2018-01-12 | \- | 4.0.3 | -| pasha | applications | 2013-11-14 | 1.0.9 | \- | -| pathscale | compilers | 2016-03-04 | \- | enzo-6.0.858\<br />enzo-6.0.749 | -| pdt | tools | 2015-09-18 | 3.18.1 | 3.18.1 | -| perf | tools | 2016-06-14 | | | -| perl | applications | 2015-01-29 | 5.20.1\<br />5.12.1 | 5.20.1\<br />5.12.1 | -| petsc | libraries | 2018-03-19 | *3.3-p6* \<br /> 3.1-p8 | *3.3-p6* \<br /> 3.8.3-64bit\<br />3.8.3\<br />3.4.4\<br />3.4.3\<br />3.3-p7-64bit\<br />3.3-p7\<br />3.2-p7\<br />3.1-p8-p\<br />3.1-p8 | -| pgi | compilers | 2018-09-07 | 14.9\<br />14.7\<br />14.6\<br />14.3\<br />13.4 | *18.3* \<br /> 17.7\<br />17.4\<br />17.1\<br />16.9\<br />16.5\<br />16.4\<br />16.10\<br />16.1\<br />15.9\<br />14.9 | -| pigz | tools | 2018-11-22 | \- | 2.3.4 | -| prope-env | tools | 2017-05-02 | \- | *.1.0* | -| protobuf | devel | 2021-03-17 | \- | 3.5.0\<br />3.2.0 | -| pycuda | libraries | 2016-10-28 | \- | *2016.1.2* \<br /> 2013.1.1\<br />2012.1 | -| pyslurm | libraries | 2017-11-09 | \- | 16.05.8 | -| python | libraries | 2018-01-17 | 3.6\<br />3.3.0\<br />2.7.5\<br />2.7 | *3.6* \<br /> intelpython3\<br />intelpython2\<br />3.5.2\<br />3.4.3\<br />3.3.0\<br />3.1.2\<br />2.7.6\<br />2.7.5\<br />2.7 | -| q-chem | applications | 2016-12-12 | \- | 4.4 | -| qt | libraries | 2016-10-26 | \- | *4.8.1* \<br /> 5.4.1\<br />4.8.7 | -| quantum_espresso | applications | 2016-09-13 | *5.0.3* \<br /> 5.0.2 | 5.3.0\<br />5.1.2\<br />5.0.3 | -| r | applications | 2016-02-18 | \- | 3.2.1\<br />2.15.3 | -| ramdisk | tools | 2016-07-21 | 1.0 | \- | -| read-nvml-clocks-pci | tools | 2018-02-22 | \- | 1.0 | -| readex | tools | 2018-06-13 | \- | pre\<br />beta-1806\<br />beta-1805\<br />beta-1804\<br />beta\<br />alpha | -| redis | tools | 2016-06-28 | \- | 3.2.1 | -| relion | applications | 2017-06-21 | \- | 2.1 | -| repoclient | applications | 2017-01-18 | \- | 1.4.1 | -| ripgrep | tools | 2017-02-17 | \- | 0.3.2 | -| robinhood | tools | 2017-05-04 | \- | 2.4.3 | -| root | applications | 2015-02-27 | \- | 6.02.05 | -| rstudio | lang | 2020-02-18 | \- | 0.98.1103 | -| ruby | tools | 2014-07-21 | \- | 2.1.2 | -| samrai | libraries | 2016-04-22 | \- | 3.10.0 | -| samtools | tools | 2018-07-13 | 0.1.18 | 1.8 | -| scafes | libraries | 2017-05-05 | *2.3.0* \<br /> 2.2.0\<br />2.0.0\<br />1.0.0 | *2.3.0* \<br /> 2.2.0\<br />2.1.0\<br />2.0.1\<br />2.0.0\<br />1.0.0 | -| scala | compilers | 2015-06-22 | \- | 2.11.4\<br />2.10.4 | -| scalapack | libraries | 2013-11-21 | 2.0.2 | \- | -| scalasca | tools | 2018-02-06 | \- | *2.3.1* | -| scons | tools | 2015-11-19 | 2.3.4 | 2.4.1\<br />2.3.4 | -| scorep | tools | 2018-10-01 | 1.3.0 | *3.0* \<br /> try\<br />trunk\<br />ompt\<br />java\<br />dev-io\<br />3.1 | -| scorep-apapi | libraries | 2018-01-09 | \- | gcc-2018-01-09 | -| scorep-cpu-energy | libraries | 2017-01-20 | \- | r217\<br />r211\<br />r117\<br />2017-01-20\<br />2016-04-07 | -| scorep-cpu-id | libraries | 2014-08-13 | \- | r117 | -| scorep-dataheap | libraries | 2015-07-28 | \- | *2015-07-28* \<br /> r191\<br />r122 | -| scorep-dev | tools | 2017-07-19 | \- | *05* | -| scorep-hdeem | libraries | 2018-06-19 | \- | *2016-12-20* \<br /> sync\<br />2017-12-08a\<br />2017-12-08.lua\<br />2016-11-21 | -| scorep-plugin-x86-energy | libraries | 2018-06-19 | \- | xmpi\<br />intelmpi\<br />2017-09-06\<br />2017-09-05 | -| scorep-printmetrics | libraries | 2018-02-26 | \- | 2018-02-26 | -| scorep-uncore | libraries | 2018-06-21 | \- | *2018-01-24* \<br /> 2016-03-29 | -| scorep_plugin_x86_energy | libraries | 2018-07-04 | \- | intel-2018\<br />gcc-7.1.0\<br />2017-07-14 | -| scout | compilers | 2015-06-22 | 1.6.0 | 1.6.0 | -| sed | tools | 2018-02-21 | \- | 4.4 | -| sftp | tools | 2014-04-10 | \- | 6.6 | -| shifter | tools | 2016-06-09 | \- | 16.04.0pre1 | -| siesta | applications | 2017-05-29 | 3.1-pl20 | 4.0\<br />3.2-pl4 | -| singularity | tools | 2019-02-13 | | ff69c5f3 | -| sionlib | tools | 2017-06-29 | \- | 1.6.1\<br />1.5.5 | -| siox | libraries | 2016-10-27 | \- | 2016-10-27\<br />2016-10-26 | -| spm | applications | 2014-07-09 | \- | 8-r4667 | -| spm12 | libraries | 2017-07-12 | \- | r6906 | -| spparks | applications | 2016-06-30 | \- | 2016feb | -| sqlite3 | libraries | 2016-07-27 | 3.8.2 | 3.8.10 | -| sra-tools | tools | 2018-07-19 | \- | 2.9.1 | -| stack | tools | 2016-06-23 | \- | 1.1.2 | -| star | applications | 2017-10-25 | \- | *12.06* \<br /> 9.06\<br />12.04\<br />12.02\<br />11.02\<br />10.04 | -| subread | tools | 2018-07-13 | \- | 1.6.2 | -| suitesparse | libraries | 2017-08-25 | \- | *4.5.4* \<br /> 4.2.1 | -| superlu | libraries | 2017-08-25 | \- | *5.2.1* | -| superlu_dist | libraries | 2017-03-21 | \- | 5.1.3 | -| superlu_mt | libraries | 2017-03-21 | \- | 3.1 | -| svn | tools | 2016-03-16 | 1.8.11\<br />1.7.3 | *1.9.3* \<br /> 1.8.8\<br />1.8.11 | -| swig | tools | 2016-04-08 | 2.0 | 3.0.8\<br />2.0 | -| swipl | tools | 2017-02-28 | \- | 7.4.0-rc2 | -| tcl | applications | 2017-08-07 | \- | 8.6.6 | -| tcltk | applications | 2015-02-06 | \- | 8.4.20 | -| tecplot360 | applications | 2018-05-22 | 2015\<br />2013 | 2018r1\<br />2017r2\<br />2017r1\<br />2016r2\<br />2015r2\<br />2015\<br />2013\<br />2010 | -| tesseract | libraries | 2016-06-28 | \- | 3.04 | -| texinfo | devel | 2020-08-11 | \- | 5.2 | -| theodore | libraries | 2016-05-27 | \- | 1.3 | -| tiff | libraries | 2013-09-16 | 3.9.2 | 3.9.2 | -| tinker | applications | 2014-03-04 | \- | 6.3 | -| tmux | tools | 2021-04-09 | \- | 2.2 | -| totalview | tools | 2017-11-03 | *2017.2.11* \<br /> 8.9.2-0\<br />8.8.0-1\<br />8.13.0-0\<br />8.11.0-3 | *2017.2.11* \<br /> 8.9.2-0\<br />8.8.0-1\<br />8.13.0-0\<br />8.11.0-3 | -| trilinos | applications | 2016-04-13 | \- | 12.6.1 | -| trinityrnaseq | applications | 2013-05-16 | r2013-02-25 | \- | -| turbomole | applications | 2017-01-10 | 7.1\<br />6.6\<br />6.5 | 7.1\<br />6.6\<br />6.5 | -| valgrind | tools | 2017-08-08 | 3.8.1 | *3.10.1* \<br /> r15216\<br />3.8.1\<br />3.13.0 | -| vampir | tools | 2019-03-04 | *9.6.1* \<br /> 9.7\<br />9.5.0\<br />9.4.0\<br />9.3.0\<br />8.5.0 | *9.5.0* \<br /> 9.4.0\<br />9.3.0\<br />9.3\<br />9.2.0\<br />9.1.0\<br />9.0.0\<br />8.5.0\<br />8.4.1\<br />8.3.0 | -| vampirlive | tools | 2018-02-27 | \- | | -| vampirtrace | tools | 2016-03-29 | *5.14.4* | *5.14.4* | -| vampirtrace-plugins | libraries | 2014-08-06 | \- | x86\<br />power-1.1\<br />power-1.0\<br />apapi | -| vasp | applications | 2017-11-08 | *5.3* \<br /> 5.2 | 5.4.4\<br />5.4.1\<br />5.3 | -| visit | applications | 2016-12-13 | \- | 2.4.2\<br />2.12.0 | -| vmd | applications | 2016-12-15 | \- | 1.9.3 | -| vt_dataheap | libraries | 2014-08-14 | \- | r190 | -| vtk | libraries | 2016-08-16 | 5.10.1 | 5.10.1 | -| wannier90 | libraries | 2016-04-21 | 1.2 | 2.0.1\<br />1.2 | -| wget | tools | 2015-05-12 | 1.16.3 | 1.16.3 | -| wxwidgets | libraries | 2017-03-15 | \- | 3.0.2 | -| yade | applications | 2014-05-22 | \- | | -| zlib | lib | 2020-12-11 | \- | 1.2.8 | - -### ML Environment - -<span class="twiki-macro TABLE">headerrows= 1"</span> <span -class="twiki-macro EDITTABLE" -format="| text, 30, Software | text, 40, Kategorie | text, 30, Letzte Änderung | text, 30, SGI-UV | text, 30, Taurus | " -changerows="on"></span> - -| Software | Category | Last change | Venus | Taurus | -|:--------------------------|:----------|:------------|:------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| ASE | chem | 2020-09-29 | \- | 3.19.0 | -| ATK | vis | 2020-10-12 | \- | 2.34.1\<br />2.28.1 | -| Anaconda3 | lang | 2019-11-13 | \- | 2019.07\<br />2019.03 | -| Arrow | data | 2020-11-30 | \- | 0.16.0\<br />0.14.1 | -| Autoconf | devel | 2020-10-14 | \- | 2.69 | -| Automake | devel | 2020-10-14 | \- | 1.16.1\<br />1.15.1 | -| Autotools | devel | 2020-10-14 | \- | 20180311\<br />20170619 | -| Bazel | devel | 2021-03-22 | \- | 3.7.1\<br />3.4.1\<br />2.0.0\<br />1.1.0\<br />0.29.1\<br />0.26.1\<br />0.25.2\<br />0.20.0\<br />0.18.0 | -| BigDataFrameworkConfigure | devel | 2019-09-16 | \- | 0.0.2\<br />0.0.1 | -| Bison | lang | 2020-10-14 | \- | 3.5.3\<br />3.3.2\<br />3.0.5\<br />3.0.4 | -| Boost | devel | 2020-11-30 | \- | 1.71.0\<br />1.70.0\<br />1.69.0\<br />1.67.0\<br />1.66.0 | -| CMake | devel | 2020-11-02 | \- | 3.9.5\<br />3.9.1\<br />3.16.4\<br />3.15.3\<br />3.13.3\<br />3.12.1\<br />3.11.4\<br />3.10.2 | -| CUDA | system | 2020-10-14 | \- | 9.2.88\<br />11.0.2\<br />10.1.243\<br />10.1.105 | -| CUDAcore | system | 2020-10-14 | \- | 11.0.2 | -| Check | lib | 2020-10-14 | \- | 0.15.2 | -| Clang | compiler | 2020-02-20 | \- | 9.0.1 | -| CubeLib | perf | 2020-07-23 | \- | 4.4.4\<br />4.4 | -| CubeW | perf | 2019-02-04 | \- | 4.4 | -| CubeWriter | perf | 2020-07-23 | \- | 4.4.3 | -| DBus | devel | 2019-09-11 | \- | 1.13.8 | -| Devel-NYTProf | perf | 2019-08-30 | \- | 6.06 | -| Doxygen | devel | 2019-07-17 | \- | 1.8.14\<br />1.8.13 | -| EasyBuild | tools | 2021-03-08 | \- | 4.3.3\<br />4.3.2\<br />4.3.1\<br />4.3.0\<br />4.2.2\<br />4.2.0\<br />4.1.2\<br />4.1.1\<br />4.1.0\<br />4.0.1\<br />3.9.4\<br />3.9.3\<br />3.9.2\<br />3.9.1\<br />3.8.1\<br />3.8.0\<br />3.7.1\<br />3.7.0\<br />3.6.2\<br />3.6.1 | -| Eigen | math | 2020-11-02 | \- | 3.3.7 | -| FFTW | numlib | 2020-10-14 | \- | 3.3.8\<br />3.3.7\<br />3.3.6 | -| FFmpeg | vis | 2020-09-29 | \- | 4.2.1\<br />4.1 | -| Flink | devel | 2019-09-16 | \- | 1.9.0\<br />1.8.1 | -| FriBidi | lang | 2020-09-29 | \- | 1.0.5 | -| GCC | compiler | 2020-10-14 | \- | 9.3.0\<br />8.3.0\<br />8.2.0-2.31.1\<br />7.3.0-2.30\<br />6.4.0-2.28 | -| GCCcore | compiler | 2020-10-14 | \- | 9.3.0\<br />8.3.0\<br />8.2.0\<br />7.3.0\<br />6.4.0 | -| GDAL | data | 2019-08-19 | \- | 2.2.3 | -| GDRCopy | lib | 2020-10-14 | \- | 2.1 | -| GEOS | math | 2019-08-19 | \- | 3.6.2 | -| GLib | vis | 2020-02-19 | \- | 2.62.0\<br />2.60.1\<br />2.54.3 | -| GMP | math | 2020-11-02 | \- | 6.2.0\<br />6.1.2 | -| GObject-Introspection | devel | 2020-10-12 | \- | 1.63.1\<br />1.54.1 | -| GSL | numlib | 2020-02-19 | \- | 2.6\<br />2.5 | -| GTK+ | vis | 2019-02-15 | \- | 2.24.32 | -| Gdk-Pixbuf | vis | 2019-02-15 | \- | 2.36.12 | -| Ghostscript | tools | 2020-02-19 | \- | 9.50\<br />9.27 | -| HDF5 | data | 2020-02-19 | \- | 1.10.5\<br />1.10.2\<br />1.10.1 | -| Hadoop | devel | 2019-09-16 | \- | 2.7.7 | -| HarfBuzz | vis | 2019-02-15 | \- | 2.2.0 | -| Horovod | tools | 2020-08-04 | \- | 0.19.5\<br />0.18.2 | -| Hyperopt | lib | 2020-02-19 | \- | 0.2.2 | -| ICU | lib | 2020-02-19 | \- | 64.2\<br />61.1\<br />56.1 | -| ImageMagick | vis | 2020-02-20 | \- | 7.0.9-5\<br />7.0.8-46 | -| JUnit | devel | 2020-01-21 | \- | 4.12 | -| JasPer | vis | 2020-02-19 | \- | 2.0.14 | -| Java | lang | 2020-02-26 | \- | 11.0.6\<br />1.8.0-162\<br />1.8-191-b26 | -| JsonCpp | lib | 2020-10-30 | \- | 1.9.3 | -| Keras | math | 2019-06-28 | \- | 2.2.4 | -| LAME | data | 2020-06-24 | \- | 3.100 | -| LLVM | compiler | 2020-09-24 | \- | 9.0.0\<br />8.0.1\<br />7.0.1\<br />6.0.0\<br />5.0.1 | -| LMDB | lib | 2020-10-30 | \- | 0.9.24 | -| LibTIFF | lib | 2020-02-19 | \- | 4.0.9\<br />4.0.10 | -| LibUUID | lib | 2019-08-02 | \- | 1.0.3 | -| LittleCMS | vis | 2020-02-20 | \- | 2.9 | -| M4 | devel | 2020-10-14 | \- | 1.4.18\<br />1.4.17 | -| MPFR | math | 2020-06-24 | \- | 4.0.2 | -| Mako | devel | 2020-02-19 | \- | 1.1.0\<br />1.0.8\<br />1.0.7 | -| Mesa | vis | 2020-02-19 | \- | 19.1.7\<br />19.0.1\<br />18.1.1\<br />17.3.6 | -| Meson | tools | 2020-11-03 | \- | 0.55.1\<br />0.51.2\<br />0.50.0 | -| MongoDB | data | 2019-08-05 | \- | 4.0.3 | -| NASM | lang | 2020-02-19 | \- | 2.14.02\<br />2.13.03 | -| NCCL | lib | 2020-02-18 | \- | 2.4.8\<br />2.4.2\<br />2.3.7 | -| NLopt | numlib | 2020-02-19 | \- | 2.6.1\<br />2.4.2 | -| NSPR | lib | 2019-09-11 | \- | 4.21 | -| NSS | lib | 2019-09-11 | \- | 3.42.1 | -| Ninja | tools | 2020-11-03 | \- | 1.9.0\<br />1.10.0 | -| OPARI2 | perf | 2020-07-23 | \- | 2.0.5\<br />2.0.3 | -| OTF2 | perf | 2020-07-23 | \- | 2.2\<br />2.1.1 | -| OpenBLAS | numlib | 2020-10-14 | \- | 0.3.9\<br />0.3.7\<br />0.3.5\<br />0.3.1\<br />0.2.20 | -| OpenCV | vis | 2019-02-21 | \- | 4.0.1 | -| OpenMPI | mpi | 2021-02-10 | \- | 4.0.3\<br />3.1.4\<br />3.1.3\<br />3.1.1 | -| OpenPGM | system | 2019-09-11 | \- | 5.2.122 | -| PAPI | perf | 2020-07-23 | \- | 6.0.0\<br />5.6.0 | -| PCRE | devel | 2020-02-19 | \- | 8.43\<br />8.41 | -| PCRE2 | devel | 2019-09-11 | \- | 10.33 | -| PDT | perf | 2020-07-23 | \- | 3.25 | -| PGI | compiler | 2019-05-14 | \- | 19.4 | -| PMIx | lib | 2020-10-14 | \- | 3.1.5 | -| PROJ | lib | 2019-08-19 | \- | 5.0.0 | -| Pango | vis | 2019-02-15 | \- | 1.42.4 | -| Perl | lang | 2020-10-14 | \- | 5.30.2\<br />5.30.0\<br />5.28.1\<br />5.28.0\<br />5.26.1 | -| Pillow | vis | 2020-06-24 | \- | 6.2.1 | -| PowerAI | data | 2019-12-10 | \- | 1.7.0.a0\<br />1.6.1 | -| PyTorch | devel | 2020-09-29 | \- | 1.6.0\<br />1.3.1\<br />1.1.0 | -| PyTorch-Geometric | devel | 2020-09-29 | \- | 1.6.1 | -| PyYAML | lib | 2020-02-18 | \- | 5.1.2\<br />3.13 | -| Python | lang | 2020-11-02 | \- | 3.8.2\<br />3.7.4\<br />3.7.2\<br />3.6.6\<br />3.6.4\<br />2.7.16\<br />2.7.15\<br />2.7.14 | -| PythonAnaconda | lang | 2019-12-10 | \- | 3.7\<br />3.6 | -| Qt5 | devel | 2019-09-12 | \- | 5.12.3 | -| R | lang | 2020-08-20 | \- | 3.6.2\<br />3.6.0\<br />3.4.4 | -| RDFlib | lib | 2020-09-29 | \- | 4.2.2 | -| SCons | devel | 2019-09-11 | \- | 3.0.5 | -| SIONlib | lib | 2020-07-23 | \- | 1.7.6 | -| SQLite | devel | 2020-11-02 | \- | 3.31.1\<br />3.29.0\<br />3.27.2\<br />3.24.0\<br />3.21.0\<br />3.20.1 | -| SWIG | devel | 2020-10-30 | \- | 4.0.1\<br />3.0.12 | -| ScaLAPACK | numlib | 2020-10-14 | \- | 2.1.0\<br />2.0.2 | -| SciPy-bundle | lang | 2020-11-02 | \- | 2020.03\<br />2019.10 | -| Score-P | perf | 2020-07-23 | \- | 6.0\<br />4.1 | -| Six | lib | 2019-02-05 | \- | 1.11.0 | -| Spark | devel | 2020-09-29 | \- | 3.0.1\<br />2.4.4\<br />2.4.3 | -| SpectrumMPI | mpi | 2019-01-14 | \- | system | -| Szip | tools | 2020-02-18 | \- | 2.1.1 | -| Tcl | lang | 2020-11-02 | \- | 8.6.9\<br />8.6.8\<br />8.6.7\<br />8.6.10 | -| TensorFlow | lib | 2020-10-30 | \- | 2.3.1\<br />2.2.0\<br />2.1.0\<br />2.0.0\<br />1.15.0\<br />1.14.0 | -| Tk | vis | 2020-02-19 | \- | 8.6.9\<br />8.6.8 | -| Tkinter | lang | 2020-09-23 | \- | 3.7.4\<br />3.6.6 | -| UCX | lib | 2020-10-14 | \- | 1.8.0 | -| UDUNITS | phys | 2020-02-19 | \- | 2.2.26 | -| UnZip | tools | 2020-11-02 | \- | 6.0 | -| Vampir | perf | 2020-11-30 | \- | 9.9.0\<br />9.8.0\<br />9.7.1\<br />9.11\<br />9.10.0 | -| X11 | vis | 2020-11-03 | \- | 20200222\<br />20190717\<br />20190311\<br />20180604\<br />20180131 | -| XML-Parser | data | 2019-02-15 | \- | 2.44-01 | -| XZ | tools | 2020-10-14 | \- | 5.2.5\<br />5.2.4\<br />5.2.3 | -| Yasm | lang | 2020-09-29 | \- | 1.3.0 | -| ZeroMQ | devel | 2019-09-11 | \- | 4.3.2 | -| Zip | tools | 2020-07-30 | \- | 3.0 | -| ant | devel | 2020-10-12 | \- | 1.10.7\<br />1.10.1 | -| binutils | tools | 2020-10-14 | \- | 2.34\<br />2.32\<br />2.31.1\<br />2.30\<br />2.28 | -| bokeh | tools | 2020-09-29 | \- | 1.4.0 | -| bzip2 | tools | 2020-11-02 | \- | 1.0.8\<br />1.0.6 | -| cURL | tools | 2020-11-02 | \- | 7.69.1\<br />7.66.0\<br />7.63.0\<br />7.60.0\<br />7.58.0 | -| cairo | vis | 2020-02-19 | \- | 1.16.0\<br />1.14.12 | -| cftime | data | 2019-07-17 | \- | 1.0.1 | -| cuDNN | numlib | 2020-02-18 | \- | 7.6.4.38\<br />7.4.2.24\<br />7.1.4.18 | -| dask | data | 2020-09-29 | \- | 2.8.0 | -| dill | data | 2019-10-29 | \- | 0.3.1.1 | -| double-conversion | lib | 2020-08-13 | \- | 3.1.4 | -| expat | tools | 2020-10-14 | \- | 2.2.9\<br />2.2.7\<br />2.2.6\<br />2.2.5 | -| flatbuffers | devel | 2020-10-30 | \- | 1.12.0 | -| flatbuffers-python | devel | 2021-04-10 | \- | 1.12 | -| flex | lang | 2020-10-14 | \- | 2.6.4\<br />2.6.3 | -| fontconfig | vis | 2020-11-03 | \- | 2.13.92\<br />2.13.1\<br />2.13.0\<br />2.12.6 | -| fosscuda | toolchain | 2020-10-14 | \- | 2020a\<br />2019b\<br />2019a\<br />2018b | -| freetype | vis | 2020-11-03 | \- | 2.9.1\<br />2.9\<br />2.10.1 | -| future | lib | 2019-02-05 | \- | 0.16.0 | -| gcccuda | toolchain | 2020-10-14 | \- | 2020a\<br />2019b\<br />2019a\<br />2018b | -| gettext | tools | 2020-11-03 | \- | 0.20.1\<br />0.19.8.1 | -| gflags | devel | 2020-06-24 | \- | 2.2.2 | -| giflib | lib | 2020-10-30 | \- | 5.2.1 | -| git | tools | 2020-02-18 | \- | 2.23.0\<br />2.18.0 | -| glog | devel | 2020-06-24 | \- | 0.4.0 | -| golf | toolchain | 2019-01-14 | \- | 2018a | -| gompic | toolchain | 2020-10-14 | \- | 2020a\<br />2019b\<br />2019a\<br />2018b | -| gperf | devel | 2020-11-03 | \- | 3.1 | -| gsmpi | toolchain | 2019-01-14 | \- | 2018a | -| gsolf | toolchain | 2019-01-14 | \- | 2018a | -| h5py | data | 2020-07-30 | \- | 2.8.0\<br />2.10.0 | -| help2man | tools | 2020-10-14 | \- | 1.47.8\<br />1.47.6\<br />1.47.4\<br />1.47.12 | -| hwloc | system | 2020-10-14 | \- | 2.2.0\<br />2.0.3\<br />1.11.12\<br />1.11.11\<br />1.11.10 | -| hypothesis | tools | 2020-06-24 | \- | 4.44.2 | -| intltool | devel | 2020-11-03 | \- | 0.51.0 | -| libGLU | vis | 2020-02-19 | \- | 9.0.1\<br />9.0.0 | -| libdrm | lib | 2020-02-19 | \- | 2.4.99\<br />2.4.97\<br />2.4.92\<br />2.4.91 | -| libevent | lib | 2020-10-14 | \- | 2.1.8\<br />2.1.11 | -| libfabric | lib | 2021-02-10 | \- | 1.11.0 | -| libffi | lib | 2020-11-02 | \- | 3.3\<br />3.2.1 | -| libgeotiff | lib | 2019-08-19 | \- | 1.4.2 | -| libjpeg-turbo | lib | 2020-02-19 | \- | 2.0.3\<br />2.0.2\<br />2.0.0\<br />1.5.3 | -| libpciaccess | system | 2020-10-14 | \- | 0.16\<br />0.14 | -| libpng | lib | 2020-11-03 | \- | 1.6.37\<br />1.6.36\<br />1.6.34 | -| libreadline | lib | 2020-10-14 | \- | 8.0\<br />7.0 | -| libsndfile | lib | 2020-02-19 | \- | 1.0.28 | -| libsodium | lib | 2019-09-11 | \- | 1.0.17 | -| libtool | lib | 2020-10-14 | \- | 2.4.6 | -| libunwind | lib | 2020-09-25 | \- | 1.3.1\<br />1.2.1 | -| libxml2 | lib | 2020-10-14 | \- | 2.9.9\<br />2.9.8\<br />2.9.7\<br />2.9.4\<br />2.9.10 | -| libxslt | lib | 2020-10-27 | \- | 1.1.34\<br />1.1.33\<br />1.1.32 | -| libyaml | lib | 2020-01-24 | \- | 0.2.2\<br />0.2.1 | -| magma | math | 2020-09-29 | \- | 2.5.1 | -| matplotlib | vis | 2020-09-23 | \- | 3.1.1\<br />3.0.3 | -| ncurses | devel | 2020-10-14 | \- | 6.2\<br />6.1\<br />6.0 | -| netCDF | data | 2019-07-17 | \- | 4.6.1\<br />4.6.0 | -| netcdf4-python | data | 2019-07-17 | \- | 1.4.3 | -| nettle | lib | 2020-02-19 | \- | 3.5.1\<br />3.4.1\<br />3.4 | -| nsync | devel | 2020-10-30 | \- | 1.24.0 | -| numactl | tools | 2020-10-14 | \- | 2.0.13\<br />2.0.12\<br />2.0.11 | -| numba | lang | 2020-09-25 | \- | 0.47.0 | -| pixman | vis | 2020-02-19 | \- | 0.38.4\<br />0.38.0\<br />0.34.0 | -| pkg-config | devel | 2020-10-14 | \- | 0.29.2 | -| pkgconfig | devel | 2020-02-18 | \- | 1.5.1\<br />1.3.1 | -| pocl | lib | 2020-04-22 | \- | 1.4 | -| protobuf | devel | 2020-01-24 | \- | 3.6.1.2\<br />3.6.1\<br />3.10.0 | -| protobuf-python | devel | 2020-10-30 | \- | 3.10.0 | -| pybind11 | lib | 2020-11-02 | \- | 2.4.3 | -| re2c | tools | 2019-09-11 | \- | 1.1.1 | -| rstudio | lang | 2020-01-21 | \- | 1.2.5001 | -| scikit-image | vis | 2020-09-29 | \- | 0.16.2 | -| scikit-learn | data | 2020-09-24 | \- | 0.21.3 | -| snappy | lib | 2020-10-30 | \- | 1.1.7 | -| spleeter | tools | 2020-10-05 | \- | 1.5.4 | -| torchvision | vis | 2021-03-11 | \- | 0.7.0 | -| tqdm | lib | 2020-09-29 | \- | 4.41.1 | -| typing-extensions | devel | 2021-04-10 | \- | 3.7.4.3 | -| util-linux | tools | 2020-11-03 | \- | 2.35\<br />2.34\<br />2.33\<br />2.32.1\<br />2.32\<br />2.31.1 | -| wheel | tools | 2019-01-30 | \- | 0.31.1 | -| x264 | vis | 2020-06-24 | \- | 20190925\<br />20181203 | -| x265 | vis | 2020-09-29 | \- | 3.2\<br />3.0 | -| xorg-macros | devel | 2020-10-14 | \- | 1.19.2 | -| zlib | lib | 2020-10-14 | \- | 1.2.11 | -| zsh | tools | 2021-01-06 | \- | 5.8 | diff --git a/twiki2md/root/PerformanceTools/IOTrack.md b/twiki2md/root/PerformanceTools/IOTrack.md deleted file mode 100644 index f20334c8ead2ae2fcb75e39a7b1096ec524e0cbc..0000000000000000000000000000000000000000 --- a/twiki2md/root/PerformanceTools/IOTrack.md +++ /dev/null @@ -1,27 +0,0 @@ -# Introduction - -IOTrack is a small tool developed at ZIH that tracks the I/O requests of -all processes and dumps a statistic per process at the end of the -program run. - -# How it works - -On taurus load the module via - - module load iotrack - -Then, instead of running your normal command, put "iotrack" in front of -it. So, - - python xyz.py arg1 arg2 - -changes to: - - iotrack python xyz.py arg1 arg2 - -# Technical Details - -The functionality is implemented in a library that is preloaded via -LD_PRELOAD. Thus, this will not work for static binaries. - --- Main.MichaelKluge - 2013-07-16 diff --git a/twiki2md/root/SoftwareDevelopment/PerformanceTools.md b/twiki2md/root/SoftwareDevelopment/PerformanceTools.md deleted file mode 100644 index eb353e8d62612b27761368ce9d6a6326918e1d2f..0000000000000000000000000000000000000000 --- a/twiki2md/root/SoftwareDevelopment/PerformanceTools.md +++ /dev/null @@ -1,15 +0,0 @@ -# Performance Tools - -- [Score-P](ScoreP) - tool suite for profiling, event tracing, and - online analysis of HPC applications -- [VampirTrace](VampirTrace) - recording performance relevant data at - runtime -- [Vampir](Vampir) - visualizing performance data from your program -- [Hardware performance counters - PAPI](PapiLibrary) - generic - performance counters -- [perf tools](PerfTools) - general performance statistic -- [IOTrack](IOTrack) - I/O statistics -- [EnergyMeasurement](EnergyMeasurement) - energy/power measurements - on taurus - --- Main.mark - 2009-12-16 diff --git a/twiki2md/root/SystemTaurus/EnergyMeasurement.md b/twiki2md/root/SystemTaurus/EnergyMeasurement.md deleted file mode 100644 index 607263056fd593b5f0ae62474a1c63961c8b31aa..0000000000000000000000000000000000000000 --- a/twiki2md/root/SystemTaurus/EnergyMeasurement.md +++ /dev/null @@ -1,310 +0,0 @@ -# Energy Measurement Infrastructure - -All nodes of the HPC machine Taurus are equipped with power -instrumentation that allow the recording and accounting of power -dissipation and energy consumption data. The data is made available -through several different interfaces, which will be described below. - -## System Description - -The Taurus system is split into two phases. While both phases are -equipped with energy-instrumented nodes, the instrumentation -significantly differs in the number of instrumented nodes and their -spatial and temporal granularity. - -### Phase 1 - -In phase one, the 270 Sandy Bridge nodes are equipped with node-level -power instrumentation that is stored in the Dataheap infrastructure at a -rate of 1Sa/s and further the energy consumption of a job is available -in SLURM (see below). - -### Phase 2 - -In phase two, all of the 1456 Haswell DLC nodes are equipped with power -instrumentation. In addition to the access methods of phase one, users -will also be able to access the measurements through a C API to get the -full temporal and spatial resolution, as outlined below: - -- ** Blade:**1 kSa/s for the whole node, includes both sockets, DRAM, - SSD, and other on-board consumers. Since the system is directly - water cooled, no cooling components are included in the blade - consumption. -- **Voltage regulators (VR):** 100 Sa/s for each of the six VR - measurement points, one for each socket and four for eight DRAM - lanes (two lanes bundled). - -The GPU blades of each Phase as well as the Phase I Westmere partition -also have 1 Sa/s power instrumentation but have a lower accuracy. - -HDEEM is now generally available on all nodes in the "haswell" -partition. - -## Summary of Measurement Interfaces - -| Interface | Sensors | Rate | Phase I | Phase II Haswell | -|:-------------------------------------------|:----------------|:--------------------------------|:--------|:-----------------| -| Dataheap (C, Python, VampirTrace, Score-P) | Blade, (CPU) | 1 Sa/s | yes | yes | -| HDEEM\* (C, Score-P) | Blade, CPU, DDR | 1 kSa/s (Blade), 100 Sa/s (VRs) | no | yes | -| HDEEM Command Line Interface | Blade, CPU, DDR | 1 kSa/s (Blade), 100 Sa/s (VR) | no | yes | -| SLURM Accounting (sacct) | Blade | Per Job Energy | yes | yes | -| SLURM Profiling (hdf5) | Blade | up to 1 Sa/s | yes | yes | - -Note: Please specify `-p haswell --exclusive` along with your job -request if you wish to use hdeem. - -## Accuracy - -HDEEM measurements have an accuracy of 2 % for Blade (node) -measurements, and 5 % for voltage regulator (CPU, DDR) measurements. - -## Command Line Interface - -The HDEEM infrastructure can be controlled through command line tools -that are made available by loading the **hdeem** module. They are -commonly used on the node under test to start, stop, and query the -measurement device. - -- **startHdeem**: Start a measurement. After the command succeeds, the - measurement data with the 1000 / 100 Sa/s described above will be - recorded on the Board Management Controller (BMC), which is capable - of storing up to 8h of measurement data. -- **stopHdeem**: Stop a measurement. No further data is recorded and - the previously recorded data remains available on the BMC. -- **printHdeem**: Read the data from the BMC. By default, the data is - written into a CSV file, whose name can be controlled using the - **-o** argument. -- **checkHdeem**: Print the status of the measurement device. -- **clearHdeem**: Reset and clear the measurement device. No further - data can be read from the device after this command is executed - before a new measurement is started. - -## Integration in Application Performance Traces - -The per-node power consumption data can be included as metrics in -application traces by using the provided metric plugins for Score-P (and -VampirTrace). The plugins are provided as modules and set all necessary -environment variables that are required to record data for all nodes -that are part of the current job. - -For 1 Sa/s Blade values (Dataheap): - -- [Score-P](ScoreP): use the module **`scorep-dataheap`** -- [VampirTrace](VampirTrace): use the module - **vampirtrace-plugins/power-1.1** - -For 1000 Sa/s (Blade) and 100 Sa/s (CPU{0,1}, DDR{AB,CD,EF,GH}): - -- [Score-P](ScoreP): use the module **\<span - class="WYSIWYG_TT">scorep-hdeem\</span>**\<br />Note: %ENDCOLOR%This - module requires a recent version of "scorep/sync-...". Please use - the latest that fits your compiler & MPI version.**\<br />** -- [VampirTrace](VampirTrace): not supported - -By default, the modules are set up to record the power data for the -nodes they are used on. For further information on how to change this -behavior, please use module show on the respective module. - - # Example usage with gcc - % module load scorep/trunk-2016-03-17-gcc-xmpi-cuda7.5 - % module load scorep-dataheap - % scorep gcc application.c -o application - % srun ./application - -Once the application is finished, a trace will be available that allows -you to correlate application functions with the component power -consumption of the parallel application. Note: For energy measurements, -only tracing is supported in Score-P/VampirTrace. The modules therefore -disables profiling and enables tracing, please use [Vampir](Vampir) to -view the trace. - -\<img alt="demoHdeem_high_low_vampir_3.png" height="262" -src="%ATTACHURL%/demoHdeem_high_low_vampir_3.png" width="695" /> - -%RED%Note<span class="twiki-macro ENDCOLOR"></span>: the power -measurement modules **`scorep-dataheap`** and **`scorep-hdeem`** are -dynamic and only need to be loaded during execution. However, -**`scorep-hdeem`** does require the application to be linked with a -certain version of Score-P. - -By default,** `scorep-dataheap`**records all sensors that are available. -Currently this is the total node consumption and for Phase II the CPUs. -**`scorep-hdeem`** also records all available sensors (node, 2x CPU, 4x -DDR) by default. You can change the selected sensors by setting the -environment variables: - - # For HDEEM - % export SCOREP_METRIC_HDEEM_PLUGIN=Blade,CPU* - # For Dataheap - % export SCOREP_METRIC_DATAHEAP_PLUGIN=localhost/watts - -For more information on how to use Score-P, please refer to the -[respective documentation](ScoreP). - -## Access Using Slurm Tools - -[Slurm](Slurm) maintains its own database of job information, including -energy data. There are two main ways of accessing this data, which are -described below. - -### Post-Mortem Per-Job Accounting - -This is the easiest way of accessing information about the energy -consumed by a job and its job steps. The Slurm tool `sacct` allows users -to query post-mortem energy data for any past job or job step by adding -the field `ConsumedEnergy` to the `--format` parameter: - - $> sacct --format="jobid,jobname,ntasks,submit,start,end,ConsumedEnergy,nodelist,state" -j 3967027 - JobID JobName NTasks Submit Start End ConsumedEnergy NodeList State - ------------ ---------- -------- ------------------- ------------------- ------------------- -------------- --------------- ---------- - 3967027 bash 2014-01-07T12:25:42 2014-01-07T12:25:52 2014-01-07T12:41:20 taurusi1159 COMPLETED - 3967027.0 sleep 1 2014-01-07T12:26:07 2014-01-07T12:26:07 2014-01-07T12:26:18 0 taurusi1159 COMPLETED - 3967027.1 sleep 1 2014-01-07T12:29:06 2014-01-07T12:29:06 2014-01-07T12:29:16 1.67K taurusi1159 COMPLETED - 3967027.2 sleep 1 2014-01-07T12:33:25 2014-01-07T12:33:25 2014-01-07T12:33:36 1.84K taurusi1159 COMPLETED - 3967027.3 sleep 1 2014-01-07T12:34:06 2014-01-07T12:34:06 2014-01-07T12:34:11 1.09K taurusi1159 COMPLETED - 3967027.4 sleep 1 2014-01-07T12:38:03 2014-01-07T12:38:03 2014-01-07T12:39:44 18.93K taurusi1159 COMPLETED - -The job consisted of 5 job steps, each executing a sleep of a different -length. Note that the ConsumedEnergy metric is only applicable to -exclusive jobs. - -### - -### Slurm Energy Profiling - -The `srun` tool offers several options for profiling job steps by adding -the `--profile` parameter. Possible profiling options are `All`, -`Energy`, `Task`, `Lustre`, and `Network`. In all cases, the profiling -information is stored in an hdf5 file that can be inspected using -available hdf5 tools, e.g., `h5dump`. The files are stored under -`/scratch/profiling/` for each job, job step, and node. A description of -the data fields in the file can be found -[here](http://slurm.schedmd.com/hdf5_profile_user_guide.html#HDF5). In -general, the data files contain samples of the current **power** -consumption on a per-second basis: - - $> srun -p sandy --acctg-freq=2,energy=1 --profile=energy sleep 10 - srun: job 3967674 queued and waiting for resources - srun: job 3967674 has been allocated resources - $> h5dump /scratch/profiling/jschuch/3967674_0_taurusi1073.h5 - [...] - DATASET "Energy_0000000002 Data" { - DATATYPE H5T_COMPOUND { - H5T_STRING { - STRSIZE 24; - STRPAD H5T_STR_NULLTERM; - CSET H5T_CSET_ASCII; - CTYPE H5T_C_S1; - } "Date_Time"; - H5T_STD_U64LE "Time"; - H5T_STD_U64LE "Power"; - H5T_STD_U64LE "CPU_Frequency"; - } - DATASPACE SIMPLE { ( 1 ) / ( 1 ) } - DATA { - (0): { - "", - 1389097545, # timestamp - 174, # power value - 1 - } - } - } - -## - -## Using the HDEEM C API - -Note: Please specify -p haswell --exclusive along with your job request -if you wish to use hdeem. - -Please download the offical documentation at \<font face="Calibri" -size="2"> [\<font -color="#0563C1">\<u>http://www.bull.com/download-hdeem-library-reference-guide\</u>\</font>](http://www.bull.com/download-hdeem-library-reference-guide)\</font> - -The HDEEM headers and sample code are made available via the hdeem -module. To find the location of the hdeem installation use - - % module show hdeem - ------------------------------------------------------------------- - /sw/modules/taurus/libraries/hdeem/2.1.9ms: - - conflict hdeem - module-whatis Load hdeem version 2.1.9ms - prepend-path PATH /sw/taurus/libraries/hdeem/2.1.9ms/include - setenv HDEEM_ROOT /sw/taurus/libraries/hdeem/2.1.9ms - ------------------------------------------------------------------- - -You can find an example of how to use the API under -\<span>$HDEEM_ROOT/sample.\</span> - -## Access Using the Dataheap Infrastructure - -In addition to the energy accounting data that is stored by Slurm, this -information is also written into our local data storage and analysis -infrastructure called -[Dataheap](http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/forschung/projekte/dataheap/). -From there, the data can be used in various ways, such as including it -into application performance trace data or querying through a Python -interface. - -The Dataheap infrastructure is designed to store various types of -time-based samples from different data sources. In the case of the -energy measurements on Taurus, the data is stored as a timeline of power -values which allows the reconstruction of the power and energy -consumption over time. The timestamps are stored as UNIX timestamps with -a millisecond granularity. The data is stored for each node in the form -of `nodename/watts`, e.g., `taurusi1073/watts`. Further metrics might -already be available or might be added in the future for which -information is available upon request. - -**Note**: The dataheap infrastructure can only be accessed from inside -the university campus network. - -### Using the Python Interface - -The module `dataheap/1.0` provides a Python module that can be used to -query the data in the Dataheap for personalized data analysis. The -following is an example of how to use the interface: - - import time - import os - from dhRequest import dhClient - - # Connect to the dataheap manager - dhc = dhClient() - dhc.connect(os.environ['DATAHEAP_MANAGER_ADDR'], int(os.environ['DATAHEAP_MANAGER_PORT'])) - - # take timestamps - tbegin = dhc.getTimeStamp() - # workload - os.system("srun -n 6 a.out") - tend = dhc.getTimeStamp() - - # wait for the data to get to the - # dataheap - time.sleep(5) - - # replace this with name of the node the job ran on - # Note: use multiple requests if the job used multiple nodes - countername = "taurusi1159/watts" - - # query the dataheap - integral = dhc.storageRequest("INTEGRAL(%d,%d,\"%s\", 0)"%(tbegin, tend, countername)) - # Remember: timestamps are stored in millisecond UNIX timestamps - energy = integral/1000 - - print energy - - timeline = dhc.storageRequest("TIMELINE(%d,%d,\"%s\", 0)"%(tbegin, tend, countername)) - - # output a list of all timestamp/power-value pairs - print timeline - -## More information and Citing - -More information can be found in the paper \<a -href="<http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7016382>" -title="HDEEM Paper E2SC 2014">HDEEM: high definition energy efficiency -monitoring\</a> by Hackenberg et al. Please cite this paper if you are -using HDEEM for your scientific work. diff --git a/twiki2md/root/SystemTaurus/RunningNxGpuAppsInOneJob.md b/twiki2md/root/SystemTaurus/RunningNxGpuAppsInOneJob.md deleted file mode 100644 index 2152522aeaa03d97841a8a9ca1d5ed844a1a0449..0000000000000000000000000000000000000000 --- a/twiki2md/root/SystemTaurus/RunningNxGpuAppsInOneJob.md +++ /dev/null @@ -1,85 +0,0 @@ -# Running Multiple GPU Applications Simultaneously in a Batch Job - -Keywords: slurm, job, gpu, multiple, instances, application, program, -background, parallel, serial, concurrently, simultaneously - -## Objective - -Our starting point is a (serial) program that needs a single GPU and -four CPU cores to perform its task (e.g. TensorFlow). The following -batch script shows how to run such a job on the Taurus partition called -"ml". - - #!/bin/bash - #SBATCH --ntasks=1 - #SBATCH --cpus-per-task=4 - #SBATCH --gres=gpu:1 - #SBATCH --gpus-per-task=1 - #SBATCH --time=01:00:00 - #SBATCH --mem-per-cpu=1443 - #SBATCH --partition=ml - - srun some-gpu-application - -When srun is used within a submission script, it inherits parameters -from sbatch, including --ntasks=1, --cpus-per-task=4, etc. So we -actually implicitly run the following - - srun --ntasks=1 --cpus-per-task=4 ... --partition=ml some-gpu-application - -Now, our goal is to run four instances of this program concurrently in a -**single** batch script. Of course we could also start the above script -multiple times with sbatch, but this is not what we want to do here. - -## Solution - -In order to run multiple programs concurrently in a single batch -script/allocation we have to do three things: - -1\. Allocate enough resources to accommodate multiple instances of our -program. This can be achieved with an appropriate batch script header -(see below). - -2\. Start job steps with srun as background processes. This is achieved -by adding an ampersand at the end of the srun command - -3\. Make sure that each background process gets its private resources. -We need to set the resource fraction needed for a single run in the -corresponding srun command. The total aggregated resources of all job -steps must fit in the allocation specified in the batch script header. -Additionally, the option --exclusive is needed to make sure that each -job step is provided with its private set of CPU and GPU resources. - -The following example shows how four independent instances of the same -program can be run concurrently from a single batch script. Each -instance (task) is equipped with 4 CPUs (cores) and one GPU. - - #!/bin/bash - #SBATCH --ntasks=4 - #SBATCH --cpus-per-task=4 - #SBATCH --gres=gpu:4 - #SBATCH --gpus-per-task=1 - #SBATCH --time=01:00:00 - #SBATCH --mem-per-cpu=1443 - #SBATCH --partition=ml - - srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application & - srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application & - srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application & - srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application & - - echo "Waiting for all job steps to complete..." - wait - echo "All jobs completed!" - -In practice it is possible to leave out resource options in srun that do -not differ from the ones inherited from the surrounding sbatch context. -The following line would be sufficient to do the job in this example: - - srun --exclusive --gres=gpu:1 --ntasks=1 some-gpu-application & - -Yet, it adds some extra safety to leave them in, enabling the SLURM -scheduler to complain if not enough resources in total were specified in -the header of the batch script. - --- Main.HolgerBrunst - 2021-04-16 diff --git a/twiki2md/root/WebHome/Accessibility.md b/twiki2md/root/WebHome/Accessibility.md deleted file mode 100644 index 022418cf2d58baf9e223c90117e744752bcc762e..0000000000000000000000000000000000000000 --- a/twiki2md/root/WebHome/Accessibility.md +++ /dev/null @@ -1,54 +0,0 @@ -# Erklrung zur Barrierefreiheit - -Diese Erklrung zur Barrierefreiheit gilt fr die unter -<https://doc.zih.tu-dresden.de> verffentlichte Website der Technischen -Universitt Dresden. - -Als ffentliche Stelle im Sinne des Barrierefreie-Websites-Gesetz -(BfWebG) ist die Technische Universitt Dresden bemht, ihre Websites und -mobilen Anwendungen im Einklang mit den Bestimmungen des -Barrierefreie-Websites-Gesetz (BfWebG) in Verbindung mit der -Barrierefreie-Informationstechnik-Verordnung (BITV 2.0) barrierefrei -zugnglich zu machen. - -## Erstellung dieser Erklrung zur Barrierefreiheit - -Diese Erklrung wurde am 17.09.2020 erstellt und zuletzt am 17.09.2020 -aktualisiert. Grundlage der Erstellung dieser Erklrung zur -Barrierefreiheit ist eine am 17.09.2020 von der TU Dresden durchgefhrte -Selbstbewertung. - -## Stand der Barrierefreiheit - -Es wurde bisher noch kein BITV-Test fr die Website durchgefhrt. Dieser -ist bis 30.11.2020 geplant. - -## Kontakt - -Sollten Ihnen Mngel in Bezug auf die barrierefreie Gestaltung auffallen, -knnen Sie uns diese ber das Formular [Barriere -melden](https://tu-dresden.de/barrierefreiheit/barriere-melden) -mitteilen und im zugnglichen Format anfordern. Alternativ knnen Sie sich -direkt an die Meldestelle fr Barrieren wenden (Koordinatorin: Mandy -Weickert, E-Mail: <barrieren@tu-dresden.de>, Telefon: [+49 351 -463-42022](tel:+49-351-463-42022), Fax: [+49 351 -463-42021](tel:+49-351-463-42021), Besucheradresse: Nthnitzer Strae 46, -APB 1102, 01187 Dresden). - -## Durchsetzungsverfahren - -Wenn wir Ihre Rckmeldungen aus Ihrer Sicht nicht befriedigend -bearbeiten, knnen Sie sich an die Schsische Durchsetzungsstelle wenden: - -**Beauftragter der Schsischen Staatsregierung fr die Belange von -Menschen mit Behinderungen**\<br /> Albertstrae 10\<br /> 01097 Dresden - -Postanschrift: Archivstrae 1, 01097 Dresden\<br /> E-Mail: -<info.behindertenbeauftragter@sk.sachsen.de>\<br /> Telefon: [+49 351 -564-12161](tel:+49-351-564-12161)\<br /> Fax: [+49 351 -564-12169](tel:+49-351-564-12169)\<br /> Webseite: -<https://www.inklusion.sachsen.de> - -\<div id="footer"> \</div> - --- Main.MatthiasKraeusslein - 2020-09-18 diff --git a/twiki2md/root/WebHome/FurtherDocumentation.md b/twiki2md/root/WebHome/FurtherDocumentation.md deleted file mode 100644 index 2e6586a43633c20327fae0132b90d3fe73c1bb7c..0000000000000000000000000000000000000000 --- a/twiki2md/root/WebHome/FurtherDocumentation.md +++ /dev/null @@ -1,81 +0,0 @@ -# Further Documentation - - - -## Libraries and Compiler - -- <http://www.intel.com/software/products/mkl/index.htm> -- <http://www.intel.com/software/products/ipp/index.htm> -- <http://www.ball-project.org/> -- <http://www.intel.com/software/products/compilers/> - Intel Compiler - Suite -- <http://www.pgroup.com/doc> - PGI Compiler -- <http://pathscale.com/ekopath.html> - PathScale Compilers - -## Tools - -- <http://www.allinea.com/downloads/userguide.pdf> - Allinea DDT - Manual -- <http://www.totalviewtech.com/support/documentation.html> - - Totalview Documentation -- <http://www.gnu.org/software/gdb/documentation/> - GNU Debugger -- <http://vampir-ng.de> - official homepage of Vampir, an outstanding - tool for performance analysis developed at ZIH. -- <http://www.fz-juelich.de/zam/kojak/> - homepage of KOJAK at the FZ - Jlich. Parts of this project are used by Vampirtrace. -- <http://www.intel.com/software/products/threading/index.htm> - -## OpenMP - -You will find a lot of information at the following web pages: - -- <http://www.openmp.org> -- <http://www.compunity.org> - -## MPI - -The following sites may be interesting: - -- <http://www.mcs.anl.gov/mpi/> - the MPI homepage. -- <http://www.mpi-forum.org/> - Message Passing Interface (MPI) Forum - Home Page -- <http://www.open-mpi.org/> - the dawn of a new standard for a more - fail-tolerant MPI. -- The manual for SGI-MPI (installed on Mars ) can be found at: - -<http://techpubs.sgi.com/library/manuals/3000/007-3773-003/pdf/007-3773-003.pdf> - -## SGI developer forum - -The web sites behind -<http://www.sgi.com/developers/resources/tech_pubs.html> are full of -most detailed information on SGI systems. Have a look onto the section -'Linux Publications'. You will be redirected to the public part of SGI's -technical publication repository. - -- Linux Application Tuning Guide -- Linux Programmer's Guide, The -- Linux Device Driver Programmer's Guide -- Linux Kernel Internals.... and more. - -## Intel Itanium - -There is a lot of additional material regarding the Itanium CPU: - -- <http://www.intel.com/design/itanium/manuals/iiasdmanual.htm> -- <http://www.intel.com/design/archives/processors/itanium/index.htm> -- <http://www.intel.com/design/itanium2/documentation.htm> - -You will find the following manuals: - -- Intel Itanium Processor Floating-Point Software Assistance handler - (FPSWA) -- Intel Itanium Architecture Software Developer's Manuals Volume 1: - Application Architecture -- Intel Itanium Architecture Software Developer's Manuals Volume 2: - System Architecture -- Intel Itanium Architecture Software Developer's Manuals Volume 3: - Instruction Set -- Intel Itanium 2 Processor Reference Manual for Software Development - and Optimization -- Itanium Architecture Assembly Language Reference Guide diff --git a/twiki2md/root/WebHome/TypicalProjectSchedule.md b/twiki2md/root/WebHome/TypicalProjectSchedule.md deleted file mode 100644 index c7b404ea81e20010d505392e1a67caec71f8b678..0000000000000000000000000000000000000000 --- a/twiki2md/root/WebHome/TypicalProjectSchedule.md +++ /dev/null @@ -1,546 +0,0 @@ -# Typical project schedule - - - -## \<span style="font-size: 1em;">0. Application for HPC login\</span> - -In order to use the HPC systems installed at ZIH, a project application -form has to be filled in. The HPC project manager should hold a -professorship (university) or head a research group. You may also apply -for the "Schnupperaccount" (trial account) for one year. Check the -[Access](Access) page for details. - -## \<span style="font-size: 1em;">1. Request for resources\</span> - -Important note: Taurus is based on the Linux system. Thus for the -effective work, you should have to know how to work with -[Linux](https://en.wikipedia.org/wiki/Linux) based systems and [Linux -Shell](https://ubuntu.com/tutorials/command-line-for-beginners#1-overview). -Beginners can find a lot of different tutorials on the internet, [for -example](https://swcarpentry.github.io/shell-novice/). - -### \<span style="font-size: 1em;">1.1 How do I determine the required CPU / GPU hours?\</span> - -Taurus is focused on data-intensive computing. The cluster is oriented -on the work with the high parallel code. Please keep it in mind for the -transfer sequential code from a local machine. So far you will have -execution time for the sequential program it is reasonable to use -[Amdahl's law](https://en.wikipedia.org/wiki/Amdahl%27s_law) to roughly -predict execution time in parallel. Think in advance about the -parallelization strategy for your project. - -### \<span style="font-size: 1em;">1.2 What software do I need? What is already available (in the correct version)?\</span> - -The good practice for the HPC clusters is use software and packages -where parallelization is possible. The open-source software is more -preferable than proprietary. However, the majority of popular -programming languages, scientific applications, software, packages -available or could be installed on Taurus in different ways. First of -all, check the [Software module list](SoftwareModulesList). There are -two different software environments: **scs5** (the regular one) and -**ml** (environment for the Machine Learning partition). Keep in mind -that Taurus has a Linux based operating system. - -## 2. Access to the cluster - -### SSH access - -%RED%Important note:%ENDCOLOR%\<span style="font-size: 1em;"> ssh to -Taurus is only possible from \</span> **inside** \<span -style="font-size: 1em;"> TU Dresden Campus. Users from outside should -use \</span> **VPN** \<span style="font-size: 1em;"> (see \</span>\<a -href="<https://tu-dresden.de/zih/dienste/service-katalog/arbeitsumgebung/zugang_datennetz/vpn>" -target="\_top">here\</a>\<span style="font-size: 1em;">).\</span> - -The recommended way to connect to the HPC login servers directly via -ssh: - - ssh <zih-login>@taurus.hrsk.tu-dresden.de - -Please put this command in the terminal and replace \<zih-login> with -your login that you received during the access procedure. Accept the -host verifying and enter your password. You will be loaded by login -nodes in your Taurus home directory. - -This method requires two conditions: Linux OS, workstation within the -campus network. For other options and details check the \<a href="Login" -target="\_blank">Login page\</a>. - -Useful links: [Access](Access), [Project Request -Form](ProjectRequestForm), [Terms Of Use](TermsOfUse) - -## 3. Available software, use of the software - -According to 1.2, first of all, check the [Software module -list](SoftwareModulesList). Keep in mind that there are two different -environments: **scs5** (for the x86 architecture) and **ml** -(environment for the Machine Learning partition based on the Power9 -architecture). - -\<span style="font-size: 1em;">Work with the software on Taurus could be -started only after allocating the resources by \</span> [batch -systems](BatchSystems)\<span style="font-size: 1em;">. By default, you -are in the login nodes. They are not specified for the work, only for -the login. Allocating resources will be done by batch system \</span> -[SLURM](Slurm). - -There are a lot of different possibilities to work with software on -Taurus: - -**a.** **Modules** - -\<span style="font-size: 1em;">The easiest way to start working with -software is using the \</span>\<a -href="RuntimeEnvironment#Module_Environments" target="\_blank">Modules -system\</a>\<span style="font-size: 1em;">. Modules are a way to use -frameworks, compilers, loader, libraries, and utilities. The module is a -user interface that provides utilities for the dynamic modification of a -user's module environment without manual modifications. You could use -them for **srun**, bath jobs (**sbatch**) and the Jupyterhub.\</span> - -**b. JupyterNotebook** - -The Jupyter Notebook is an open-source web application that allows -creating documents containing live code, equations, visualizations, and -narrative text. \<span style="font-size: 1em;">There is \</span>\<a -href="JupyterHub" target="\_self">jupyterhub\</a>\<span -style="font-size: 1em;"> on Taurus, where you can simply run your -Jupyter notebook on HPC nodes using modules, preloaded or custom virtual -environments. Moreover, you can run a [manually created remote jupyter -server](DeepLearning#Jupyter_notebook) for more specific cases.\</span> - -**c.** **Containers** - -\<span style="font-size: 1em;">Some tasks require using containers. It -can be done on Taurus by [Singularity](https://sylabs.io/). Details -could be found in the [following -chapter](TypicalProjectSchedule#Use_of_containers).\</span> - -Useful links: [Libraries](Libraries), [Deep Learning](DeepLearning), -[Jupyter Hub](JupyterHub), [Big Data -Frameworks](BigDataFrameworks:ApacheSparkApacheFlinkApacheHadoop), -[R](DataAnalyticsWithR), [Applications for various fields of -science](Applications) - -## 4. Create a project structure. Data management - -Correct organisation of the project structure is a straightforward way -to the efficient work of the whole team. There have to be rules and -regulations for working with the project that every member should -follow. \<span style="font-size: 1em;">The uniformity of the project -could be achieved by using for each member of a team: the same **data -storage** or set of them, the same **set of software** (packages, -libraries etc), **access rights** to project data should be taken into -account and set up correctly. \</span> - -### 4.1 Data storage and management - -#### 4.1.1 Taxonomy of file systems - -\<span style="font-size: 1em;">As soon as you have access to Taurus you -have to manage your data. The main [concept](HPCStorageConcept2019) of -working with data on Taurus is using [Workspaces](WorkSpaces). Use it -properly:\</span> - -- use a `/home` directory for the limited amount of personal data, - simple examples and the results of calculations. The home directory - is not a working directory! However, \<span - class="WYSIWYG_TT">/home\</span> file system is backed up using - snapshots; -- use **workspace** as a place for working data (i.e. datasets); - Recommendations of choosing the correct storage system for workspace - presented below. - -**Recommendations to choose of storage system:**\<span -style`"font-size: 1em;">For data that seldomly changes but consumes a lot of space, the </span> ==warm_archive=` -\<span style="font-size: 1em;"> can be used. (Note that this is -\</span>mounted** read-only**\<span style="font-size: 1em;">on the -compute nodes). For a series of calculations that works on the same data -please use a \</span> **scratch**\<span style="font-size: 1em;">based -workspace. \</span> **SSD** \<span style="font-size: 1em;">, in its -turn, is the fastest available file system made only for large parallel -applications running with millions of small I/O (input, output -operations).\</span>\<span style="font-size: 1em;"> If the batch job -needs a directory for temporary data then -\</span>**\<span>SSD\</span>**\<span style="font-size: 1em;"> is a good -choice as well. The data can be deleted afterwards.\</span> - -Note: Keep in mind that every working space has a storage duration ( -i.e. ssd - 30 days). Thus be careful with the expire date otherwise it -could vanish. The core data of your project should be [backed -up](FileSystems#Backup_and_snapshots_of_the_file_system) and -[archived](PreservationResearchData)(for the most -[important](https://www.dcc.ac.uk/guidance/how-guides/five-steps-decide-what-data-keep) -data). - -#### \<span style="font-size: 1em;">4.1.2 Backup \</span> - -\<span -style`"font-size: 1em;">The backup is a crucial part of any project. Organize it at the beginning of the project. If you will lose/delete your data in the "no back up" file systems it can not be restored! The b</span><span style="font-size: 13px;">ackup on Taurus is </span><b style="font-size: 1em;">only </b><span style="font-size: 13px;">available in the </span> =/home` -\<span style`"font-size: 13px;"> and the </span> =/projects` \<span -style="font-size: 13px;"> file systems! Backed up files could be -restored by the user. Details could be found -[here](FileSystems#Backup_and_snapshots_of_the_file_system).\</span> - -#### 4.1.3 Folder structure and organizing data - -\<span style="font-size: 1em;">Organizing of living data using the file -system helps for consistency and structuredness of the project. -\</span>\<span style="font-size: 1em;">We recommend following the rules -for your work regarding:\</span> - -- Organizing the data: Never change the original data; Automatize the - organizing the data; Clearly separate intermediate and final output - in the filenames; Carry identifier and original name along in your - analysis pipeline; Make outputs clearly identifiable; Document your - analysis steps. -- Naming Data: Keep s\<span style="font-size: 1em;">hort, but - meaningful names; \</span>\<span style="font-size: 1em;">Keep - standard file endings; File names dont replace documentation and - metadata; Use standards of your discipline; \</span>\<span - style="font-size: 1em;">Make rules for your project, document and - keep them (See the \</span> [README - recommendations](TypicalProjectSchedule#README_recommendation) - below) - -\<span style="font-size: 1em;">This is the example of an organisation -(hierarchical) for the folder structure. Use it as a visual illustration -of the above:\</span> - -\<img align="justify" alt="Organizing_Data-using_file_systems.png" -height="161" src="%ATTACHURL%/Organizing_Data-using_file_systems.png" -title="Organizing_Data-using_file_systems.png" width="300" /> - -Keep in mind [input-process-output -pattern](https://en.wikipedia.org/wiki/IPO_model#Programming) for the -work with folder structure. - -#### 4.1.4 README recommendation - -In general, [README](https://en.wikipedia.org/wiki/README) is just -simple general information of software/project that exists in the same -directory/repository of the project. README is used to explain the -details project and the **structure** of the project/folder in a short -way. We recommend using readme as for entire project as for every -important folder in the project. - -Example of the structure for the README:\<br />\<span style="font-size: -1em;">Think first: What is calculated why? (Description); \</span>\<span -style="font-size: 1em;">What is expected? (software and version)\<br -/>\</span>\<span style="font-size: 1em;">Example text file\<br -/>\</span>\<span style="font-size: 1em;"> Title:\<br />\</span>\<span -style="font-size: 1em;"> User:\<br />\</span>\<span style="font-size: -1em;"> Date:\<br />\</span>\<span style="font-size: 1em;"> -Description:\<br />\</span>\<span style="font-size: 1em;"> software:\<br -/>\</span>\<span style="font-size: 1em;"> version:\</span> - -#### 4.1.5 Metadata - -Another important aspect is the -[Metadata](http://dublincore.org/resources/metadata-basics/). It is -sufficient to use -[Metadata](PreservationResearchData#Why_should_I_add_Meta_45Data_to_my_data_63) -for your project on Taurus. [Metadata -standards](https://en.wikipedia.org/wiki/Metadata_standard) will help to -do it easier (i.e. [Dublin core](https://dublincore.org/), -[OME](https://www.openmicroscopy.org/)) - -#### 4.1.6 Data hygiene - -Don't forget about data hygiene: Classify your current data into -critical (need it now), necessary (need it later) or unnecessary -(redundant, trivial or obsolete); Track and classify data throughout its -lifecycle (from creation, storage and use to sharing, archiving and -destruction); Erase the data you dont need throughout its lifecycle. - -### \<span style="font-size: 1em;">4.2 Software packages\</span> - -As was written before the module concept is the basic concept for using -software on Taurus. Uniformity of the project has to be achieved by -using the same set of software on different levels. It could be done by -using environments. There are two types of environments should be -distinguished: runtime environment (the project level, use scripts to -load [modules](RuntimeEnvironment)), Python virtual environment. The -concept of the environment will give an opportunity to use the same -version of the software on every level of the project for every project -member. - -#### Private individual and project modules files - -[Private individual and project module -files](RuntimeEnvironment#Private_Project_Module_Files)\<span -style="font-size: 1em;"> will be discussed in [chapter -](TypicalProjectSchedule#A_7._Use_of_specific_software_40packages_44_libraries_44_etc_41)\</span> -[7](TypicalProjectSchedule#A_7._Use_of_specific_software_40packages_44_libraries_44_etc_41)\<span -style="font-size: 1em;">. Project modules list is a powerful instrument -for effective teamwork.\</span> - -#### Python virtual environment - -If you are working with the Python then it is crucial to use the virtual -environment on Taurus. The main purpose of Python virtual environments -(don't mess with the software environment for modules) is to create an -isolated environment for Python projects (self-contained directory tree -that contains a Python installation for a particular version of Python, -plus a number of additional packages). - -**Vitualenv (venv)** is a standard Python tool to create isolated Python -environments. We recommend using venv to work with Tensorflow and -Pytorch on Taurus. It has been integrated into the standard library -under the \<a href="<https://docs.python.org/3/library/venv.html>" -target="\_blank">venv module\</a>. **Conda** is the second way to use a -virtual environment on the Taurus. \<a -href="<https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html>" -target="\_blank">Conda\</a> is an open-source package management system -and environment management system from the Anaconda. - -[Detailed information](Python#Virtual_environment) about using the -virtual environment. - -### \<span style="font-size: 1em;">4.3 Application software availability\</span> - -Software created for the purpose of the project should be available for -all members of the group. The instruction of how to use the software: -installation of packages, compilation etc should be documented and gives -the opportunity to comfort efficient and safe work. - -### 4.4 Access rights - -The concept of **permissions** and **ownership** is crucial in Linux. -See the -[HPC-introduction](%PUBURL%/Compendium/WebHome/HPC-Introduction.pdf?t=1602081321) -slides for the understanding of the main concept. Standard Linux -changing permission command(i.e `chmod`) valid for Taurus as well. The -**group** access level contains members of your project group. Be -careful with 'write' permission and never allow to change the original -data. - -Useful links: [Data Management](DataManagement), [File -Systems](FileSystems), [Get Started with HPC-DA](GetStartedWithHPCDA), -[Project Management](ProjectManagement), [Preservation research -data](PreservationResearchData) - -## 5. Data moving - -### 5.1 Moving data to/from the HPC machines - -To copy data to/from the HPC machines, the Taurus [export -nodes](ExportNodes) should be used as a preferred way. There are three -possibilities to exchanging data between your local machine (lm) and the -HPC machines (hm):\<span> **SCP, RSYNC, SFTP**. \</span>\<span -style`"font-size: 1em;">Type following commands in the terminal of the local machine. The </span> ==SCP=` -\<span style="font-size: 1em;"> command was used for the following -example.\</span> - -#### Copy data from lm to hm - - scp <file> <zih-user>@taurusexport.hrsk.tu-dresden.de:<target-location> #Copy file from your local machine. For example: scp helloworld.txt mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/ - - scp -r <directory> <zih-user>@taurusexport.hrsk.tu-dresden.de:<target-location> #Copy directory from your local machine. - -#### Copy data from hm to lm - - scp <zih-user>@taurusexport.hrsk.tu-dresden.de:<file> <target-location> #Copy file. For example: scp mustermann@taurusexport.hrsk.tu-dresden.de:/scratch/ws/mastermann-Macine_learning_project/helloworld.txt /home/mustermann/Downloads - - scp -r <zih-user>@taurusexport.hrsk.tu-dresden.de:<directory> <target-location> #Copy directory - -### 5.2 Moving data inside the HPC machines. Datamover - -The best way to transfer data inside the Taurus is the \<a -href="DataMover" target="\_blank">datamover\</a>. It is the special data -transfer machine provides the best data speed. To load, move, copy etc. -files from one file system to another file system, you have to use -commands with **dt** prefix, such as: **\<span>dtcp, dtwget, dtmv, dtrm, -dtrsync, dttar, dtls. \</span>**\<span style="font-size: 1em;">These -commands submit a job to the data transfer machines that execute the -selected command. Except for the '\</span>\<span style="font-size: -1em;">dt'\</span>\<span style="font-size: 1em;"> prefix, their syntax is -the same as the shell command without the '\</span>\<span -style="font-size: 1em;">dt\</span>\<span style="font-size: -1em;">'\</span>**.** - -Keep in mind: The warm_archive is not writable for jobs. However, you -can store the data in the warm archive with the datamover. - -Useful links: [Data Mover](DataMover), [Export Nodes](ExportNodes) - -## 6. Use of hardware - -To run the software, do some calculations or compile your code compute -nodes have to be used. Login nodes which are using for login can not be -used for your computations. Submit your tasks (by using -[jobs](https://en.wikipedia.org/wiki/Job_(computing))) to compute nodes. -The [SLURM](Slurm) (scheduler to handle your jobs) is using on Taurus -for this purposes. [HPC -Introduction](%PUBURL%/Compendium/WebHome/HPC-Introduction.pdf) is a -good resource to get started with it. - -### 6.1 What do I need a CPU or GPU? - -The main difference between CPU and GPU architecture is that a CPU is -designed to handle a wide range of tasks quickly, but are limited in the -concurrency of tasks that can be running. While GPUs can process data -much faster than a CPU due to massive parallelism (but the amount of -data which single GPU's core can handle is small), GPUs are not as -versatile as CPUs. - -### 6.2 Selection of suitable hardware - -Available [hardware](HardwareTaurus): Normal compute nodes (Haswell\[ -[64,128,256](SystemTaurus#Run_45time_and_Memory_Limits)\], Broadwell, -[Rome](RomeNodes)), Large [SMP nodes](SDFlex), Accelerator(GPU) nodes: -(gpu2 partition, [ml partition](Power9)). - -The exact partition could be specified by `-p` flag with the srun -command or in your batch job. - -Majority of the basic task could be done on the conventional nodes like -a Haswell. SLURM will automatically select a suitable partition -depending on your memory and --gres (gpu) requirements. If you do not -specify the partition most likely you will be addressed to the Haswell -partition (1328 nodes in total). - -#### Parallel jobs: - -**MPI jobs**: For MPI jobs typically allocates one core per task. -Several nodes could be allocated if it is necessary. SLURM will -automatically find suitable hardware. Normal compute nodes are perfect -for this task. - -**OpenMP jobs**: An SMP-parallel job can only run **within a node**, so -it is necessary to include the options **-N 1** and **-n 1**. Using ---cpus-per-task N SLURM will start one task and you will have N CPUs. -The maximum number of processors for an SMP-parallel program is 896 on -Taurus ( [SMP](SDFlex) island). - -**GPUs** partitions are best suited for **repetitive** and -**highly-parallel** computing tasks. If you have a task with potential -[data -parallelism](https://en.wikipedia.org/wiki/Data_parallelism#:~:text=Data%20parallelism%20is%20parallelization%20across,on%20each%20element%20in%20parallel.) -most likely that you need the GPUs. Beyond video rendering, GPUs excel -in tasks such as machine learning, financial simulations and risk -modelling. Use the gpu2 and ml partition only if you need GPUs! -Otherwise using the x86 partitions (e.g Haswell) most likely would be -more beneficial. - -**Interactive jobs**: SLURM can forward your X11 credentials to the -first (or even all) node for a job with the --x11 option. To use an -interactive job you have to specify -X flag for the ssh login. - -### 6.3 Interactive vs. sbatch - -However, using srun directly on the shell will lead to blocking and -launch an interactive job. Apart from short test runs, it is -**recommended to launch your jobs into the background by using batch -jobs**. For that, you can conveniently put the parameters directly into -the job file which you can submit using `sbatch` \[options\] \<job -file>. - -### 6.4 Processing of data for input and output - -Pre-processing and post-processing of the data is a crucial part for the -majority of data-dependent projects. The quality of this work influence -on the computations. However, pre- and post-processing in many cases can -be done completely or partially on a local pc and then -[transferred](TypicalProjectSchedule#A_5._Data_moving) to the Taurus. -Please use Taurus for the computation-intensive tasks. - -Useful links: [Batch Systems](BatchSystems), [Hardware -Taurus](HardwareTaurus), [HPC-DA](HPCDA), [Slurm](Slurm) - -## 7. Use of specific software (packages, libraries, etc) - -### 7.1 Modular system - -The modular concept is the easiest way to work with the software on -Taurus. It allows to user to switch between different versions of -installed programs and provides utilities for the dynamic modification -of a user's environment. The information can be found -[here](RuntimeEnvironment#Modules). - -#### Private project and user modules files - -[Private project module -files](RuntimeEnvironment#Private_Project_Module_Files)\<span -style="font-size: 1em;"> allow you to load your group-wide installed -software into your environment and to handle different versions. It -allows creating your own software environment for the project. You can -create a list of modules that will be loaded for every member of the -team. It gives opportunity on unifying work of the team and defines the -reproducibility of results. Private modules can be loaded like other -modules with \</span>\<span class="WYSIWYG_TT">module load\</span>\<span -style="font-size: 1em;">.\</span> - -[Private user module -files](RuntimeEnvironment#Private_User_Module_Files) allow you to load -your own installed software into your environment. It works in the same -manner as to project modules but for your private use. - -### 7.2 Use of containers - -[Containerization](https://www.ibm.com/cloud/learn/containerization) -encapsulating or packaging up software code and all its dependencies to -run uniformly and consistently on any infrastructure. On Taurus -[Singularity](https://sylabs.io/) used as a standard container solution. -Singularity enables users to have full control of their environment. -This means that you dont have to ask an HPC support to install anything -for you - you can put it in a Singularity container and run! As opposed -to Docker (the most famous container solution), Singularity is much more -suited to being used in an HPC environment and more efficient in many -cases. Docker containers can easily be used in Singularity. Information -about the use of Singularity on Taurus can be found [here](Container). - -\<span style="font-size: 1em;">In some cases using Singularity requires -a Linux machine with root privileges (e.g. using the ml partition), the -same architecture and a compatible kernel. For many reasons, users on -Taurus cannot be granted root permissions. A solution is a Virtual -Machine (VM) on the ml partition which allows users to gain root -permissions in an isolated environment. There are two main options on -how to work with VM on Taurus:\<br />\</span>\<span style="font-size: -1em;">1. \</span> [VM tools](VMTools)\<span style="font-size: 1em;">. -Automative algorithms for using virtual machines;\<br />\</span>\<span -style="font-size: 1em;">2. \</span> [Manual method](Cloud)\<span -style="font-size: 1em;">. It required more operations but gives you more -flexibility and reliability.\<br />\</span>\<span style="font-size: -1em;">Additional Information: Examples of the definition for the -Singularity container (\</span> -[here](SingularityExampleDefinitions)\<span style="font-size: 1em;">) -and some hints (\</span> [here](SingularityRecipeHints)\<span -style="font-size: 1em;">).\</span> - -Useful links: [Containers](Container), [Custom EasyBuild -Environment](CustomEasyBuildEnvironment), [Cloud](Cloud) - -## 8. Structuring experiments - -- \<p>Input data\</p> -- \<p>Calculation results\</p> -- \<p>Log files\</p> -- \<p>Submission scripts (examples / code for survival)\</p> - -## What if everything didn't help? - -### Create a ticket: how do I do that? - -The best way to ask about the help is to create a ticket. In order to do -that you have to write a message to the <hpcsupport@zih.tu-dresden.de> -with a detailed description of your problem. If it possible please add -logs, used environment and write a minimal executable example for the -purpose to recreate the error or issue. - -### \<span style="font-size: 1em;">Communication with HPC support\</span> - -There is the HPC support team who is responsible for the support of HPC -users and stable work of the cluster. You could find the -[details](https://tu-dresden.de/zih/hochleistungsrechnen/support) in the -right part of any page of the compendium. However, please, before the -contact with the HPC support team check the documentation carefully -(starting points: [ main page](WebHome), [HPC-DA](HPCDA)), use a -[search](WebSearch) and then create a ticket. The ticket is a preferred -way to solve the issue, but in some terminable cases, you can call to -ask for help. - -Useful link: [Further Documentation](FurtherDocumentation) - -\<span style="font-size: 1em;">-- Main.AndreiPolitov - -2020-09-14\</span>