diff --git a/Dockerfile b/Dockerfile index f6bf9841524472c7af2522ce9cd641e9c5dbd824..57490c2509a22302ba13ed4bd05d32f0d7b0fb51 100644 --- a/Dockerfile +++ b/Dockerfile @@ -10,7 +10,7 @@ RUN pip install mkdocs>=1.1.2 mkdocs-material>=7.1.0 # Linter # ########## -RUN apt update && apt install -y nodejs npm aspell +RUN apt update && apt install -y nodejs npm aspell git RUN npm install -g markdownlint-cli markdown-link-check diff --git a/doc.zih.tu-dresden.de/README.md b/doc.zih.tu-dresden.de/README.md index dc65e18d33561b7e1bc5e50d73f8bfd4085f3f27..f1d0e97563caae06b8859b8c0632e7dacc2fb641 100644 --- a/doc.zih.tu-dresden.de/README.md +++ b/doc.zih.tu-dresden.de/README.md @@ -11,7 +11,7 @@ long describing complex steps, contributing is quite easy - trust us. Users can contribute to the documentation via the [issue tracking system](https://gitlab.hrz.tu-chemnitz.de/zih/hpcsupport/hpc-compendium/-/issues). For that, open an issue to report typos and missing documentation or request for more precise -wording etc. ZIH staff will get in touch with you to resolve the issue and improve the +wording etc. ZIH staff will get in touch with you to resolve the issue and improve the documentation. **Reminder:** Non-documentation issues and requests need to be send as ticket to @@ -107,7 +107,7 @@ Open `http://127.0.0.1:8000` with a web browser to preview the local copy of the You can also use `docker` to build a container from the `Dockerfile`, if you are familiar with it. This may take a while, as mkdocs and other necessary software needs to be downloaded. -Building a container with the documentation inside could be done with the following steps: +Building a container could be done with the following steps: ```Bash cd /PATH/TO/hpc-compendium @@ -137,7 +137,7 @@ echo http://$(docker inspect -f "{{.NetworkSettings.IPAddress}}" $(docker ps -qf ``` The running container automatically takes care of file changes and rebuilds the -documentation. If you want to check whether the markdown files are formatted +documentation. If you want to check whether the markdown files are formatted properly, use the following command: ```Bash @@ -247,7 +247,8 @@ There are two important branches in this repository: - Preview: - Branch containing recent changes which will be soon merged to main branch (protected branch) - - Served at [todo url](todo url) from TUD VPN + - Served at [https://doc.zih.tu-dresden.de/preview](https://doc.zih.tu-dresden.de/preview) from + TUD-ZIH VPN - Main: Branch which is deployed at [https://doc.zih.tu-dresden.de](https://doc.zih.tu-dresden.de) holding the current documentation (protected branch) @@ -387,280 +388,3 @@ BigDataFrameworksApacheSparkApacheFlinkApacheHadoop.md is not included in nav pika.md is not included in nav specific_software.md is not included in nav ``` - -### Pre-commit Git Hook - -You can automatically run checks whenever you try to commit a change. In this case, failing checks -prevent commits (unless you use option `--no-verify`). This can be accomplished by adding a -pre-commit hook to your local clone of the repository. The following code snippet shows how to do -that: - -```bash -cp doc.zih.tu-dresden.de/util/pre-commit .git/hooks/ -``` - -!!! note - The pre-commit hook only works, if you can use docker without using `sudo`. If this is not - already the case, use the command `adduser $USER docker` to enable docker commands without - `sudo` for the current user. Restart the docker daemons afterwards. - -## Content Rules - -**Remark:** Avoid using tabs both in markdown files and in `mkdocs.yaml`. Type spaces instead. - -### New Page and Pages Structure - -The pages structure is defined in the configuration file [mkdocs.yaml](mkdocs.yml). - -```Shell Session -docs/ - - Home: index.md - - Application for HPC Login: application.md - - Request for Resources: req_resources.md - - Access to the Cluster: access.md - - Available Software and Usage: - - Overview: software/overview.md - ... -``` - -To add a new page to the documentation follow these two steps: - -1. Create a new markdown file under `docs/subdir/file_name.md` and put the documentation inside. The - sub directory and file name should follow the pattern `fancy_title_and_more.md`. -1. Add `subdir/file_name.md` to the configuration file `mkdocs.yml` by updating the navigation - section. - -Make sure that the new page **is not floating**, i.e., it can be reached directly from the documentation -structure. - -### Markdown - -1. Please keep things simple, i.e., avoid using fancy markdown dialects. - * [Cheat Sheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) - * [Style Guide](https://github.com/google/styleguide/blob/gh-pages/docguide/style.md) - -1. Do not add large binary files or high resolution images to the repository. See this valuable - document for [image optimization](https://web.dev/fast/#optimize-your-images). - -1. [Admonitions](https://squidfunk.github.io/mkdocs-material/reference/admonitions/) may be -actively used, especially for longer code examples, warnings, tips, important information that -should be highlighted, etc. Code examples, longer than half screen height should collapsed -(and indented): - -??? example - ```Bash - [...] - # very long example here - [...] - ``` - -### Writing Style - -**TODO** Guide [Issue #14](#14) - -* Capitalize headings, e.g. *Exclusive Reservation of Hardware* -* Give keywords in link texts, e.g. [Code Blocks](#code-blocks-and-syntax-highlighting) is more - descriptive than [this subsection](#code-blocks-and-syntax-highlighting) - -### Spelling and Technical Wording - -To provide a consistent and high quality documentation, and help users to find the right pages, -there is a list of conventions w.r.t. spelling and technical wording. - -* Language settings: en_us -* `I/O` not `IO` -* `Slurm` not `SLURM` -* `Filesystem` not `file system` -* `ZIH system` and `ZIH systems` not `Taurus`, `HRSKII`, `our HPC systems`, etc. -* `Workspace` not `work space` -* avoid term `HPC-DA` -* Partition names after the keyword *partition*: *partition `ml`* not *ML partition*, *ml - partition*, *`ml` partition*, *"ml" partition*, etc. - -### Code Blocks and Command Prompts - -Showing commands and sample output is an important part of all technical documentation. To make -things as clear for readers as possible and provide a consistent documentation, some rules have to -be followed. - -1. Use ticks to mark code blocks and commands, not italic font. -1. Specify language for code blocks ([see below](#code-blocks-and-syntax-highlighting)). -1. All code blocks and commands should be runnable from a login node or a node within a specific - partition (e.g., `ml`). -1. It should be clear from the prompt, where the command is run (e.g. local machine, login node or - specific partition). - -#### Prompts - -We follow this rules regarding prompts: - -| Host/Partition | Prompt | -|------------------------|------------------| -| Login nodes | `marie@login$` | -| Arbitrary compute node | `marie@compute$` | -| `haswell` partition | `marie@haswell$` | -| `ml` partition | `marie@ml$` | -| `alpha` partition | `marie@alpha$` | -| `alpha` partition | `marie@alpha$` | -| `romeo` partition | `marie@romeo$` | -| `julia` partition | `marie@julia$` | -| Localhost | `marie@local$` | - -*Remarks:* - -* **Always use a prompt**, even there is no output provided for the shown command. -* All code blocks should use long parameter names (e.g. Slurm parameters), if available. -* All code blocks which specify some general command templates, e.g. containing `<` and `>` - (see [Placeholders](#mark-placeholders)), should use `bash` for the code block. Additionally, - an example invocation, perhaps with output, should be given with the normal `console` code block. - See also [Code Block description below](#code-blocks-and-syntax-highlighting). -* Using some magic, the prompt as well as the output is identified and will not be copied! -* Stick to the [generic user name](#data-privacy-and-generic-user-name) `marie`. - -#### Code Blocks and Syntax Highlighting - -This project makes use of the extension -[pymdownx.highlight](https://squidfunk.github.io/mkdocs-material/reference/code-blocks/) for syntax -highlighting. There is a complete list of supported -[language short codes](https://pygments.org/docs/lexers/). - -For consistency, use the following short codes within this project: - -With the exception of command templates, use `console` for shell session and console: - -```` markdown -```console -marie@login$ ls -foo -bar -``` -```` - -Make sure that shell session and console code blocks are executable on the login nodes of HPC system. - -Command templates use [Placeholders](#mark-placeholders) to mark replaceable code parts. Command -templates should give a general idea of invocation and thus, do not contain any output. Use a -`bash` code block followed by an invocation example (with `console`): - -```` markdown -```bash -marie@local$ ssh -NL <local port>:<compute node>:<remote port> <zih login>@tauruslogin.hrsk.tu-dresden.de -``` - -```console -marie@local$ ssh -NL 5901:172.24.146.46:5901 marie@tauruslogin.hrsk.tu-dresden.de -``` -```` - -Also use `bash` for shell scripts such as jobfiles: - -```` markdown -```bash -#!/bin/bash -#SBATCH --nodes=1 -#SBATCH --time=01:00:00 -#SBATCH --output=slurm-%j.out - -module load foss - -srun a.out -``` -```` - -!!! important - - Use long parameter names where possible to ease understanding. - -`python` for Python source code: - -```` markdown -```python -from time import gmtime, strftime -print(strftime("%Y-%m-%d %H:%M:%S", gmtime())) -``` -```` - -`pycon` for Python console: - -```` markdown -```pycon ->>> from time import gmtime, strftime ->>> print(strftime("%Y-%m-%d %H:%M:%S", gmtime())) -2021-08-03 07:20:33 -``` -```` - -Line numbers can be added via - -```` markdown -```bash linenums="1" -#!/bin/bash - -#SBATCH -N 1 -#SBATCH -n 23 -#SBATCH -t 02:10:00 - -srun a.out -``` -```` - -_Result_: - - - -Specific Lines can be highlighted by using - -```` markdown -```bash hl_lines="2 3" -#!/bin/bash - -#SBATCH -N 1 -#SBATCH -n 23 -#SBATCH -t 02:10:00 - -srun a.out -``` -```` - -_Result_: - - - -### Data Privacy and Generic User Name - -Where possible, replace login, project name and other private data with clearly arbitrary placeholders. -E.g., use the generic login `marie` and the corresponding project name `p_marie`. - -```console -marie@login$ ls -l -drwxr-xr-x 3 marie p_marie 4096 Jan 24 2020 code -drwxr-xr-x 3 marie p_marie 4096 Feb 12 2020 data --rw-rw---- 1 marie p_marie 4096 Jan 24 2020 readme.md -``` - -### Mark Omissions - -If showing only a snippet of a long output, omissions are marked with `[...]`. - -### Mark Placeholders - -Stick to the Unix rules on optional and required arguments, and selection of item sets: - -* `<required argument or value>` -* `[optional argument or value]` -* `{choice1|choice2|choice3}` - -## Graphics and Attachments - -All graphics and attachments are saved within `misc` directory of the respective sub directory in -`docs`. - -The syntax to insert a graphic or attachment into a page is - -```Bash - -{: align="center"} -``` - -The attribute `align` is optional. By default, graphics are left aligned. **Note:** It is crucial to -have `{: align="center"}` on a new line. diff --git a/doc.zih.tu-dresden.de/docs/access/desktop_cloud_visualization.md b/doc.zih.tu-dresden.de/docs/access/desktop_cloud_visualization.md index 6b40f3bad658df5a171d8b46e5e34f8ae7a1ee95..b9c0d1cd8f894c6944b52daa07fa09c772c73dc0 100644 --- a/doc.zih.tu-dresden.de/docs/access/desktop_cloud_visualization.md +++ b/doc.zih.tu-dresden.de/docs/access/desktop_cloud_visualization.md @@ -4,8 +4,8 @@ NICE DCV enables remote accessing OpenGL 3D applications running on ZIH systems server's GPUs. If you don't need OpenGL acceleration, you might also want to try our [WebVNC](graphical_applications_with_webvnc.md) solution. -Look [here](https://docs.aws.amazon.com/dcv/latest/userguide/client-web.html) if you want to know -if your browser is supported by DCV. +See [the official DCV documentation](https://docs.aws.amazon.com/dcv/latest/userguide/client-web.html) +if you want to know whether your browser is supported by DCV. ## Access with JupyterHub diff --git a/doc.zih.tu-dresden.de/docs/access/jupyterhub.md b/doc.zih.tu-dresden.de/docs/access/jupyterhub.md index dcdd9363c8d406d7227b97abce91ad67298e9a67..b6b0f25d3963da0529f26274a3daf4bdfcb0bbe0 100644 --- a/doc.zih.tu-dresden.de/docs/access/jupyterhub.md +++ b/doc.zih.tu-dresden.de/docs/access/jupyterhub.md @@ -41,7 +41,7 @@ settings. You can: - modify batch system parameters to your needs ([more about batch system Slurm](../jobs_and_resources/slurm.md)) - assign your session to a project or reservation -- load modules from the [module system](../software/runtime_environment.md) +- load modules from the [module system](../software/modules.md) - choose a different standard environment (in preparation for future software updates or testing additional features) @@ -189,7 +189,7 @@ Here is a short list of some included software: \* generic = all partitions except ml -\*\* R is loaded from the [module system](../software/runtime_environment.md) +\*\* R is loaded from the [module system](../software/modules.md) ### Creating and Using a Custom Environment diff --git a/doc.zih.tu-dresden.de/docs/access/security_restrictions.md b/doc.zih.tu-dresden.de/docs/access/security_restrictions.md index bcdc0f578c8e1c7674d5eb42395870636359729b..b43d631c07fc47bf55da932dbb0d11aca4cf2ecf 100644 --- a/doc.zih.tu-dresden.de/docs/access/security_restrictions.md +++ b/doc.zih.tu-dresden.de/docs/access/security_restrictions.md @@ -15,8 +15,8 @@ The most important items for ZIH systems are: * Ideally, there should be no private key on ZIH system except for local use. * Keys to other systems must be passphrase-protected! * **ssh to ZIH systems** is only possible from inside TU Dresden campus - (`login[1,2].zih.tu-dresden.de` will be blacklisted). Users from outside can use VPN (see - [here](https://tu-dresden.de/zih/dienste/service-katalog/arbeitsumgebung/zugang_datennetz/vpn)). + (`login[1,2].zih.tu-dresden.de` will be blacklisted). Users from outside can use + [VPN](https://tu-dresden.de/zih/dienste/service-katalog/arbeitsumgebung/zugang_datennetz/vpn). * **ssh from ZIH system** is only possible inside TU Dresden campus. (Direct SSH access to other computing centers was the spreading vector of the recent incident.) diff --git a/doc.zih.tu-dresden.de/docs/access/ssh_login.md b/doc.zih.tu-dresden.de/docs/access/ssh_login.md index 59e304f3a337f0b30a804ba21e4d4396452cd4c8..69dc79576910d37b001aaaff4cfc43c8ab583b18 100644 --- a/doc.zih.tu-dresden.de/docs/access/ssh_login.md +++ b/doc.zih.tu-dresden.de/docs/access/ssh_login.md @@ -88,6 +88,7 @@ in it (you can omit lines starting with `#`): ```bash Host taurus + #For login (shell access) HostName taurus.hrsk.tu-dresden.de #Put your ZIH-Login after keyword "User": User marie @@ -98,6 +99,15 @@ Host taurus #Enable X11 forwarding for graphical applications and compression. You don't need parameter -X and -C when invoking ssh then. ForwardX11 yes Compression yes +Host taurusexport + #For copying data without shell access + HostName taurusexport.hrsk.tu-dresden.de + #Put your ZIH-Login after keyword "User": + User marie + #Path to private key: + IdentityFile ~/.ssh/id_ed25519 + #Don't try other keys if you have more: + IdentitiesOnly yes ``` Afterwards, you can connect to the ZIH system using: @@ -106,6 +116,9 @@ Afterwards, you can connect to the ZIH system using: marie@local$ ssh taurus ``` +If you want to copy data from/to ZIH systems, please refer to [Export Nodes: Transfer Data to/from +ZIH's Filesystems](../data_transfer/export_nodes.md) for more information on export nodes. + ### X11-Forwarding If you plan to use an application with graphical user interface (GUI), you need to enable diff --git a/doc.zih.tu-dresden.de/docs/application/overview.md b/doc.zih.tu-dresden.de/docs/application/overview.md index 6ab0da135480e6a9621b492a2d9b4fe956f7e2cb..59e6e6e78833b63dd358ecaeda361135aba7ef30 100644 --- a/doc.zih.tu-dresden.de/docs/application/overview.md +++ b/doc.zih.tu-dresden.de/docs/application/overview.md @@ -5,7 +5,7 @@ The HPC project manager should hold a professorship (university) or head a resea also apply for a "Schnupperaccount" (trial account) for one year to find out if the machine is useful for your application. -An other able use case is to request resources for a courses. +An other able use case is to request resources for a courses. To learn more about applying for a project or a course, check the following page: [https://tu-dresden.de/zih/hochleistungsrechnen/zugang][1] diff --git a/doc.zih.tu-dresden.de/docs/archive/beegfs_on_demand.md b/doc.zih.tu-dresden.de/docs/archive/beegfs_on_demand.md index 8c2235f933fb41f5e590e880fdeb92ce6e950dfc..e221188dcd1c33ef66815d38bffd4a8c5866f48e 100644 --- a/doc.zih.tu-dresden.de/docs/archive/beegfs_on_demand.md +++ b/doc.zih.tu-dresden.de/docs/archive/beegfs_on_demand.md @@ -3,7 +3,7 @@ !!! warning This documentation page is outdated. - The up-to date documentation on BeeGFS can be found [here](../data_lifecycle/beegfs.md). + Please see the [new BeeGFS page](../data_lifecycle/beegfs.md). **Prerequisites:** To work with TensorFlow you obviously need a [login](../application/overview.md) to the ZIH systems and basic knowledge about Linux, mounting, and batch system Slurm. diff --git a/doc.zih.tu-dresden.de/docs/archive/install_jupyter.md b/doc.zih.tu-dresden.de/docs/archive/install_jupyter.md index 0d50ecc6c8ec26c30fccaf7882abee6f2070d55b..3d59d1cc7cf9e93e9a7f3ca78d22100978a72b8f 100644 --- a/doc.zih.tu-dresden.de/docs/archive/install_jupyter.md +++ b/doc.zih.tu-dresden.de/docs/archive/install_jupyter.md @@ -1,5 +1,9 @@ # Jupyter Installation +!!! warning + + This page is outdated! + Jupyter notebooks allow to analyze data interactively using your web browser. One advantage of Jupyter is, that code, documentation and visualization can be included in a single notebook, so that it forms a unit. Jupyter notebooks can be used for many tasks, such as data cleaning and @@ -41,17 +45,17 @@ one is to download Anaconda in your home directory. 1. Load Anaconda module (recommended): ```console -marie@compute module load modenv/scs5 -marie@compute module load Anaconda3 +marie@compute$ module load modenv/scs5 +marie@compute$ module load Anaconda3 ``` 1. Download latest Anaconda release (see example below) and change the rights to make it an executable script and run the installation script: ```console -marie@compute wget https://repo.continuum.io/archive/Anaconda3-2019.03-Linux-x86_64.sh -marie@compute chmod u+x Anaconda3-2019.03-Linux-x86_64.sh -marie@compute ./Anaconda3-2019.03-Linux-x86_64.sh +marie@compute$ wget https://repo.continuum.io/archive/Anaconda3-2019.03-Linux-x86_64.sh +marie@compute$ chmod u+x Anaconda3-2019.03-Linux-x86_64.sh +marie@compute$ ./Anaconda3-2019.03-Linux-x86_64.sh ``` (during installation you have to confirm the license agreement) @@ -60,7 +64,7 @@ Next step will install the anaconda environment into the home directory (`/home/userxx/anaconda3`). Create a new anaconda environment with the name `jnb`. ```console -marie@compute conda create --name jnb +marie@compute$ conda create --name jnb ``` ## Set environmental variables @@ -69,15 +73,15 @@ In the shell, activate previously created python environment (you can deactivate it also manually) and install Jupyter packages for this python environment: ```console -marie@compute source activate jnb -marie@compute conda install jupyter +marie@compute$ source activate jnb +marie@compute$ conda install jupyter ``` If you need to adjust the configuration, you should create the template. Generate configuration files for Jupyter notebook server: ```console -marie@compute jupyter notebook --generate-config +marie@compute$ jupyter notebook --generate-config ``` Find a path of the configuration file, usually in the home under `.jupyter` directory, e.g. @@ -87,28 +91,30 @@ Set a password (choose easy one for testing), which is needed later on to log in in browser session: ```console -marie@compute jupyter notebook password Enter password: Verify password: +marie@compute$ jupyter notebook password +Enter password: +Verify password: ``` You get a message like that: -```console +```bash [NotebookPasswordApp] Wrote *hashed password* to -/home/<zih_user>/.jupyter/jupyter_notebook_config.json +/home/marie/.jupyter/jupyter_notebook_config.json ``` I order to create a certificate for secure connections, you can create a self-signed certificate: ```console -marie@compute openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mykey.key -out mycert.pem +marie@compute$ openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mykey.key -out mycert.pem ``` Fill in the form with decent values. Possible entries for your Jupyter configuration (`.jupyter/jupyter_notebook*config.py*`). -```console +```bash c.NotebookApp.certfile = u'<path-to-cert>/mycert.pem' c.NotebookApp.keyfile = u'<path-to-cert>/mykey.key' @@ -124,11 +130,11 @@ c.NotebookApp.allow_remote_access = True !!! note `<path-to-cert>` - path to key and certificate files, for example: - (`/home/<zih_user>/mycert.pem`) + (`/home/marie/mycert.pem`) ## Slurm job file to run the Jupyter server on ZIH system with GPU (1x K80) (also works on K20) -```console +```bash #!/bin/bash -l #SBATCH --gres=gpu:1 # request GPU #SBATCH --partition=gpu2 # use partition GPU 2 @@ -138,7 +144,7 @@ c.NotebookApp.allow_remote_access = True #SBATCH --time=02:30:00 #SBATCH --mem=4000M #SBATCH -J "jupyter-notebook" # job-name -#SBATCH -A <name_of_your_project> +#SBATCH -A p_marie unset XDG_RUNTIME_DIR # might be required when interactive instead of sbatch to avoid 'Permission denied error' srun jupyter notebook @@ -146,7 +152,7 @@ srun jupyter notebook Start the script above (e.g. with the name `jnotebook`) with sbatch command: -```console +```bash sbatch jnotebook.slurm ``` @@ -155,11 +161,9 @@ If you have a question about sbatch script see the article about [Slurm](../jobs Check by the command: `tail notebook_output.txt` the status and the **token** of the server. It should look like this: -```console -https://(taurusi2092.taurus.hrsk.tu-dresden.de or 127.0.0.1):9999/ -``` +`https://(taurusi2092.taurus.hrsk.tu-dresden.de or 127.0.0.1):9999/` -You can see the **server node's hostname** by the command: `squeue -u <username>`. +You can see the **server node's hostname** by the command: `squeue --me`. ### Remote connect to the server @@ -169,7 +173,7 @@ There are two options on how to connect to the server: solution above. Open the other terminal and configure ssh tunnel: (look up connection values in the output file of Slurm job, e.g.) (recommended): -```console +```bash node=taurusi2092 #see the name of the node with squeue -u <your_login> localport=8887 #local port on your computer remoteport=9999 #pay attention on the value. It should be the same value as value in the notebook_output.txt @@ -183,12 +187,12 @@ pgrep -f "ssh -fNL ${localport}" #verify that tunnel is alive You can connect directly if you know the IP address (just ping the node's hostname while logged on ZIH system). -```console -#comand on remote terminal -taurusi2092$> host taurusi2092 -# copy IP address from output +```bash +#command on remote terminal +marie@taurusi2092$ host taurusi2092 +# copy IP address from output # paste IP to your browser or call on local terminal e.g.: -local$> firefox https://<IP>:<PORT> # https important to use SSL cert +marie@local$ firefox https://<IP>:<PORT> # https important to use SSL cert ``` To login into the Jupyter notebook site, you have to enter the **token**. diff --git a/doc.zih.tu-dresden.de/docs/archive/no_ib_jobs.md b/doc.zih.tu-dresden.de/docs/archive/no_ib_jobs.md index 9ccce6361bcaa0bc024644f348708354d269a04f..49007a12354190a0fdde97a14a1a6bda922ea38d 100644 --- a/doc.zih.tu-dresden.de/docs/archive/no_ib_jobs.md +++ b/doc.zih.tu-dresden.de/docs/archive/no_ib_jobs.md @@ -25,8 +25,8 @@ Infiniband access if (and only if) they have set the `--tmp`-option as well: >units can be specified using the suffix \[K\|M\|G\|T\]. This option >applies to job allocations. -Keep in mind: Since the scratch file system are not available and the -project file system is read-only mounted at the compute nodes you have +Keep in mind: Since the scratch filesystem are not available and the +project filesystem is read-only mounted at the compute nodes you have to work in /tmp. A simple job script should do this: @@ -34,7 +34,7 @@ A simple job script should do this: - create a temporary directory on the compute node in `/tmp` and go there - start the application (under /sw/ or /projects/)using input data - from somewhere in the project file system + from somewhere in the project filesystem - archive and transfer the results to some global location ```Bash diff --git a/doc.zih.tu-dresden.de/docs/archive/system_altix.md b/doc.zih.tu-dresden.de/docs/archive/system_altix.md index 951b06137a599fc95239e5d50144fd2fa205e096..aa61353f4bec0c143b7c86892d8f3cb0a3c41d00 100644 --- a/doc.zih.tu-dresden.de/docs/archive/system_altix.md +++ b/doc.zih.tu-dresden.de/docs/archive/system_altix.md @@ -22,9 +22,9 @@ The jobs for these partitions (except Neptun) are scheduled by the [Platform LSF batch system running on `mars.hrsk.tu-dresden.de`. The actual placement of a submitted job may depend on factors like memory size, number of processors, time limit. -### File Systems +### Filesystems -All partitions share the same CXFS file systems `/work` and `/fastfs`. +All partitions share the same CXFS filesystems `/work` and `/fastfs`. ### ccNUMA Architecture @@ -123,8 +123,8 @@ nodes with dedicated resources for the user's job. Normally a job can be submitt #### LSF -The batch system on Atlas is LSF. For general information on LSF, please follow -[this link](platform_lsf.md). +The batch system on Atlas is LSF, see also the +[general information on LSF](platform_lsf.md). #### Submission of Parallel Jobs diff --git a/doc.zih.tu-dresden.de/docs/archive/system_atlas.md b/doc.zih.tu-dresden.de/docs/archive/system_atlas.md index 0e744c4ab702afac9d3ac413ccfb5abd58fef817..2bebd5511e69f98370aea0c721cee272f940fbc6 100644 --- a/doc.zih.tu-dresden.de/docs/archive/system_atlas.md +++ b/doc.zih.tu-dresden.de/docs/archive/system_atlas.md @@ -22,7 +22,7 @@ kernel. Currently, the following hardware is installed: Mars and Deimos users: Please read the [migration hints](migrate_to_atlas.md). -All nodes share the `/home` and `/fastfs` file system with our other HPC systems. Each +All nodes share the `/home` and `/fastfs` filesystem with our other HPC systems. Each node has 180 GB local disk space for scratch mounted on `/tmp`. The jobs for the compute nodes are scheduled by the [Platform LSF](platform_lsf.md) batch system from the login nodes `atlas.hrsk.tu-dresden.de` . @@ -86,8 +86,8 @@ user's job. Normally a job can be submitted with these data: #### LSF -The batch system on Atlas is LSF. For general information on LSF, please follow -[this link](platform_lsf.md). +The batch system on Atlas is LSF, see also the +[general information on LSF](platform_lsf.md). #### Submission of Parallel Jobs diff --git a/doc.zih.tu-dresden.de/docs/archive/system_venus.md b/doc.zih.tu-dresden.de/docs/archive/system_venus.md index 2c0a1fe2b83b1c4e7d09f5e2f6495db8658cb7f9..56acf9b47081726c9662150f638ff430e099020c 100644 --- a/doc.zih.tu-dresden.de/docs/archive/system_venus.md +++ b/doc.zih.tu-dresden.de/docs/archive/system_venus.md @@ -19,9 +19,9 @@ the Linux operating system SLES 11 SP 3 with a kernel version 3.x. From our experience, most parallel applications benefit from using the additional hardware hyperthreads. -### File Systems +### Filesystems -Venus uses the same `home` file system as all our other HPC installations. +Venus uses the same `home` filesystem as all our other HPC installations. For computations, please use `/scratch`. ## Usage @@ -77,8 +77,8 @@ nodes with dedicated resources for the user's job. Normally a job can be submitt - files for redirection of output and error messages, - executable and command line parameters. -The batch system on Venus is Slurm. For general information on Slurm, please follow -[this link](../jobs_and_resources/slurm.md). +The batch system on Venus is Slurm. Please see +[general information on Slurm](../jobs_and_resources/slurm.md). #### Submission of Parallel Jobs @@ -92,10 +92,10 @@ On Venus, you can only submit jobs with a core number which is a multiple of 8 ( srun -n 16 a.out ``` -**Please note:** There are different MPI libraries on Taurus and Venus, +**Please note:** There are different MPI libraries on Venus than on other ZIH systems, so you have to compile the binaries specifically for their target. -#### File Systems +#### Filesystems - The large main memory on the system allows users to create RAM disks within their own jobs. diff --git a/doc.zih.tu-dresden.de/docs/contrib/content_rules.md b/doc.zih.tu-dresden.de/docs/contrib/content_rules.md index b6b21e2315f7bbc6b81775f4c38a8021ddda48d5..2be83c1f78668abb764586741a7de764b5baa112 100644 --- a/doc.zih.tu-dresden.de/docs/contrib/content_rules.md +++ b/doc.zih.tu-dresden.de/docs/contrib/content_rules.md @@ -51,6 +51,8 @@ should be highlighted, etc. Code examples, longer than half screen height should ## Writing Style * Capitalize headings, e.g. *Exclusive Reservation of Hardware* +* Give keywords in link texts, e.g. [Code Blocks](#code-blocks-and-syntax-highlighting) is more + descriptive than [this subsection](#code-blocks-and-syntax-highlighting) * Use active over passive voice * Write with confidence. This confidence should be reflected in the documentation, so that the readers trust and follow it. @@ -65,8 +67,11 @@ there is a list of conventions w.r.t. spelling and technical wording. * `I/O` not `IO` * `Slurm` not `SLURM` * `Filesystem` not `file system` -* `ZIH system` and `ZIH systems` not `Taurus` etc. if possible +* `ZIH system` and `ZIH systems` not `Taurus`, `HRSKII`, `our HPC systems`, etc. * `Workspace` not `work space` +* avoid term `HPC-DA` +* Partition names after the keyword *partition*: *partition `ml`* not *ML partition*, *ml + partition*, *`ml` partition*, *"ml" partition*, etc. ### Long Options @@ -124,7 +129,7 @@ For consistency, use the following short codes within this project: With the exception of command templates, use `console` for shell session and console: -```` markdown +````markdown ```console marie@login$ ls foo @@ -138,7 +143,7 @@ Command templates use [Placeholders](#mark-placeholders) to mark replaceable cod templates should give a general idea of invocation and thus, do not contain any output. Use a `bash` code block followed by an invocation example (with `console`): -```` markdown +````markdown ```bash marie@local$ ssh -NL <local port>:<compute node>:<remote port> <zih login>@tauruslogin.hrsk.tu-dresden.de ``` @@ -150,7 +155,7 @@ marie@local$ ssh -NL 5901:172.24.146.46:5901 marie@tauruslogin.hrsk.tu-dresden.d Also use `bash` for shell scripts such as job files: -```` markdown +````markdown ```bash #!/bin/bash #SBATCH --nodes=1 @@ -169,7 +174,7 @@ srun a.out `python` for Python source code: -```` markdown +````markdown ```python from time import gmtime, strftime print(strftime("%Y-%m-%d %H:%M:%S", gmtime())) @@ -178,7 +183,7 @@ print(strftime("%Y-%m-%d %H:%M:%S", gmtime())) `pycon` for Python console: -```` markdown +````markdown ```pycon >>> from time import gmtime, strftime >>> print(strftime("%Y-%m-%d %H:%M:%S", gmtime())) @@ -188,7 +193,7 @@ print(strftime("%Y-%m-%d %H:%M:%S", gmtime())) Line numbers can be added via -```` markdown +````markdown ```bash linenums="1" #!/bin/bash @@ -200,6 +205,10 @@ srun a.out ``` ```` +_Result_: + + + Specific Lines can be highlighted by using ```` markdown @@ -214,6 +223,10 @@ srun a.out ``` ```` +_Result_: + + + ### Data Privacy and Generic User Name Where possible, replace login, project name and other private data with clearly arbitrary placeholders. diff --git a/doc.zih.tu-dresden.de/docs/contrib/contribute_browser.md b/doc.zih.tu-dresden.de/docs/contrib/contribute_browser.md new file mode 100644 index 0000000000000000000000000000000000000000..45e8018d263300c03101f1374b6350ce58a131dd --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/contrib/contribute_browser.md @@ -0,0 +1,105 @@ +# Contribution Guide for Browser-based Editing + +In the following, it is outlined how to contribute to the +[HPC documentation](https://doc.zih.tu-dresden.de/) of +[TU Dresden/ZIH](https://tu-dresden.de/zih/) by means of GitLab's web interface using a standard web +browser only. + +## Preparation + +First of all, you need an account on [gitlab.hrz.tu-chemnitz.de](https://gitlab.hrz.tu-chemnitz.de). +Secondly, you need access to the project +[ZIH/hpcsupport/hpc-compendium](https://gitlab.hrz.tu-chemnitz.de/zih/hpcsupport/hpc-compendium). + +The project is publicly visible, i.e., it is open to the world and any signed-in user has the +[Guest role](https://gitlab.hrz.tu-chemnitz.de/help/user/permissions.md) on this repository. Guests +have only very +[limited permissions](https://gitlab.hrz.tu-chemnitz.de/help/user/permissions.md#project-members-permissions). +In particular, as guest, you can contribute to the documentation by +[creating issues](howto_contribute.md#contribute-via-issue), but you cannot edit files and create +new branches. + +To be granted the role **Developer**, please request access by clicking the corresponding button. + + + +Once you are granted the developer role, choose "ZIH/hpcsupport/hpc-compendium" in your project list. + +!!! hint "Git basics" + + If you are not familiar with the basics of git-based document revision control yet, please have + a look at [Gitlab tutorials](https://gitlab.hrz.tu-chemnitz.de/help/gitlab-basics/index.md). + +## Create a Branch + +Your contribution starts by creating your own branch of the repository that will hold your edits and +additions. Create your branch by clicking on "+" near "preview->hpc-compendium/" as depicted in +the figure and click "New branch". + + + +By default, the new branch should be created from the `preview` branch, as pre-selected. + +Define a branch name that briefly describes what you plan to change, e.g., `edits-in-document-xyz`. +Then, click on "Create branch" as depicted in this figure: + + + +As a result, you should now see your branch's name on top of your list of repository files as +depicted here: + + + +## Editing Existing Articles + +Navigate the depicted document hierarchy under `doc.zih.tu-dresden.de/docs` until you find the +article to be edited. A click on the article's name opens a textual representation of the article. +In the top right corner of it, you find the button "Edit" to be clicked in order to make changes. +Once you completed your changes, click on "Commit changes". Please add meaningful comment about the +changes you made under "Commit message". Feel free to do as many changes and commits as you wish in +your branch of the repository. + +## Adding New Article + +Navigate the depicted document hierarchy under `doc.zih.tu-dresden.de/docs` to find a topic that +fits best to your article. To start a completely new article, click on "+ New file" as depicted +here: + + + +Set a file name that corresponds well to your article like `application_xyz.md`. +(The file name should follow the pattern `fancy_title_and_more.md`.) +Once you completed your initial edits, click on "commit". + + + +Finally, the new article needs to be added to the navigation section of the configuration file +`doc.zih.tu-dresden.de/mkdocs.yaml`. + +## Submitting Articles for Publication + +Once you are satisfied with your edits, you are ready for publication. +Therefore, your edits need to undergo an internal review process and pass the CI/CD pipeline tests. +This process is triggered by creating a "merge request", which serves the purpose of merging your edits +into the `preview` branch of the repository. + +* Click on "Merge requests" (in the menu to the left) as depicted below. +* Then, click on the button "New merge request". +* Select your source branch (for example `edits-in-document-xyz`) and click on "Compare branches and + continue". (The target branch is always `preview`. This is pre-selected - do not change!) +* The next screen will give you an overview of your changes. Please provide a meaningful + description of the contributions. Once you checked them, click on "Create merge request". + + + +## Revision of Articles + +As stated earlier, all changes undergo a review process. +This covers automated checks contained in the CI/CD pipeline and the review by a maintainer. +You can follow this process under +[Merge requests](https://gitlab.hrz.tu-chemnitz.de/zih/hpcsupport/hpc-compendium/-/merge_requests) +(where you initiated your merge request). +If you are asked to make corrections or changes, follow the directions as indicated. +Once your merge request has been accepted, the merge request will be closed and the branch will be deleted. +At this point, there is nothing else to do for you. +Except probably for waiting a little while until your changes become visible on the official web site. diff --git a/doc.zih.tu-dresden.de/docs/contrib/contribute_container.md b/doc.zih.tu-dresden.de/docs/contrib/contribute_container.md index d3b87d46d6f45af76665b49a74fb3ed7f580edcb..dd44fafa136d63ae80267226f70dc00563507ba3 100644 --- a/doc.zih.tu-dresden.de/docs/contrib/contribute_container.md +++ b/doc.zih.tu-dresden.de/docs/contrib/contribute_container.md @@ -86,7 +86,25 @@ To avoid a lot of retyping, use the following in your shell: alias wiki="docker run --name=hpc-compendium --rm -it -w /docs --mount src=$PWD/doc.zih.tu-dresden.de,target=/docs,type=bind hpc-compendium bash -c" ``` -You are now ready to use the different checks +You are now ready to use the different checks, however we suggest to try the pre-commit hook. + +#### Pre-commit Git Hook + +We recommend to automatically run checks whenever you try to commit a change. In this case, failing +checks prevent commits (unless you use option `--no-verify`). This can be accomplished by adding a +pre-commit hook to your local clone of the repository. The following code snippet shows how to do +that: + +```bash +cp doc.zih.tu-dresden.de/util/pre-commit .git/hooks/ +``` + +!!! note + The pre-commit hook only works, if you can use docker without using `sudo`. If this is not + already the case, use the command `adduser $USER docker` to enable docker commands without + `sudo` for the current user. Restart the docker daemons afterwards. + +Read on if you want to run a specific check. #### Linter diff --git a/doc.zih.tu-dresden.de/docs/contrib/howto_contribute.md b/doc.zih.tu-dresden.de/docs/contrib/howto_contribute.md index e9a1a20833cead3533ad08303ac8445c7ee54b0f..e0d91cccc3f534e0d7057b72f1d6479f8932b6aa 100644 --- a/doc.zih.tu-dresden.de/docs/contrib/howto_contribute.md +++ b/doc.zih.tu-dresden.de/docs/contrib/howto_contribute.md @@ -14,9 +14,9 @@ For that, open an issue to report typos and missing documentation or request for wording etc. ZIH staff will get in touch with you to resolve the issue and improve the documentation. -??? tip "GIF: Create GitLab Issue" +??? tip "Create an issue in GitLab" -  +  {: align=center} !!! warning "HPC support" diff --git a/doc.zih.tu-dresden.de/docs/contrib/misc/cb_branch_indicator.png b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_branch_indicator.png new file mode 100644 index 0000000000000000000000000000000000000000..1c024c55142a12d390d4eaf8306632ed80e0eb9a Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_branch_indicator.png differ diff --git a/doc.zih.tu-dresden.de/docs/contrib/misc/cb_commit_file.png b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_commit_file.png new file mode 100644 index 0000000000000000000000000000000000000000..3df543cb2940c808a24bc7be023691aba40ff9c7 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_commit_file.png differ diff --git a/doc.zih.tu-dresden.de/docs/contrib/misc/cb_create_new_branch.png b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_create_new_branch.png new file mode 100644 index 0000000000000000000000000000000000000000..8e9bca4e7fcc8014f725c1c1d024037e23a64204 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_create_new_branch.png differ diff --git a/doc.zih.tu-dresden.de/docs/contrib/misc/cb_create_new_file.png b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_create_new_file.png new file mode 100644 index 0000000000000000000000000000000000000000..30fed32f3c5a12b91dc0c7cd2250978653ea84f6 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_create_new_file.png differ diff --git a/doc.zih.tu-dresden.de/docs/contrib/misc/cb_new_merge_request.png b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_new_merge_request.png new file mode 100644 index 0000000000000000000000000000000000000000..e74b1ec4d43c6017fa7d1e6326996c30795c71a6 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_new_merge_request.png differ diff --git a/doc.zih.tu-dresden.de/docs/contrib/misc/cb_set_branch_name.png b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_set_branch_name.png new file mode 100644 index 0000000000000000000000000000000000000000..4da02249faeea31495c792bc045d593d9b989a04 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/contrib/misc/cb_set_branch_name.png differ diff --git a/doc.zih.tu-dresden.de/misc/highlight_lines.png b/doc.zih.tu-dresden.de/docs/contrib/misc/highlight_lines.png similarity index 100% rename from doc.zih.tu-dresden.de/misc/highlight_lines.png rename to doc.zih.tu-dresden.de/docs/contrib/misc/highlight_lines.png diff --git a/doc.zih.tu-dresden.de/misc/lines.png b/doc.zih.tu-dresden.de/docs/contrib/misc/lines.png similarity index 100% rename from doc.zih.tu-dresden.de/misc/lines.png rename to doc.zih.tu-dresden.de/docs/contrib/misc/lines.png diff --git a/doc.zih.tu-dresden.de/docs/contrib/misc/request_access.png b/doc.zih.tu-dresden.de/docs/contrib/misc/request_access.png new file mode 100644 index 0000000000000000000000000000000000000000..c051e93b6a149ed69e95e5d9b653a80110836266 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/contrib/misc/request_access.png differ diff --git a/doc.zih.tu-dresden.de/docs/data_transfer/export_nodes.md b/doc.zih.tu-dresden.de/docs/data_transfer/export_nodes.md index 9ccba626f713f7be9aa72488866fbc34776cc5ee..80ea758c57b09601cadd001aa018c56a2f219a3f 100644 --- a/doc.zih.tu-dresden.de/docs/data_transfer/export_nodes.md +++ b/doc.zih.tu-dresden.de/docs/data_transfer/export_nodes.md @@ -12,7 +12,11 @@ The export nodes are reachable under the hostname `taurusexport.hrsk.tu-dresden. ## Access From Linux There are at least three tool to exchange data between your local workstation and ZIH systems. All -are explained in the following abstract in more detail. +are explained in the following section in more detail. + +!!! important + The following explanations require that you have already set up your [SSH configuration + ](../access/ssh_login.md#configuring-default-parameters-for-ssh). ### SCP @@ -22,20 +26,20 @@ in a directory, the option `-r` has to be specified. ??? example "Example: Copy a file from your workstation to ZIH systems" - ```console - marie@local$ scp <file> <zih-user>@taurusexport.hrsk.tu-dresden.de:<target-location> + ```bash + marie@local$ scp <file> taurusexport:<target-location> # Add -r to copy whole directory - marie@local$ scp -r <directory> <zih-user>@taurusexport.hrsk.tu-dresden.de:<target-location> + marie@local$ scp -r <directory> taurusexport:<target-location> ``` ??? example "Example: Copy a file from ZIH systems to your workstation" ```console - marie@login$ scp <zih-user>@taurusexport.hrsk.tu-dresden.de:<file> <target-location> + marie@login$ scp taurusexport:<file> <target-location> # Add -r to copy whole directory - marie@login$ scp -r <zih-user>@taurusexport.hrsk.tu-dresden.de:<directory> <target-location> + marie@login$ scp -r taurusexport:<directory> <target-location> ``` ### SFTP @@ -48,7 +52,7 @@ use compression to increase performance. ```console # Enter virtual command line -marie@local$ sftp <zih-user>@taurusexport.hrsk.tu-dresden.de +marie@local$ sftp taurusexport # Exit virtual command line sftp> exit # or @@ -63,7 +67,7 @@ from this virtual command line, then you have to prefix the command with the let ??? example "Example: Copy a file from your workstation to ZIH systems" ```console - marie@local$ sftp <zih-user>@taurusexport.hrsk.tu-dresden.de + marie@local$ sftp taurusexport # Copy file sftp> put <file> # Copy directory @@ -73,7 +77,7 @@ from this virtual command line, then you have to prefix the command with the let ??? example "Example: Copy a file from ZIH systems to your local workstation" ```console - marie@local$ sftp <zih-user>@taurusexport.hrsk.tu-dresden.de + marie@local$ sftp taurusexport # Copy file sftp> get <file> # Copy directory @@ -95,18 +99,18 @@ the local machine. ```console # Copy file - marie@local$ rsync <file> <zih-user>@taurusexport.hrsk.tu-dresden.de:<target-location> + marie@local$ rsync <file> taurusexport:<target-location> # Copy directory - marie@local$ rsync -r <directory> <zih-user>@taurusexport.hrsk.tu-dresden.de:<target-location> + marie@local$ rsync -r <directory> taurusexport:<target-location> ``` ??? example "Example: Copy a file from ZIH systems to your local workstation" ```console # Copy file - marie@local$ rsync <zih-user>@taurusexport.hrsk.tu-dresden.de:<file> <target-location> + marie@local$ rsync taurusexport:<file> <target-location> # Copy directory - marie@local$ rsync -r <zih-user>@taurusexport.hrsk.tu-dresden.de:<directory> <target-location> + marie@local$ rsync -r taurusexport:<directory> <target-location> ``` ## Access From Windows diff --git a/doc.zih.tu-dresden.de/docs/data_transfer/overview.md b/doc.zih.tu-dresden.de/docs/data_transfer/overview.md index 095fa14a96d514f6daea6b8edc8850651ba5f367..c2f4fe1e669b17b4f0cdf21c39e4072d4b80fa5d 100644 --- a/doc.zih.tu-dresden.de/docs/data_transfer/overview.md +++ b/doc.zih.tu-dresden.de/docs/data_transfer/overview.md @@ -2,15 +2,15 @@ ## Moving Data to/from ZIH Systems -There are at least three tools to exchange data between your local workstation and ZIH systems: +There are at least three tools for exchanging data between your local workstation and ZIH systems: `scp`, `rsync`, and `sftp`. Please refer to the offline or online man pages of [scp](https://www.man7.org/linux/man-pages/man1/scp.1.html), [rsync](https://man7.org/linux/man-pages/man1/rsync.1.html), and [sftp](https://man7.org/linux/man-pages/man1/sftp.1.html) for detailed information. -No matter what tool you prefer, it is crucial that the **export nodes** are used prefered way to -copy data to/from ZIH systems. Please follow the linkt to documentation on [export -nodes](export_nodes.md) for further reference and examples. +No matter what tool you prefer, it is crucial that the **export nodes** are used as preferred way to +copy data to/from ZIH systems. Please follow the link to the documentation on +[export nodes](export_nodes.md) for further reference and examples. ## Moving Data Inside ZIH Systems: Datamover diff --git a/doc.zih.tu-dresden.de/docs/index.md b/doc.zih.tu-dresden.de/docs/index.md index 24d3907def65508bc521a0fd3109b9792c76f19b..60d43b4e73f285901931f652c55aedabc393c451 100644 --- a/doc.zih.tu-dresden.de/docs/index.md +++ b/doc.zih.tu-dresden.de/docs/index.md @@ -13,8 +13,7 @@ Issues concerning this documentation can reported via the GitLab Please check for any already existing issue before submitting your issue in order to avoid duplicate issues. -Contributions from user-side are highly welcome. Please refer to -the detailed [documentation](contrib/howto_contribute.md) to get started. +Contributions from user-side are highly welcome. Please find out more in our [guidelines how to contribute](contrib/howto_contribute.md). **Reminder:** Non-documentation issues and requests need to be send as ticket to [hpcsupport@zih.tu-dresden.de](mailto:hpcsupport@zih.tu-dresden.de). diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md index ca813dbe4b627f2ac74b33163f285c6caa93348b..3d342f628fc7abfeb851500d3cc6fc785d1a03e2 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md @@ -104,7 +104,7 @@ can be preloaded in "Preload modules (modules load):" field. ### Containers Singularity containers enable users to have full control of their software environment. -Detailed information about containers can be found [here](../software/containers.md). +For more information, see the [Singularity container details](../software/containers.md). Nvidia [NGC](https://developer.nvidia.com/blog/how-to-run-ngc-deep-learning-containers-with-singularity/) diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md index 1b395644aa972113ac887c764c9a651f56826093..218bd3d4b186efcd583c3fb6c092b4e0dbad3180 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md @@ -12,7 +12,7 @@ users and the ZIH. - Login-Nodes (`tauruslogin[3-6].hrsk.tu-dresden.de`) - each with 2x Intel(R) Xeon(R) CPU E5-2680 v3 each with 12 cores - @ 2.50GHz, MultiThreading Disabled, 64 GB RAM, 128 GB SSD local disk + @ 2.50GHz, Multithreading Disabled, 64 GB RAM, 128 GB SSD local disk - IPs: 141.30.73.\[102-105\] - Transfer-Nodes (`taurusexport3/4.hrsk.tu-dresden.de`, DNS Alias `taurusexport.hrsk.tu-dresden.de`) @@ -25,7 +25,7 @@ users and the ZIH. - 32 nodes, each with - 8 x NVIDIA A100-SXM4 - - 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, MultiThreading disabled + - 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, Multithreading disabled - 1 TB RAM - 3.5 TB local memory at NVMe device at `/tmp` - Hostnames: `taurusi[8001-8034]` @@ -35,7 +35,7 @@ users and the ZIH. ## Island 7 - AMD Rome CPUs - 192 nodes, each with - - 2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, MultiThreading + - 2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, Multithreading enabled, - 512 GB RAM - 200 GB /tmp on local SSD local disk @@ -66,7 +66,7 @@ For machine learning, we have 32 IBM AC922 nodes installed with this configurati ## Island 4 to 6 - Intel Haswell CPUs - 1456 nodes, each with 2x Intel(R) Xeon(R) CPU E5-2680 v3 (12 cores) - @ 2.50GHz, MultiThreading disabled, 128 GB SSD local disk + @ 2.50GHz, Multithreading disabled, 128 GB SSD local disk - Hostname: `taurusi4[001-232]`, `taurusi5[001-612]`, `taurusi6[001-612]` - Varying amounts of main memory (selected automatically by the batch @@ -87,7 +87,7 @@ For machine learning, we have 32 IBM AC922 nodes installed with this configurati ### Extension of Island 4 with Broadwell CPUs * 32 nodes, each witch 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz - (**14 cores**), MultiThreading disabled, 64 GB RAM, 256 GB SSD local disk + (**14 cores**), Multithreading disabled, 64 GB RAM, 256 GB SSD local disk * from the users' perspective: Broadwell is like Haswell * Hostname: `taurusi[4233-4264]` * Slurm partition `broadwell` @@ -95,7 +95,7 @@ For machine learning, we have 32 IBM AC922 nodes installed with this configurati ## Island 2 Phase 2 - Intel Haswell CPUs + NVIDIA K80 GPUs * 64 nodes, each with 2x Intel(R) Xeon(R) CPU E5-E5-2680 v3 (12 cores) - @ 2.50GHz, MultiThreading Disabled, 64 GB RAM (2.67 GB per core), + @ 2.50GHz, Multithreading Disabled, 64 GB RAM (2.67 GB per core), 128 GB SSD local disk, 4x NVIDIA Tesla K80 (12 GB GDDR RAM) GPUs * Hostname: `taurusi2[045-108]` * Slurm Partition `gpu` @@ -104,7 +104,7 @@ For machine learning, we have 32 IBM AC922 nodes installed with this configurati ## SMP Nodes - up to 2 TB RAM - 5 Nodes each with 4x Intel(R) Xeon(R) CPU E7-4850 v3 (14 cores) @ - 2.20GHz, MultiThreading Disabled, 2 TB RAM + 2.20GHz, Multithreading Disabled, 2 TB RAM - Hostname: `taurussmp[3-7]` - Slurm partition `smp2` @@ -116,7 +116,7 @@ For machine learning, we have 32 IBM AC922 nodes installed with this configurati ## Island 2 Phase 1 - Intel Sandybridge CPUs + NVIDIA K20x GPUs - 44 nodes, each with 2x Intel(R) Xeon(R) CPU E5-2450 (8 cores) @ - 2.10GHz, MultiThreading Disabled, 48 GB RAM (3 GB per core), 128 GB + 2.10GHz, Multithreading Disabled, 48 GB RAM (3 GB per core), 128 GB SSD local disk, 2x NVIDIA Tesla K20x (6 GB GDDR RAM) GPUs - Hostname: `taurusi2[001-044]` - Slurm partition `gpu1` diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm.md index 13b4e5b127c7d1013b1868e823522599fbca55e2..a5bb1980e342b8f1c19ecb6b610a5d481cd98268 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm.md @@ -1,9 +1,9 @@ # Batch System Slurm -When log in to ZIH systems, you are placed on a login node. There you can manage your +When logging in to ZIH systems, you are placed on a login node. There, you can manage your [data life cycle](../data_lifecycle/overview.md), [setup experiments](../data_lifecycle/experiments.md), and -edit and prepare jobs. The login nodes are not suited for computational work! From the login nodes, +edit and prepare jobs. The login nodes are not suited for computational work! From the login nodes, you can interact with the batch system, e.g., submit and monitor your jobs. ??? note "Batch System" @@ -32,7 +32,7 @@ ZIH uses the batch system Slurm for resource management and job scheduling. Just specify the resources you need in terms of cores, memory, and time and your Slurm will place your job on the system. -This pages provides a brief overview on +This page provides a brief overview on * [Slurm options](#options) to specify resource requirements, * how to submit [interactive](#interactive-jobs) and [batch jobs](#batch-jobs), @@ -60,39 +60,39 @@ There are three basic Slurm commands for job submission and execution: Using `srun` directly on the shell will be blocking and launch an [interactive job](#interactive-jobs). Apart from short test runs, it is recommended to submit your jobs to Slurm for later execution by using [batch jobs](#batch-jobs). For that, you can conveniently -put the parameters directly in a [job file](#job-files) which you can submit using `sbatch [options] -<job file>`. +put the parameters directly in a [job file](#job-files), which you can submit using `sbatch +[options] <job file>`. -During runtime, the environment variable `SLURM_JOB_ID` will be set to the id of your job. The job +At runtime, the environment variable `SLURM_JOB_ID` is set to the id of your job. The job id is unique. The id allows you to [manage and control](#manage-and-control-jobs) your jobs. ## Options -The following table holds the most important options for `srun/sbatch/salloc` to specify resource +The following table contains the most important options for `srun/sbatch/salloc` to specify resource requirements and control communication. ??? tip "Options Table" | Slurm Option | Description | |:---------------------------|:------------| - | `-n, --ntasks=<N>` | number of (MPI) tasks (default: 1) | - | `-N, --nodes=<N>` | number of nodes; there will be `--ntasks-per-node` processes started on each node | - | `--ntasks-per-node=<N>` | number of tasks per allocated node to start (default: 1) | - | `-c, --cpus-per-task=<N>` | number of CPUs per task; needed for multithreaded (e.g. OpenMP) jobs; typically `N` should be equal to `OMP_NUM_THREADS` | - | `-p, --partition=<name>` | type of nodes where you want to execute your job (refer to [partitions](partitions_and_limits.md)) | - | `--mem-per-cpu=<size>` | memory need per allocated CPU in MB | - | `-t, --time=<HH:MM:SS>` | maximum runtime of the job | - | `--mail-user=<your email>` | get updates about the status of the jobs | - | `--mail-type=ALL` | for what type of events you want to get a mail; valid options: `ALL`, `BEGIN`, `END`, `FAIL`, `REQUEUE` | - | `-J, --job-name=<name>` | name of the job shown in the queue and in mails (cut after 24 chars) | - | `--no-requeue` | disable requeueing of the job in case of node failure (default: enabled) | - | `--exclusive` | exclusive usage of compute nodes; you will be charged for all CPUs/cores on the node | - | `-A, --account=<project>` | charge resources used by this job to the specified project | - | `-o, --output=<filename>` | file to save all normal output (stdout) (default: `slurm-%j.out`) | - | `-e, --error=<filename>` | file to save all error output (stderr) (default: `slurm-%j.out`) | - | `-a, --array=<arg>` | submit an array job ([examples](slurm_examples.md#array-jobs)) | - | `-w <node1>,<node2>,...` | restrict job to run on specific nodes only | - | `-x <node1>,<node2>,...` | exclude specific nodes from job | + | `-n, --ntasks=<N>` | Number of (MPI) tasks (default: 1) | + | `-N, --nodes=<N>` | Number of nodes; there will be `--ntasks-per-node` processes started on each node | + | `--ntasks-per-node=<N>` | Number of tasks per allocated node to start (default: 1) | + | `-c, --cpus-per-task=<N>` | Number of CPUs per task; needed for multithreaded (e.g. OpenMP) jobs; typically `N` should be equal to `OMP_NUM_THREADS` | + | `-p, --partition=<name>` | Type of nodes where you want to execute your job (refer to [partitions](partitions_and_limits.md)) | + | `--mem-per-cpu=<size>` | Memory need per allocated CPU in MB | + | `-t, --time=<HH:MM:SS>` | Maximum runtime of the job | + | `--mail-user=<your email>` | Get updates about the status of the jobs | + | `--mail-type=ALL` | For what type of events you want to get a mail; valid options: `ALL`, `BEGIN`, `END`, `FAIL`, `REQUEUE` | + | `-J, --job-name=<name>` | Name of the job shown in the queue and in mails (cut after 24 chars) | + | `--no-requeue` | Disable requeueing of the job in case of node failure (default: enabled) | + | `--exclusive` | Exclusive usage of compute nodes; you will be charged for all CPUs/cores on the node | + | `-A, --account=<project>` | Charge resources used by this job to the specified project | + | `-o, --output=<filename>` | File to save all normal output (stdout) (default: `slurm-%j.out`) | + | `-e, --error=<filename>` | File to save all error output (stderr) (default: `slurm-%j.out`) | + | `-a, --array=<arg>` | Submit an array job ([examples](slurm_examples.md#array-jobs)) | + | `-w <node1>,<node2>,...` | Restrict job to run on specific nodes only | + | `-x <node1>,<node2>,...` | Exclude specific nodes from job | !!! note "Output and Error Files" @@ -109,19 +109,19 @@ requirements and control communication. ### Host List If you want to place your job onto specific nodes, there are two options for doing this. Either use -`-p, --partion=<name>` to specify a host group aka. [partition](partitions_and_limits.md) that fits -your needs. Or, use `-w, --nodelist=<host1,host2,..>`) with a list of hosts that will work for you. +`-p, --partition=<name>` to specify a host group aka. [partition](partitions_and_limits.md) that fits +your needs. Or, use `-w, --nodelist=<host1,host2,..>` with a list of hosts that will work for you. ## Interactive Jobs Interactive activities like editing, compiling, preparing experiments etc. are normally limited to -the login nodes. For longer interactive sessions you can allocate cores on the compute node with the -command `salloc`. It takes the same options like `sbatch` to specify the required resources. +the login nodes. For longer interactive sessions, you can allocate cores on the compute node with +the command `salloc`. It takes the same options as `sbatch` to specify the required resources. `salloc` returns a new shell on the node, where you submitted the job. You need to use the command `srun` in front of the following commands to have these commands executed on the allocated resources. If you allocate more than one task, please be aware that `srun` will run the command on -each allocated task! +each allocated task by default! The syntax for submitting a job is @@ -132,16 +132,23 @@ marie@login$ srun [options] <command> An example of an interactive session looks like: ```console -marie@login$ srun --pty -n 1 -c 4 --time=1:00:00 --mem-per-cpu=1700 bash -marie@login$ srun: job 13598400 queued and waiting for resources -marie@login$ srun: job 13598400 has been allocated resources +marie@login$ srun --pty --ntasks=1 --cpus-per-task=4 --time=1:00:00 --mem-per-cpu=1700 bash -l +srun: job 13598400 queued and waiting for resources +srun: job 13598400 has been allocated resources marie@compute$ # Now, you can start interactive work with e.g. 4 cores ``` +!!! note "Using `module` commands" + + The [module commands](../software/modules.md) are made available by sourcing the files + `/etc/profile` and `~/.bashrc`. This is done automatically by passing the parameter `-l` to your + shell, as shown in the example above. If you missed adding `-l` at submitting the interactive + session, no worry, you can source this files also later on manually. + !!! note "Partition `interactive`" A dedicated partition `interactive` is reserved for short jobs (< 8h) with not more than one job - per user. Please check the availability of nodes there with `sinfo -p interactive`. + per user. Please check the availability of nodes there with `sinfo --partition=interactive`. ### Interactive X11/GUI Jobs @@ -176,10 +183,10 @@ Batch jobs are encapsulated within [job files](#job-files) and submitted to the environment settings and the commands for executing the application. Using batch jobs and job files has multiple advantages: -* You can reproduce your experiments and work, because it's all steps are saved in a file. +* You can reproduce your experiments and work, because all steps are saved in a file. * You can easily share your settings and experimental setup with colleagues. -* Submit your job file to the scheduling system for later execution. In the meanwhile, you can grab - a coffee and proceed with other work (,e.g., start writing a paper). +* You can submit your job file to the scheduling system for later execution. In the meanwhile, you can + grab a coffee and proceed with other work (e.g., start writing a paper). !!! hint "The syntax for submitting a job file to Slurm is" @@ -208,7 +215,7 @@ srun ./application [options] # Execute parallel application with srun ``` The following two examples show the basic resource specifications for a pure OpenMP application and -a pure MPI application, respectively. Within the section [Job Examples](slurm_examples.md) we +a pure MPI application, respectively. Within the section [Job Examples](slurm_examples.md), we provide a comprehensive collection of job examples. ??? example "Job file OpenMP" @@ -230,7 +237,7 @@ provide a comprehensive collection of job examples. ``` * Submisson: `marie@login$ sbatch batch_script.sh` - * Run with fewer CPUs: `marie@login$ sbatch -c 14 batch_script.sh` + * Run with fewer CPUs: `marie@login$ sbatch --cpus-per-task=14 batch_script.sh` ??? example "Job file MPI" @@ -248,7 +255,7 @@ provide a comprehensive collection of job examples. ``` * Submisson: `marie@login$ sbatch batch_script.sh` - * Run with fewer MPI tasks: `marie@login$ sbatch --ntasks 14 batch_script.sh` + * Run with fewer MPI tasks: `marie@login$ sbatch --ntasks=14 batch_script.sh` ## Manage and Control Jobs @@ -289,14 +296,14 @@ marie@login$ whypending <jobid> ### Editing Jobs Jobs that have not yet started can be altered. Using `scontrol update timelimit=4:00:00 -jobid=<jobid>` it is for example possible to modify the maximum runtime. `scontrol` understands many -different options, please take a look at the [man page](https://slurm.schedmd.com/scontrol.html) for -more details. +jobid=<jobid>`, it is for example possible to modify the maximum runtime. `scontrol` understands +many different options, please take a look at the +[scontrol documentation](https://slurm.schedmd.com/scontrol.html) for more details. ### Canceling Jobs The command `scancel <jobid>` kills a single job and removes it from the queue. By using `scancel -u -<username>` you can send a canceling signal to all of your jobs at once. +<username>`, you can send a canceling signal to all of your jobs at once. ### Accounting @@ -317,34 +324,34 @@ marie@login$ sacct [...] ``` -We'd like to point your attention to the following options gain insight in your jobs. +We'd like to point your attention to the following options to gain insight in your jobs. ??? example "Show specific job" ```console - marie@login$ sacct -j <JOBID> + marie@login$ sacct --jobs=<JOBID> ``` ??? example "Show all fields for a specific job" ```console - marie@login$ sacct -j <JOBID> -o All + marie@login$ sacct --jobs=<JOBID> --format=All ``` ??? example "Show specific fields" ```console - marie@login$ sacct -j <JOBID> -o JobName,MaxRSS,MaxVMSize,CPUTime,ConsumedEnergy + marie@login$ sacct --jobs=<JOBID> --format=JobName,MaxRSS,MaxVMSize,CPUTime,ConsumedEnergy ``` -The manual page (`man sacct`) and the [online reference](https://slurm.schedmd.com/sacct.html) +The manual page (`man sacct`) and the [sacct online reference](https://slurm.schedmd.com/sacct.html) provide a comprehensive documentation regarding available fields and formats. !!! hint "Time span" By default, `sacct` only shows data of the last day. If you want to look further into the past - without specifying an explicit job id, you need to provide a start date via the `-S` option. - A certain end date is also possible via `-E`. + without specifying an explicit job id, you need to provide a start date via the option + `--starttime` (or short: `-S`). A certain end date is also possible via `--endtime` (or `-E`). ??? example "Show all jobs since the beginning of year 2021" @@ -356,7 +363,7 @@ provide a comprehensive documentation regarding available fields and formats. How to ask for a reservation is described in the section [reservations](overview.md#exclusive-reservation-of-hardware). -After we agreed with your requirements, we will send you an e-mail with your reservation name. Then +After we agreed with your requirements, we will send you an e-mail with your reservation name. Then, you could see more information about your reservation with the following command: ```console @@ -387,7 +394,7 @@ constraints, please refer to the [Slurm documentation](https://slurm.schedmd.com | Feature | Description | |:--------|:-------------------------------------------------------------------------| -| DA | subset of Haswell nodes with a high bandwidth to NVMe storage (island 6) | +| DA | Subset of Haswell nodes with a high bandwidth to NVMe storage (island 6) | #### Filesystem Features diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_examples.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_examples.md index 396657db06766eaab6f8694ca4bed4f8014cf7f4..2af016d0188ae4f926b45e7b8fdc14b039e8baa3 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_examples.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_examples.md @@ -39,7 +39,7 @@ For MPI-parallel jobs one typically allocates one core per task that has to be s There are different MPI libraries on ZIH systems for the different micro archtitectures. Thus, you have to compile the binaries specifically for the target architecture and partition. Please refer to the sections [building software](../software/building_software.md) and - [module environments](../software/runtime_environment.md#module-environments) for detailed + [module environments](../software/modules.md#module-environments) for detailed information. !!! example "Job file for MPI application" diff --git a/doc.zih.tu-dresden.de/docs/software/big_data_frameworks_spark.md b/doc.zih.tu-dresden.de/docs/software/big_data_frameworks_spark.md index 9bc564d05a310005edc1d5564549db8da08ee415..5636a870ae8821094f2bad1e2248fc08be767b9e 100644 --- a/doc.zih.tu-dresden.de/docs/software/big_data_frameworks_spark.md +++ b/doc.zih.tu-dresden.de/docs/software/big_data_frameworks_spark.md @@ -1,9 +1,5 @@ # Big Data Frameworks: Apache Spark -!!! note - - This page is under construction - [Apache Spark](https://spark.apache.org/), [Apache Flink](https://flink.apache.org/) and [Apache Hadoop](https://hadoop.apache.org/) are frameworks for processing and integrating Big Data. These frameworks are also offered as software [modules](modules.md) in both `ml` and @@ -13,18 +9,13 @@ Big Data. These frameworks are also offered as software [modules](modules.md) in marie@login$ module avail Spark ``` -The **aim** of this page is to introduce users on how to start working with -these frameworks on ZIH systems. - **Prerequisites:** To work with the frameworks, you need [access](../access/ssh_login.md) to ZIH systems and basic knowledge about data analysis and the batch system [Slurm](../jobs_and_resources/slurm.md). -The usage of Big Data frameworks is -different from other modules due to their master-worker approach. That -means, before an application can be started, one has to do additional steps. -In the following, we assume that a Spark application should be -started. +The usage of Big Data frameworks is different from other modules due to their master-worker +approach. That means, before an application can be started, one has to do additional steps. +In the following, we assume that a Spark application should be started. The steps are: @@ -34,13 +25,7 @@ The steps are: 1. Start the Spark application Apache Spark can be used in [interactive](#interactive-jobs) and [batch](#batch-jobs) jobs as well -as via [Jupyter notebook](#jupyter-notebook). All three ways are outlined in the following. - -!!! note - - It is recommended to use ssh keys to avoid entering the password - every time to log in to nodes. For the details, please check the - [external documentation](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/s2-ssh-configuration-keypairs). +as via [Jupyter notebooks](#jupyter-notebook). All three ways are outlined in the following. ## Interactive Jobs @@ -49,22 +34,13 @@ as via [Jupyter notebook](#jupyter-notebook). All three ways are outlined in the The Spark module is available in both `scs5` and `ml` environments. Thus, Spark can be executed using different CPU architectures, e.g., Haswell and Power9. -Let us assume that two nodes should be used for the computation. Use a -`srun` command similar to the following to start an interactive session -using the partition haswell. The following code snippet shows a job submission -to haswell nodes with an allocation of two nodes with 60 GB main memory -exclusively for one hour: - -```console -marie@login$ srun --partition=haswell -N 2 --mem=60g --exclusive --time=01:00:00 --pty bash -l -``` - -The command for different resource allocation on the partition `ml` is -similar, e. g. for a job submission to `ml` nodes with an allocation of one -node, one task per node, two CPUs per task, one GPU per node, with 10000 MB for one hour: +Let us assume that two nodes should be used for the computation. Use a `srun` command similar to +the following to start an interactive session using the partition haswell. The following code +snippet shows a job submission to haswell nodes with an allocation of two nodes with 50 GB main +memory exclusively for one hour: ```console -marie@login$ srun --partition=ml -N 1 -n 1 -c 2 --gres=gpu:1 --mem-per-cpu=10000 --time=01:00:00 --pty bash +marie@login$ srun --partition=haswell --nodes=2 --mem=50g --exclusive --time=01:00:00 --pty bash -l ``` Once you have the shell, load Spark using the command @@ -73,25 +49,22 @@ Once you have the shell, load Spark using the command marie@compute$ module load Spark ``` -Before the application can be started, the Spark cluster needs to be set -up. To do this, configure Spark first using configuration template at -`$SPARK_HOME/conf`: +Before the application can be started, the Spark cluster needs to be set up. To do this, configure +Spark first using configuration template at `$SPARK_HOME/conf`: ```console marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf ``` -This places the configuration in a directory called -`cluster-conf-<JOB_ID>` in your `home` directory, where `<JOB_ID>` stands -for the id of the Slurm job. After that, you can start Spark in the -usual way: +This places the configuration in a directory called `cluster-conf-<JOB_ID>` in your `home` +directory, where `<JOB_ID>` stands for the id of the Slurm job. After that, you can start Spark in +the usual way: ```console marie@compute$ start-all.sh ``` -The Spark processes should now be set up and you can start your -application, e. g.: +The Spark processes should now be set up and you can start your application, e. g.: ```console marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000 @@ -104,24 +77,22 @@ marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOM ### Custom Configuration -The script `framework-configure.sh` is used to derive a configuration from -a template. It takes two parameters: +The script `framework-configure.sh` is used to derive a configuration from a template. It takes two +parameters: - The framework to set up (Spark, Flink, Hadoop) - A configuration template -Thus, you can modify the configuration by replacing the default -configuration template with a customized one. This way, your custom -configuration template is reusable for different jobs. You can start -with a copy of the default configuration ahead of your interactive -session: +Thus, you can modify the configuration by replacing the default configuration template with a +customized one. This way, your custom configuration template is reusable for different jobs. You +can start with a copy of the default configuration ahead of your interactive session: ```console marie@login$ cp -r $SPARK_HOME/conf my-config-template ``` -After you have changed `my-config-template`, you can use your new template -in an interactive job with: +After you have changed `my-config-template`, you can use your new template in an interactive job +with: ```console marie@compute$ source framework-configure.sh spark my-config-template @@ -129,8 +100,8 @@ marie@compute$ source framework-configure.sh spark my-config-template ### Using Hadoop Distributed Filesystem (HDFS) -If you want to use Spark and HDFS together (or in general more than one -framework), a scheme similar to the following can be used: +If you want to use Spark and HDFS together (or in general more than one framework), a scheme +similar to the following can be used: ```console marie@compute$ module load Hadoop @@ -143,20 +114,49 @@ marie@compute$ start-all.sh ## Batch Jobs -Using `srun` directly on the shell blocks the shell and launches an -interactive job. Apart from short test runs, it is **recommended to -launch your jobs in the background using batch jobs**. For that, you can -conveniently put the parameters directly into the job file and submit it via +Using `srun` directly on the shell blocks the shell and launches an interactive job. Apart from +short test runs, it is **recommended to launch your jobs in the background using batch jobs**. For +that, you can conveniently put the parameters directly into the job file and submit it via `sbatch [options] <job file>`. -Please use a [batch job](../jobs_and_resources/slurm.md) similar to -[example-spark.sbatch](misc/example-spark.sbatch). +Please use a [batch job](../jobs_and_resources/slurm.md) with a configuration, similar to the +example below: + +??? example "spark.sbatch" + ```bash + #!/bin/bash -l + #SBATCH --time=00:05:00 + #SBATCH --partition=haswell + #SBATCH --nodes=2 + #SBATCH --exclusive + #SBATCH --mem=50G + #SBATCH --job-name="example-spark" + + ml Spark/3.0.1-Hadoop-2.7-Java-1.8-Python-3.7.4-GCCcore-8.3.0 + + function myExitHandler () { + stop-all.sh + } + + #configuration + . framework-configure.sh spark $SPARK_HOME/conf + + #register cleanup hook in case something goes wrong + trap myExitHandler EXIT + + start-all.sh + + spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000 + + stop-all.sh + + exit 0 + ``` ## Jupyter Notebook -There are two general options on how to work with Jupyter notebooks: -There is [JupyterHub](../access/jupyterhub.md), where you can simply -run your Jupyter notebook on HPC nodes (the preferable way). +You can run Jupyter notebooks with Spark on the ZIH systems in a similar way as described on the +[JupyterHub](../access/jupyterhub.md) page. ### Preparation @@ -165,25 +165,25 @@ to [normal Python virtual environments](../software/python_virtual_environments. You start with an allocation: ```console -marie@login$ srun --pty -n 1 -c 2 --mem-per-cpu=2500 -t 01:00:00 bash -l +marie@login$ srun --pty --ntasks=1 --cpus-per-task=2 --mem-per-cpu=2500 --time=01:00:00 bash -l ``` -When a node is allocated, install he required packages: +When a node is allocated, install the required packages: ```console -marie@compute$ cd +marie@compute$ cd $HOME marie@compute$ mkdir jupyter-kernel +marie@compute$ module load Python marie@compute$ virtualenv --system-site-packages jupyter-kernel/env #Create virtual environment [...] marie@compute$ source jupyter-kernel/env/bin/activate #Activate virtual environment. -marie@compute$ pip install ipykernel +(env) marie@compute$ pip install ipykernel [...] -marie@compute$ python -m ipykernel install --user --name haswell-py3.7-spark --display-name="haswell-py3.7-spark" +(env) marie@compute$ python -m ipykernel install --user --name haswell-py3.7-spark --display-name="haswell-py3.7-spark" Installed kernelspec haswell-py3.7-spark in [...] -marie@compute$ pip install findspark - -marie@compute$ deactivate +(env) marie@compute$ pip install findspark +(env) marie@compute$ deactivate ``` You are now ready to spawn a notebook with Spark. @@ -192,23 +192,19 @@ You are now ready to spawn a notebook with Spark. Assuming that you have prepared everything as described above, you can go to [https://taurus.hrsk.tu-dresden.de/jupyter](https://taurus.hrsk.tu-dresden.de/jupyter). -In the tab "Advanced", go -to the field "Preload modules" and select one of the Spark modules. -When your Jupyter instance is started, check whether the kernel that -you created in the preparation phase (see above) is shown in the top -right corner of the notebook. If it is not already selected, select the -kernel `haswell-py3.7-spark`. Then, you can set up Spark. Since the setup -in the notebook requires more steps than in an interactive session, we -have created an example notebook that you can use as a starting point -for convenience: [SparkExample.ipynb](misc/SparkExample.ipynb) +In the tab "Advanced", go to the field "Preload modules" and select one of the Spark modules. When +your Jupyter instance is started, check whether the kernel that you created in the preparation +phase (see above) is shown in the top right corner of the notebook. If it is not already selected, +select the kernel `haswell-py3.7-spark`. Then, you can set up Spark. Since the setup in the +notebook requires more steps than in an interactive session, we have created an example notebook +that you can use as a starting point for convenience: [SparkExample.ipynb](misc/SparkExample.ipynb) !!! note - You could work with simple examples in your home directory but according to the - [storage concept](../data_lifecycle/overview.md) - **please use [workspaces](../data_lifecycle/workspaces.md) for - your study and work projects**. For this reason, you have to use - advanced options of Jupyterhub and put "/" in "Workspace scope" field. + You could work with simple examples in your home directory, but, according to the + [storage concept](../data_lifecycle/overview.md), **please use + [workspaces](../data_lifecycle/workspaces.md) for your study and work projects**. For this + reason, you have to use advanced options of Jupyterhub and put "/" in "Workspace scope" field. ## FAQ @@ -222,10 +218,9 @@ re-login to the ZIH system. Q: There are a lot of errors and warnings during the set up of the session -A: Please check the work capability on a simple example. The source of -warnings could be ssh etc, and it could be not affecting the frameworks +A: Please check the work capability on a simple example as shown in this documentation. !!! help - If you have questions or need advice, please see - [https://www.scads.de/transfer-2/beratung-und-support-en/](https://www.scads.de/transfer-2/beratung-und-support-en/) or contact the HPC support. + If you have questions or need advice, please use the contact form on + [https://scads.ai/contact/](https://scads.ai/contact/) or contact the HPC support. diff --git a/doc.zih.tu-dresden.de/docs/software/building_software.md b/doc.zih.tu-dresden.de/docs/software/building_software.md index 33aeaf919fa3bad56c14fdb6aa130eafac6c0d5d..c3bd76ce331034247b162630b08d36f982ebc45d 100644 --- a/doc.zih.tu-dresden.de/docs/software/building_software.md +++ b/doc.zih.tu-dresden.de/docs/software/building_software.md @@ -1,42 +1,39 @@ # Building Software -While it is possible to do short compilations on the login nodes, it is -generally considered good practice to use a job for that, especially -when using many parallel make processes. Note that starting on December -6th 2016, the /projects file system will be mounted read-only on all -compute nodes in order to prevent users from doing large I/O there -(which is what the /scratch is for). In consequence, you cannot compile -in /projects within a job anymore. If you wish to install software for -your project group anyway, you can use a build directory in the /scratch -file system instead: - -Every sane build system should allow you to keep your source code tree -and your build directory separate, some even demand them to be different -directories. Plus, you can set your installation prefix (the target -directory) back to your /projects folder and do the "make install" step -on the login nodes. - -For instance, when using CMake and keeping your source in /projects, you -could do the following: - - # save path to your source directory: - export SRCDIR=/projects/p_myproject/mysource - - # create a build directory in /scratch: - mkdir /scratch/p_myproject/mysoftware_build - - # change to build directory within /scratch: - cd /scratch/p_myproject/mysoftware_build - - # create Makefiles: - cmake -DCMAKE_INSTALL_PREFIX=/projects/p_myproject/mysoftware $SRCDIR - - # build in a job: - srun --mem-per-cpu=1500 -c 12 --pty make -j 12 - - # do the install step on the login node again: - make install - -As a bonus, your compilation should also be faster in the parallel -/scratch file system than it would be in the comparatively slow -NFS-based /projects file system. +While it is possible to do short compilations on the login nodes, it is generally considered good +practice to use a job for that, especially when using many parallel make processes. Note that +starting on December 6th 2016, the `/projects` filesystem will be mounted read-only on all compute +nodes in order to prevent users from doing large I/O there (which is what the `/scratch` is for). +In consequence, you cannot compile in `/projects` within a job anymore. If you wish to install +software for your project group anyway, you can use a build directory in the `/scratch` filesystem +instead: + +Every sane build system should allow you to keep your source code tree and your build directory +separate, some even demand them to be different directories. Plus, you can set your installation +prefix (the target directory) back to your `/projects` folder and do the "make install" step on the +login nodes. + +For instance, when using CMake and keeping your source in `/projects`, you could do the following: + +```console +# save path to your source directory: +marie@login$ export SRCDIR=/projects/p_myproject/mysource + +# create a build directory in /scratch: +marie@login$ mkdir /scratch/p_myproject/mysoftware_build + +# change to build directory within /scratch: +marie@login$ cd /scratch/p_myproject/mysoftware_build + +# create Makefiles: +marie@login$ cmake -DCMAKE_INSTALL_PREFIX=/projects/p_myproject/mysoftware $SRCDIR + +# build in a job: +marie@login$ srun --mem-per-cpu=1500 --cpus-per-task=12 --pty make -j 12 + +# do the install step on the login node again: +marie@login$ make install +``` + +As a bonus, your compilation should also be faster in the parallel `/scratch` filesystem than it +would be in the comparatively slow NFS-based `/projects` filesystem. diff --git a/doc.zih.tu-dresden.de/docs/software/containers.md b/doc.zih.tu-dresden.de/docs/software/containers.md index 93c2762667be1e5addecff38c3cf38d08ac60d7e..bbb3e80772f3fcc71480e4555fb146f602806804 100644 --- a/doc.zih.tu-dresden.de/docs/software/containers.md +++ b/doc.zih.tu-dresden.de/docs/software/containers.md @@ -31,9 +31,9 @@ environment. creating a new container requires root privileges. However, new containers can be created on your local workstation and moved to ZIH systems for -execution. Follow the instructions for [locally install Singularity](#local-installation) and -[container creation](#container-creation). Moreover, existing Docker container can easily be -converted, which is documented [here](#importing-a-docker-container). +execution. Follow the instructions for [locally installing Singularity](#local-installation) and +[container creation](#container-creation). Moreover, existing Docker container can easily be +converted, see [Import a docker container](#importing-a-docker-container). If you are already familar with Singularity, you might be more intressted in our [singularity recipes and hints](singularity_recipe_hints.md). diff --git a/doc.zih.tu-dresden.de/docs/software/fem_software.md b/doc.zih.tu-dresden.de/docs/software/fem_software.md index aa8917ad3c59fbd1c60a349e9030ff830f904234..3be2314889bfe45f9554fb499c4d757337bef33d 100644 --- a/doc.zih.tu-dresden.de/docs/software/fem_software.md +++ b/doc.zih.tu-dresden.de/docs/software/fem_software.md @@ -11,7 +11,7 @@ marie@login$ module load ANSYS/<version> ``` - The section [runtime environment](runtime_environment.md) provides a comprehensive overview + The section [runtime environment](modules.md) provides a comprehensive overview on the module system and relevant commands. ## Abaqus diff --git a/doc.zih.tu-dresden.de/docs/software/lo2s.md b/doc.zih.tu-dresden.de/docs/software/lo2s.md new file mode 100644 index 0000000000000000000000000000000000000000..cf34feccfca15e1e37d5278f30117aaba827e800 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/lo2s.md @@ -0,0 +1,141 @@ +# lo2s - Lightweight Node-Level Performance Monitoring + +`lo2s` creates parallel OTF2 traces with a focus on both application and system view. +The traces can contain any of the following information: + +* From running threads + * Calling context samples based on instruction overflows + * The calling context samples are annotated with the disassembled assembler instruction string + * The frame pointer-based call-path for each calling context sample + * Per-thread performance counter readings + * Which thread was scheduled on which CPU at what time +* From the system + * Metrics from tracepoints (e.g., the selected C-state or P-state) + * The node-level system tree (CPUs (HW-threads), cores, packages) + * CPU power measurements (x86_energy) + * Microarchitecture specific metrics (x86_adapt, per package or core) + * Arbitrary metrics through plugins (Score-P compatible) + +In general, `lo2s` operates either in **process monitoring** or **system monitoring** mode. + +With **process monitoring**, all information is grouped by each thread of a monitored process +group - it shows you *on which CPU is each monitored thread running*. `lo2s` either acts as a +prefix command to run the process (and also tracks its children) or `lo2s` attaches to a running +process. + +In the **system monitoring** mode, information is grouped by logical CPU - it shows you +*which thread was running on a given CPU*. Metrics are also shown per CPU. + +In both modes, `lo2s` always groups system-level metrics (e.g., tracepoints) by their respective +system hardware component. + +## Usage + +Only the basic usage is shown in this Wiki. For a more detailed explanation, refer to the +[Lo2s website](https://github.com/tud-zih-energy/lo2s). + +Before using `lo2s`, set up the correct environment with + +```console +marie@login$ module load lo2s +``` + +As `lo2s` is built upon [perf](perf_tools.md), its usage and limitations are very similar to that. +In particular, you can use `lo2s` as a prefix command just like `perf`. Even some of the command +line arguments are inspired by `perf`. The main difference to `perf` is that `lo2s` will output +a [Vampir trace](vampir.md), which allows a full-blown performance analysis almost like +[Score-P](scorep.md). + +To record the behavior of an application, prefix the application run with `lo2s`. We recommend +using the double dash `--` to prevent mixing command line arguments between `lo2s` and the user +application. In the following example, we run `lo2s` on the application `sleep 2`. + +```console +marie@compute$ lo2s --no-kernel -- sleep 2 +[ lo2s: sleep 2 (0), 1 threads, 0.014082s CPU, 2.03315s total ] +[ lo2s: 5 wakeups, wrote 2.48 KiB lo2s_trace_2021-10-12T12-39-06 ] +``` + +This will record the application in the `process monitoring mode`. This means, the applications +process, its forked processes, and threads are recorded and can be analyzed using Vampir. +The main view will represent each process and thread over time. There will be a metric "CPU" +indicating for each process, on which CPU it was executed during the runtime. + +## Required Permissions + +By design, `lo2s` almost exclusively utilizes Linux Kernel facilities such as perf and tracepoints +to perform the application measurements. For security reasons, these facilities require special +permissions, in particular `perf_event_paranoid` and read permissions to the `debugfs` under +`/sys/kernel/debug`. + +Luckily, for the `process monitoring mode` the default settings allow you to run `lo2s` just fine. +All you need to do is pass the `--no-kernel` parameter like in the example above. + +For the `system monitoring mode` you can get the required permission with the Slurm parameter +`--exclusive`. (Note: Regardless of the actual requested processes per node, you will accrue +cpu-hours as if you had reserved all cores on the node.) + +## Memory Requirements + +When requesting memory for your jobs, you need to take into account that `lo2s` needs a substantial +amount of memory for its operation. Unfortunately, the amount of memory depends on the application. +The amount mainly scales with the number of processes spawned by the traced application. For each +processes, there is a fixed-sized buffer. This should be fine for a typical HPC application, but +can lead to extreme cases there the buffers are orders of magnitude larger than the resulting trace. +For instance, recording a CMake run, which spawns hundreds of processes, each running only for +a few milliseconds, leaving each buffer almost empty. Still, the buffers needs to be allocated +and thus require a lot of memory. + +Given such a case, we recommend to use the `system monitoring mode` instead, as the memory in this +mode scales with the number of logical CPUs instead of the number of processes. + +## Advanced Topic: System Monitoring + +The `system monitoring mode` gives a different view. As the name implies, the focus isn't on processes +anymore, but the system as a whole. In particular, a trace recorded in this mode will show a timeline +for each logical CPU of the system. To enable this mode, you need to pass `-a` parameter. + +```console +marie@compute$ lo2s -a +^C[ lo2s (system mode): monitored processes: 0, 0.136623s CPU, 13.7872s total ] +[ lo2s (system mode): 36 wakeups, wrote 301.39 KiB lo2s_trace_2021-11-01T09-44-31 ] +``` + +Note: As you can read in the above example, `lo2s` monitored zero processes even though it was run +in the `system monitoring mode`. Certainly, there are more than none processes running on a system. +However, as the user accounts on our HPC systems are limited to only see their own processes and `lo2s` +records in the scope of the user, it will only see the users own processes. Hence, in the example +above, there are no other processes running. + +When using the `system monitoring mode` without passing a program, `lo2s` will run indefinitely. +You can stop the measurement by sending `lo2s` a `SIGINT` signal or hit `ctrl+C`. However, if you pass +a program, `lo2s` will start that program and run the measurement until the started process finishes. +Of course, the process and any of its child processes and threads will be visible in the resulting trace. + +```console +marie@compute$ lo2s -a -- sleep 10 +[ lo2s (system mode): sleep 10 (0), 1 threads, monitored processes: 1, 0.133598s CPU, 10.3996s total ] +[ lo2s (system mode): 39 wakeups, wrote 280.39 KiB lo2s_trace_2021-11-01T09-55-04 ] +``` + +Like in the `process monitoring mode`, `lo2s` can also sample instructions in the system monitoring mode. +You can enable the instruction sampling by passing the parameter `--instruction-sampling` to `lo2s`. + +```console +marie@compute$ lo2s -a --instruction-sampling -- make -j +[ lo2s (system mode): make -j (0), 268 threads, monitored processes: 286, 258.789s CPU, 445.076s total ] +[ lo2s (system mode): 3815 wakeups, wrote 39.24 MiB lo2s_trace_2021-10-29T15-08-44 ] +``` + +## Advanced Topic: Metric Plugins + +`Lo2s` is compatible with [Score-P](scorep.md) metric plugins, but only a subset will work. +In particular, `lo2s` only supports asynchronous plugins with the per host or once scope. +You can find a large set of plugins in the [Score-P Organization on GitHub](https://github.com/score-p). + +To activate plugins, you can use the same environment variables as with Score-P, or with `LO2S` as +prefix: + + - LO2S_METRIC_PLUGINS + - LO2S_METRIC_PLUGIN + - LO2S_METRIC_PLUGIN_PLUGIN diff --git a/doc.zih.tu-dresden.de/docs/software/mathematics.md b/doc.zih.tu-dresden.de/docs/software/mathematics.md index 9629e76b77cd8779a993c6c1f3bc5b0fe68d1140..5b8e23b2fd3ed373bdf7bf6394ae3b2faf98ce74 100644 --- a/doc.zih.tu-dresden.de/docs/software/mathematics.md +++ b/doc.zih.tu-dresden.de/docs/software/mathematics.md @@ -21,9 +21,9 @@ font manager. You need to copy the fonts from ZIH systems to your local system and expand the font path -```bash -localhost$ scp -r taurus.hrsk.tu-dresden.de:/sw/global/applications/mathematica/10.0/SystemFiles/Fonts/Type1/ ~/.fonts -localhost$ xset fp+ ~/.fonts/Type1 +```console +marie@local$ scp -r taurus.hrsk.tu-dresden.de:/sw/global/applications/mathematica/10.0/SystemFiles/Fonts/Type1/ ~/.fonts +marie@local$ xset fp+ ~/.fonts/Type1 ``` #### Windows Workstation @@ -93,41 +93,41 @@ interfaces with the Maple symbolic engine, allowing it to be part of a full comp Running MATLAB via the batch system could look like this (for 456 MB RAM per core and 12 cores reserved). Please adapt this to your needs! -```bash -zih$ module load MATLAB -zih$ srun -t 8:00 -c 12 --mem-per-cpu=456 --pty --x11=first bash -zih$ matlab +```console +marie@login$ module load MATLAB +marie@login$ srun --time=8:00 --cpus-per-task=12 --mem-per-cpu=456 --pty --x11=first bash +marie@compute$ matlab ``` With following command you can see a list of installed software - also the different versions of matlab. -```bash -zih$ module avail +```console +marie@login$ module avail ``` Please choose one of these, then load the chosen software with the command: ```bash -zih$ module load MATLAB/version +marie@login$ module load MATLAB/<version> ``` Or use: -```bash -zih$ module load MATLAB +```console +marie@login$ module load MATLAB ``` (then you will get the most recent Matlab version. -[Refer to the modules section for details.](../software/runtime_environment.md#modules)) +[Refer to the modules section for details.](../software/modules.md#modules)) ### Interactive If X-server is running and you logged in at ZIH systems, you should allocate a CPU for your work with command -```bash -zih$ srun --pty --x11=first bash +```console +marie@login$ srun --pty --x11=first bash ``` - now you can call "matlab" (you have 8h time to work with the matlab-GUI) @@ -138,8 +138,9 @@ Using Scripts You have to start matlab-calculation as a Batch-Job via command -```bash -srun --pty matlab -nodisplay -r basename_of_your_matlab_script #NOTE: you must omit the file extension ".m" here, because -r expects a matlab command or function call, not a file-name. +```console +marie@login$ srun --pty matlab -nodisplay -r basename_of_your_matlab_script +# NOTE: you must omit the file extension ".m" here, because -r expects a matlab command or function call, not a file-name. ``` !!! info "License occupying" @@ -160,7 +161,7 @@ You can find detailed documentation on the Matlab compiler at Compile your `.m` script into a binary: ```bash -mcc -m name_of_your_matlab_script.m -o compiled_executable -R -nodisplay -R -nosplash +marie@login$ mcc -m name_of_your_matlab_script.m -o compiled_executable -R -nodisplay -R -nosplash ``` This will also generate a wrapper script called `run_compiled_executable.sh` which sets the required @@ -172,41 +173,35 @@ Then run the binary via the wrapper script in a job (just a simple example, you [sbatch script](../jobs_and_resources/slurm.md#job-submission) for that) ```bash -zih$ srun ./run_compiled_executable.sh $EBROOTMATLAB +marie@login$ srun ./run_compiled_executable.sh $EBROOTMATLAB ``` ### Parallel MATLAB #### With 'local' Configuration -- If you want to run your code in parallel, please request as many - cores as you need! -- start a batch job with the number N of processes -- example for N= 4: `srun -c 4 --pty --x11=first bash` -- run Matlab with the GUI or the CLI or with a script -- inside use `matlabpool open 4` to start parallel - processing +- If you want to run your code in parallel, please request as many cores as you need! +- Start a batch job with the number `N` of processes, e.g., `srun --cpus-per-task=4 --pty + --x11=first bash -l` +- Run Matlab with the GUI or the CLI or with a script +- Inside Matlab use `matlabpool open 4` to start parallel processing -- example for 1000*1000 matrix multiplication - -!!! example +!!! example "Example for 1000*1000 matrix-matrix multiplication" ```bash R = distributed.rand(1000); D = R * R ``` -- to close parallel task: -`matlabpool close` +- Close parallel task using `matlabpool close` #### With parfor -- start a batch job with the number N of processes (e.g. N=12) -- inside use `matlabpool open N` or - `matlabpool(N)` to start parallel processing. It will use +- Start a batch job with the number `N` of processes (,e.g., `N=12`) +- Inside use `matlabpool open N` or `matlabpool(N)` to start parallel processing. It will use the 'local' configuration by default. -- Use `parfor` for a parallel loop, where the **independent** loop - iterations are processed by N threads +- Use `parfor` for a parallel loop, where the **independent** loop iterations are processed by `N` + threads !!! example diff --git a/doc.zih.tu-dresden.de/docs/software/misc/SparkExample.ipynb b/doc.zih.tu-dresden.de/docs/software/misc/SparkExample.ipynb index ffe1aa174859fe6697f65af7ce7bd09d526e4bc1..67eb37e898667946a0a6dbdf60bc104908e12601 100644 --- a/doc.zih.tu-dresden.de/docs/software/misc/SparkExample.ipynb +++ b/doc.zih.tu-dresden.de/docs/software/misc/SparkExample.ipynb @@ -9,7 +9,13 @@ "%%bash\n", "echo $SPARK_HOME\n", "echo $JAVA_HOME\n", - "hostname" + "hostname\n", + "if [ ! -d $HOME/jupyter-spark-conf ]\n", + "then\n", + "cp -r $SPARK_HOME/conf $HOME/jupyter-spark-conf\n", + "chmod -R u+w $HOME/jupyter-spark-conf\n", + "echo \"ml `ml -t list Spark` 2>/dev/null\" >> $HOME/jupyter-spark-conf/spark-env.sh\n", + "fi" ] }, { @@ -30,7 +36,7 @@ "metadata": {}, "outputs": [], "source": [ - "!SHELL=/bin/bash bash framework-configure.sh spark $SPARK_HOME/conf " + "!SHELL=/bin/bash bash framework-configure.sh spark $HOME/jupyter-spark-conf" ] }, { @@ -48,8 +54,6 @@ "metadata": {}, "outputs": [], "source": [ - "#import findspark\n", - "#findspark.init()\n", "import platform\n", "import pyspark\n", "from pyspark import SparkContext" @@ -104,6 +108,15 @@ "!ps -ef | grep -i java" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pkill -f \"pyspark-shell\"" + ] + }, { "cell_type": "code", "execution_count": null, @@ -114,9 +127,9 @@ ], "metadata": { "kernelspec": { - "display_name": "haswell-py3.6-spark", + "display_name": "haswell-py3.7-spark", "language": "python", - "name": "haswell-py3.6-spark" + "name": "haswell-py3.7-spark" }, "language_info": { "codemirror_mode": { @@ -128,7 +141,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.10" + "version": "3.7.4" } }, "nbformat": 4, diff --git a/doc.zih.tu-dresden.de/docs/software/misc/example-spark.sbatch b/doc.zih.tu-dresden.de/docs/software/misc/example-spark.sbatch deleted file mode 100644 index 2fcf3aa39b8e66b004fa0fed621475e3200f9d76..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/misc/example-spark.sbatch +++ /dev/null @@ -1,27 +0,0 @@ -#!/bin/bash -#SBATCH --time=00:03:00 -#SBATCH --partition=haswell -#SBATCH --nodes=1 -#SBATCH --exclusive -#SBATCH --mem=50G -#SBATCH -J "example-spark" - -ml Spark/3.0.1-Hadoop-2.7-Java-1.8-Python-3.7.4-GCCcore-8.3.0 - -function myExitHandler () { - stop-all.sh -} - -#configuration -. framework-configure.sh spark $SPARK_HOME/conf - -#register cleanup hook in case something goes wrong -trap myExitHandler EXIT - -start-all.sh - -spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000 - -stop-all.sh - -exit 0 diff --git a/doc.zih.tu-dresden.de/docs/software/modules.md b/doc.zih.tu-dresden.de/docs/software/modules.md index 8f5a0ae2c4792fd92c458dc89033b2058a1e22de..58f200d25f01d52385626776b53c93f38e999397 100644 --- a/doc.zih.tu-dresden.de/docs/software/modules.md +++ b/doc.zih.tu-dresden.de/docs/software/modules.md @@ -1,199 +1,326 @@ # Modules -Usage of software on HPC systems is managed by a **modules system**. A module is a user interface -that provides utilities for the dynamic modification of a user's environment (e.g., *PATH*, -*LD_LIBRARY_PATH* etc.) to access the compilers, loader, libraries, and utilities. With the help -of modules, users can smoothly switch between different versions of installed software packages -and libraries. +Usage of software on HPC systems is managed by a **modules system**. -For all applications, tools, libraries etc. the correct environment can be easily set by the command +!!! note "Module" -``` -module load -``` + A module is a user interface that provides utilities for the dynamic modification of a user's + environment, e.g. prepending paths to: -e.g: `module load MATLAB`. If several versions are installed they can be chosen like: `module load -MATLAB/2019b`. + * `PATH` + * `LD_LIBRARY_PATH` + * `MANPATH` + * and more -A list of all modules shows by command + to help you to access compilers, loader, libraries and utilities. -``` -module available -#or -module avail -#or -ml av + By using modules, you can smoothly switch between different versions of + installed software packages and libraries. -``` +## Module Commands -Other important commands are: +Using modules is quite straightforward and the following table lists the basic commands. -```Bash -module help #show all module options -module list #list all user-installed modules -module purge #remove all user-installed modules -module spider #search for modules across all environments, can take a parameter -module load <modname> #load module modname -module rm <modname> #unload module modname -module switch <mod> <mod2> #unload module mod1; load module mod2 -``` +| Command | Description | +|:------------------------------|:-----------------------------------------------------------------| +| `module help` | Show all module options | +| `module list` | List active modules in the user environment | +| `module purge` | Remove modules from the user environment | +| `module avail [modname]` | List all available modules | +| `module spider [modname]` | Search for modules across all environments | +| `module load <modname>` | Load module `modname` in the user environment | +| `module unload <modname>` | Remove module `modname` from the user environment | +| `module switch <mod1> <mod2>` | Replace module `mod1` with module `mod2` | -Module files are ordered by their topic on Taurus. By default, with `module available` you will see -all available module files and topics. If you just wish to see the installed versions of a certain -module, you can use `module av <softwarename>` and all available versions of the exact software will -be displayed. +Module files are ordered by their topic on ZIH systems. By default, with `module avail` you will +see all topics and their available module files. If you just wish to see the installed versions of a +certain module, you can use `module avail softwarename` and it will display the available versions of +`softwarename` only. -## Module environments +### Examples -On Taurus, there exist different module environments, each containing a set of software modules. -They are activated via the meta module modenv which has different versions, one of which is loaded -by default. You can switch between them by simply loading the desired modenv-version, e.g.: +???+ example "Finding available software" + This examples illustrates the usage of the command `module avail` to search for available Matlab + installations. + + ```console + marie@compute$ module avail matlab + + ------------------------------ /sw/modules/scs5/math ------------------------------ + MATLAB/2017a MATLAB/2018b MATLAB/2020a + MATLAB/2018a MATLAB/2019b MATLAB/2021a (D) + + Wo: + D: Standard Modul. + + Verwenden Sie "module spider" um alle verfügbaren Module anzuzeigen. + Verwenden Sie "module keyword key1 key2 ...", um alle verfügbaren Module + anzuzeigen, die mindestens eines der Schlüsselworte enthält. + ``` + +???+ example "Loading and removing modules" + + A particular module or several modules are loaded into your environment using the `module load` + command. The counter part to remove a module or several modules is `module unload`. + + ```console + marie@compute$ module load Python/3.8.6 + Module Python/3.8.6-GCCcore-10.2.0 and 11 dependencies loaded. + ``` + +???+ example "Removing all modules" + + To remove all loaded modules from your environment with one keystroke, invoke + + ```console + marie@compute$ module purge + Die folgenden Module wurden nicht entladen: + (Benutzen Sie "module --force purge" um alle Module zu entladen): + + 1) modenv/scs5 + Module Python/3.8.6-GCCcore-10.2.0 and 11 dependencies unloaded. + ``` + +### Front-End ml + +There is a front end for the module command, which helps you to type less. It is `ml`. + Any module command can be given after `ml`: + +| ml Command | module Command | +|:------------------|:------------------------------------------| +| `ml` | `module list` | +| `ml foo bar` | `module load foo bar` | +| `ml -foo -bar baz`| `module unload foo bar; module load baz` | +| `ml purge` | `module purge` | +| `ml show foo` | `module show foo` | + +???+ example "Usage of front-end ml" + + ```console + marie@compute$ ml +Python/3.8.6 + Module Python/3.8.6-GCCcore-10.2.0 and 11 dependencies loaded. + marie@compute$ ml + + Derzeit geladene Module: + 1) modenv/scs5 (S) 5) bzip2/1.0.8-GCCcore-10.2.0 9) SQLite/3.33.0-GCCcore-10.2.0 13) Python/3.8.6-GCCcore-10.2.0 + 2) GCCcore/10.2.0 6) ncurses/6.2-GCCcore-10.2.0 10) XZ/5.2.5-GCCcore-10.2.0 + 3) zlib/1.2.11-GCCcore-10.2.0 7) libreadline/8.0-GCCcore-10.2.0 11) GMP/6.2.0-GCCcore-10.2.0 + 4) binutils/2.35-GCCcore-10.2.0 8) Tcl/8.6.10-GCCcore-10.2.0 12) libffi/3.3-GCCcore-10.2.0 + + Wo: + S: Das Modul ist angeheftet. Verwenden Sie "--force", um das Modul zu entladen. + + marie@compute$ ml -Python/3.8.6 +ANSYS/2020R2 + Module Python/3.8.6-GCCcore-10.2.0 and 11 dependencies unloaded. + Module ANSYS/2020R2 loaded. + ``` + +## Module Environments + +On ZIH systems, there exist different **module environments**, each containing a set of software modules. +They are activated via the meta module `modenv` which has different versions, one of which is loaded +by default. You can switch between them by simply loading the desired modenv-version, e.g. + +```console +marie@compute$ module load modenv/ml ``` -module load modenv/ml -``` -| modenv/scs5 | SCS5 software | default | -| | | | -| modenv/ml | HPC-DA software (for use on the "ml" partition) | | -| modenv/hiera | WIP hierarchical module tree | | -| modenv/classic | Manually built pre-SCS5 (AE4.0) software | default | -| | | | - -The old modules (pre-SCS5) are still available after loading the corresponding `modenv` version -(classic), however, due to changes in the libraries of the operating system, it is not guaranteed -that they still work under SCS5. Please don't use modenv/classic if you do not absolutely have to. -Most software is available under modenv/scs5, too, just be aware of the possibly different spelling -(case-sensitivity). - -The command `module spider <modname>` allows searching for specific software in all modenv -environments. It will also display information on how to load a found module when giving a precise +### modenv/scs5 (default) + +* SCS5 software +* usually optimized for Intel processors (Partitions: `haswell`, `broadwell`, `gpu2`, `julia`) + +### modenv/ml + +* data analytics software (for use on the partition ml) +* necessary to run most software on the partition ml +(The instruction set [Power ISA](https://en.wikipedia.org/wiki/Power_ISA#Power_ISA_v.3.0) +is different from the usual x86 instruction set. +Thus the 'machine code' of other modenvs breaks). + +### modenv/hiera + +* uses a hierarchical module load scheme +* optimized software for AMD processors (Partitions: romeo, alpha) + +### modenv/classic + +* deprecated, old software. Is not being curated. +* may break due to library inconsistencies with the operating system. +* please don't use software from that modenv + +### Searching for Software + +The command `module spider <modname>` allows searching for a specific software across all modenv +environments. It will also display information on how to load a particular module when giving a precise module (with version) as the parameter. -## Per-architecture builds +??? example + + ```console + marie@login$ module spider p7zip + + ---------------------------------------------------------------------------------------------------------------------------------------------------------- + p7zip: + ---------------------------------------------------------------------------------------------------------------------------------------------------------- + Beschreibung: + p7zip is a quick port of 7z.exe and 7za.exe (command line version of 7zip) for Unix. 7-Zip is a file archiver with highest compression ratio. + + Versionen: + p7zip/9.38.1 + p7zip/17.03-GCCcore-10.2.0 + p7zip/17.03 + + ---------------------------------------------------------------------------------------------------------------------------------------------------------- + Um detaillierte Informationen über ein bestimmtes "p7zip"-Modul zu erhalten (auch wie das Modul zu laden ist), verwenden sie den vollständigen Namen des Moduls. + Zum Beispiel: + $ module spider p7zip/17.03 + ---------------------------------------------------------------------------------------------------------------------------------------------------------- + ``` + +## Per-Architecture Builds Since we have a heterogeneous cluster, we do individual builds of some of the software for each architecture present. This ensures that, no matter what partition the software runs on, a build -optimized for the host architecture is used automatically. This is achieved by having -'/sw/installed' symlinked to different directories on the compute nodes. +optimized for the host architecture is used automatically. +For that purpose we have created symbolic links on the compute nodes, +at the system path `/sw/installed`. However, not every module will be available for each node type or partition. Especially when introducing new hardware to the cluster, we do not want to rebuild all of the older module versions and in some cases cannot fall-back to a more generic build either. That's why we provide the script: `ml_arch_avail` that displays the availability of modules for the different node architectures. +### Example Invocation of ml_arch_avail + +```console +marie@compute$ ml_arch_avail TensorFlow/2.4.1 +TensorFlow/2.4.1: haswell, rome +TensorFlow/2.4.1: haswell, rome ``` -ml_arch_avail CP2K -Example output: +The command shows all modules that match on `TensorFlow/2.4.1`, and their respective availability. +Note that this will not work for meta-modules that do not have an installation directory +(like some tool chain modules). -#CP2K/6.1-foss-2019a: haswell, rome -#CP2K/5.1-intel-2018a: haswell -#CP2K/6.1-foss-2019a-spglib: haswell, rome -#CP2K/6.1-intel-2018a: haswell -#CP2K/6.1-intel-2018a-spglib: haswell -``` +## Advanced Usage -The command shows all modules that match on CP2K, and their respective availability. Note that this -will not work for meta-modules that do not have an installation directory (like some toolchain -modules). +For writing your own Modulefiles please have a look at the [Guide for writing project and private Modulefiles](private_modules.md). -## Project and User Private Modules +## Troubleshooting -Private module files allow you to load your own installed software packages into your environment -and to handle different versions without getting into conflicts. Private modules can be setup for a -single user as well as all users of project group. The workflow and settings for user private module -files is described in the following. The [settings for project private -modules](#project-private-modules) differ only in details. +### When I log in, the wrong modules are loaded by default -The command +Reset your currently loaded modules with `module purge` +(or `module purge --force` if you also want to unload your basic `modenv` module). +Then run `module save` to overwrite the +list of modules you load by default when logging in. -``` -module use <path_to_module_files> -``` +### I can't load module TensorFlow -adds directory by user choice to the list of module directories that are searched by the `module` -command. Within directory `../privatemodules` user can add directories for every software user wish -to install and add also in this directory a module file for every version user have installed. -Further information about modules can be found [here](http://modules.sourceforge.net/). +Check the dependencies by e.g. calling `module spider TensorFlow/2.4.1` +it will list a number of modules that need to be loaded +before the TensorFlow module can be loaded. -This is an example of work a private module file: +??? example "Loading the dependencies" -- create a directory in your home directory: + ```console + marie@compute$ module load TensorFlow/2.4.1 + Lmod hat den folgenden Fehler erkannt: Diese Module existieren, aber + können nicht wie gewünscht geladen werden: "TensorFlow/2.4.1" + Versuchen Sie: "module spider TensorFlow/2.4.1" um anzuzeigen, wie die Module + geladen werden. -``` -cd -mkdir privatemodules && cd privatemodules -mkdir testsoftware && cd testsoftware -``` -- add the directory in the list of module directories: + marie@compute$ module spider TensorFlow/2.4.1 -``` -module use $HOME/privatemodules -``` + ---------------------------------------------------------------------------------- + TensorFlow: TensorFlow/2.4.1 + ---------------------------------------------------------------------------------- + Beschreibung: + An open-source software library for Machine Intelligence -- create a file with the name `1.0` with a test software in the `testsoftware` directory (use e.g. -echo, emacs, etc): -``` -#%Module###################################################################### -## -## testsoftware modulefile -## -proc ModulesHelp { } { - puts stderr "Loads testsoftware" -} - -set version 1.0 -set arch x86_64 -set path /home/<user>/opt/testsoftware/$version/$arch/ - -prepend-path PATH $path/bin -prepend-path LD_LIBRARY_PATH $path/lib - -if [ module-info mode load ] { - puts stderr "Load testsoftware version $version" -} -``` + Sie müssen alle Module in einer der nachfolgenden Zeilen laden bevor Sie das Modul "TensorFlow/2.4.1" laden können. -- check the availability of the module with `ml av`, the output should look like this: + modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 + This extension is provided by the following modules. To access the extension you must load one of the following modules. Note that any module names in parentheses show the module location in the software hierarchy. -``` ---------------------- /home/masterman/privatemodules --------------------- - testsoftware/1.0 -``` -- load the test module with `module load testsoftware`, the output: + TensorFlow/2.4.1 (modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5) -``` -Load testsoftware version 1.0 -Module testsoftware/1.0 loaded. -``` -### Project Private Modules + This module provides the following extensions: -Private module files allow you to load project- or group-wide installed software into your -environment and to handle different versions without getting into conflicts. + absl-py/0.10.0 (E), astunparse/1.6.3 (E), cachetools/4.2.0 (E), dill/0.3.3 (E), gast/0.3.3 (E), google-auth-oauthlib/0.4.2 (E), google-auth/1.24.0 (E), google-pasta/0.2.0 (E), grpcio/1.32.0 (E), gviz-api/1.9.0 (E), h5py/2.10.0 (E), Keras-Preprocessing/1.1.2 (E), Markdown/3.3.3 (E), oauthlib/3.1.0 (E), opt-einsum/3.3.0 (E), portpicker/1.3.1 (E), pyasn1-modules/0.2.8 (E), requests-oauthlib/1.3.0 (E), rsa/4.7 (E), tblib/1.7.0 (E), tensorboard-plugin-profile/2.4.0 (E), tensorboard-plugin-wit/1.8.0 (E), tensorboard/2.4.1 (E), tensorflow-estimator/2.4.0 (E), TensorFlow/2.4.1 (E), termcolor/1.1.0 (E), Werkzeug/1.0.1 (E), wrapt/1.12.1 (E) -The module files have to be stored in your global projects directory -`/projects/p_projectname/privatemodules`. An example of a module file can be found in the section -above. To use a project-wide module file you have to add the path to the module file to the module -environment with the command + Help: + Description + =========== + An open-source software library for Machine Intelligence -``` -module use /projects/p_projectname/privatemodules -``` -After that, the modules are available in your module environment and you can load the modules with -the `module load` command. + More information + ================ + - Homepage: https://www.tensorflow.org/ + + + Included extensions + =================== + absl-py-0.10.0, astunparse-1.6.3, cachetools-4.2.0, dill-0.3.3, gast-0.3.3, + google-auth-1.24.0, google-auth-oauthlib-0.4.2, google-pasta-0.2.0, + grpcio-1.32.0, gviz-api-1.9.0, h5py-2.10.0, Keras-Preprocessing-1.1.2, + Markdown-3.3.3, oauthlib-3.1.0, opt-einsum-3.3.0, portpicker-1.3.1, + pyasn1-modules-0.2.8, requests-oauthlib-1.3.0, rsa-4.7, tblib-1.7.0, + tensorboard-2.4.1, tensorboard-plugin-profile-2.4.0, tensorboard-plugin- + wit-1.8.0, TensorFlow-2.4.1, tensorflow-estimator-2.4.0, termcolor-1.1.0, + Werkzeug-1.0.1, wrapt-1.12.1 + + + Names marked by a trailing (E) are extensions provided by another module. + + + + marie@compute$ ml +modenv/hiera +GCC/10.2.0 +CUDA/11.1.1 +OpenMPI/4.0.5 +TensorFlow/2.4.1 + + Die folgenden Module wurden in einer anderen Version erneut geladen: + 1) GCC/7.3.0-2.30 => GCC/10.2.0 3) binutils/2.30-GCCcore-7.3.0 => binutils/2.35 + 2) GCCcore/7.3.0 => GCCcore/10.2.0 4) modenv/scs5 => modenv/hiera -## Using Private Modules and Programs in the $HOME Directory + Module GCCcore/7.3.0, binutils/2.30-GCCcore-7.3.0, GCC/7.3.0-2.30, GCC/7.3.0-2.30 and 3 dependencies unloaded. + Module GCCcore/7.3.0, GCC/7.3.0-2.30, GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5, TensorFlow/2.4.1 and 50 dependencies loaded. + marie@compute$ module list -An automated backup system provides security for the HOME-directories on the cluster on a daily -basis. This is the reason why we urge users to store (large) temporary data (like checkpoint files) -on the /scratch -Filesystem or at local scratch disks. + Derzeit geladene Module: + 1) modenv/hiera (S) 28) Tcl/8.6.10 + 2) GCCcore/10.2.0 29) SQLite/3.33.0 + 3) zlib/1.2.11 30) GMP/6.2.0 + 4) binutils/2.35 31) libffi/3.3 + 5) GCC/10.2.0 32) Python/3.8.6 + 6) CUDAcore/11.1.1 33) pybind11/2.6.0 + 7) CUDA/11.1.1 34) SciPy-bundle/2020.11 + 8) numactl/2.0.13 35) Szip/2.1.1 + 9) XZ/5.2.5 36) HDF5/1.10.7 + 10) libxml2/2.9.10 37) cURL/7.72.0 + 11) libpciaccess/0.16 38) double-conversion/3.1.5 + 12) hwloc/2.2.0 39) flatbuffers/1.12.0 + 13) libevent/2.1.12 40) giflib/5.2.1 + 14) Check/0.15.2 41) ICU/67.1 + 15) GDRCopy/2.1-CUDA-11.1.1 42) JsonCpp/1.9.4 + 16) UCX/1.9.0-CUDA-11.1.1 43) NASM/2.15.05 + 17) libfabric/1.11.0 44) libjpeg-turbo/2.0.5 + 18) PMIx/3.1.5 45) LMDB/0.9.24 + 19) OpenMPI/4.0.5 46) nsync/1.24.0 + 20) OpenBLAS/0.3.12 47) PCRE/8.44 + 21) FFTW/3.3.8 48) protobuf/3.14.0 + 22) ScaLAPACK/2.1.0 49) protobuf-python/3.14.0 + 23) cuDNN/8.0.4.30-CUDA-11.1.1 50) flatbuffers-python/1.12 + 24) NCCL/2.8.3-CUDA-11.1.1 51) typing-extensions/3.7.4.3 + 25) bzip2/1.0.8 52) libpng/1.6.37 + 26) ncurses/6.2 53) snappy/1.1.8 + 27) libreadline/8.0 54) TensorFlow/2.4.1 -**Please note**: We have set `ulimit -c 0` as a default to prevent users from filling the disk with -the dump of a crashed program. bash -users can use `ulimit -Sc unlimited` to enable the debugging -via analyzing the core file (limit coredumpsize unlimited for tcsh). + Wo: + S: Das Modul ist angeheftet. Verwenden Sie "--force", um das Modul zu entladen. + ``` diff --git a/doc.zih.tu-dresden.de/docs/software/private_modules.md b/doc.zih.tu-dresden.de/docs/software/private_modules.md new file mode 100644 index 0000000000000000000000000000000000000000..4b79463f05988afd689b5fa18bddc758c16dfaa7 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/software/private_modules.md @@ -0,0 +1,105 @@ +# Project and User Private Modules + +Private module files allow you to load your own installed software packages into your environment +and to handle different versions without getting into conflicts. Private modules can be setup for a +single user as well as all users of project group. The workflow and settings for user private module +files is described in the following. The [settings for project private +modules](#project-private-modules) differ only in details. + +In order to use your own module files please use the command +`module use <path_to_module_files>`. It will add the path to the list of module directories +that are searched by lmod (i.e. the `module` command). You may use a directory `privatemodules` +within your home or project directory to setup your own module files. + +Please see the [Environment Modules open source project's web page](http://modules.sourceforge.net/) +for further information on writing module files. + +## 1. Create Directories + +```console +marie@compute$ cd $HOME +marie@compute$ mkdir --verbose --parents privatemodules/testsoftware +marie@compute$ cd privatemodules/testsoftware +``` + +(create a directory in your home directory) + +## 2. Notify lmod + +```console +marie@compute$ module use $HOME/privatemodules +``` + +(add the directory in the list of module directories) + +## 3. Create Modulefile + +Create a file with the name `1.0` with a +test software in the `testsoftware` directory you created earlier +(using your favorite editor) and paste the following text into it: + +``` +#%Module###################################################################### +## +## testsoftware modulefile +## +proc ModulesHelp { } { + puts stderr "Loads testsoftware" +} + +set version 1.0 +set arch x86_64 +set path /home/<user>/opt/testsoftware/$version/$arch/ + +prepend-path PATH $path/bin +prepend-path LD_LIBRARY_PATH $path/lib + +if [ module-info mode load ] { + puts stderr "Load testsoftware version $version" +} +``` + +## 4. Check lmod + +Check the availability of the module with `ml av`, the output should look like this: + +``` +--------------------- /home/masterman/privatemodules --------------------- + testsoftware/1.0 +``` + +## 5. Load Module + +Load the test module with `module load testsoftware`, the output should look like this: + +```console +Load testsoftware version 1.0 +Module testsoftware/1.0 loaded. +``` + +## Project Private Modules + +Private module files allow you to load project- or group-wide installed software into your +environment and to handle different versions without getting into conflicts. + +The module files have to be stored in your global projects directory +`/projects/p_projectname/privatemodules`. An example of a module file can be found in the section +above. To use a project-wide module file you have to add the path to the module file to the module +environment with the command + +```console +marie@compute$ module use /projects/p_projectname/privatemodules +``` + +After that, the modules are available in your module environment and you can load the modules with +the `module load` command. + +## Using Private Modules and Programs in the $HOME Directory + +An automated backup system provides security for the HOME-directories on the cluster on a daily +basis. This is the reason why we urge users to store (large) temporary data (like checkpoint files) +on the /scratch filesystem or at local scratch disks. + +**Please note**: We have set `ulimit -c 0` as a default to prevent users from filling the disk with +the dump of crashed programs. `bash` users can use `ulimit -Sc unlimited` to enable the debugging +via analyzing the core file. diff --git a/doc.zih.tu-dresden.de/docs/software/pytorch.md b/doc.zih.tu-dresden.de/docs/software/pytorch.md index 3c2e88a6c9fc209c246ede0e50410771be541c3f..e84f3aac54a88e0984b0da17e3e3527fe37e7b46 100644 --- a/doc.zih.tu-dresden.de/docs/software/pytorch.md +++ b/doc.zih.tu-dresden.de/docs/software/pytorch.md @@ -1,11 +1,11 @@ # PyTorch -[PyTorch](https://pytorch.org/){:target="_blank"} is an open-source machine learning framework. +[PyTorch](https://pytorch.org/) is an open-source machine learning framework. It is an optimized tensor library for deep learning using GPUs and CPUs. -PyTorch is a machine learning tool developed by Facebooks AI division to process large-scale +PyTorch is a machine learning tool developed by Facebook's AI division to process large-scale object detection, segmentation, classification, etc. PyTorch provides a core data structure, the tensor, a multi-dimensional array that shares many -similarities with Numpy arrays. +similarities with NumPy arrays. Please check the software modules list via @@ -13,9 +13,9 @@ Please check the software modules list via marie@login$ module spider pytorch ``` -to find out, which PyTorch modules are available on your partition. +to find out, which PyTorch modules are available. -We recommend using partitions alpha and/or ml when working with machine learning workflows +We recommend using partitions `alpha` and/or `ml` when working with machine learning workflows and the PyTorch library. You can find detailed hardware specification in our [hardware documentation](../jobs_and_resources/hardware_overview.md). @@ -25,7 +25,8 @@ You can find detailed hardware specification in our On the partition `alpha`, load the module environment: ```console -marie@login$ srun -p alpha --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash #Job submission on alpha nodes with 1 gpu on 1 node with 800 Mb per CPU +# Job submission on alpha nodes with 1 gpu on 1 node with 800 Mb per CPU +marie@login$ srun -p alpha --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash marie@alpha$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch/1.9.0 Die folgenden Module wurden in einer anderen Version erneut geladen: 1) modenv/scs5 => modenv/hiera @@ -34,6 +35,7 @@ Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5, PyTorch/1.9.0 and 54 dependencies ``` ??? hint "Torchvision on partition `alpha`" + On the partition `alpha`, the module torchvision is not yet available within the module system. (19.08.2021) Torchvision can be made available by using a virtual environment: @@ -45,12 +47,13 @@ Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5, PyTorch/1.9.0 and 54 dependencies ``` Using the **--no-deps** option for "pip install" is necessary here as otherwise the PyTorch - version might be replaced and you will run into trouble with the cuda drivers. + version might be replaced and you will run into trouble with the CUDA drivers. On the partition `ml`: ```console -marie@login$ srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash #Job submission in ml nodes with 1 gpu on 1 node with 800 Mb per CPU +# Job submission in ml nodes with 1 gpu on 1 node with 800 Mb per CPU +marie@login$ srun -p ml --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash ``` After calling @@ -62,8 +65,8 @@ marie@login$ module spider pytorch we know that we can load PyTorch (including torchvision) with ```console -marie@ml$ module load modenv/ml torchvision/0.7.0-fosscuda-2019b-Python-3.7.4-PyTorch-1.6.0 -Module torchvision/0.7.0-fosscuda-2019b-Python-3.7.4-PyTorch-1.6.0 and 55 dependencies loaded. +marie@ml$ module load modenv/ml torchvision/0.7.0-fossCUDA-2019b-Python-3.7.4-PyTorch-1.6.0 +Module torchvision/0.7.0-fossCUDA-2019b-Python-3.7.4-PyTorch-1.6.0 and 55 dependencies loaded. ``` Now, we check that we can access PyTorch: @@ -75,19 +78,24 @@ marie@{ml,alpha}$ python -c "import torch; print(torch.__version__)" The following example shows how to create a python virtual environment and import PyTorch. ```console -marie@ml$ mkdir python-environments #create folder -marie@ml$ which python #check which python are you using +# Create folder +marie@ml$ mkdir python-environments +# Check which python are you using +marie@ml$ which python /sw/installed/Python/3.7.4-GCCcore-8.3.0/bin/python -marie@ml$ virtualenv --system-site-packages python-environments/env #create virtual environment "env" which inheriting with global site packages +# Create virtual environment "env" which inheriting with global site packages +marie@ml$ virtualenv --system-site-packages python-environments/env [...] -marie@ml$ source python-environments/env/bin/activate #activate virtual environment "env". Example output: (env) bash-4.2$ +# Activate virtual environment "env". Example output: (env) bash-4.2$ +marie@ml$ source python-environments/env/bin/activate marie@ml$ python -c "import torch; print(torch.__version__)" ``` ## PyTorch in JupyterHub -In addition to using interactive and batch jobs, it is possible to work with PyTorch using JupyterHub. -The production and test environments of JupyterHub contain Python kernels, that come with a PyTorch support. +In addition to using interactive and batch jobs, it is possible to work with PyTorch using +JupyterHub. The production and test environments of JupyterHub contain Python kernels, that come +with a PyTorch support.  {: align="center"} @@ -96,3 +104,62 @@ The production and test environments of JupyterHub contain Python kernels, that For details on how to run PyTorch with multiple GPUs and/or multiple nodes, see [distributed training](distributed_training.md). + +## Migrate PyTorch-script from CPU to GPU + +It is recommended to use GPUs when using large training data sets. While TensorFlow automatically +uses GPUs if they are available, in PyTorch you have to move your tensors manually. + +First, you need to import `torch.CUDA`: + +```python3 +import torch.CUDA +``` + +Then you define a `device`-variable, which is set to 'CUDA' automatically when a GPU is available +with this code: + +```python3 +device = torch.device('CUDA' if torch.CUDA.is_available() else 'cpu') +``` + +You then have to move all of your tensors to the selected device. This looks like this: + +```python3 +x_train = torch.FloatTensor(x_train).to(device) +y_train = torch.FloatTensor(y_train).to(device) +``` + +Remember that this does not break backward compatibility when you port the script back to a computer +without GPU, because without GPU, `device` is set to 'cpu'. + +### Caveats + +#### Moving Data Back to the CPU-Memory + +The CPU cannot directly access variables stored on the GPU. If you want to use the variables, e.g., +in a `print` statement or when editing with NumPy or anything that is not PyTorch, you have to move +them back to the CPU-memory again. This then may look like this: + +```python3 +cpu_x_train = x_train.cpu() +print(cpu_x_train) +... +error_train = np.sqrt(metrics.mean_squared_error(y_train[:,1].cpu(), y_prediction_train[:,1])) +``` + +Remember that, without `.detach()` before the CPU, if you change `cpu_x_train`, `x_train` will also +be changed. If you want to treat them independently, use + +```python3 +cpu_x_train = x_train.detach().cpu() +``` + +Now you can change `cpu_x_train` without `x_train` being affected. + +#### Speed Improvements and Batch Size + +When you have a lot of very small data points, the speed may actually decrease when you try to train +them on the GPU. This is because moving data from the CPU-memory to the GPU-memory takes time. If +this occurs, please try using a very large batch size. This way, copying back and forth only takes +places a few times and the bottleneck may be reduced. diff --git a/doc.zih.tu-dresden.de/docs/software/runtime_environment.md b/doc.zih.tu-dresden.de/docs/software/runtime_environment.md deleted file mode 100644 index 1bca8daa7cfa08f3b58b19e5608c2e333b9055f9..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/runtime_environment.md +++ /dev/null @@ -1,181 +0,0 @@ -# Runtime Environment - -Make sure you know how to work with a Linux system. Documentations and tutorials can be easily -found on the internet or in your library. - -## Modules - -To allow the user to switch between different versions of installed programs and libraries we use a -*module concept*. A module is a user interface that provides utilities for the dynamic modification -of a user's environment, i.e., users do not have to manually modify their environment variables ( -`PATH` , `LD_LIBRARY_PATH`, ...) to access the compilers, loader, libraries, and utilities. - -For all applications, tools, libraries etc. the correct environment can be easily set by e.g. -`module load Mathematica`. If several versions are installed they can be chosen like `module load -MATLAB/2019b`. A list of all modules shows `module avail`. Other important commands are: - -| Command | Description | -|:------------------------------|:-----------------------------------------------------------------| -| `module help` | show all module options | -| `module list` | list all user-installed modules | -| `module purge` | remove all user-installed modules | -| `module avail` | list all available modules | -| `module spider` | search for modules across all environments, can take a parameter | -| `module load <modname>` | load module `modname` | -| `module unload <modname>` | unloads module `modname` | -| `module switch <mod1> <mod2>` | unload module `mod1` ; load module `mod2` | - -Module files are ordered by their topic on our HPC systems. By default, with `module av` you will -see all available module files and topics. If you just wish to see the installed versions of a -certain module, you can use `module av softwarename` and it will display the available versions of -`softwarename` only. - -### Lmod: An Alternative Module Implementation - -Historically, the module command on our HPC systems has been provided by the rather dated -*Environment Modules* software which was first introduced in 1991. As of late 2016, we also offer -the new and improved [LMOD](https://www.tacc.utexas.edu/research-development/tacc-projects/lmod) as -an alternative. It has a handful of advantages over the old Modules implementation: - -- all modulefiles are cached, which especially speeds up tab - completion with bash -- sane version ordering (9.0 \< 10.0) -- advanced version requirement functions (atleast, between, latest) -- auto-swapping of modules (if a different version was already loaded) -- save/auto-restore of loaded module sets (module save) -- multiple language support -- properties, hooks, ... -- depends_on() function for automatic dependency resolution with - reference counting - -### Module Environments - -On Taurus, there exist different module environments, each containing a set of software modules. -They are activated via the meta module **modenv** which has different versions, one of which is -loaded by default. You can switch between them by simply loading the desired modenv-version, e.g.: - -```Bash -module load modenv/ml -``` - -| | | | -|--------------|------------------------------------------------------------------------|---------| -| modenv/scs5 | SCS5 software | default | -| modenv/ml | HPC-DA software (for use on the "ml" partition) | | -| modenv/hiera | Hierarchical module tree (for use on the "romeo" and "gpu3" partition) | | - -The old modules (pre-SCS5) are still available after loading **modenv**/**classic**, however, due to -changes in the libraries of the operating system, it is not guaranteed that they still work under -SCS5. Please don't use modenv/classic if you do not absolutely have to. Most software is available -under modenv/scs5, too, just be aware of the possibly different spelling (case-sensitivity). - -You can use `module spider \<modname>` to search for a specific -software in all modenv environments. It will also display information on -how to load a found module when giving a precise module (with version) -as the parameter. - -Also see the information under [SCS5 software](../software/scs5_software.md). - -### Per-Architecture Builds - -Since we have a heterogenous cluster, we do individual builds of some of the software for each -architecture present. This ensures that, no matter what partition the software runs on, a build -optimized for the host architecture is used automatically. This is achieved by having -`/sw/installed` symlinked to different directories on the compute nodes. - -However, not every module will be available for each node type or partition. Especially when -introducing new hardware to the cluster, we do not want to rebuild all of the older module versions -and in some cases cannot fall-back to a more generic build either. That's why we provide the script: -`ml_arch_avail` that displays the availability of modules for the different node architectures. - -E.g.: - -```Bash -$ ml_arch_avail CP2K -CP2K/6.1-foss-2019a: haswell, rome -CP2K/5.1-intel-2018a: sandy, haswell -CP2K/6.1-foss-2019a-spglib: haswell, rome -CP2K/6.1-intel-2018a: sandy, haswell -CP2K/6.1-intel-2018a-spglib: haswell -``` - -shows all modules that match on CP2K, and their respective availability. Note that this will not -work for meta-modules that do not have an installation directory (like some toolchain modules). - -### Private User Module Files - -Private module files allow you to load your own installed software into your environment and to -handle different versions without getting into conflicts. - -You only have to call `module use <path to your module files>`, which adds your directory to the -list of module directories that are searched by the `module` command. Within the privatemodules -directory you can add directories for each software you wish to install and add - also in this -directory - a module file for each version you have installed. Further information about modules can -be found at <https://lmod.readthedocs.io> . - -**todo** quite old - -This is an example of a private module file: - -```Bash -dolescha@venus:~/module use $HOME/privatemodules - -dolescha@venus:~/privatemodules> ls -null testsoftware - -dolescha@venus:~/privatemodules/testsoftware> ls -1.0 - -dolescha@venus:~> module av -------------------------------- /work/home0/dolescha/privatemodules --------------------------- -null testsoftware/1.0 - -dolescha@venus:~> module load testsoftware -Load testsoftware version 1.0 - -dolescha@venus:~/privatemodules/testsoftware> cat 1.0 -#%Module###################################################################### -## -## testsoftware modulefile -## -proc ModulesHelp { } { - puts stderr "Loads testsoftware" -} - -set version 1.0 -set arch x86_64 -set path /home/<user>/opt/testsoftware/$version/$arch/ - -prepend-path PATH $path/bin -prepend-path LD_LIBRARY_PATH $path/lib - -if [ module-info mode load ] { - puts stderr "Load testsoftware version $version" -} -``` - -### Private Project Module Files - -Private module files allow you to load your group-wide installed software into your environment and -to handle different versions without getting into conflicts. - -The module files have to be stored in your global projects directory, e.g. -`/projects/p_projectname/privatemodules`. An example for a module file can be found in the section -above. - -To use a project-wide module file you have to add the path to the module file to the module -environment with following command `module use /projects/p_projectname/privatemodules`. - -After that, the modules are available in your module environment and you -can load the modules with `module load` . - -## Misc - -An automated [backup](../data_lifecycle/file_systems.md#backup-and-snapshots-of-the-file-system) -system provides security for the HOME-directories on `Taurus` and `Venus` on a daily basis. This is -the reason why we urge our users to store (large) temporary data (like checkpoint files) on the -/scratch -Filesystem or at local scratch disks. - -`Please note`: We have set `ulimit -c 0` as a default to prevent you from filling the disk with the -dump of a crashed program. `bash` -users can use `ulimit -Sc unlimited` to enable the debugging via -analyzing the core file (limit coredumpsize unlimited for tcsh). diff --git a/doc.zih.tu-dresden.de/docs/software/scorep.md b/doc.zih.tu-dresden.de/docs/software/scorep.md index eeea99ad110477282ec3897d69d65e800885cda8..0e2dc6c2358c95f47373a2f046f3fe4d643ae643 100644 --- a/doc.zih.tu-dresden.de/docs/software/scorep.md +++ b/doc.zih.tu-dresden.de/docs/software/scorep.md @@ -144,7 +144,7 @@ After the application run, you will find an experiment directory in your current which contains all recorded data. In general, you can record a profile and/or a event trace. Whether a profile and/or a trace is recorded, is specified by the environment variables `SCOREP_ENABLE_PROFILING` and `SCOREP_ENABLE_TRACING` (see -[documentation](https://perftools.pages.jsc.fz-juelich.de/cicd/scorep/tags/latest/html/measurement.html)). +[official Score-P documentation](https://perftools.pages.jsc.fz-juelich.de/cicd/scorep/tags/latest/html/measurement.html)). If the value of this variables is zero or false, profiling/tracing is disabled. Otherwise Score-P will record a profile and/or trace. By default, profiling is enabled and tracing is disabled. For more information please see the list of Score-P measurement diff --git a/doc.zih.tu-dresden.de/docs/software/scs5_software.md b/doc.zih.tu-dresden.de/docs/software/scs5_software.md index f1606236729c0354e5129b71c5c93c14325cb097..b5a1bef60d20cdc9989c8db82f766d31a96d3cdc 100644 --- a/doc.zih.tu-dresden.de/docs/software/scs5_software.md +++ b/doc.zih.tu-dresden.de/docs/software/scs5_software.md @@ -21,7 +21,7 @@ remove it and accept the new one after comparing its fingerprint with those list ## Using Software Modules Starting with SCS5, we only provide -[Lmod](../software/runtime_environment.md#lmod-an-alternative-module-implementation) as the +[Lmod](../software/modules.md#lmod-an-alternative-module-implementation) as the environment module tool of choice. As usual, you can get a list of the available software modules via: @@ -38,7 +38,7 @@ There is a special module that is always loaded (sticky) called | | | | |----------------|-------------------------------------------------|---------| | modenv/scs5 | SCS5 software | default | -| modenv/ml | HPC-DA software (for use on the "ml" partition) | | +| modenv/ml | software for data analytics (partition ml) | | | modenv/classic | Manually built pre-SCS5 (AE4.0) software | hidden | The old modules (pre-SCS5) are still available after loading the diff --git a/doc.zih.tu-dresden.de/docs/software/vampir.md b/doc.zih.tu-dresden.de/docs/software/vampir.md index 465d28925302091bf0e0d66156753452c3608912..24a22c35acda9afcfa6e1e56bdd553da716ec245 100644 --- a/doc.zih.tu-dresden.de/docs/software/vampir.md +++ b/doc.zih.tu-dresden.de/docs/software/vampir.md @@ -146,8 +146,8 @@ marie@local$ ssh -L 30000:taurusi1253:30055 taurus.hrsk.tu-dresden.de ``` Now, the port 30000 on your desktop is connected to the VampirServer port 30055 at the compute node -taurusi1253 of Taurus. Finally, start your local Vampir client and establish a remote connection to -`localhost`, port 30000 as described in the manual. +taurusi1253 of the ZIH system. Finally, start your local Vampir client and establish a remote +connection to `localhost`, port 30000 as described in the manual. ```console marie@local$ vampir diff --git a/doc.zih.tu-dresden.de/mkdocs.yml b/doc.zih.tu-dresden.de/mkdocs.yml index 45778f44b6ee876cfaa701d8438282215b10d103..79057c1d6770f69e13f6df3bdbcff4a3693851ad 100644 --- a/doc.zih.tu-dresden.de/mkdocs.yml +++ b/doc.zih.tu-dresden.de/mkdocs.yml @@ -24,7 +24,7 @@ nav: - Overview: software/overview.md - Environment: - Modules: software/modules.md - - Runtime Environment: software/runtime_environment.md + - Private Modulefiles: software/private_modules.md - Custom EasyBuild Modules: software/custom_easy_build_environment.md - Python Virtual Environments: software/python_virtual_environments.md - Containers: @@ -65,6 +65,7 @@ nav: - Debugging: software/debuggers.md - MPI Error Detection: software/mpi_usage_error_detection.md - Score-P: software/scorep.md + - lo2s: software/lo2s.md - PAPI Library: software/papi.md - Pika: software/pika.md - Perf Tools: software/perf_tools.md @@ -111,6 +112,7 @@ nav: - Phase2 Migration: archive/phase2_migration.md - Platform LSF: archive/platform_lsf.md - BeeGFS on Demand: archive/beegfs_on_demand.md + - Install JupyterHub: archive/install_jupyter.md - Switched-Off Systems: - Overview: archive/systems_switched_off.md - From Deimos to Atlas: archive/migrate_to_atlas.md @@ -128,6 +130,7 @@ nav: - Contribute: - How-To: contrib/howto_contribute.md - Content Rules: contrib/content_rules.md + - Browser-based Editing: contrib/contribute_browser.md - Work Locally Using Containers: contrib/contribute_container.md # Project Information diff --git a/doc.zih.tu-dresden.de/tud_theme/stylesheets/extra.css b/doc.zih.tu-dresden.de/tud_theme/stylesheets/extra.css index a3a992501bff7f7b153a1beb0779e7f3e576f9e6..5505ff954a79532a27f55f2b0ad0d82eecd095de 100644 --- a/doc.zih.tu-dresden.de/tud_theme/stylesheets/extra.css +++ b/doc.zih.tu-dresden.de/tud_theme/stylesheets/extra.css @@ -137,23 +137,6 @@ strong { width: 125px; } -.md-header__button.md-icon { - display: flex; - justify-content: center; - align-items: center; -} - -@media screen and (min-width: 76.25rem) { - .md-header__button.md-icon { - display: none; - } -} - -@media screen and (min-width: 60rem) { - .md-header__button.md-icon { - display: none; - } -} /* toc */ /* operation-status */ .operation-status-logo { diff --git a/doc.zih.tu-dresden.de/util/grep-forbidden-words.sh b/doc.zih.tu-dresden.de/util/grep-forbidden-patterns.sh similarity index 55% rename from doc.zih.tu-dresden.de/util/grep-forbidden-words.sh rename to doc.zih.tu-dresden.de/util/grep-forbidden-patterns.sh index cfb2b91b57457b701c5b80e76c6346d460cf4602..280e4003dc951164c86b44560d6c81e3a5dc640c 100755 --- a/doc.zih.tu-dresden.de/util/grep-forbidden-words.sh +++ b/doc.zih.tu-dresden.de/util/grep-forbidden-patterns.sh @@ -12,22 +12,34 @@ basedir=`dirname "$basedir"` #Further fields represent patterns with exceptions. #For example, the first rule says: # The pattern \<io\> should not be present in any file (case-insensitive match), except when it appears as ".io". -ruleset="i \<io\> \.io +ruleset="The word \"IO\" should not be used, use \"I/O\" instead. +i \<io\> \.io +\"SLURM\" (only capital letters) should not be used, use \"Slurm\" instead. s \<SLURM\> +\"File system\" should be written as \"filesystem\", except when used as part of a proper name. i file \+system HDFS -i \<taurus\> taurus\.hrsk /taurus /TAURUS +Use \"ZIH systems\" or \"ZIH system\" instead of \"Taurus\". \"taurus\" is only allowed when used in ssh commands and other very specific situations. +i \<taurus\> taurus\.hrsk /taurus /TAURUS ssh ^[0-9]\+:Host taurus$ +\"HRSKII\" should be avoided, use \"ZIH system\" instead. i \<hrskii\> +The term \"HPC-DA\" should be avoided. Depending on the situation, use \"data analytics\" or similar. i hpc[ -]\+da\> +\"ATTACHURL\" was a keyword in the old wiki, don't use it. i attachurl +Replace \"todo\" with real content. i \<todo\> <!--.*todo.*--> +Avoid spaces at end of lines. i [[:space:]]$ +When referencing partitions, put keyword \"partition\" in front of partition name, e. g. \"partition ml\", not \"ml partition\". i \(alpha\|ml\|haswell\|romeo\|gpu\|smp\|julia\|hpdlf\|scs5\)-\?\(interactive\)\?[^a-z]*partition +Give hints in the link text. Words such as \"here\" or \"this link\" are meaningless. i \[\s\?\(documentation\|here\|this \(link\|page\|subsection\)\|slides\?\|manpage\)\s\?\] +Use \"workspace\" instead of \"work space\" or \"work-space\". i work[ -]\+space" # Whitelisted files will be ignored # Whitespace separated list with full path -whitelist=(doc.zih.tu-dresden.de/README.md doc.zih.tu-dresden.de/docs/contrib/content_rules.md doc.zih.tu-dresden.de/docs/access/ssh_login.md) +whitelist=(doc.zih.tu-dresden.de/README.md doc.zih.tu-dresden.de/docs/contrib/content_rules.md) function grepExceptions () { if [ $# -gt 0 ]; then @@ -39,6 +51,32 @@ function grepExceptions () { fi } +function checkFile(){ + f=$1 + echo "Check wording in file $f" + while read message; do + IFS=$'\t' read -r flags pattern exceptionPatterns + while IFS=$'\t' read -r -a exceptionPatternsArray; do + if [ $silent = false ]; then + echo " Pattern: $pattern" + fi + grepflag= + case "$flags" in + "i") + grepflag=-i + ;; + esac + if grep -n $grepflag $color "$pattern" "$f" | grepExceptions "${exceptionPatternsArray[@]}" ; then + number_of_matches=`grep -n $grepflag $color "$pattern" "$f" | grepExceptions "${exceptionPatternsArray[@]}" | wc -l` + ((cnt=cnt+$number_of_matches)) + if [ $silent = false ]; then + echo " $message" + fi + fi + done <<< $exceptionPatterns + done <<< $ruleset +} + function usage () { echo "$0 [options]" echo "Search forbidden patterns in markdown files." @@ -95,31 +133,19 @@ fi echo "... $files ..." cnt=0 -for f in $files; do - if [ "${f: -3}" == ".md" -a -f "$f" ]; then - if (printf '%s\n' "${whitelist[@]}" | grep -xq $f); then - echo "Skip whitelisted file $f" - continue +if [[ ! -z $file ]]; then + checkFile $file +else + for f in $files; do + if [ "${f: -3}" == ".md" -a -f "$f" ]; then + if (printf '%s\n' "${whitelist[@]}" | grep -xq $f); then + echo "Skip whitelisted file $f" + continue + fi + checkFile $f fi - echo "Check wording in file $f" - while IFS=$'\t' read -r flags pattern exceptionPatterns; do - while IFS=$'\t' read -r -a exceptionPatternsArray; do - if [ $silent = false ]; then - echo " Pattern: $pattern" - fi - grepflag= - case "$flags" in - "i") - grepflag=-i - ;; - esac - if grep -n $grepflag $color "$pattern" "$f" | grepExceptions "${exceptionPatternsArray[@]}" ; then - ((cnt=cnt+1)) - fi - done <<< $exceptionPatterns - done <<< $ruleset - fi -done + done +fi echo "" case $cnt in diff --git a/doc.zih.tu-dresden.de/util/grep-forbidden-patterns.testdoc b/doc.zih.tu-dresden.de/util/grep-forbidden-patterns.testdoc new file mode 100644 index 0000000000000000000000000000000000000000..2b674702cd81304662b439a61d2fe15246ef8215 --- /dev/null +++ b/doc.zih.tu-dresden.de/util/grep-forbidden-patterns.testdoc @@ -0,0 +1,46 @@ +# Diese Datei versucht alles falsch zu machen, worauf grep-forbidden-words.sh checkt. + +`i \[\s\?\(documentation\|here\|this \(link\|page\|subsection\)\|slides\?\|manpage\)\s\?\]` + +Man kann Workspace schreiben oder aber auch +work-Space, beides sollte auffallen. + +Die ML-Partition, +die Alpha-Partition, +die Haswell-Partition, +die Romeo-Partition, +die GPU-Partition, +die SMP-Partition, +die Julia-Partition, +die HPDLF-Partition, +die scs5-Partition (was ist das überhaupt?), +alle gibt es auch in interaktiv: +Die ML-interactive partition, +die Alpha-interactive partition, +die Haswell-interactive Partition, +die Romeo-interactive partition, +die GPU-interactive partition, +die SMP-interactive partition, +die Julia-interactive partition, +die HPDLF-interactive partition, +die scs5-interactive partition (was ist das überhaupt?), +alle diese Partitionen existieren, aber man darf sie nicht benennen. +``` +Denn sonst kommt das Leerzeichenmonster und packt Leerzeichen ans Ende der Zeile. +``` + +TODO: io sollte mit SLURM laufen. + +Das HDFS ist ein sehr gutes +file system auf taurus. + +Taurus ist erreichbar per +taurus.hrsk oder per +/taurus oder per +/TAURUS + +Was ist hrskii? Keine Ahnung! + +Was ist HPC-DA? Ist es ein attachurl? See (this page). +Or (here). +Or (manpage). diff --git a/doc.zih.tu-dresden.de/util/pre-commit b/doc.zih.tu-dresden.de/util/pre-commit index b86b75d9a07870a68118aa500ee80781e216c56b..eb63bbea24052eb1dff4ec16a17b8b5aba275e18 100755 --- a/doc.zih.tu-dresden.de/util/pre-commit +++ b/doc.zih.tu-dresden.de/util/pre-commit @@ -69,7 +69,7 @@ then fi echo "Forbidden words checking..." -docker run --name=hpc-compendium --rm -w /docs --mount src="$(pwd)",target=/docs,type=bind hpc-compendium ./doc.zih.tu-dresden.de/util/grep-forbidden-words.sh +docker run --name=hpc-compendium --rm -w /docs --mount src="$(pwd)",target=/docs,type=bind hpc-compendium ./doc.zih.tu-dresden.de/util/grep-forbidden-patterns.sh if [ $? -ne 0 ] then exit_ok=no diff --git a/doc.zih.tu-dresden.de/util/test-grep-forbidden-patterns.sh b/doc.zih.tu-dresden.de/util/test-grep-forbidden-patterns.sh new file mode 100755 index 0000000000000000000000000000000000000000..1e98caf528d9b1a1d640e9dd3e5c7dc23ec937ea --- /dev/null +++ b/doc.zih.tu-dresden.de/util/test-grep-forbidden-patterns.sh @@ -0,0 +1,13 @@ +#!/bin/bash + +expected_match_count=32 + +number_of_matches=$(bash ./doc.zih.tu-dresden.de/util/grep-forbidden-patterns.sh -f doc.zih.tu-dresden.de/util/grep-forbidden-patterns.testdoc -c -c | grep "Forbidden Patterns:" | sed -e 's/.*: //' | sed -e 's/ matches.*//') + +if [ $number_of_matches -eq $expected_match_count ]; then + echo "Test OK" + exit 0 +else + echo "Test failed: $expected_match_count matches expected, but only $number_of_matches found" + exit 1 +fi diff --git a/doc.zih.tu-dresden.de/wordlist.aspell b/doc.zih.tu-dresden.de/wordlist.aspell index 2cc8f1197350bfb759fc76721c1ab63055b3bcf1..c4133b92a64287d657209a0b05ecb8789984b87e 100644 --- a/doc.zih.tu-dresden.de/wordlist.aspell +++ b/doc.zih.tu-dresden.de/wordlist.aspell @@ -1,4 +1,4 @@ -personal_ws-1.1 en 203 +personal_ws-1.1 en 203 Abaqus ALLREDUCE Altix @@ -27,6 +27,7 @@ checkpointing Chemnitz citable CLI +CMake COMSOL conda config @@ -54,7 +55,6 @@ ddl DDP DDR DFG -dir distr DistributedDataParallel DMTCP @@ -116,7 +116,9 @@ Horovod horovodrun hostname Hostnames +hpc HPC +hpcsupport HPE HPL html @@ -134,6 +136,7 @@ init inode IOPS IPs +ISA Itanium jobqueue jpg @@ -149,8 +152,9 @@ LAPACK lapply Leichtbau LINPACK -linter Linter +linter +lmod LoadLeveler localhost lsf @@ -163,12 +167,15 @@ matlab MEGWARE mem MiB +Microarchitecture MIMD Miniconda mkdocs MKL MNIST modenv +modenvs +modulefile Montecito mountpoint mpi @@ -184,7 +191,6 @@ multiphysics Multiphysics multithreaded Multithreading -MultiThreading NAMD natively nbsp @@ -213,18 +219,20 @@ OpenBLAS OpenCL OpenGL OpenMP -openmpi OpenMPI +openmpi OpenSSH Opteron +OTF overfitting -pandarallel Pandarallel +pandarallel PAPI parallelization parallelize parfor pdf +perf Perf performant PESSL @@ -236,8 +244,8 @@ PMI png PowerAI ppc -pre Pre +pre Preload preloaded preloading @@ -267,8 +275,9 @@ RSS RStudio Rsync runnable -runtime Runtime +runtime +sacct salloc Sandybridge Saxonid @@ -315,12 +324,15 @@ TensorFlow TFLOPS Theano tmp -todo ToDo +todo toolchain toolchains +torchvision +Torchvision tracefile tracefiles +tracepoints transferability Trition undistinguishable @@ -350,6 +362,6 @@ XLC XLF Xming yaml -zih ZIH +zih ZIH's