diff --git a/doc.zih.tu-dresden.de/docs/access/jupyterhub.md b/doc.zih.tu-dresden.de/docs/access/jupyterhub.md index d3cdc8f582c663a2b5d27dcd4f59a6c2e7dc659b..dcdd9363c8d406d7227b97abce91ad67298e9a67 100644 --- a/doc.zih.tu-dresden.de/docs/access/jupyterhub.md +++ b/doc.zih.tu-dresden.de/docs/access/jupyterhub.md @@ -137,8 +137,8 @@ This message appears instantly if your batch system parameters are not valid. Please check those settings against the available hardware. Useful pages for valid batch system parameters: -- [Slurm batch system (Taurus)](../jobs_and_resources/system_taurus.md#batch-system) - [General information how to use Slurm](../jobs_and_resources/slurm.md) +- [Partitions and limits](../jobs_and_resources/partitions_and_limits.md) ### Error Message in JupyterLab diff --git a/doc.zih.tu-dresden.de/docs/access/security_restrictions.md b/doc.zih.tu-dresden.de/docs/access/security_restrictions.md index 25f6270410c4e35cee150019298fac6dd33cd01e..ae93ca28662cfc8e0fe9d3f76e2819690af53276 100644 --- a/doc.zih.tu-dresden.de/docs/access/security_restrictions.md +++ b/doc.zih.tu-dresden.de/docs/access/security_restrictions.md @@ -1,27 +1,27 @@ -# Security Restrictions on Taurus +# Security Restrictions -As a result of the security incident the German HPC sites in Gau Alliance are now adjusting their -measurements to prevent infection and spreading of the malware. +As a result of a security incident the German HPC sites in Gauß Alliance have adjusted their +measurements to prevent infection and spreading of malware. -The most important items for HPC systems at ZIH are: +The most important items for ZIH systems are: -- All users (who haven't done so recently) have to +* All users (who haven't done so recently) have to [change their ZIH password](https://selfservice.zih.tu-dresden.de/l/index.php/pswd/change_zih_password). - **Login to Taurus is denied with an old password.** -- All old (private and public) keys have been moved away. -- All public ssh keys for Taurus have to - - be re-generated using only the ED25519 algorithm (`ssh-keygen -t ed25519`) - - **passphrase for the private key must not be empty** -- Ideally, there should be no private key on Taurus except for local use. -- Keys to other systems must be passphrase-protected! -- **ssh to Taurus** is only possible from inside TU Dresden Campus - (login\[1,2\].zih.tu-dresden.de will be blacklisted). Users from outside can use VPN (see + * **Login to ZIH systems is denied with an old password.** +* All old (private and public) keys have been moved away. +* All public ssh keys for ZIH systems have to + * be re-generated using only the ED25519 algorithm (`ssh-keygen -t ed25519`) + * **passphrase for the private key must not be empty** +* Ideally, there should be no private key on ZIH system except for local use. +* Keys to other systems must be passphrase-protected! +* **ssh to ZIH systems** is only possible from inside TU Dresden campus + (`login[1,2].zih.tu-dresden.de` will be blacklisted). Users from outside can use VPN (see [here](https://tu-dresden.de/zih/dienste/service-katalog/arbeitsumgebung/zugang_datennetz/vpn)). -- **ssh from Taurus** is only possible inside TU Dresden Campus. - (Direct ssh access to other computing centers was the spreading vector of the recent incident.) +* **ssh from ZIH system** is only possible inside TU Dresden campus. + (Direct SSH access to other computing centers was the spreading vector of the recent incident.) -Data transfer is possible via the taurusexport nodes. We are working on a bandwidth-friendly -solution. +Data transfer is possible via the [export nodes](../data_transfer/export_nodes.md). We are working +on a bandwidth-friendly solution. We understand that all this will change convenient workflows. If the measurements would render your -work on Taurus completely impossible, please contact the HPC support. +work on ZIH systems completely impossible, please [contact the HPC support](../support.md). diff --git a/doc.zih.tu-dresden.de/docs/archive/beegfs_on_demand.md b/doc.zih.tu-dresden.de/docs/archive/beegfs_on_demand.md index ce009ace4bdcfc58fc20009eafbc6faf6c4fd553..29274c54a77ce954a478325f7d43110557a325e5 100644 --- a/doc.zih.tu-dresden.de/docs/archive/beegfs_on_demand.md +++ b/doc.zih.tu-dresden.de/docs/archive/beegfs_on_demand.md @@ -62,7 +62,7 @@ Check the status of the job with `squeue -u \<username>`. ## Mount BeeGFS Filesystem You can mount BeeGFS filesystem on the ML partition (PowerPC architecture) or on the Haswell -[partition](../jobs_and_resources/system_taurus.md) (x86_64 architecture) +[partition](../jobs_and_resources/partitions_and_limits.md) (x86_64 architecture) ### Mount BeeGFS Filesystem on the Partition `ml` diff --git a/doc.zih.tu-dresden.de/docs/archive/cxfs_end_of_support.md b/doc.zih.tu-dresden.de/docs/archive/cxfs_end_of_support.md index 84e018b655f958ecb2d0a8d35982aad47a66adb2..2854bb2aeccb7d016e91dda4d9de6d717521bf46 100644 --- a/doc.zih.tu-dresden.de/docs/archive/cxfs_end_of_support.md +++ b/doc.zih.tu-dresden.de/docs/archive/cxfs_end_of_support.md @@ -1,44 +1,45 @@ -# Changes in the CXFS File System +# Changes in the CXFS Filesystem -With the ending support from SGI, the CXFS file system will be seperated -from its tape library by the end of March, 2013. +!!! warning -This file system is currently mounted at + This page is outdated! -- SGI Altix: `/fastfs/` -- Atlas: `/hpc_fastfs/` +With the ending support from SGI, the CXFS filesystem will be separated from its tape library by +the end of March, 2013. -We kindly ask our users to remove their large data from the file system. +This filesystem is currently mounted at + +* SGI Altix: `/fastfs/` +* Atlas: `/hpc_fastfs/` + +We kindly ask our users to remove their large data from the filesystem. Files worth keeping can be moved -- to the new [Intermediate Archive](../data_lifecycle/intermediate_archive.md) (max storage +* to the new [Intermediate Archive](../data_lifecycle/intermediate_archive.md) (max storage duration: 3 years) - see [MigrationHints](#migration-from-cxfs-to-the-intermediate-archive) below, -- or to the [Log-term Archive](../data_lifecycle/preservation_research_data.md) (tagged with +* or to the [Log-term Archive](../data_lifecycle/preservation_research_data.md) (tagged with metadata). -To run the file system without support comes with the risk of losing -data. So, please store away your results into the Intermediate Archive. -`/fastfs` might on only be used for really temporary data, since we are -not sure if we can fully guarantee the availability and the integrity of -this file system, from then on. +To run the filesystem without support comes with the risk of losing data. So, please store away +your results into the Intermediate Archive. `/fastfs` might on only be used for really temporary +data, since we are not sure if we can fully guarantee the availability and the integrity of this +filesystem, from then on. -With the new HRSK-II system comes a large scratch file system with appr. -800 TB disk space. It will be made available for all running HPC systems -in due time. +With the new HRSK-II system comes a large scratch filesystem with approximately 800 TB disk space. +It will be made available for all running HPC systems in due time. ## Migration from CXFS to the Intermediate Archive Data worth keeping shall be moved by the users to the directory `archive_migration`, which can be found in your project's and your -personal `/fastfs` directories. (`/fastfs/my_login/archive_migration`, -`/fastfs/my_project/archive_migration` ) +personal `/fastfs` directories: -\<u>Attention:\</u> Exclusively use the command `mv`. Do **not** use -`cp` or `rsync`, for they will store a second version of your files in -the system. +* `/fastfs/my_login/archive_migration` +* `/fastfs/my_project/archive_migration` -Please finish this by the end of January. Starting on Feb/18/2013, we -will step by step transfer these directories to the new hardware. +**Attention:** Exclusively use the command `mv`. Do **not** use `cp` or `rsync`, for they will store +a second version of your files in the system. -- Set DENYTOPICVIEW = WikiGuest +Please finish this by the end of January. Starting on Feb/18/2013, we will step by step transfer +these directories to the new hardware. diff --git a/doc.zih.tu-dresden.de/docs/archive/unicore_rest_api.md b/doc.zih.tu-dresden.de/docs/archive/unicore_rest_api.md index 3cc59e7beb48a69a2b939542b14fef28cf4047fc..721aca76357bc1b004178d06c30ae0e8f0362b64 100644 --- a/doc.zih.tu-dresden.de/docs/archive/unicore_rest_api.md +++ b/doc.zih.tu-dresden.de/docs/archive/unicore_rest_api.md @@ -1,18 +1,15 @@ # UNICORE access via REST API -**%RED%The UNICORE support has been abandoned and so this way of access -is no longer available.%ENDCOLOR%** +!!! warning -Most of the UNICORE features are also available using its REST API. - -This API is documented here: - -<https://sourceforge.net/p/unicore/wiki/REST_API/> + This page is outdated! The UNICORE support has been abandoned and so this way of access is no + longer available. -Some useful examples of job submission via REST are available at: - -<https://sourceforge.net/p/unicore/wiki/REST_API_Examples/> - -The base address for the Taurus system at the ZIH is: +Most of the UNICORE features are also available using its REST API. -unicore.zih.tu-dresden.de:8080/TAURUS/rest/core +* This API is documented here: + * [https://sourceforge.net/p/unicore/wiki/REST_API/](https://sourceforge.net/p/unicore/wiki/REST_API/) +* Some useful examples of job submission via REST are available at: + * [https://sourceforge.net/p/unicore/wiki/REST_API_Examples/](https://sourceforge.net/p/unicore/wiki/REST_API_Examples/) +* The base address for the Taurus system at the ZIH is: + * *unicore.zih.tu-dresden.de:8080/TAURUS/rest/core* diff --git a/doc.zih.tu-dresden.de/docs/index.md b/doc.zih.tu-dresden.de/docs/index.md index cc174e052a72bf6258ce4844749690ae28d7a46c..515188ce538e03c1937aa255cc45754186e67a2b 100644 --- a/doc.zih.tu-dresden.de/docs/index.md +++ b/doc.zih.tu-dresden.de/docs/index.md @@ -1,48 +1,32 @@ -# ZIH HPC Compendium +# ZIH HPC Documentation -Dear HPC users, +This is the documentation of the HPC systems and services provided at +[TU Dresden/ZIH](https://tu-dresden.de/zih/). This documentation is work in progress, since we try +to incorporate more information with increasing experience and with every question you ask us. The +HPC team invites you to take part in the improvement of these pages by correcting or adding useful +information. -due to restrictions coming from data security and software incompatibilities the old -"HPC Compendium" is now reachable only from inside TU Dresden campus (or via VPN). +## Starting Point -Internal users should be redirected automatically. +Besids this documentation, the slides from the [HPC introduction](misc/HPC-Introduction.pdf) are +a good starting point for new users of ZIH systems as well for HPC in general. -We apologize for this severe action, but we are in the middle of the preparation for a wiki -relaunch, so we do not want to redirect resources to fix technical/security issues for a system -that will last only a few weeks. +## Contribution -Thank you for your understanding, +Issues concerning this documentation can reported via the GitLab +[issue tracking system](https://gitlab.hrz.tu-chemnitz.de/zih/hpc-compendium/hpc-compendium/-/issues). +Please check for any already existing issue before submitting your issue in order to avoid duplicate +issues. -your HPC Support Team ZIH +Contributions from user-side are highly welcome. Please refer to +the detailed documentation to get started. -## What is new? +**Reminder:** Non-documentation issues and requests need to be send as ticket to +[hpcsupport@zih.tu-dresden.de](mailto:hpcsupport@zih.tu-dresden.de). -The desire for a new technical documentation is driven by two major aspects: +## Licenses -1. Clear and user-oriented structure of the content -1. Usage of modern tools for technical documentation +The documenation and the repository have two licenses: -The HPC Compendium provided knowledge and help for many years. It grew with every new hardware -installation and ZIH stuff tried its best to keep it up to date. But, to be honest, it has become -quite messy, and housekeeping it was a nightmare. - -The new structure is designed with the schedule for an HPC project in mind. This will ease the start -for new HPC users, as well speedup searching information w.r.t. a specific topic for advanced users. - -We decided against a classical wiki software. Instead, we write the documentation in markdown and -make use of the static site generator [mkdocs](https://www.mkdocs.org/) to create static html files -from this markdown files. All configuration, layout and content files are managed within a git -repository. The generated static html files, i.e, the documentation you are now reading, is deployed -to a web server. - -The workflow is flexible, allows a high level of automation, and is quite easy to maintain. - -From a technical point, our new documentation system is highly inspired by -[OLFC User Documentation](https://docs.olcf.ornl.gov/) as well as -[NERSC Technical Documentation](https://nersc.gitlab.io/). - -## Contribute - -Contributions are highly welcome. Please refere to -[README.md](https://gitlab.hrz.tu-chemnitz.de/zih/hpc-compendium/hpc-compendium/-/blob/main/doc.zih.tu-dresden.de/README.md) -file of this project. +* All documentation is licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). +* All software components are licensed under MIT license. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md index 5324f550e30e66b6ec6830cf7fddbb921b0dbdbf..67c3168d23b7edb4a0128ead8ac3fcf0bf96520a 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md @@ -1,13 +1,13 @@ -# Alpha Centauri - Multi-GPU sub-cluster +# Alpha Centauri - Multi-GPU Sub-Cluster -The sub-cluster "AlphaCentauri" had been installed for AI-related computations (ScaDS.AI). +The sub-cluster "Alpha Centauri" had been installed for AI-related computations (ScaDS.AI). It has 34 nodes, each with: -- 8 x NVIDIA A100-SXM4 (40 GB RAM) -- 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz with multithreading enabled -- 1 TB RAM 3.5 TB `/tmp` local NVMe device -- Hostnames: `taurusi[8001-8034]` -- Slurm partition `alpha` for batch jobs and `alpha-interactive` for interactive jobs +* 8 x NVIDIA A100-SXM4 (40 GB RAM) +* 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz with multi-threading enabled +* 1 TB RAM 3.5 TB `/tmp` local NVMe device +* Hostnames: `taurusi[8001-8034]` +* Slurm partition `alpha` for batch jobs and `alpha-interactive` for interactive jobs !!! note @@ -23,8 +23,8 @@ The software for the `alpha` partition is available in `modenv/hiera` module env To check the available modules for `modenv/hiera`, use the command -```bash -module spider <module_name> +```console +marie@alpha$ module spider <module_name> ``` For example, to check whether PyTorch is available in version 1.7.1: @@ -95,11 +95,11 @@ Successfully installed torchvision-0.10.0 ### JupyterHub -[JupyterHub](../access/jupyterhub.md) can be used to run Jupyter notebooks on AlphaCentauri +[JupyterHub](../access/jupyterhub.md) can be used to run Jupyter notebooks on Alpha Centauri sub-cluster. As a starting configuration, a "GPU (NVIDIA Ampere A100)" preset can be used in the advanced form. In order to use latest software, it is recommended to choose `fosscuda-2020b` as a standard environment. Already installed modules from `modenv/hiera` -can be pre-loaded in "Preload modules (modules load):" field. +can be preloaded in "Preload modules (modules load):" field. ### Containers @@ -109,6 +109,6 @@ Detailed information about containers can be found [here](../software/containers Nvidia [NGC](https://developer.nvidia.com/blog/how-to-run-ngc-deep-learning-containers-with-singularity/) containers can be used as an effective solution for machine learning related tasks. (Downloading -containers requires registration). Nvidia-prepared containers with software solutions for specific +containers requires registration). Nvidia-prepared containers with software solutions for specific scientific problems can simplify the deployment of deep learning workloads on HPC. NGC containers have shown consistent performance compared to directly run code. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/batch_systems.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/batch_systems.md deleted file mode 100644 index 06e9be7e7a8ab5efa0ae1272ba6159ac50310e0b..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/batch_systems.md +++ /dev/null @@ -1,56 +0,0 @@ -# Batch Systems - -Applications on an HPC system can not be run on the login node. They have to be submitted to compute -nodes with dedicated resources for user jobs. Normally a job can be submitted with these data: - -- number of CPU cores, -- requested CPU cores have to belong on one node (OpenMP programs) or - can distributed (MPI), -- memory per process, -- maximum wall clock time (after reaching this limit the process is - killed automatically), -- files for redirection of output and error messages, -- executable and command line parameters. - -Depending on the batch system the syntax differs slightly: - -- [Slurm](../jobs_and_resources/slurm.md) (taurus, venus) - -If you are confused by the different batch systems, you may want to enjoy this [batch system -commands translation table](http://slurm.schedmd.com/rosetta.pdf). - -**Comment:** Please keep in mind that for a large runtime a computation may not reach its end. Try -to create shorter runs (4...8 hours) and use checkpointing. Here is an extreme example from -literature for the waste of large computing resources due to missing checkpoints: - -*Earth was a supercomputer constructed to find the question to the answer to the Life, the Universe, -and Everything by a race of hyper-intelligent pan-dimensional beings. Unfortunately 10 million years -later, and five minutes before the program had run to completion, the Earth was destroyed by -Vogons.* (Adams, D. The Hitchhikers Guide Through the Galaxy) - -## Exclusive Reservation of Hardware - -If you need for some special reasons, e.g., for benchmarking, a project or paper deadline, parts of -our machines exclusively, we offer the opportunity to request and reserve these parts for your -project. - -Please send your request **7 working days** before the reservation should start (as that's our -maximum time limit for jobs and it is therefore not guaranteed that resources are available on -shorter notice) with the following information to the [HPC -support](mailto:hpcsupport@zih.tu-dresden.de?subject=Request%20for%20a%20exclusive%20reservation%20of%20hardware&body=Dear%20HPC%20support%2C%0A%0AI%20have%20the%20following%20request%20for%20a%20exclusive%20reservation%20of%20hardware%3A%0A%0AProject%3A%0AReservation%20owner%3A%0ASystem%3A%0AHardware%20requirements%3A%0ATime%20window%3A%20%3C%5Byear%5D%3Amonth%3Aday%3Ahour%3Aminute%20-%20%5Byear%5D%3Amonth%3Aday%3Ahour%3Aminute%3E%0AReason%3A): - -- `Project:` *\<Which project will be credited for the reservation?>* -- `Reservation owner:` *\<Who should be able to run jobs on the - reservation? I.e., name of an individual user or a group of users - within the specified project.>* -- `System:` *\<Which machine should be used?>* -- `Hardware requirements:` *\<How many nodes and cores do you need? Do - you have special requirements, e.g., minimum on main memory, - equipped with a graphic card, special placement within the network - topology?>* -- `Time window:` *\<Begin and end of the reservation in the form - year:month:dayThour:minute:second e.g.: 2020-05-21T09:00:00>* -- `Reason:` *\<Reason for the reservation.>* - -**Please note** that your project CPU hour budget will be credited for the reserved hardware even if -you don't use it. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/binding_and_distribution_of_tasks.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/binding_and_distribution_of_tasks.md index 4e8bde8c6e43ab765135f3199525a09820abf8d1..4677a625300c59a04160389f4cf9a3bf975018c8 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/binding_and_distribution_of_tasks.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/binding_and_distribution_of_tasks.md @@ -1,45 +1,76 @@ # Binding and Distribution of Tasks +Slurm provides several binding strategies to place and bind the tasks and/or threads of your job +to cores, sockets and nodes. + +!!! note + + Keep in mind that the distribution method might have a direct impact on the execution time of + your application. The manipulation of the distribution can either speed up or slow down your + application. + ## General -To specify a pattern the commands `--cpu_bind=<cores|sockets>` and -`--distribution=<block | cyclic>` are needed. cpu_bind defines the resolution in which the tasks -will be allocated. While --distribution determinates the order in which the tasks will be allocated -to the cpus. Keep in mind that the allocation pattern also depends on your specification. +To specify a pattern the commands `--cpu_bind=<cores|sockets>` and `--distribution=<block|cyclic>` +are needed. The option `cpu_bind` defines the resolution in which the tasks will be allocated. While +`--distribution` determinate the order in which the tasks will be allocated to the CPUs. Keep in +mind that the allocation pattern also depends on your specification. -```Bash -#!/bin/bash -#SBATCH --nodes=2 # request 2 nodes -#SBATCH --cpus-per-task=4 # use 4 cores per task -#SBATCH --tasks-per-node=4 # allocate 4 tasks per node - 2 per socket +!!! example "Explicitly specify binding and distribution" -srun --ntasks 8 --cpus-per-task 4 --cpu_bind=cores --distribution=block:block ./application -``` + ```bash + #!/bin/bash + #SBATCH --nodes=2 # request 2 nodes + #SBATCH --cpus-per-task=4 # use 4 cores per task + #SBATCH --tasks-per-node=4 # allocate 4 tasks per node - 2 per socket + + srun --ntasks 8 --cpus-per-task 4 --cpu_bind=cores --distribution=block:block ./application + ``` In the following sections there are some selected examples of the combinations between `--cpu_bind` and `--distribution` for different job types. +## OpenMP Strategies + +The illustration below shows the default binding of a pure OpenMP-job on a single node with 16 CPUs +on which 16 threads are allocated. + +```Bash +#!/bin/bash +#SBATCH --nodes=1 +#SBATCH --tasks-per-node=1 +#SBATCH --cpus-per-task=16 + +export OMP_NUM_THREADS=16 + +srun --ntasks 1 --cpus-per-task $OMP_NUM_THREADS ./application +``` + + +{: align=center} + ## MPI Strategies -### Default Binding and Dsitribution Pattern +### Default Binding and Distribution Pattern -The default binding uses --cpu_bind=cores in combination with --distribution=block:cyclic. The -default (as well as block:cyclic) allocation method will fill up one node after another, while +The default binding uses `--cpu_bind=cores` in combination with `--distribution=block:cyclic`. The +default (as well as `block:cyclic`) allocation method will fill up one node after another, while filling socket one and two in alternation. Resulting in only even ranks on the first socket of each node and odd on each second socket of each node. -\<img alt="" -src="data:;base64," -/> + +{: align="center"} -```Bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --tasks-per-node=16 -#SBATCH --cpus-per-task=1 +!!! example "Default binding and default distribution" -srun --ntasks 32 ./application -``` + ```bash + #!/bin/bash + #SBATCH --nodes=2 + #SBATCH --tasks-per-node=16 + #SBATCH --cpus-per-task=1 + + srun --ntasks 32 ./application + ``` ### Core Bound @@ -50,18 +81,19 @@ application. This method allocates the tasks linearly to the cores. -\<img alt="" -src="data:;base64," -/> + +{: align="center"} -```Bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --tasks-per-node=16 -#SBATCH --cpus-per-task=1 +!!! example "Binding to cores and block:block distribution" -srun --ntasks 32 --cpu_bind=cores --distribution=block:block ./application -``` + ```bash + #!/bin/bash + #SBATCH --nodes=2 + #SBATCH --tasks-per-node=16 + #SBATCH --cpus-per-task=1 + + srun --ntasks 32 --cpu_bind=cores --distribution=block:block ./application + ``` #### Distribution: cyclic:cyclic @@ -71,18 +103,19 @@ then the first socket of the second node until one task is placed on every first socket of every node. After that it will place a task on every second socket of every node and so on. -\<img alt="" -src="<data:;base64,>" -/> + +{: align="center"} -```Bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --tasks-per-node=16 -#SBATCH --cpus-per-task=1 +!!! example "Binding to cores and cyclic:cyclic distribution" -srun --ntasks 32 --cpu_bind=cores --distribution=cyclic:cyclic -``` + ```bash + #!/bin/bash + #SBATCH --nodes=2 + #SBATCH --tasks-per-node=16 + #SBATCH --cpus-per-task=1 + + srun --ntasks 32 --cpu_bind=cores --distribution=cyclic:cyclic + ``` #### Distribution: cyclic:block @@ -90,104 +123,108 @@ The cyclic:block distribution will allocate the tasks of your job in alternation on node level, starting with first node filling the sockets linearly. -\<img alt="" -src="<data:;base64,>" -/> + +{: align="center"} + +!!! example "Binding to cores and cyclic:block distribution" + ```bash #!/bin/bash #SBATCH --nodes=2 #SBATCH --tasks-per-node=16 #SBATCH --cpus-per-task=1 srun --ntasks 32 --cpu_bind=cores --distribution=cyclic:block ./application + ``` ### Socket Bound -Note: The general distribution onto the nodes and sockets stays the -same. The mayor difference between socket and cpu bound lies within the -ability of the tasks to "jump" from one core to another inside a socket -while executing the application. These jumps can slow down the execution -time of your application. +The general distribution onto the nodes and sockets stays the same. The mayor difference between +socket- and CPU-bound lies within the ability of the OS to move tasks from one core to another +inside a socket while executing the application. These jumps can slow down the execution time of +your application. #### Default Distribution -The default distribution uses --cpu_bind=sockets with ---distribution=block:cyclic. The default allocation method (as well as -block:cyclic) will fill up one node after another, while filling socket -one and two in alternation. Resulting in only even ranks on the first -socket of each node and odd on each second socket of each node. +The default distribution uses `--cpu_bind=sockets` with `--distribution=block:cyclic`. The default +allocation method (as well as `block:cyclic`) will fill up one node after another, while filling +socket one and two in alternation. Resulting in only even ranks on the first socket of each node and +odd on each second socket of each node. -\<img alt="" -src="data:;base64," -/> + +{: align="center"} -```Bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --tasks-per-node=16 -#SBATCH --cpus-per-task=1 +!!! example "Binding to sockets and block:cyclic distribution" -srun --ntasks 32 -cpu_bind=sockets ./application -``` + ```bash + #!/bin/bash + #SBATCH --nodes=2 + #SBATCH --tasks-per-node=16 + #SBATCH --cpus-per-task=1 + + srun --ntasks 32 -cpu_bind=sockets ./application + ``` #### Distribution: block:block This method allocates the tasks linearly to the cores. -\<img alt="" -src="data:;base64," -/> + +{: align="center"} -```Bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --tasks-per-node=16 -#SBATCH --cpus-per-task=1 +!!! example "Binding to sockets and block:block distribution" -srun --ntasks 32 --cpu_bind=sockets --distribution=block:block ./application -``` + ```bash + #!/bin/bash + #SBATCH --nodes=2 + #SBATCH --tasks-per-node=16 + #SBATCH --cpus-per-task=1 + + srun --ntasks 32 --cpu_bind=sockets --distribution=block:block ./application + ``` #### Distribution: block:cyclic -The block:cyclic distribution will allocate the tasks of your job in +The `block:cyclic` distribution will allocate the tasks of your job in alternation between the first node and the second node while filling the sockets linearly. -\<img alt="" -src="data:;base64," -/> + +{: align="center"} + +!!! example "Binding to sockets and block:cyclic distribution" + ```bash #!/bin/bash #SBATCH --nodes=2 #SBATCh --tasks-per-node=16 #SBATCH --cpus-per-task=1 srun --ntasks 32 --cpu_bind=sockets --distribution=block:cyclic ./application + ``` ## Hybrid Strategies ### Default Binding and Distribution Pattern -The default binding pattern of hybrid jobs will split the cores -allocated to a rank between the sockets of a node. The example shows -that Rank 0 has 4 cores at its disposal. Two of them on first socket -inside the first node and two on the second socket inside the first -node. +The default binding pattern of hybrid jobs will split the cores allocated to a rank between the +sockets of a node. The example shows that Rank 0 has 4 cores at its disposal. Two of them on first +socket inside the first node and two on the second socket inside the first node. -\<img alt="" -src="data:;base64," -/> + +{: align="center"} -```Bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --tasks-per-node=4 -#SBATCH --cpus-per-task=4 +!!! example "Binding to sockets and block:block distribution" -export OMP_NUM_THREADS=4 + ```bash + #!/bin/bash + #SBATCH --nodes=2 + #SBATCH --tasks-per-node=4 + #SBATCH --cpus-per-task=4 -srun --ntasks 8 --cpus-per-task $OMP_NUM_THREADS ./application -``` + export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK + srun --ntasks 8 --cpus-per-task $OMP_NUM_THREADS ./application + ``` ### Core Bound @@ -195,36 +232,37 @@ srun --ntasks 8 --cpus-per-task $OMP_NUM_THREADS ./application This method allocates the tasks linearly to the cores. -\<img alt="" -src="<data:;base64,>" -/> + +{: align="center"} -```Bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --tasks-per-node=4 -#SBATCH --cpus-per-task=4 +!!! example "Binding to cores and block:block distribution" -export OMP_NUM_THREADS=4 + ```bash + #!/bin/bash + #SBATCH --nodes=2 + #SBATCH --tasks-per-node=4 + #SBATCH --cpus-per-task=4 -srun --ntasks 8 --cpus-per-task $OMP_NUM_THREADS --cpu_bind=cores --distribution=block:block ./application -``` + export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK + srun --ntasks 8 --cpus-per-task $OMP_NUM_THREADS --cpu_bind=cores --distribution=block:block ./application + ``` #### Distribution: cyclic:block -The cyclic:block distribution will allocate the tasks of your job in -alternation between the first node and the second node while filling the -sockets linearly. +The `cyclic:block` distribution will allocate the tasks of your job in alternation between the first +node and the second node while filling the sockets linearly. -\<img alt="" -src="data:;base64," -/> + +{: align="center"} -```Bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --tasks-per-node=4 -#SBATCH --cpus-per-task=4 +!!! example "Binding to cores and cyclic:block distribution" -export OMP_NUM_THREADS=4<br /><br />srun --ntasks 8 --cpus-per-task $OMP_NUM_THREADS --cpu_bind=cores --distribution=cyclic:block ./application -``` + ```bash + #!/bin/bash + #SBATCH --nodes=2 + #SBATCH --tasks-per-node=4 + #SBATCH --cpus-per-task=4 + + export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK + srun --ntasks 8 --cpus-per-task $OMP_NUM_THREADS --cpu_bind=cores --distribution=cyclic:block ./application + ``` diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/checkpoint_restart.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/checkpoint_restart.md index ea3343fe1a5d21a296207fc374aa181e3ccc0855..619e9bfa84413b6f37e99ae8838abd354f059ba5 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/checkpoint_restart.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/checkpoint_restart.md @@ -12,6 +12,15 @@ from the very beginning, you should be familiar with the concept of checkpointin Another motivation is to use checkpoint/restart to split long running jobs into several shorter ones. This might improve the overall job throughput, since shorter jobs can "fill holes" in the job queue. +Here is an extreme example from literature for the waste of large computing resources due to missing +checkpoints: + +!!! cite "Adams, D. The Hitchhikers Guide Through the Galaxy" + + Earth was a supercomputer constructed to find the question to the answer to the Life, the Universe, + and Everything by a race of hyper-intelligent pan-dimensional beings. Unfortunately 10 million years + later, and five minutes before the program had run to completion, the Earth was destroyed by + Vogons. If you wish to do checkpointing, your first step should always be to check if your application already has such capabilities built-in, as that is the most stable and safe way of doing it. @@ -21,7 +30,7 @@ Abaqus, Amber, Gaussian, GROMACS, LAMMPS, NAMD, NWChem, Quantum Espresso, STAR-C In case your program does not natively support checkpointing, there are attempts at creating generic checkpoint/restart solutions that should work application-agnostic. One such project which we -recommend is [Distributed MultiThreaded CheckPointing](http://dmtcp.sourceforge.net) (DMTCP). +recommend is [Distributed Multi-Threaded Check-Pointing](http://dmtcp.sourceforge.net) (DMTCP). DMTCP is available on ZIH systems after having loaded the `dmtcp` module @@ -47,7 +56,7 @@ checkpoint/restart bits transparently to your batch script. You just have to spe total runtime of your calculation and the interval in which you wish to do checkpoints. The latter (plus the time it takes to write the checkpoint) will then be the runtime of the individual jobs. This should be targeted at below 24 hours in order to be able to run on all -[haswell64 partitions](../jobs_and_resources/system_taurus.md#run-time-limits). For increased +[haswell64 partitions](../jobs_and_resources/partitions_and_limits.md#runtime-limits). For increased fault-tolerance, it can be chosen even shorter. To use it, first add a `dmtcp_launch` before your application call in your batch script. In the case @@ -85,7 +94,7 @@ about 2 days in total. !!! Hints - - If you see your first job running into the timelimit, that probably + - If you see your first job running into the time limit, that probably means the timeout for writing out checkpoint files does not suffice and should be increased. Our tests have shown that it takes approximately 5 minutes to write out the memory content of a fully @@ -95,7 +104,7 @@ about 2 days in total. content is rather incompressible, it might be a good idea to disable the checkpoint file compression by setting: `export DMTCP_GZIP=0` - Note that all jobs the script deems necessary for your chosen - timelimit/interval values are submitted right when first calling the + time limit/interval values are submitted right when first calling the script. If your applications take considerably less time than what you specified, some of the individual jobs will be unnecessary. As soon as one job does not find a checkpoint to resume from, it will @@ -115,7 +124,7 @@ What happens in your work directory? If you wish to restart manually from one of your checkpoints (e.g., if something went wrong in your later jobs or the jobs vanished from the queue for some reason), you have to call `dmtcp_sbatch` -with the `-r, --resume` parameter, specifying a cpkt\_\* directory to resume from. Then it will use +with the `-r, --resume` parameter, specifying a `cpkt_` directory to resume from. Then it will use the same parameters as in the initial run of this job chain. If you wish to adjust the time limit, for instance, because you realized that your original limit was too short, just use the `-t, --time` parameter again on resume. @@ -126,7 +135,7 @@ If for some reason our automatic chain job script is not suitable for your use c just use DMTCP on its own. In the following we will give you step-by-step instructions on how to checkpoint your job manually: -* Load the dmtcp module: `module load dmtcp` +* Load the DMTCP module: `module load dmtcp` * DMTCP usually runs an additional process that manages the creation of checkpoints and such, the so-called `coordinator`. It must be started in your batch script before the actual start of your application. To help you with this process, we @@ -138,9 +147,9 @@ first checkpoint has been created, which can be useful if you wish to implement chaining on your own. * In front of your program call, you have to add the wrapper script `dmtcp_launch`. This will create a checkpoint automatically after 40 seconds and then -terminate your application and with it the job. If the job runs into its timelimit (here: 60 +terminate your application and with it the job. If the job runs into its time limit (here: 60 seconds), the time to write out the checkpoint was probably not long enough. If all went well, you -should find cpkt\* files in your work directory together with a script called +should find `cpkt` files in your work directory together with a script called `./dmtcp_restart_script.sh` that can be used to resume from the checkpoint. ???+ example diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md new file mode 100644 index 0000000000000000000000000000000000000000..8a6e3f5904582a8316fdcde363144a174b59e4d3 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md @@ -0,0 +1,127 @@ +# ZIH Systems + +ZIH systems comprises the *High Performance Computing and Storage Complex* (HRSK-II) and its +extension *High Performance Computing – Data Analytics* (HPC-DA). In totoal it offers scientists +about 60,000 CPU cores and a peak performance of more than 1.5 quadrillion floating point operations +per second. The architecture specifically tailored to data-intensive computing, Big Data analytics, +and artificial intelligence methods with extensive capabilities for energy measurement and +performance monitoring provides ideal conditions to achieve the ambitious research goals of the +users and the ZIH. + +## Login Nodes + +- Login-Nodes (`tauruslogin[3-6].hrsk.tu-dresden.de`) + - each with 2x Intel(R) Xeon(R) CPU E5-2680 v3 each with 12 cores + @ 2.50GHz, MultiThreading Disabled, 64 GB RAM, 128 GB SSD local disk + - IPs: 141.30.73.\[102-105\] +- Transfer-Nodes (`taurusexport3/4.hrsk.tu-dresden.de`, DNS Alias + `taurusexport.hrsk.tu-dresden.de`) + - 2 Servers without interactive login, only available via file transfer protocols (`rsync`, `ftp`) + - IPs: 141.30.73.82/83 +- Direct access to these nodes is granted via IP whitelisting (contact + hpcsupport@zih.tu-dresden.de) - otherwise use TU Dresden VPN. + +## AMD Rome CPUs + NVIDIA A100 + +- 32 nodes, each with + - 8 x NVIDIA A100-SXM4 + - 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, MultiThreading disabled + - 1 TB RAM + - 3.5 TB local memory at NVMe device at `/tmp` +- Hostnames: `taurusi[8001-8034]` +- Slurm partition `alpha` +- Dedicated mostly for ScaDS-AI + +## Island 7 - AMD Rome CPUs + +- 192 nodes, each with + - 2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, MultiThreading + enabled, + - 512 GB RAM + - 200 GB /tmp on local SSD local disk +- Hostnames: taurusi\[7001-7192\] +- Slurm partition `romeo` +- More information under [Rome Nodes](rome_nodes.md) + +## Large SMP System HPE Superdome Flex + +- 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20GHz (28 cores) +- 47 TB RAM +- Currently configured as one single node + - Hostname: `taurussmp8` +- Slurm partition `julia` +- More information under [HPE SD Flex](sd_flex.md) + +## IBM Power9 Nodes for Machine Learning + +For machine learning, we have 32 IBM AC922 nodes installed with this configuration: + +- 2 x IBM Power9 CPU (2.80 GHz, 3.10 GHz boost, 22 cores) +- 256 GB RAM DDR4 2666MHz +- 6x NVIDIA VOLTA V100 with 32GB HBM2 +- NVLINK bandwidth 150 GB/s between GPUs and host +- Slurm partition `ml` +- Hostnames: `taurusml[1-32]` + +## Island 4 to 6 - Intel Haswell CPUs + +- 1456 nodes, each with 2x Intel(R) Xeon(R) CPU E5-2680 v3 (12 cores) + @ 2.50GHz, MultiThreading disabled, 128 GB SSD local disk +- Hostname: `taurusi4[001-232]`, `taurusi5[001-612]`, + `taurusi6[001-612]` +- Varying amounts of main memory (selected automatically by the batch + system for you according to your job requirements) + - 1328 nodes with 2.67 GB RAM per core (64 GB total): + `taurusi[4001-4104,5001-5612,6001-6612]` + - 84 nodes with 5.34 GB RAM per core (128 GB total): + `taurusi[4105-4188]` + - 44 nodes with 10.67 GB RAM per core (256 GB total): + `taurusi[4189-4232]` +- Slurm Partition `haswell` + +??? hint "Node topology" + +  + {: align=center} + +### Extension of Island 4 with Broadwell CPUs + +* 32 nodes, eachs witch 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz + (**14 cores**), MultiThreading disabled, 64 GB RAM, 256 GB SSD local disk +* from the users' perspective: Broadwell is like Haswell +* Hostname: `taurusi[4233-4264]` +* Slurm partition `broadwell` + +## Island 2 Phase 2 - Intel Haswell CPUs + NVIDIA K80 GPUs + +* 64 nodes, each with 2x Intel(R) Xeon(R) CPU E5-E5-2680 v3 (12 cores) + @ 2.50GHz, MultiThreading Disabled, 64 GB RAM (2.67 GB per core), + 128 GB SSD local disk, 4x NVIDIA Tesla K80 (12 GB GDDR RAM) GPUs +* Hostname: `taurusi2[045-108]` +* Slurm Partition `gpu` +* Node topology, same as [island 4 - 6](#island-4-to-6-intel-haswell-cpus) + +## SMP Nodes - up to 2 TB RAM + +- 5 Nodes each with 4x Intel(R) Xeon(R) CPU E7-4850 v3 (14 cores) @ + 2.20GHz, MultiThreading Disabled, 2 TB RAM + - Hostname: `taurussmp[3-7]` + - Slurm partition `smp2` + +??? hint "Node topology" + +  + {: align=center} + +## Island 2 Phase 1 - Intel Sandybridge CPUs + NVIDIA K20x GPUs + +- 44 nodes, each with 2x Intel(R) Xeon(R) CPU E5-2450 (8 cores) @ + 2.10GHz, MultiThreading Disabled, 48 GB RAM (3 GB per core), 128 GB + SSD local disk, 2x NVIDIA Tesla K20x (6 GB GDDR RAM) GPUs +- Hostname: `taurusi2[001-044]` +- Slurm partition `gpu1` + +??? hint "Node topology" + +  + {: align=center} diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_taurus.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_taurus.md deleted file mode 100644 index ff28e9b69d95496f299b80b45179f3787ad996cb..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_taurus.md +++ /dev/null @@ -1,110 +0,0 @@ -# Central Components - -- Login-Nodes (`tauruslogin[3-6].hrsk.tu-dresden.de`) - - each with 2x Intel(R) Xeon(R) CPU E5-2680 v3 each with 12 cores - @ 2.50GHz, MultiThreading Disabled, 64 GB RAM, 128 GB SSD local - disk - - IPs: 141.30.73.\[102-105\] -- Transfer-Nodes (`taurusexport3/4.hrsk.tu-dresden.de`, DNS Alias - `taurusexport.hrsk.tu-dresden.de`) - - 2 Servers without interactive login, only available via file - transfer protocols (rsync, ftp) - - IPs: 141.30.73.82/83 -- Direct access to these nodes is granted via IP whitelisting (contact - <hpcsupport@zih.tu-dresden.de>) - otherwise use TU Dresden VPN. - -## AMD Rome CPUs + NVIDIA A100 - -- 32 nodes, each with - - 8 x NVIDIA A100-SXM4 - - 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, MultiThreading - disabled - - 1 TB RAM - - 3.5 TB /tmp local NVMe device -- Hostnames: taurusi\[8001-8034\] -- SLURM partition `alpha` -- dedicated mostly for ScaDS-AI - -## Island 7 - AMD Rome CPUs - -- 192 nodes, each with - - 2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, MultiThreading - enabled, - - 512 GB RAM - - 200 GB /tmp on local SSD local disk -- Hostnames: taurusi\[7001-7192\] -- SLURM partition `romeo` -- more information under [RomeNodes](rome_nodes.md) - -## Large SMP System HPE Superdome Flex - -- 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20GHz (28 cores) -- 47 TB RAM -- currently configured as one single node - - Hostname: taurussmp8 -- SLURM partition `julia` -- more information under [HPE SD Flex](sd_flex.md) - -## IBM Power9 Nodes for Machine Learning - -For machine learning, we have 32 IBM AC922 nodes installed with this -configuration: - -- 2 x IBM Power9 CPU (2.80 GHz, 3.10 GHz boost, 22 cores) -- 256 GB RAM DDR4 2666MHz -- 6x NVIDIA VOLTA V100 with 32GB HBM2 -- NVLINK bandwidth 150 GB/s between GPUs and host -- SLURM partition `ml` -- Hostnames: taurusml\[1-32\] - -## Island 4 to 6 - Intel Haswell CPUs - -- 1456 nodes, each with 2x Intel(R) Xeon(R) CPU E5-2680 v3 (12 cores) - @ 2.50GHz, MultiThreading disabled, 128 GB SSD local disk -- Hostname: taurusi4\[001-232\], taurusi5\[001-612\], - taurusi6\[001-612\] -- varying amounts of main memory (selected automatically by the batch - system for you according to your job requirements) - - 1328 nodes with 2.67 GB RAM per core (64 GB total): - taurusi\[4001-4104,5001-5612,6001-6612\] - - 84 nodes with 5.34 GB RAM per core (128 GB total): - taurusi\[4105-4188\] - - 44 nodes with 10.67 GB RAM per core (256 GB total): - taurusi\[4189-4232\] -- SLURM Partition `haswell` -- [Node topology] **todo** %ATTACHURL%/i4000.png - -### Extension of Island 4 with Broadwell CPUs - -- 32 nodes, eachs witch 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz - (**14 cores**) , MultiThreading disabled, 64 GB RAM, 256 GB SSD - local disk -- from the users' perspective: Broadwell is like Haswell -- Hostname: taurusi\[4233-4264\] -- SLURM partition `broadwell` - -## Island 2 Phase 2 - Intel Haswell CPUs + NVIDIA K80 GPUs - -- 64 nodes, each with 2x Intel(R) Xeon(R) CPU E5-E5-2680 v3 (12 cores) - @ 2.50GHz, MultiThreading Disabled, 64 GB RAM (2.67 GB per core), - 128 GB SSD local disk, 4x NVIDIA Tesla K80 (12 GB GDDR RAM) GPUs -- Hostname: taurusi2\[045-108\] -- SLURM Partition `gpu` -- [Node topology] **todo %ATTACHURL%/i4000.png** (without GPUs) - -## SMP Nodes - up to 2 TB RAM - -- 5 Nodes each with 4x Intel(R) Xeon(R) CPU E7-4850 v3 (14 cores) @ - 2.20GHz, MultiThreading Disabled, 2 TB RAM - - Hostname: `taurussmp[3-7]` - - SLURM Partition `smp2` - - [Node topology] **todo** %ATTACHURL%/smp2.png - -## Island 2 Phase 1 - Intel Sandybridge CPUs + NVIDIA K20x GPUs - -- 44 nodes, each with 2x Intel(R) Xeon(R) CPU E5-2450 (8 cores) @ - 2.10GHz, MultiThreading Disabled, 48 GB RAM (3 GB per core), 128 GB - SSD local disk, 2x NVIDIA Tesla K20x (6 GB GDDR RAM) GPUs -- Hostname: `taurusi2[001-044]` -- SLURM Partition `gpu1` -- [Node topology] **todo** %ATTACHURL%/i2000.png (without GPUs) diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/index.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/index.md deleted file mode 100644 index 911449758f01a2fce79f5179b5d81f51c79abe84..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/index.md +++ /dev/null @@ -1,65 +0,0 @@ -# Batch System - -Applications on an HPC system can not be run on the login node. They have to be submitted to compute -nodes with dedicated resources for user jobs. Normally a job can be submitted with these data: - -* number of CPU cores, -* requested CPU cores have to belong on one node (OpenMP programs) or can distributed (MPI), -* memory per process, -* maximum wall clock time (after reaching this limit the process is killed automatically), -* files for redirection of output and error messages, -* executable and command line parameters. - -*Comment:* Please keep in mind that for a large runtime a computation may not reach its end. Try to -create shorter runs (4...8 hours) and use checkpointing. Here is an extreme example from literature -for the waste of large computing resources due to missing checkpoints: - ->Earth was a supercomputer constructed to find the question to the answer to the Life, the Universe, ->and Everything by a race of hyper-intelligent pan-dimensional beings. Unfortunately 10 million years ->later, and five minutes before the program had run to completion, the Earth was destroyed by ->Vogons. - -(Adams, D. The Hitchhikers Guide Through the Galaxy) - -## Slurm - -The HRSK-II systems are operated with the batch system [Slurm](https://slurm.schedmd.com). Just -specify the resources you need in terms of cores, memory, and time and your job will be placed on -the system. - -### Job Submission - -Job submission can be done with the command: `srun [options] <command>` - -However, using `srun` directly on the shell will be blocking and launch an interactive job. Apart -from short test runs, it is recommended to launch your jobs into the background by using batch jobs. -For that, you can conveniently put the parameters directly in a job file which you can submit using -`sbatch [options] <job file>` - -Some options of srun/sbatch are: - -| Slurm Option | Description | -|------------|-------| -| `-n <N>` or `--ntasks <N>` | set a number of tasks to N(default=1). This determines how many processes will be spawned by srun (for MPI jobs). | -| `-N <N>` or `--nodes <N>` | set number of nodes that will be part of a job, on each node there will be --ntasks-per-node processes started, if the option --ntasks-per-node is not given, 1 process per node will be started | -| `--ntasks-per-node <N>` | how many tasks per allocated node to start, as stated in the line before | -| `-c <N>` or `--cpus-per-task <N>` | this option is needed for multithreaded (e.g. OpenMP) jobs, it tells SLURM to allocate N cores per task allocated; typically N should be equal to the number of threads you program spawns, e.g. it should be set to the same number as OMP_NUM_THREADS | -| `-p <name>` or `--partition <name>`| select the type of nodes where you want to execute your job, on Taurus we currently have haswell, smp, sandy, west, ml and gpu available | -| `--mem-per-cpu <name>` | specify the memory need per allocated CPU in MB | -| `--time <HH:MM:SS>` | specify the maximum runtime of your job, if you just put a single number in, it will be interpreted as minutes | -| `--mail-user <your email>` | tell the batch system your email address to get updates about the status of the jobs | -| `--mail-type ALL` | specify for what type of events you want to get a mail; valid options beside ALL are: BEGIN, END, FAIL, REQUEUE | -| `-J <name> or --job-name <name>` | give your job a name which is shown in the queue, the name will also be included in job emails (but cut after 24 chars within emails) | -| `--exclusive` | tell SLURM that only your job is allowed on the nodes allocated to this job; please be aware that you will be charged for all CPUs/cores on the node | -| `-A <project>` | Charge resources used by this job to the specified project, useful if a user belongs to multiple projects. | -| `-o <filename>` or `--output <filename>` | specify a file name that will be used to store all normal output (stdout), you can use %j (job id) and %N (name of first node) to automatically adopt the file name to the job, per default stdout goes to "slurm-%j.out" | - -<!--NOTE: the target path of this parameter must be writeable on the compute nodes, i.e. it may not point to a read-only mounted file system like /projects.--> -<!---e <filename> or --error <filename>--> - -<!--specify a file name that will be used to store all error output (stderr), you can use %j (job id) and %N (name of first node) to automatically adopt the file name to the job, per default stderr goes to "slurm-%j.out" as well--> - -<!--NOTE: the target path of this parameter must be writeable on the compute nodes, i.e. it may not point to a read-only mounted file system like /projects.--> -<!---a or --array submit an array job, see the extra section below--> -<!---w <node1>,<node2>,... restrict job to run on specific nodes only--> -<!---x <node1>,<node2>,... exclude specific nodes from job--> diff --git a/Compendium_attachments/Slurm/hdfview_memory.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hdfview_memory.png similarity index 100% rename from Compendium_attachments/Slurm/hdfview_memory.png rename to doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hdfview_memory.png diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hybrid.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hybrid.png new file mode 100644 index 0000000000000000000000000000000000000000..116e03dd0785492be3f896cda69959a025f5ac49 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hybrid.png differ diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hybrid_cores_block_block.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hybrid_cores_block_block.png new file mode 100644 index 0000000000000000000000000000000000000000..4c196df91b2fe410609a8e76505eca95f283ce29 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hybrid_cores_block_block.png differ diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hybrid_cores_cyclic_block.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hybrid_cores_cyclic_block.png new file mode 100644 index 0000000000000000000000000000000000000000..dfccaf451553c710fcddd648ae9721866668f9e8 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/hybrid_cores_cyclic_block.png differ diff --git a/Compendium_attachments/HardwareTaurus/i2000.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/i2000.png similarity index 100% rename from Compendium_attachments/HardwareTaurus/i2000.png rename to doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/i2000.png diff --git a/Compendium_attachments/HardwareTaurus/i4000.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/i4000.png similarity index 100% rename from Compendium_attachments/HardwareTaurus/i4000.png rename to doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/i4000.png diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi.png new file mode 100644 index 0000000000000000000000000000000000000000..82087209059e535401724c493fff74d743da58e4 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi.png differ diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_block_block.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_block_block.png new file mode 100644 index 0000000000000000000000000000000000000000..0c6e9bbfa0e7f0614ede7e89f292e2d5f1a74316 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_block_block.png differ diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_cyclic_block.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_cyclic_block.png new file mode 100644 index 0000000000000000000000000000000000000000..dab17e83ed4930b253818e15bc42ef1b1b2c9918 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_cyclic_block.png differ diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_cyclic_cyclic.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_cyclic_cyclic.png new file mode 100644 index 0000000000000000000000000000000000000000..8b9361dd1f0a2b76b063ad64652844c425aacbdf Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_cyclic_cyclic.png differ diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_default.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_default.png new file mode 100644 index 0000000000000000000000000000000000000000..82087209059e535401724c493fff74d743da58e4 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_default.png differ diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_socket_block_block.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_socket_block_block.png new file mode 100644 index 0000000000000000000000000000000000000000..be12c78d1a85297cd60161a1808462941def94fb Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_socket_block_block.png differ diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_socket_block_cyclic.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_socket_block_cyclic.png new file mode 100644 index 0000000000000000000000000000000000000000..08f2a90100ed88175f7ef6fa3d867a70ad0880d7 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/mpi_socket_block_cyclic.png differ diff --git a/Compendium_attachments/NvmeStorage/nvme.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/nvme.png similarity index 100% rename from Compendium_attachments/NvmeStorage/nvme.png rename to doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/nvme.png diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/openmp.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/openmp.png new file mode 100644 index 0000000000000000000000000000000000000000..0cf284368f10bdd8c4a3b4c97530151e0142aad6 Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/openmp.png differ diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/part.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/part.png new file mode 100644 index 0000000000000000000000000000000000000000..e2b5418f622d3fa32ba2c6ce44889e84e4d1cddd Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/part.png differ diff --git a/Compendium_attachments/HardwareTaurus/smp2.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/smp2.png similarity index 100% rename from Compendium_attachments/HardwareTaurus/smp2.png rename to doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/smp2.png diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/nvme_storage.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/nvme_storage.md index 40a0d6af3e6f62fe69a76fc01e806b63fa8dc9df..78b8175ccbba3fb0eee8be7b946ebe2bee31219b 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/nvme_storage.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/nvme_storage.md @@ -1,6 +1,5 @@ # NVMe Storage -**TODO image nvme.png** 90 NVMe storage nodes, each with - 8x Intel NVMe Datacenter SSD P4610, 3.2 TB @@ -11,3 +10,6 @@ - 64 GB RAM NVMe cards can saturate the HCAs + + +{: align=center} diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/overview.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/overview.md index 67cc21a6cd4a4c68cbaec377151106bf63428b75..eb4aae34fc7850dbfa5d7d8f22628d569aa48f8c 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/overview.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/overview.md @@ -1,38 +1,21 @@ # HPC Resources and Jobs -When log in to ZIH systems, you are placed on a *login node* **TODO** link to login nodes section -where you can [manage data life cycle](../data_lifecycle/overview.md), -[setup experiments](../data_lifecycle/experiments.md), execute short tests and compile moderate -projects. The login nodes cannot be used for real experiments and computations. Long and extensive -computational work and experiments have to be encapsulated into so called **jobs** and scheduled to -the compute nodes. - -<!--Login nodes which are using for login can not be used for your computations.--> -<!--To run software, do calculations and experiments, or compile your code compute nodes have to be used.--> - -ZIH uses the batch system Slurm for resource management and job scheduling. -<!--[HPC Introduction]**todo link** is a good resource to get started with it.--> - -??? note "Batch Job" - - In order to allow the batch scheduler an efficient job placement it needs these - specifications: - - * **requirements:** cores, memory per core, (nodes), additional resources (GPU), - * maximum run-time, - * HPC project (normally use primary group which gives id), - * who gets an email on which occasion, - - The runtime environment (see [here](../software/overview.md)) as well as the executable and - certain command-line arguments have to be specified to run the computational work. - -??? note "Batch System" - - The batch system is the central organ of every HPC system users interact with its compute - resources. The batch system finds an adequate compute system (partition/island) for your compute - jobs. It organizes the queueing and messaging, if all resources are in use. If resources are - available for your job, the batch system allocates and connects to these resources, transfers - run-time environment, and starts the job. +ZIH operates a high performance computing (HPC) system with more than 60.000 cores, 720 GPUs, and a +flexible storage hierarchy with about 16 PB total capacity. The HPC system provides an optimal +research environment especially in the area of data analytics and machine learning as well as for +processing extremely large data sets. Moreover it is also a perfect platform for highly scalable, +data-intensive and compute-intensive applications. + +With shared [login nodes](#login-nodes) and [filesystems](../data_lifecycle/file_systems.md) our +HPC system enables users to easily switch between [the components](hardware_overview.md), each +specialized for different application scenarios. + +When log in to ZIH systems, you are placed on a login node where you can +[manage data life cycle](../data_lifecycle/overview.md), +[setup experiments](../data_lifecycle/experiments.md), +execute short tests and compile moderate projects. The login nodes cannot be used for real +experiments and computations. Long and extensive computational work and experiments have to be +encapsulated into so called **jobs** and scheduled to the compute nodes. Follow the page [Slurm](slurm.md) for comprehensive documentation using the batch system at ZIH systems. There is also a page with extensive set of [Slurm examples](slurm_examples.md). @@ -49,9 +32,9 @@ a single GPU's core can handle is small), GPUs are not as versatile as CPUs. ### Available Hardware ZIH provides a broad variety of compute resources ranging from normal server CPUs of different -manufactures, to large shared memory nodes, GPU-assisted nodes up to highly specialized resources for +manufactures, large shared memory nodes, GPU-assisted nodes up to highly specialized resources for [Machine Learning](../software/machine_learning.md) and AI. -The page [Hardware Taurus](hardware_taurus.md) holds a comprehensive overview. +The page [ZIH Systems](hardware_overview.md) holds a comprehensive overview. The desired hardware can be specified by the partition `-p, --partition` flag in Slurm. The majority of the basic tasks can be executed on the conventional nodes like a Haswell. Slurm will @@ -60,19 +43,19 @@ automatically select a suitable partition depending on your memory and GPU requi ### Parallel Jobs **MPI jobs:** For MPI jobs typically allocates one core per task. Several nodes could be allocated -if it is necessary. Slurm will automatically find suitable hardware. Normal compute nodes are -perfect for this task. +if it is necessary. The batch system [Slurm](slurm.md) will automatically find suitable hardware. +Normal compute nodes are perfect for this task. **OpenMP jobs:** SMP-parallel applications can only run **within a node**, so it is necessary to -include the options `-N 1` and `-n 1`. Using `--cpus-per-task N` Slurm will start one task and you -will have N CPUs. The maximum number of processors for an SMP-parallel program is 896 on Taurus -([SMP]**todo link** island). +include the [batch system](slurm.md) options `-N 1` and `-n 1`. Using `--cpus-per-task N` Slurm will +start one task and you will have `N` CPUs. The maximum number of processors for an SMP-parallel +program is 896 on [partition `julia`](partitions_and_limits.md). **GPUs** partitions are best suited for **repetitive** and **highly-parallel** computing tasks. If -you have a task with potential [data parallelism]**todo link** most likely that you need the GPUs. -Beyond video rendering, GPUs excel in tasks such as machine learning, financial simulations and risk -modeling. Use the gpu2 and ml partition only if you need GPUs! Otherwise using the x86 partitions -(e.g Haswell) most likely would be more beneficial. +you have a task with potential [data parallelism](../software/gpu_programming.md) most likely that +you need the GPUs. Beyond video rendering, GPUs excel in tasks such as machine learning, financial +simulations and risk modeling. Use the partitions `gpu2` and `ml` only if you need GPUs! Otherwise +using the x86-based partitions most likely would be more beneficial. **Interactive jobs:** Slurm can forward your X11 credentials to the first node (or even all) for a job with the `--x11` option. To use an interactive job you have to specify `-X` flag for the ssh login. @@ -91,5 +74,34 @@ projects. The quality of this work influence on the computations. However, pre- in many cases can be done completely or partially on a local system and then transferred to ZIH systems. Please use ZIH systems primarily for the computation-intensive tasks. -<!--Useful links: [Batch Systems]**todo link**, [Hardware Taurus]**todo link**, [HPC-DA]**todo link**,--> -<!--[Slurm]**todo link**--> +## Exclusive Reservation of Hardware + +If you need for some special reasons, e.g., for benchmarking, a project or paper deadline, parts of +our machines exclusively, we offer the opportunity to request and reserve these parts for your +project. + +Please send your request **7 working days** before the reservation should start (as that's our +maximum time limit for jobs and it is therefore not guaranteed that resources are available on +shorter notice) with the following information to the +[HPC +support](mailto:hpcsupport@zih.tu-dresden.de?subject=Request%20for%20a%20exclusive%20reservation%20of%20hardware&body=Dear%20HPC%20support%2C%0A%0AI%20ha +ve%20the%20following%20request%20for%20a%20exclusive%20reservation%20of%20hardware%3A%0A%0AProject%3A%0AReservation%20owner%3A%0ASystem%3A%0AHardware%20r +equirements%3A%0ATime%20window%3A%20%3C%5Byear%5D%3Amonth%3Aday%3Ahour%3Aminute%20-%20%5Byear%5D%3Amonth%3Aday%3Ahour%3Aminute%3E%0AReason%3A): + +- `Project:` *Which project will be credited for the reservation?* +- `Reservation owner:` *Who should be able to run jobs on the + reservation? I.e., name of an individual user or a group of users + within the specified project.* +- `System:` *Which machine should be used?* +- `Hardware requirements:` *How many nodes and cores do you need? Do + you have special requirements, e.g., minimum on main memory, + equipped with a graphic card, special placement within the network + topology?* +- `Time window:` *Begin and end of the reservation in the form + year:month:dayThour:minute:second e.g.: 2020-05-21T09:00:00* +- `Reason:` *Reason for the reservation.* + +!!! hint + + Please note that your project CPU hour budget will be credited for the reserved hardware even if + you don't use it. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/partitions_and_limits.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/partitions_and_limits.md new file mode 100644 index 0000000000000000000000000000000000000000..edf5bae8582cff37ba5dca68d70c70a35438f341 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/partitions_and_limits.md @@ -0,0 +1,78 @@ +# Partitions, Memory and Run Time Limits + +There is no such thing as free lunch at ZIH systems. Since, compute nodes are operated in multi-user +node by default, jobs of several users can run at the same time at the very same node sharing +resources, like memory (but not CPU). On the other hand, a higher throughput can be achieved by +smaller jobs. Thus, restrictions w.r.t. [memory](#memory-limits) and +[runtime limits](#runtime-limits) have to be respected when submitting jobs. + +## Runtime Limits + +!!! note "Runtime limits are enforced." + + This means, a job will be canceled as soon as it exceeds its requested limit. Currently, the + maximum run time is 7 days. + +Shorter jobs come with multiple advantages: + +- lower risk of loss of computing time, +- shorter waiting time for scheduling, +- higher job fluctuation; thus, jobs with high priorities may start faster. + +To bring down the percentage of long running jobs we restrict the number of cores with jobs longer +than 2 days to approximately 50% and with jobs longer than 24 to 75% of the total number of cores. +(These numbers are subject to changes.) As best practice we advise a run time of about 8h. + +!!! hint "Please always try to make a good estimation of your needed time limit." + + For this, you can use a command line like this to compare the requested timelimit with the + elapsed time for your completed jobs that started after a given date: + + ```console + marie@login$ sacct -X -S 2021-01-01 -E now --format=start,JobID,jobname,elapsed,timelimit -s COMPLETED + ``` + +Instead of running one long job, you should split it up into a chain job. Even applications that are +not capable of checkpoint/restart can be adapted. Please refer to the section +[Checkpoint/Restart](../jobs_and_resources/checkpoint_restart.md) for further documentation. + + +{: align="center"} + +## Memory Limits + +!!! note "Memory limits are enforced." + + This means that jobs which exceed their per-node memory limit will be killed automatically by + the batch system. + +Memory requirements for your job can be specified via the `sbatch/srun` parameters: + +`--mem-per-cpu=<MB>` or `--mem=<MB>` (which is "memory per node"). The **default limit** is quite +low at **300 MB** per CPU. + +ZIH systems comprises different sets of nodes with different amount of installed memory which affect +where your job may be run. To achieve the shortest possible waiting time for your jobs, you should +be aware of the limits shown in the following table. + +??? hint "Partitions and memory limits" + + | Partition | Nodes | # Nodes | Cores per Node | MB per Core | MB per Node | GPUs per Node | + |:-------------------|:-----------------------------------------|:--------|:----------------|:------------|:------------|:------------------| + | `haswell64` | `taurusi[4001-4104,5001-5612,6001-6612]` | `1328` | `24` | `2541` | `61000` | `-` | + | `haswell128` | `taurusi[4105-4188]` | `84` | `24` | `5250` | `126000` | `-` | + | `haswell256` | `taurusi[4189-4232]` | `44` | `24` | `10583` | `254000` | `-` | + | `broadwell` | `taurusi[4233-4264]` | `32` | `28` | `2214` | `62000` | `-` | + | `smp2` | `taurussmp[3-7]` | `5` | `56` | `36500` | `2044000` | `-` | + | `gpu2` | `taurusi[2045-2106]` | `62` | `24` | `2583` | `62000` | `4 (2 dual GPUs)` | + | `gpu2-interactive` | `taurusi[2045-2108]` | `64` | `24` | `2583` | `62000` | `4 (2 dual GPUs)` | + | `hpdlf` | `taurusa[3-16]` | `14` | `12` | `7916` | `95000` | `3` | + | `ml` | `taurusml[1-32]` | `32` | `44 (HT: 176)` | `1443*` | `254000` | `6` | + | `romeo` | `taurusi[7001-7192]` | `192` | `128 (HT: 256)` | `1972*` | `505000` | `-` | + | `julia` | `taurussmp8` | `1` | `896` | `27343*` | `49000000` | `-` | + +!!! note + + The ML nodes have 4way-SMT, so for every physical core allocated (,e.g., with + `SLURM_HINT=nomultithread`), you will always get 4*1443 MB because the memory of the other + threads is allocated implicitly, too. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md index a6cdfba8bd47659bc3a14473cad74c10b73089d0..57ab511938f3eb515b9e38ca831e91cede692418 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md @@ -2,50 +2,48 @@ ## Hardware -- Slurm partiton: romeo -- Module architecture: rome -- 192 nodes taurusi[7001-7192], each: - - 2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, MultiThreading +- Slurm partition: `romeo` +- Module architecture: `rome` +- 192 nodes `taurusi[7001-7192]`, each: + - 2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, Simultaneous Multithreading (SMT) - 512 GB RAM - - 200 GB SSD disk mounted on /tmp + - 200 GB SSD disk mounted on `/tmp` ## Usage -There is a total of 128 physical cores in each -node. SMT is also active, so in total, 256 logical cores are available -per node. +There is a total of 128 physical cores in each node. SMT is also active, so in total, 256 logical +cores are available per node. !!! note - Multithreading is disabled per default in a job. To make use of it - include the Slurm parameter `--hint=multithread` in your job script - or command line, or set - the environment variable `SLURM_HINT=multithread` before job submission. -Each node brings 512 GB of main memory, so you can request roughly -1972MB per logical core (using --mem-per-cpu). Note that you will always -get the memory for the logical core sibling too, even if you do not -intend to use SMT. + Multithreading is disabled per default in a job. To make use of it include the Slurm parameter + `--hint=multithread` in your job script or command line, or set the environment variable + `SLURM_HINT=multithread` before job submission. + +Each node brings 512 GB of main memory, so you can request roughly 1972 MB per logical core (using +`--mem-per-cpu`). Note that you will always get the memory for the logical core sibling too, even if +you do not intend to use SMT. !!! note - If you are running a job here with only ONE process (maybe - multiple cores), please explicitly set the option `-n 1` ! -Be aware that software built with Intel compilers and `-x*` optimization -flags will not run on those AMD processors! That's why most older -modules built with intel toolchains are not available on **romeo**. + If you are running a job here with only ONE process (maybe multiple cores), please explicitly + set the option `-n 1`! + +Be aware that software built with Intel compilers and `-x*` optimization flags will not run on those +AMD processors! That's why most older modules built with Intel toolchains are not available on +partition `romeo`. -We provide the script: `ml_arch_avail` that you can use to check if a -certain module is available on rome architecture. +We provide the script `ml_arch_avail` that can be used to check if a certain module is available on +`rome` architecture. ## Example, running CP2K on Rome First, check what CP2K modules are available in general: `module load spider CP2K` or `module avail CP2K`. -You will see that there are several different CP2K versions avail, built -with different toolchains. Now let's assume you have to decided you want -to run CP2K version 6 at least, so to check if those modules are built -for rome, use: +You will see that there are several different CP2K versions avail, built with different toolchains. +Now let's assume you have to decided you want to run CP2K version 6 at least, so to check if those +modules are built for rome, use: ```console marie@login$ ml_arch_avail CP2K/6 @@ -55,13 +53,11 @@ CP2K/6.1-intel-2018a: sandy, haswell CP2K/6.1-intel-2018a-spglib: haswell ``` -There you will see that only the modules built with **foss** toolchain -are available on architecture "rome", not the ones built with **intel**. -So you can load e.g. `ml CP2K/6.1-foss-2019a`. +There you will see that only the modules built with toolchain `foss` are available on architecture +`rome`, not the ones built with `intel`. So you can load, e.g. `ml CP2K/6.1-foss-2019a`. -Then, when writing your batch script, you have to specify the **romeo** -partition. Also, if e.g. you wanted to use an entire ROME node (no SMT) -and fill it with MPI ranks, it could look like this: +Then, when writing your batch script, you have to specify the partition `romeo`. Also, if e.g. you +wanted to use an entire ROME node (no SMT) and fill it with MPI ranks, it could look like this: ```bash #!/bin/bash @@ -73,27 +69,26 @@ and fill it with MPI ranks, it could look like this: srun cp2k.popt input.inp ``` -## Using the Intel toolchain on Rome +## Using the Intel Toolchain on Rome -Currently, we have only newer toolchains starting at `intel/2019b` -installed for the Rome nodes. Even though they have AMD CPUs, you can -still use the Intel compilers on there and they don't even create -bad-performing code. When using the MKL up to version 2019, though, -you should set the following environment variable to make sure that AVX2 -is used: +Currently, we have only newer toolchains starting at `intel/2019b` installed for the Rome nodes. +Even though they have AMD CPUs, you can still use the Intel compilers on there and they don't even +create bad-performing code. When using the Intel Math Kernel Library (MKL) up to version 2019, +though, you should set the following environment variable to make sure that AVX2 is used: ```bash export MKL_DEBUG_CPU_TYPE=5 ``` -Without it, the MKL does a CPUID check and disables AVX2/FMA on -non-Intel CPUs, leading to much worse performance. +Without it, the MKL does a CPUID check and disables AVX2/FMA on non-Intel CPUs, leading to much +worse performance. + !!! note - In version 2020, Intel has removed this environment variable and added separate Zen - codepaths to the library. However, they are still incomplete and do not - cover every BLAS function. Also, the Intel AVX2 codepaths still seem to - provide somewhat better performance, so a new workaround would be to - overwrite the `mkl_serv_intel_cpu_true` symbol with a custom function: + + In version 2020, Intel has removed this environment variable and added separate Zen codepaths to + the library. However, they are still incomplete and do not cover every BLAS function. Also, the + Intel AVX2 codepaths still seem to provide somewhat better performance, so a new workaround + would be to overwrite the `mkl_serv_intel_cpu_true` symbol with a custom function: ```c int mkl_serv_intel_cpu_true() { @@ -108,13 +103,11 @@ marie@login$ gcc -shared -fPIC -o libfakeintel.so fakeintel.c marie@login$ export LD_PRELOAD=libfakeintel.so ``` -As for compiler optimization flags, `-xHOST` does not seem to produce -best-performing code in every case on Rome. You might want to try -`-mavx2 -fma` instead. +As for compiler optimization flags, `-xHOST` does not seem to produce best-performing code in every +case on Rome. You might want to try `-mavx2 -fma` instead. ### Intel MPI -We have seen only half the theoretical peak bandwidth via Infiniband -between two nodes, whereas OpenMPI got close to the peak bandwidth, so -you might want to avoid using Intel MPI on romeo if your application -heavily relies on MPI communication until this issue is resolved. +We have seen only half the theoretical peak bandwidth via Infiniband between two nodes, whereas +OpenMPI got close to the peak bandwidth, so you might want to avoid using Intel MPI on partition +`rome` if your application heavily relies on MPI communication until this issue is resolved. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md index 04624da4e55fe3a32e3d41842622b38b3e176315..6816f97581303b41bd9247969f4e4c398932bfb1 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md @@ -1,24 +1,23 @@ -# Large shared-memory node - HPE Superdome Flex +# Large Shared-Memory Node - HPE Superdome Flex -- Hostname: taurussmp8 -- Access to all shared file systems -- Slurm partition `julia` -- 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20GHz (28 cores) -- 48 TB RAM (usable: 47 TB - one TB is used for cache coherence - protocols) -- 370 TB of fast NVME storage available at `/nvme/<projectname>` +- Hostname: `taurussmp8` +- Access to all shared file systems +- Slurm partition `julia` +- 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20GHz (28 cores) +- 48 TB RAM (usable: 47 TB - one TB is used for cache coherence protocols) +- 370 TB of fast NVME storage available at `/nvme/<projectname>` -## Local temporary NVMe storage +## Local Temporary NVMe Storage There are 370 TB of NVMe devices installed. For immediate access for all projects, a volume of 87 TB -of fast NVMe storage is available at `/nvme/1/<projectname>`. For testing, we have set a quota of 100 -GB per project on this NVMe storage.This is +of fast NVMe storage is available at `/nvme/1/<projectname>`. For testing, we have set a quota of +100 GB per project on this NVMe storage. With a more detailed proposal on how this unique system (large shared memory + NVMe storage) can speed up their computations, a project's quota can be increased or dedicated volumes of up to the full capacity can be set up. -## Hints for usage +## Hints for Usage - granularity should be a socket (28 cores) - can be used for OpenMP applications with large memory demands @@ -35,5 +34,5 @@ full capacity can be set up. this unique system (large shared memory + NVMe storage) can speed up their computations, we will gladly increase this limit, for selected projects. -- Test users might have to clean-up their /nvme storage within 4 weeks +- Test users might have to clean-up their `/nvme` storage within 4 weeks to make room for large projects. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm.md index 0c4d3d92a25de40aa7ec887feeb08086081a5af3..d7c3530fad85643c4f814a02c6e3250df427af38 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm.md @@ -1,589 +1,405 @@ -# Slurm +# Batch System Slurm -The HRSK-II systems are operated with the batch system Slurm. Just specify the resources you need -in terms of cores, memory, and time and your job will be placed on the system. +When log in to ZIH systems, you are placed on a login node. There you can manage your +[data life cycle](../data_lifecycle/overview.md), +[setup experiments](../data_lifecycle/experiments.md), and +edit and prepare jobs. The login nodes are not suited for computational work! From the login nodes, +you can interact with the batch system, e.g., submit and monitor your jobs. -## Job Submission +??? note "Batch System" -Job submission can be done with the command: `srun [options] <command>` - -However, using srun directly on the shell will be blocking and launch an interactive job. Apart from -short test runs, it is recommended to launch your jobs into the background by using batch jobs. For -that, you can conveniently put the parameters directly in a job file which you can submit using -`sbatch [options] <job file>` - -Some options of `srun/sbatch` are: - -| slurm option | Description | -|:---------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| -n \<N> or --ntasks \<N> | set a number of tasks to N(default=1). This determines how many processes will be spawned by srun (for MPI jobs). | -| -N \<N> or --nodes \<N> | set number of nodes that will be part of a job, on each node there will be --ntasks-per-node processes started, if the option --ntasks-per-node is not given, 1 process per node will be started | -| --ntasks-per-node \<N> | how many tasks per allocated node to start, as stated in the line before | -| -c \<N> or --cpus-per-task \<N> | this option is needed for multithreaded (e.g. OpenMP) jobs, it tells Slurm to allocate N cores per task allocated; typically N should be equal to the number of threads you program spawns, e.g. it should be set to the same number as OMP_NUM_THREADS | -| -p \<name> or --partition \<name> | select the type of nodes where you want to execute your job, on Taurus we currently have haswell, `smp`, `sandy`, `west`, ml and `gpu` available | -| --mem-per-cpu \<name> | specify the memory need per allocated CPU in MB | -| --time \<HH:MM:SS> | specify the maximum runtime of your job, if you just put a single number in, it will be interpreted as minutes | -| --mail-user \<your email> | tell the batch system your email address to get updates about the status of the jobs | -| --mail-type ALL | specify for what type of events you want to get a mail; valid options beside ALL are: BEGIN, END, FAIL, REQUEUE | -| -J \<name> or --job-name \<name> | give your job a name which is shown in the queue, the name will also be included in job emails (but cut after 24 chars within emails) | -| --no-requeue | At node failure, jobs are requeued automatically per default. Use this flag to disable requeueing. | -| --exclusive | tell Slurm that only your job is allowed on the nodes allocated to this job; please be aware that you will be charged for all CPUs/cores on the node | -| -A \<project> | Charge resources used by this job to the specified project, useful if a user belongs to multiple projects. | -| -o \<filename> or --output \<filename> | \<p>specify a file name that will be used to store all normal output (stdout), you can use %j (job id) and %N (name of first node) to automatically adopt the file name to the job, per default stdout goes to "slurm-%j.out"\</p> \<p>%RED%NOTE:<span class="twiki-macro ENDCOLOR"></span> the target path of this parameter must be writeable on the compute nodes, i.e. it may not point to a read-only mounted file system like /projects.\</p> | -| -e \<filename> or --error \<filename> | \<p>specify a file name that will be used to store all error output (stderr), you can use %j (job id) and %N (name of first node) to automatically adopt the file name to the job, per default stderr goes to "slurm-%j.out" as well\</p> \<p>%RED%NOTE:<span class="twiki-macro ENDCOLOR"></span> the target path of this parameter must be writeable on the compute nodes, i.e. it may not point to a read-only mounted file system like /projects.\</p> | -| -a or --array | submit an array job, see the extra section below | -| -w \<node1>,\<node2>,... | restrict job to run on specific nodes only | -| -x \<node1>,\<node2>,... | exclude specific nodes from job | - -The following example job file shows how you can make use of sbatch - -```Bash -#!/bin/bash -#SBATCH --time=01:00:00 -#SBATCH --output=simulation-m-%j.out -#SBATCH --error=simulation-m-%j.err -#SBATCH --ntasks=512 -#SBATCH -A myproject - -echo Starting Program -``` + The batch system is the central organ of every HPC system users interact with its compute + resources. The batch system finds an adequate compute system (partition) for your compute jobs. + It organizes the queueing and messaging, if all resources are in use. If resources are available + for your job, the batch system allocates and connects to these resources, transfers runtime + environment, and starts the job. -During runtime, the environment variable SLURM_JOB_ID will be set to the id of your job. +??? note "Batch Job" -You can also use our [Slurm Batch File Generator]**todo** Slurmgenerator, which could help you create -basic Slurm job scripts. + At HPC systems, computational work and resource requirements are encapsulated into so-called + jobs. In order to allow the batch system an efficient job placement it needs these + specifications: -Detailed information on [memory limits on Taurus]**todo** + * requirements: number of nodes and cores, memory per core, additional resources (GPU) + * maximum run-time + * HPC project for accounting + * who gets an email on which occasion -### Interactive Jobs + Moreover, the [runtime environment](../software/overview.md) as well as the executable and + certain command-line arguments have to be specified to run the computational work. -Interactive activities like editing, compiling etc. are normally limited to the login nodes. For -longer interactive sessions you can allocate cores on the compute node with the command "salloc". It -takes the same options like `sbatch` to specify the required resources. +ZIH uses the batch system Slurm for resource management and job scheduling. +Just specify the resources you need in terms +of cores, memory, and time and your Slurm will place your job on the system. -The difference to LSF is, that `salloc` returns a new shell on the node, where you submitted the -job. You need to use the command `srun` in front of the following commands to have these commands -executed on the allocated resources. If you allocate more than one task, please be aware that srun -will run the command on each allocated task! +This pages provides a brief overview on -An example of an interactive session looks like: +* [Slurm options](#options) to specify resource requirements, +* how to submit [interactive](#interactive-jobs) and [batch jobs](#batch-jobs), +* how to [write job files](#job-files), +* how to [manage and control your jobs](#manage-and-control-jobs). -```Shell Session -tauruslogin3 /home/mark; srun --pty -n 1 -c 4 --time=1:00:00 --mem-per-cpu=1700 bash<br />srun: job 13598400 queued and waiting for resources<br />srun: job 13598400 has been allocated resources -taurusi1262 /home/mark; # start interactive work with e.g. 4 cores. -``` +If you are are already familiar with Slurm, you might be more interested in our collection of +[job examples](slurm_examples.md). +There is also a ton of external resources regarding Slurm. We recommend these links for detailed +information: -**Note:** A dedicated partition `interactive` is reserved for short jobs (< 8h) with not more than -one job per user. Please check the availability of nodes there with `sinfo -p interactive` . +- [slurm.schedmd.com](https://slurm.schedmd.com/) provides the official documentation comprising + manual pages, tutorials, examples, etc. +- [Comparison with other batch systems](https://www.schedmd.com/slurmdocs/rosetta.html) -### Interactive X11/GUI Jobs +## Job Submission -Slurm will forward your X11 credentials to the first (or even all) node -for a job with the (undocumented) --x11 option. For example, an -interactive session for 1 hour with Matlab using eight cores can be -started with: +There are three basic Slurm commands for job submission and execution: -```Shell Session -module load matlab -srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first matlab -``` +1. `srun`: Submit a job for execution or initiate job steps in real time. +1. `sbatch`: Submit a batch script to Slurm for later execution. +1. `salloc`: Obtain a Slurm job allocation (a set of nodes), execute a command, and then release the + allocation when the command is finished. -**Note:** If you are getting the error: +Using `srun` directly on the shell will be blocking and launch an +[interactive job](#interactive-jobs). Apart from short test runs, it is recommended to submit your +jobs to Slurm for later execution by using [batch jobs](#batch-jobs). For that, you can conveniently +put the parameters directly in a [job file](#job-files) which you can submit using `sbatch [options] +<job file>`. -```Bash -srun: error: x11: unable to connect node taurusiXXXX -``` +During runtime, the environment variable `SLURM_JOB_ID` will be set to the id of your job. The job +id is unique. The id allows you to [manage and control](#manage-and-control-jobs) your jobs. -that probably means you still have an old host key for the target node in your `\~/.ssh/known_hosts` -file (e.g. from pre-SCS5). This can be solved either by removing the entry from your known_hosts or -by simply deleting the known_hosts file altogether if you don't have important other entries in it. - -### Requesting an Nvidia K20X / K80 / A100 - -Slurm will allocate one or many GPUs for your job if requested. Please note that GPUs are only -available in certain partitions, like `gpu2`, `gpu3` or `gpu2-interactive`. The option -for sbatch/srun in this case is `--gres=gpu:[NUM_PER_NODE]` (where `NUM_PER_NODE` can be `1`, 2 or -4, meaning that one, two or four of the GPUs per node will be used for the job). A sample job file -could look like this - -```Bash -#!/bin/bash -#SBATCH -A Project1 # account CPU time to Project1 -#SBATCH --nodes=2 # request 2 nodes<br />#SBATCH --mincpus=1 # allocate one task per node...<br />#SBATCH --ntasks=2 # ...which means 2 tasks in total (see note below) -#SBATCH --cpus-per-task=6 # use 6 threads per task -#SBATCH --gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task) -#SBATCH --time=01:00:00 # run for 1 hour -srun ./your/cuda/application # start you application (probably requires MPI to use both nodes) -``` +## Options -Please be aware that the partitions `gpu`, `gpu1` and `gpu2` can only be used for non-interactive -jobs which are submitted by `sbatch`. Interactive jobs (`salloc`, `srun`) will have to use the -partition `gpu-interactive`. Slurm will automatically select the right partition if the partition -parameter (-p) is omitted. +The following table holds the most important options for `srun/sbatch/salloc` to specify resource +requirements and control communication. -**Note:** Due to an unresolved issue concerning the Slurm job scheduling behavior, it is currently -not practical to use `--ntasks-per-node` together with GPU jobs. If you want to use multiple nodes, -please use the parameters `--ntasks` and `--mincpus` instead. The values of mincpus \* nodes has to -equal ntasks in this case. +??? tip "Options Table" -### Limitations of GPU job allocations + | Slurm Option | Description | + |:---------------------------|:------------| + | `-n, --ntasks=<N>` | number of (MPI) tasks (default: 1) | + | `-N, --nodes=<N>` | number of nodes; there will be `--ntasks-per-node` processes started on each node | + | `--ntasks-per-node=<N>` | number of tasks per allocated node to start (default: 1) | + | `-c, --cpus-per-task=<N>` | number of CPUs per task; needed for multithreaded (e.g. OpenMP) jobs; typically `N` should be equal to `OMP_NUM_THREADS` | + | `-p, --partition=<name>` | type of nodes where you want to execute your job (refer to [partitions](partitions_and_limits.md)) | + | `--mem-per-cpu=<size>` | memory need per allocated CPU in MB | + | `-t, --time=<HH:MM:SS>` | maximum runtime of the job | + | `--mail-user=<your email>` | get updates about the status of the jobs | + | `--mail-type=ALL` | for what type of events you want to get a mail; valid options: `ALL`, `BEGIN`, `END`, `FAIL`, `REQUEUE` | + | `-J, --job-name=<name>` | name of the job shown in the queue and in mails (cut after 24 chars) | + | `--no-requeue` | disable requeueing of the job in case of node failure (default: enabled) | + | `--exclusive` | exclusive usage of compute nodes; you will be charged for all CPUs/cores on the node | + | `-A, --account=<project>` | charge resources used by this job to the specified project | + | `-o, --output=<filename>` | file to save all normal output (stdout) (default: `slurm-%j.out`) | + | `-e, --error=<filename>` | file to save all error output (stderr) (default: `slurm-%j.out`) | + | `-a, --array=<arg>` | submit an array job ([examples](slurm_examples.md#array-jobs)) | + | `-w <node1>,<node2>,...` | restrict job to run on specific nodes only | + | `-x <node1>,<node2>,...` | exclude specific nodes from job | -The number of cores per node that are currently allowed to be allocated for GPU jobs is limited -depending on how many GPUs are being requested. On the K80 nodes, you may only request up to 6 -cores per requested GPU (8 per on the K20 nodes). This is because we do not wish that GPUs remain -unusable due to all cores on a node being used by a single job which does not, at the same time, -request all GPUs. +!!! note "Output and Error Files" -E.g., if you specify `--gres=gpu:2`, your total number of cores per node (meaning: ntasks \* -cpus-per-task) may not exceed 12 (on the K80 nodes) + When redirecting stderr and stderr into a file using `--output=<filename>` and + `--stderr=<filename>`, make sure the target path is writeable on the + compute nodes, i.e., it may not point to a read-only mounted + [filesystem](../data_lifecycle/overview.md) like `/projects.` -Note that this also has implications for the use of the --exclusive parameter. Since this sets the -number of allocated cores to 24 (or 16 on the K20X nodes), you also **must** request all four GPUs -by specifying --gres=gpu:4, otherwise your job will not start. In the case of --exclusive, it won't -be denied on submission, because this is evaluated in a later scheduling step. Jobs that directly -request too many cores per GPU will be denied with the error message: +!!! note "No free lunch" -```Shell Session -Batch job submission failed: Requested node configuration is not available -``` + Runtime and memory limits are enforced. Please refer to the section on [partitions and + limits](partitions_and_limits.md) for a detailed overview. -### Parallel Jobs +### Host List -For submitting parallel jobs, a few rules have to be understood and followed. In general, they -depend on the type of parallelization and architecture. - -#### OpenMP Jobs - -An SMP-parallel job can only run within a node, so it is necessary to include the options `-N 1` and -`-n 1`. The maximum number of processors for an SMP-parallel program is 488 on Venus and 56 on -taurus (smp island). Using --cpus-per-task N Slurm will start one task and you will have N CPUs -available for your job. An example job file would look like: - -```Bash -#!/bin/bash -#SBATCH -J Science1 -#SBATCH --nodes=1 -#SBATCH --tasks-per-node=1 -#SBATCH --cpus-per-task=8 -#SBATCH --mail-type=end -#SBATCH --mail-user=your.name@tu-dresden.de -#SBATCH --time=08:00:00 +If you want to place your job onto specific nodes, there are two options for doing this. Either use +`-p, --partion=<name>` to specify a host group aka. [partition](partitions_and_limits.md) that fits +your needs. Or, use `-w, --nodelist=<host1,host2,..>`) with a list of hosts that will work for you. -export OMP_NUM_THREADS=8 -./path/to/binary -``` +## Interactive Jobs -#### MPI Jobs +Interactive activities like editing, compiling, preparing experiments etc. are normally limited to +the login nodes. For longer interactive sessions you can allocate cores on the compute node with the +command `salloc`. It takes the same options like `sbatch` to specify the required resources. -For MPI jobs one typically allocates one core per task that has to be started. **Please note:** -There are different MPI libraries on Taurus and Venus, so you have to compile the binaries -specifically for their target. +`salloc` returns a new shell on the node, where you submitted the job. You need to use the command +`srun` in front of the following commands to have these commands executed on the allocated +resources. If you allocate more than one task, please be aware that `srun` will run the command on +each allocated task! -```Bash -#!/bin/bash -#SBATCH -J Science1 -#SBATCH --ntasks=864 -#SBATCH --mail-type=end -#SBATCH --mail-user=your.name@tu-dresden.de -#SBATCH --time=08:00:00 +The syntax for submitting a job is -srun ./path/to/binary +``` +marie@login$ srun [options] <command> ``` -#### Multiple Programs Running Simultaneously in a Job - -In this short example, our goal is to run four instances of a program concurrently in a **single** -batch script. Of course we could also start a batch script four times with sbatch but this is not -what we want to do here. Please have a look at [Running Multiple GPU Applications Simultaneously in -a Batch Job] todo Compendium.RunningNxGpuAppsInOneJob in case you intend to run GPU programs -simultaneously in a **single** job. - -```Bash -#!/bin/bash -#SBATCH -J PseudoParallelJobs -#SBATCH --ntasks=4 -#SBATCH --cpus-per-task=1 -#SBATCH --mail-type=end -#SBATCH --mail-user=your.name@tu-dresden.de -#SBATCH --time=01:00:00 +An example of an interactive session looks like: -# The following sleep command was reported to fix warnings/errors with srun by users (feel free to uncomment). -#sleep 5 -srun --exclusive --ntasks=1 ./path/to/binary & +```console +marie@login$ srun --pty -n 1 -c 4 --time=1:00:00 --mem-per-cpu=1700 bash +marie@login$ srun: job 13598400 queued and waiting for resources +marie@login$ srun: job 13598400 has been allocated resources +marie@compute$ # Now, you can start interactive work with e.g. 4 cores +``` -#sleep 5 -srun --exclusive --ntasks=1 ./path/to/binary & +!!! note "Partition `interactive`" -#sleep 5 -srun --exclusive --ntasks=1 ./path/to/binary & + A dedicated partition `interactive` is reserved for short jobs (< 8h) with not more than one job + per user. Please check the availability of nodes there with `sinfo -p interactive`. -#sleep 5 -srun --exclusive --ntasks=1 ./path/to/binary & +### Interactive X11/GUI Jobs -echo "Waiting for parallel job steps to complete..." -wait -echo "All parallel job steps completed!" -``` +Slurm will forward your X11 credentials to the first (or even all) node for a job with the +(undocumented) `--x11` option. For example, an interactive session for one hour with Matlab using +eight cores can be started with: -### Exclusive Jobs for Benchmarking - -Jobs on taurus run, by default, in shared-mode, meaning that multiple jobs can run on the same -compute nodes. Sometimes, this behaviour is not desired (e.g. for benchmarking purposes), in which -case it can be turned off by specifying the Slurm parameter: `--exclusive` . - -Setting `--exclusive` **only** makes sure that there will be **no other jobs running on your nodes**. -It does not, however, mean that you automatically get access to all the resources which the node -might provide without explicitly requesting them, e.g. you still have to request a GPU via the -generic resources parameter (gres) to run on the GPU partitions, or you still have to request all -cores of a node if you need them. CPU cores can either to be used for a task (`--ntasks`) or for -multi-threading within the same task (--cpus-per-task). Since those two options are semantically -different (e.g., the former will influence how many MPI processes will be spawned by 'srun' whereas -the latter does not), Slurm cannot determine automatically which of the two you might want to use. -Since we use cgroups for separation of jobs, your job is not allowed to use more resources than -requested.* - -If you just want to use all available cores in a node, you have to -specify how Slurm should organize them, like with \<span>"-p haswell -c -24\</span>" or "\<span>-p haswell --ntasks-per-node=24". \</span> - -Here is a short example to ensure that a benchmark is not spoiled by -other jobs, even if it doesn't use up all resources in the nodes: - -```Bash -#!/bin/bash -#SBATCH -J Benchmark -#SBATCH -p haswell -#SBATCH --nodes=2 -#SBATCH --ntasks-per-node=2 -#SBATCH --cpus-per-task=8 -#SBATCH --exclusive # ensure that nobody spoils my measurement on 2 x 2 x 8 cores -#SBATCH --mail-user=your.name@tu-dresden.de -#SBATCH --time=00:10:00 - -srun ./my_benchmark +```console +marie@login$ module load matlab +marie@login$ srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first matlab ``` -### Array Jobs - -Array jobs can be used to create a sequence of jobs that share the same executable and resource -requirements, but have different input files, to be submitted, controlled, and monitored as a single -unit. The arguments `-a` or `--array` take an additional parameter that specify the array indices. -Within the job you can read the environment variables `SLURM_ARRAY_JOB_ID`, which will be set to the -first job ID of the array, and `SLURM_ARRAY_TASK_ID`, which will be set individually for each step. - -Within an array job, you can use %a and %A in addition to %j and %N -(described above) to make the output file name specific to the job. %A -will be replaced by the value of SLURM_ARRAY_JOB_ID and %a will be -replaced by the value of SLURM_ARRAY_TASK_ID. - -Here is an example of how an array job can look like: - -```Bash -#!/bin/bash -#SBATCH -J Science1 -#SBATCH --array 0-9 -#SBATCH -o arraytest-%A_%a.out -#SBATCH -e arraytest-%A_%a.err -#SBATCH --ntasks=864 -#SBATCH --mail-type=end -#SBATCH --mail-user=your.name@tu-dresden.de -#SBATCH --time=08:00:00 - -echo "Hi, I am step $SLURM_ARRAY_TASK_ID in this array job $SLURM_ARRAY_JOB_ID" -``` +!!! hint "X11 error" -**Note:** If you submit a large number of jobs doing heavy I/O in the Lustre file systems you should -limit the number of your simultaneously running job with a second parameter like: + If you are getting the error: -```Bash -#SBATCH --array=1-100000%100 -``` + ```Bash + srun: error: x11: unable to connect node taurusiXXXX + ``` -For further details please read the Slurm documentation at -(https://slurm.schedmd.com/sbatch.html) - -### Chain Jobs - -You can use chain jobs to create dependencies between jobs. This is often the case if a job relies -on the result of one or more preceding jobs. Chain jobs can also be used if the runtime limit of the -batch queues is not sufficient for your job. Slurm has an option `-d` or "--dependency" that allows -to specify that a job is only allowed to start if another job finished. - -Here is an example of how a chain job can look like, the example submits 4 jobs (described in a job -file) that will be executed one after each other with different CPU numbers: - -```Bash -#!/bin/bash -TASK_NUMBERS="1 2 4 8" -DEPENDENCY="" -JOB_FILE="myjob.slurm" - -for TASKS in $TASK_NUMBERS ; do - JOB_CMD="sbatch --ntasks=$TASKS" - if [ -n "$DEPENDENCY" ] ; then - JOB_CMD="$JOB_CMD --dependency afterany:$DEPENDENCY" - fi - JOB_CMD="$JOB_CMD $JOB_FILE" - echo -n "Running command: $JOB_CMD " - OUT=`$JOB_CMD` - echo "Result: $OUT" - DEPENDENCY=`echo $OUT | awk '{print $4}'` -done -``` + that probably means you still have an old host key for the target node in your + `~.ssh/known_hosts` file (e.g. from pre-SCS5). This can be solved either by removing the entry + from your known_hosts or by simply deleting the `known_hosts` file altogether if you don't have + important other entries in it. -### Binding and Distribution of Tasks +## Batch Jobs -The Slurm provides several binding strategies to place and bind the tasks and/or threads of your job -to cores, sockets and nodes. Note: Keep in mind that the distribution method has a direct impact on -the execution time of your application. The manipulation of the distribution can either speed up or -slow down your application. More detailed information about the binding can be found -[here](binding_and_distribution_of_tasks.md). +Working interactively using `srun` and `salloc` is a good starting point for testing and compiling. +But, as soon as you leave the testing stage, we highly recommend you to use batch jobs. +Batch jobs are encapsulated within [job files](#job-files) and submitted to the batch system using +`sbatch` for later execution. A job file is basically a script holding the resource requirements, +environment settings and the commands for executing the application. Using batch jobs and job files +has multiple advantages: -The default allocation of the tasks/threads for OpenMP, MPI and Hybrid (MPI and OpenMP) are as -follows. +* You can reproduce your experiments and work, because it's all steps are saved in a file. +* You can easily share your settings and experimental setup with colleagues. +* Submit your job file to the scheduling system for later execution. In the meanwhile, you can grab + a coffee and proceed with other work (,e.g., start writing a paper). -#### OpenMP +!!! hint "The syntax for submitting a job file to Slurm is" -The illustration below shows the default binding of a pure OpenMP-job on 1 node with 16 cpus on -which 16 threads are allocated. + ```console + marie@login$ sbatch [options] <job_file> + ``` -```Bash -#!/bin/bash -#SBATCH --nodes=1 -#SBATCH --tasks-per-node=1 -#SBATCH --cpus-per-task=16 +### Job Files -export OMP_NUM_THREADS=16 +Job files have to be written with the following structure. -srun --ntasks 1 --cpus-per-task $OMP_NUM_THREADS ./application -``` +```bash +#!/bin/bash # Batch script starts with shebang line -\<img alt="" -src="data:;base64,iVBORw0KGgoAAAANSUhEUgAAAX4AAADeCAIAAAC10/zxAAAABmJLR0QA/wD/AP+gvaeTAAASvElEQVR4nO3de1BU5ePH8XMIBN0FVllusoouCuZ0UzMV7WtTDqV2GRU0spRm1GAqtG28zaBhNmU62jg2WWkXGWegNLVmqnFGQhsv/WEaXQxLaFEQdpfBXW4ul+X8/jgTQ1z8KQd4luX9+mv3Oc8+5zl7nv3wnLNnObKiKBIA9C8/0R0AMBgRPQAEIHoACED0ABCA6AEgANEDQACiB4AARA8AAYgeAAIQPQAEIHoACED0ABCA6AEgANEDQACiB4AARA8AAYgeAAIQPQAEIHoACOCv5cWyLPdWPwAMOFr+szuzHgACaJr1qLinBTDYaD/iYdYDQACiB4AARA8AAYgeAAIQPQAEIHoACED0ABCA6AEgANEDQACiB4AARA8AAYgeAAIQPQAEIHoACED0ABCA6AEgANEDQACiB4AARA8AAYgeAAIQPQAEIHoACED0ABCA6Bm8HnnkEVmWz54921YSFRV17Nix22/hl19+0ev1t18/JycnMTFRp9NFRUXdQUfhi4ieQS0sLGzt2rX9tjqj0bhmzZrs7Ox+WyO8FtEzqK1YsaK4uPirr77qvKiioiIlJSUiIsJkMr3yyisNDQ1q+bVr1x5//HGDwXDPPfecOXOmrX5NTU1GRsaoUaPCw8OfffbZqqqqzm3Omzdv8eLFo0aN6qPNwQBC9Axqer0+Ozt748aNzc3NHRYtWrQoICCguLj4/PnzFy5csFgsanlKSorJZKqsrPzuu+8+/PDDtvpLly612WwXL168evVqaGhoWlpav20FBiRFA+0tQKDZs2dv3bq1ubl5woQJe/bsURQlMjLy6NGjiqIUFRVJkmS329Wa+fn5QUFBHo+nqKhIluXq6mq1PCcnR6fTKYpSUlIiy3JbfZfLJcuy0+nscr25ubmRkZF9vXXoU9o/+/7CMg/ewd/ff9u2bStXrly2bFlbYVlZmU6nCw8PV5+azWa3211VVVVWVhYWFjZ8+HC1fPz48eoDq9Uqy/LUqVPbWggNDS0vLw8NDe2v7cAAQ/RAeuaZZ3bu3Llt27a2EpPJVF9f73A41PSxWq2BgYFGozEmJsbpdDY2NgYGBkqSVFlZqdYfPXq0LMuFhYVkDW4T53ogSZK0Y8eO3bt319bWqk/j4+OnT59usVjq6upsNltWVtby5cv9/PwmTJgwadKk9957T5KkxsbG3bt3q/Xj4uKSkpJWrFhRUVEhSZLD4Th8+HDntXg8HrfbrZ5XcrvdjY2N/bR58D5EDyRJkqZNmzZ//vy2r7FkWT58+HBDQ8PYsWMnTZp033337dq1S1106NCh/Pz8yZMnP/roo48++mhbC7m5uSNHjkxMTAwODp4+ffrp06c7r2Xfvn1Dhw5dtmyZzWYbOnRoWFhYP2wavJPcdsaoJy+WZUmStLQAYCDS/tln1gNAAKIHgABEDwABiB4AAhA9AAQgegAIQPQAEIDoASAA0QNAAKIHgABEDwABiB4AAhA9AAQgegAIQPQAEIDoASAA0QNAAKIHgABEDwABiB4AAhA9AAQgegAIQPQAEIDoASAA0QNAAH/tTaj3IQSA28esB4AAmu65DgA9w6wHgABEDwABiB4AAhA9AAQgegAIQPQAEEDTJYVcTDgY9OzyC8bGYKDl0hxmPQAE6IUfUnBRoq/SPnNhbPgq7WODWQ8AAYgeAAIQPQAEIHoACED0ABCA6AEgANEDQACiB4AARA8AAYgeAAIQPQAEIHoACED0ABDA96Pn0qVLTz31lNFoHDZs2IQJE9avX9+DRiZMmHDs2LHbrPzAAw/k5eV1uSgnJycxMVGn00VFRfWgG+hdXjU2XnvttYkTJw4bNmz06NHr1q1ramrqQWcGEB+PntbW1ieeeGLkyJG//fZbVVVVXl6e2WwW2B+j0bhmzZrs7GyBfYDK28ZGXV3dRx99dO3atby8vLy8vDfeeENgZ/qDooH2FvratWvXJEm6dOlS50XXr19PTk4ODw+PiYl5+eWX6+vr1fIbN25kZGSMHj06ODh40qRJRUVFiqIkJCQcPXpUXTp79uxly5Y1NTW5XK709HSTyWQ0GpcsWeJwOBRFeeWVVwICAoxGY2xs7LJly7rsVW5ubmRkZF9tc+/Rsn8ZGz0bG6rNmzc//PDDvb/NvUf7/vXxWc/IkSPj4+PT09O/+OKLq1evtl+0aNGigICA4uLi8+fPX7hwwWKxqOWpqamlpaXnzp1zOp0HDhwIDg5ue0lpaenMmTNnzZp14MCBgICApUuX2my2ixcvXr16NTQ0NC0tTZKkPXv2TJw4cc+ePVar9cCBA/24rbgz3jw2Tp8+PWXKlN7fZq8iNvn6gc1m27Bhw+TJk/39/ceNG5ebm6soSlFRkSRJdrtdrZOfnx8UFOTxeIqLiyVJKi8v79BIQkLCpk2bTCbTRx99pJaUlJTIstzWgsvlkmXZ6XQqinL//fera+kOsx4v4YVjQ1GUzZs3jx07tqqqqhe3tNf1QnqIXX1/qq2t3blzp5+f36+//nrixAmdTte26J9//pEkyWaz5efnDxs2rPNrExISIiMjp02b5na71ZIffvjBz88vth2DwfDHH38oRI/m1/Y/7xkbW7ZsMZvNVqu1V7ev92nfvz5+wNWeXq+3WCxBQUG//vqryWSqr693OBzqIqvVGhgYqB6ENzQ0VFRUdH757t27w8PDn3766YaGBkmSRo8eLctyYWGh9V83btyYOHGiJEl+foPoXfUNXjI2NmzYcPDgwVOnTsXGxvbBVnoXH/+QVFZWrl279uLFi/X19dXV1e+8805zc/PUqVPj4+OnT59usVjq6upsNltWVtby5cv9/Pzi4uKSkpJWrVpVUVGhKMrvv//eNtQCAwOPHDkSEhIyd+7c2tpateaKFSvUCg6H4/Dhw2rNqKioy5cvd9kfj8fjdrubm5slSXK73Y2Njf3yNqAL3jY2MjMzjxw5cvz4caPR6Ha7ff7LdR8/4HK5XCtXrhw/fvzQoUMNBsPMmTO//fZbdVFZWdnChQuNRmN0dHRGRkZdXZ1aXl1dvXLlypiYmODg4MmTJ1++fFlp9y1GS0vLCy+88NBDD1VXVzudzszMzDFjxuj1erPZvHr1arWFkydPjh8/3mAwLFq0qEN/9u7d2/7Nbz+x90Ja9i9j447Gxo0bNzp8MOPi4vrvvbhz2vevrGi4XYl6QwwtLcCbadm/jA3fpn3/+vgBFwDvRPQAEIDoASAA0QNAAKIHgABEDwABiB4AAhA9AAQgegAIQPQAEIDoASAA0QNAAKIHgABEDwAB/LU3of58HuiMsYHuMOsBIICmfxUGAD3DrAeAAEQPAAGIHgACED0ABCB6AAhA9AAQgOgBIICmq5m5VnUw0HILQPg2bgEIYIDphd9wcT20r9I+c2Fs+CrtY4NZDwABiB4AAhA9AAQgegAIQPQAEIDoASAA0QNAAKIHgAA+Gz1nzpyZP3/+iBEjdDrdvffem5WVVV9f3w/rbWlpyczMHDFiREhIyNKlS2tqarqsptfr5XYCAwMbGxv7oXuDlqjxYLPZFi9ebDQaDQbD448/fvny5S6r5eTkJCYm6nS6qKio9uVpaWntx0leXl4/9Ll/+Gb0fPPNN4899tj9999/7tw5u91+8OBBu91eWFh4O69VFKW5ubnHq96yZcvx48fPnz9/5cqV0tLS9PT0LqvZbLbafy1cuHDBggWBgYE9XiluTeB4yMjIcDqdf/31V3l5eXR0dEpKSpfVjEbjmjVrsrOzOy+yWCxtQyU5ObnHPfE6igbaW+gLHo/HZDJZLJYO5a2trYqiXL9+PTk5OTw8PCYm5uWXX66vr1eXJiQkZGVlzZo1Kz4+vqCgwOVypaenm0wmo9G4ZMkSh8OhVtu1a1dsbGxoaGh0dPTWrVs7rz0iIuLTTz9VHxcUFPj7+9+4ceMWvXU4HIGBgT/88IPGre4LWvav94wNseMhLi5u//796uOCggI/P7+WlpbuupqbmxsZGdm+ZPny5evXr+/ppvehXkgPsavvC+pfs4sXL3a5dMaMGampqTU1NRUVFTNmzHjppZfU8oSEhHvuuaeqqkp9+uSTTy5YsMDhcDQ0NKxatWr+/PmKoly+fFmv1//999+Kojidzp9//rlD4xUVFe1XrR5tnTlz5ha93bFjx/jx4zVsbh/yjegROB4URVm3bt1jjz1ms9lcLtfzzz+/cOHCW3S1y+iJjo42mUxTpkx59913m5qa7vwN6BNETxdOnDghSZLdbu+8qKioqP2i/Pz8oKAgj8ejKEpCQsL777+vlpeUlMiy3FbN5XLJsux0OouLi4cOHfrll1/W1NR0ueq//vpLkqSSkpK2Ej8/v++///4WvY2Pj9+xY8edb2V/8I3oETge1MqzZ89W342777776tWrt+hq5+g5fvz42bNn//7778OHD8fExHSeu4miff/64Lme8PBwSZLKy8s7LyorK9PpdGoFSZLMZrPb7a6qqlKfjhw5Un1gtVplWZ46deqYMWPGjBlz3333hYaGlpeXm83mnJycDz74ICoq6n//+9+pU6c6tB8cHCxJksvlUp/W1ta2traGhIR8/vnnbWcK29cvKCiwWq1paWm9te3oTOB4UBRlzpw5ZrO5urq6rq5u8eLFs2bNqq+v7248dJaUlDRjxoxx48YtWrTo3XffPXjwoJa3wruITb6+oB7bv/766x3KW1tbO/yVKygoCAwMbPsrd/ToUbX8ypUrd911l9Pp7G4VDQ0Nb7/99vDhw9XzBe1FRER89tln6uOTJ0/e+lzPkiVLnn322TvbvH6kZf96z9gQOB4cDofU6QD8p59+6q6dzrOe9r788ssRI0bcalP7US+kh9jV95Gvv/46KCho06ZNxcXFbrf7999/z8jIOHPmTGtr6/Tp059//vna2trKysqZM2euWrVKfUn7oaYoyty5c5OTk69fv64oit1uP3TokKIof/75Z35+vtvtVhRl3759ERERnaMnKysrISGhpKTEZrM9/PDDqamp3XXSbrcPGTLEO08wq3wjehSh4yE2NnblypUul+vmzZtvvvmmXq+vrq7u3MOWlpabN2/m5ORERkbevHlTbdPj8ezfv99qtTqdzpMnT8bFxbWdihKO6OnW6dOn586dazAYhg0bdu+9977zzjvqlxdlZWULFy40Go3R0dEZGRl1dXVq/Q5Dzel0ZmZmjhkzRq/Xm83m1atXK4py4cKFhx56KCQkZPjw4dOmTfvxxx87r7epqenVV181GAx6vT41NdXlcnXXw+3bt3vtCWaVz0SPIm48FBYWJiUlDR8+PCQkZMaMGd39pdm7d2/7YxGdTqcoisfjmTNnTlhY2JAhQ8xm88aNGxsaGnr9nekZ7ftX0z3X1SNVLS3Am2nZv4wN36Z9//rgaWYA3o/oASAA0QNAAKIHgABEDwABiB4AAhA9AAQgegAIQPQAEKAX7rmu/e7L8FWMDXSHWQ8AATT9hgsAeoZZDwABiB4AAhA9AAQgegAIQPQAEIDoASAA0QNAAE1XM3OtKjCY8b+ZAQwwvfAbLq6HBgYb7Uc8zHoACED0ABCA6AEgANGDjlpaWjIzM0eMGBESErJ06dKampouq+Xk5CQmJup0uqioqA6L0tLS5Hby8vL6vtcYYIgedLRly5bjx4+fP3/+ypUrpaWl6enpXVYzGo1r1qzJzs7ucqnFYqn9V3Jych92FwMT0YOOPv744w0bNpjN5oiIiLfeeuvQoUNOp7NztXnz5i1evHjUqFFdNhIQEKD/l79/L3yRCh9D9OA/Kisr7Xb7pEmT1KdTpkxpaWm5dOnSnbaTk5MzatSoBx98cPv27c3Nzb3dTQx4/DnCf9TW1kqSFBoaqj4NDg728/Pr7nRPd5577rmXXnopPDy8sLBw9erVNptt586dvd9XDGRED/4jODhYkiSXy6U+ra2tbW1tDQkJ+fzzz1988UW18P+9iDQpKUl9MG7cOLfbbbFYiB50wAEX/iMqKioiIuKXX35Rn164cMHf33/ixIlpaWnKv+6owSFDhrS0tPRBTzGwET3oaNWqVdu2bfvnn3/sdvumTZtSUlIMBkPnah6Px+12q+dx3G53Y2OjWt7a2vrJJ5+Ulpa6XK5Tp05t3LgxJSWlXzcAA4KigfYW4IWamppeffVVg8Gg1+tTU1NdLleX1fbu3dt+IOl0OrXc4/HMmTMnLCxsyJAhZrN548aNDQ0N/dh99Aftn31NN8NRf0KmpQUAA5H2zz4HXAAEIHoACED0ABCA6AEgQC9cUsh/aAZwp5j1ABBA05frANAzzHoACED0ABCA6AEgANEDQACiB4AARA8AAYgeAAIQPQAEIHoACED0ABCA6AEgANEDQACiB4AARA8AAYgeAAIQPQAEIHoACED0ABCA6AEgwP8BhqBe/aVBoe8AAAAASUVORK5CYII=" -/> +#SBATCH --ntasks=24 # All #SBATCH lines have to follow uninterrupted +#SBATCH --time=01:00:00 # after the shebang line +#SBATCH --account=<KTR> # Comments start with # and do not count as interruptions +#SBATCH --job-name=fancyExp +#SBATCH --output=simulation-%j.out +#SBATCH --error=simulation-%j.err -#### MPI +module purge # Set up environment, e.g., clean modules environment +module load <modules> # and load necessary modules -The illustration below shows the default binding of a pure MPI-job. In -which 32 global ranks are distributed onto 2 nodes with 16 cores each. -Each rank has 1 core assigned to it. +srun ./application [options] # Execute parallel application with srun +``` -```Bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --tasks-per-node=16 -#SBATCH --cpus-per-task=1 +The following two examples show the basic resource specifications for a pure OpenMP application and +a pure MPI application, respectively. Within the section [Job Examples](slurm_examples.md) we +provide a comprehensive collection of job examples. -srun --ntasks 32 ./application -``` +??? example "Job file OpenMP" -\<img alt="" -src="data:;base64," -/> + ```bash + #!/bin/bash -#### Hybrid (MPI and OpenMP) + #SBATCH --nodes=1 + #SBATCH --tasks-per-node=1 + #SBATCH --cpus-per-task=64 + #SBATCH --time=01:00:00 + #SBATCH --account=<account> -In the illustration below the default binding of a Hybrid-job is shown. -In which 8 global ranks are distributed onto 2 nodes with 16 cores each. -Each rank has 4 cores assigned to it. + module purge + module load <modules> -```Bash -#!/bin/bash -#SBATCH --nodes=2 -#SBATCH --tasks-per-node=4 -#SBATCH --cpus-per-task=4 + export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK + srun ./path/to/openmpi_application + ``` -export OMP_NUM_THREADS=4 + * Submisson: `marie@login$ sbatch batch_script.sh` + * Run with fewer CPUs: `marie@login$ sbatch -c 14 batch_script.sh` -srun --ntasks 8 --cpus-per-task $OMP_NUM_THREADS ./application -``` +??? example "Job file MPI" -\<img alt="" -src="data:;base64," -/> + ```bash + #!/bin/bash -### Node Features for Selective Job Submission + #SBATCH --ntasks=64 + #SBATCH --time=01:00:00 + #SBATCH --account=<account> -The nodes in our HPC system are becoming more diverse in multiple aspects: hardware, mounted -storage, software. The system administrators can describe the set of properties and it is up to the -user to specify her/his requirements. These features should be thought of as changing over time -(e.g. a file system get stuck on a certain node). + module purge + module load <modules> -A feature can be used with the Slurm option `--constrain` or `-C` like -`srun -C fs_lustre_scratch2 ...` with `srun` or `sbatch`. Combinations like -`--constraint="fs_beegfs_global0`are allowed. For a detailed description of the possible -constraints, please refer to the Slurm documentation (<https://slurm.schedmd.com/srun.html>). + srun ./path/to/mpi_application + ``` -**Remark:** A feature is checked only for scheduling. Running jobs are not affected by changing -features. + * Submisson: `marie@login$ sbatch batch_script.sh` + * Run with fewer MPI tasks: `marie@login$ sbatch --ntasks 14 batch_script.sh` -### Available features on Taurus +## Manage and Control Jobs -| Feature | Description | -|:--------|:-------------------------------------------------------------------------| -| DA | subset of Haswell nodes with a high bandwidth to NVMe storage (island 6) | +### Job and Slurm Monitoring -#### File system features +On the command line, use `squeue` to watch the scheduling queue. This command will tell the reason, +why a job is not running (job status in the last column of the output). More information about job +parameters can also be determined with `scontrol -d show job <jobid>`. The following table holds +detailed descriptions of the possible job states: + +??? tip "Reason Table" + + | Reason | Long Description | + |:-------------------|:------------------| + | `Dependency` | This job is waiting for a dependent job to complete. | + | `None` | No reason is set for this job. | + | `PartitionDown` | The partition required by this job is in a down state. | + | `PartitionNodeLimit` | The number of nodes required by this job is outside of its partitions current limits. Can also indicate that required nodes are down or drained. | + | `PartitionTimeLimit` | The jobs time limit exceeds its partitions current time limit. | + | `Priority` | One or higher priority jobs exist for this partition. | + | `Resources` | The job is waiting for resources to become available. | + | `NodeDown` | A node required by the job is down. | + | `BadConstraints` | The jobs constraints can not be satisfied. | + | `SystemFailure` | Failure of the Slurm system, a filesystem, the network, etc. | + | `JobLaunchFailure` | The job could not be launched. This may be due to a filesystem problem, invalid program name, etc. | + | `NonZeroExitCode` | The job terminated with a non-zero exit code. | + | `TimeLimit` | The job exhausted its time limit. | + | `InactiveLimit` | The job reached the system inactive limit. | -A feature `fs_*` is active if a certain file system is mounted and available on a node. Access to -these file systems are tested every few minutes on each node and the Slurm features set accordingly. +In addition, the `sinfo` command gives you a quick status overview. -| Feature | Description | -|:-------------------|:---------------------------------------------------------------------| -| fs_lustre_scratch2 | `/scratch` mounted read-write (the OS mount point is `/lustre/scratch2)` | -| fs_lustre_ssd | `/lustre/ssd` mounted read-write | -| fs_warm_archive_ws | `/warm_archive/ws` mounted read-only | -| fs_beegfs_global0 | `/beegfs/global0` mounted read-write | +For detailed information on why your submitted job has not started yet, you can use the command -For certain projects, specific file systems are provided. For those, -additional features are available, like `fs_beegfs_<projectname>`. +```console +marie@login$ whypending <jobid> +``` -## Editing Jobs +### Editing Jobs Jobs that have not yet started can be altered. Using `scontrol update timelimit=4:00:00 -jobid=<jobid>` is is for example possible to modify the maximum runtime. scontrol understands many -different options, please take a look at the man page for more details. +jobid=<jobid>` it is for example possible to modify the maximum runtime. `scontrol` understands many +different options, please take a look at the [man page](https://slurm.schedmd.com/scontrol.html) for +more details. -## Job and Slurm Monitoring +### Canceling Jobs -On the command line, use `squeue` to watch the scheduling queue. This command will tell the reason, -why a job is not running (job status in the last column of the output). More information about job -parameters can also be determined with `scontrol -d show job <jobid>` Here are detailed descriptions -of the possible job status: - -| Reason | Long description | -|:-------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------| -| Dependency | This job is waiting for a dependent job to complete. | -| None | No reason is set for this job. | -| PartitionDown | The partition required by this job is in a DOWN state. | -| PartitionNodeLimit | The number of nodes required by this job is outside of its partitions current limits. Can also indicate that required nodes are DOWN or DRAINED. | -| PartitionTimeLimit | The jobs time limit exceeds its partitions current time limit. | -| Priority | One or higher priority jobs exist for this partition. | -| Resources | The job is waiting for resources to become available. | -| NodeDown | A node required by the job is down. | -| BadConstraints | The jobs constraints can not be satisfied. | -| SystemFailure | Failure of the Slurm system, a file system, the network, etc. | -| JobLaunchFailure | The job could not be launched. This may be due to a file system problem, invalid program name, etc. | -| NonZeroExitCode | The job terminated with a non-zero exit code. | -| TimeLimit | The job exhausted its time limit. | -| InactiveLimit | The job reached the system InactiveLimit. | +The command `scancel <jobid>` kills a single job and removes it from the queue. By using `scancel -u +<username>` you can send a canceling signal to all of your jobs at once. -In addition, the `sinfo` command gives you a quick status overview. +### Accounting + +The Slurm command `sacct` provides job statistics like memory usage, CPU time, energy usage etc. -For detailed information on why your submitted job has not started yet, you can use: `whypending -<jobid>`. +!!! hint "Learn from old jobs" -## Accounting + We highly encourage you to use `sacct` to learn from you previous jobs in order to better + estimate the requirements, e.g., runtime, for future jobs. -The Slurm command `sacct` provides job statistics like memory usage, CPU -time, energy usage etc. Examples: +`sacct` outputs the following fields by default. -```Shell Session +```console # show all own jobs contained in the accounting database -sacct -# show specific job -sacct -j <JOBID> -# specify fields -sacct -j <JOBID> -o JobName,MaxRSS,MaxVMSize,CPUTime,ConsumedEnergy -# show all fields -sacct -j <JOBID> -o ALL +marie@login$ sacct + JobID JobName Partition Account AllocCPUS State ExitCode +------------ ---------- ---------- ---------- ---------- ---------- -------- +[...] ``` -Read the manpage (`man sacct`) for information on the provided fields. +We'd like to point your attention to the following options gain insight in your jobs. -Note that sacct by default only shows data of the last day. If you want -to look further into the past without specifying an explicit job id, you -need to provide a startdate via the **-S** or **--starttime** parameter, -e.g +??? example "Show specific job" -```Shell Session -# show all jobs since the beginning of year 2020: -sacct -S 2020-01-01 -``` + ```console + marie@login$ sacct -j <JOBID> + ``` -## Killing jobs +??? example "Show all fields for a specific job" -The command `scancel <jobid>` kills a single job and removes it from the queue. By using `scancel -u -<username>` you are able to kill all of your jobs at once. + ```console + marie@login$ sacct -j <JOBID> -o All + ``` -## Host List +??? example "Show specific fields" -If you want to place your job onto specific nodes, there are two options for doing this. Either use -`-p` to specify a host group that fits your needs. Or, use `-w` or (`--nodelist`) with a name node -nodes that will work for you. + ```console + marie@login$ sacct -j <JOBID> -o JobName,MaxRSS,MaxVMSize,CPUTime,ConsumedEnergy + ``` -## Job Profiling +The manual page (`man sacct`) and the [online reference](https://slurm.schedmd.com/sacct.html) +provide a comprehensive documentation regarding available fields and formats. -\<a href="%ATTACHURL%/hdfview_memory.png"> \<img alt="" height="272" -src="%ATTACHURL%/hdfview_memory.png" style="float: right; margin-left: -10px;" title="hdfview" width="324" /> \</a> +!!! hint "Time span" -Slurm offers the option to gather profiling data from every task/node of the job. Following data can -be gathered: + By default, `sacct` only shows data of the last day. If you want to look further into the past + without specifying an explicit job id, you need to provide a start date via the `-S` option. + A certain end date is also possible via `-E`. -- Task data, such as CPU frequency, CPU utilization, memory - consumption (RSS and VMSize), I/O -- Energy consumption of the nodes -- Infiniband data (currently deactivated) -- Lustre filesystem data (currently deactivated) +??? example "Show all jobs since the beginning of year 2021" -The data is sampled at a fixed rate (i.e. every 5 seconds) and is stored in a HDF5 file. + ```console + marie@login$ sacct -S 2021-01-01 [-E now] + ``` -**CAUTION**: Please be aware that the profiling data may be quiet large, depending on job size, -runtime, and sampling rate. Always remove the local profiles from -`/lustre/scratch2/profiling/${USER}`, either by running sh5util as shown above or by simply removing -those files. +## Jobs at Reservations -Usage examples: +How to ask for a reservation is described in the section +[reservations](overview.md#exclusive-reservation-of-hardware). +After we agreed with your requirements, we will send you an e-mail with your reservation name. Then +you could see more information about your reservation with the following command: -```Shell Session -# create energy and task profiling data (--acctg-freq is the sampling rate in seconds) -srun --profile=All --acctg-freq=5,energy=5 -n 32 ./a.out -# create task profiling data only -srun --profile=All --acctg-freq=5 -n 32 ./a.out +```console +marie@login$ scontrol show res=<reservation name> +# e.g. scontrol show res=hpcsupport_123 +``` -# merge the node local files in /lustre/scratch2/profiling/${USER} to single file -# (without -o option output file defaults to job_<JOBID>.h5) -sh5util -j <JOBID> -o profile.h5 -# in jobscripts or in interactive sessions (via salloc): -sh5util -j ${SLURM_JOBID} -o profile.h5 +If you want to use your reservation, you have to add the parameter +`--reservation=<reservation name>` either in your sbatch script or to your `srun` or `salloc` command. -# view data: -module load HDFView -hdfview.sh profile.h5 -``` +## Node Features for Selective Job Submission -More information about profiling with Slurm: +The nodes in our HPC system are becoming more diverse in multiple aspects: hardware, mounted +storage, software. The system administrators can describe the set of properties and it is up to the +user to specify her/his requirements. These features should be thought of as changing over time +(e.g., a filesystem get stuck on a certain node). -- [Slurm Profiling](http://slurm.schedmd.com/hdf5_profile_user_guide.html) -- [sh5util](http://slurm.schedmd.com/sh5util.html) +A feature can be used with the Slurm option `--constrain` or `-C` like +`srun -C fs_lustre_scratch2 ...` with `srun` or `sbatch`. Combinations like +`--constraint="fs_beegfs_global0`are allowed. For a detailed description of the possible +constraints, please refer to the [Slurm documentation](https://slurm.schedmd.com/srun.html). -## Reservations +!!! hint -If you want to run jobs, which specifications are out of our job limitations, you could -[ask for a reservation](mailto:hpcsupport@zih.tu-dresden.de). Please add the following information -to your request mail: + A feature is checked only for scheduling. Running jobs are not affected by changing features. -- start time (please note, that the start time have to be later than - the day of the request plus 7 days, better more, because the longest - jobs run 7 days) -- duration or end time -- account -- node count or cpu count -- partition +### Available Features -After we agreed with your requirements, we will send you an e-mail with your reservation name. Then -you could see more information about your reservation with the following command: +| Feature | Description | +|:--------|:-------------------------------------------------------------------------| +| DA | subset of Haswell nodes with a high bandwidth to NVMe storage (island 6) | -```Shell Session -scontrol show res=<reservation name> -# e.g. scontrol show res=hpcsupport_123 -``` +#### Filesystem Features -If you want to use your reservation, you have to add the parameter `--reservation=<reservation -name>` either in your sbatch script or to your `srun` or `salloc` command. +A feature `fs_*` is active if a certain filesystem is mounted and available on a node. Access to +these filesystems are tested every few minutes on each node and the Slurm features set accordingly. -## Slurm External Links +| Feature | Description | +|:-------------------|:---------------------------------------------------------------------| +| `fs_lustre_scratch2` | `/scratch` mounted read-write (mount point is `/lustre/scratch2)` | +| `fs_lustre_ssd` | `/lustre/ssd` mounted read-write | +| `fs_warm_archive_ws` | `/warm_archive/ws` mounted read-only | +| `fs_beegfs_global0` | `/beegfs/global0` mounted read-write | -- Manpages, tutorials, examples, etc: (http://slurm.schedmd.com/) -- Comparison with other batch systems: (http://www.schedmd.com/slurmdocs/rosetta.html) +For certain projects, specific filesystems are provided. For those, +additional features are available, like `fs_beegfs_<projectname>`. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_examples.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_examples.md index 187bd7cf82651718fb0b188edfa0c95f33621b20..5f4c9b50c6b7d64474609d3f1753571bb9ec0999 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_examples.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_examples.md @@ -1,5 +1,358 @@ -# SlurmExamples +# Job Examples -## Array-Job with Afterok-Dependency and DataMover Usage +## Parallel Jobs -TODO +For submitting parallel jobs, a few rules have to be understood and followed. In general, they +depend on the type of parallelization and architecture. + +### OpenMP Jobs + +An SMP-parallel job can only run within a node, so it is necessary to include the options `-N 1` and +`-n 1`. The maximum number of processors for an SMP-parallel program is 896 and 56 on partition +`taurussmp8` and `smp2`, respectively. Please refer to the +[partitions section](partitions_and_limits.md#memory-limits) for up-to-date information. Using the +option `--cpus-per-task=<N>` Slurm will start one task and you will have `N` CPUs available for your +job. An example job file would look like: + +!!! example "Job file for OpenMP application" + + ```Bash + #!/bin/bash + #SBATCH --nodes=1 + #SBATCH --tasks-per-node=1 + #SBATCH --cpus-per-task=8 + #SBATCH --time=08:00:00 + #SBATCH -J Science1 + #SBATCH --mail-type=end + #SBATCH --mail-user=your.name@tu-dresden.de + + export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK + ./path/to/binary + ``` + +### MPI Jobs + +For MPI-parallel jobs one typically allocates one core per task that has to be started. + +!!! warning "MPI libraries" + + There are different MPI libraries on ZIH systems for the different micro archtitectures. Thus, + you have to compile the binaries specifically for the target architecture and partition. Please + refer to the sections [building software](../software/building_software.md) and + [module environments](../software/runtime_environment.md#module-environments) for detailed + information. + +!!! example "Job file for MPI application" + + ```Bash + #!/bin/bash + #SBATCH --ntasks=864 + #SBATCH --time=08:00:00 + #SBATCH -J Science1 + #SBATCH --mail-type=end + #SBATCH --mail-user=your.name@tu-dresden.de + + srun ./path/to/binary + ``` + +### Multiple Programs Running Simultaneously in a Job + +In this short example, our goal is to run four instances of a program concurrently in a **single** +batch script. Of course we could also start a batch script four times with `sbatch` but this is not +what we want to do here. Please have a look at +[this subsection](#multiple-programs-running-simultaneously-in-a-job) +in case you intend to run GPU programs simultaneously in a **single** job. + +!!! example " " + + ```Bash + #!/bin/bash + #SBATCH --ntasks=4 + #SBATCH --cpus-per-task=1 + #SBATCH --time=01:00:00 + #SBATCH -J PseudoParallelJobs + #SBATCH --mail-type=end + #SBATCH --mail-user=your.name@tu-dresden.de + + # The following sleep command was reported to fix warnings/errors with srun by users (feel free to uncomment). + #sleep 5 + srun --exclusive --ntasks=1 ./path/to/binary & + + #sleep 5 + srun --exclusive --ntasks=1 ./path/to/binary & + + #sleep 5 + srun --exclusive --ntasks=1 ./path/to/binary & + + #sleep 5 + srun --exclusive --ntasks=1 ./path/to/binary & + + echo "Waiting for parallel job steps to complete..." + wait + echo "All parallel job steps completed!" + ``` + +## Requesting GPUs + +Slurm will allocate one or many GPUs for your job if requested. Please note that GPUs are only +available in certain partitions, like `gpu2`, `gpu3` or `gpu2-interactive`. The option +for `sbatch/srun` in this case is `--gres=gpu:[NUM_PER_NODE]` (where `NUM_PER_NODE` can be `1`, `2` or +`4`, meaning that one, two or four of the GPUs per node will be used for the job). + +!!! example "Job file to request a GPU" + + ```Bash + #!/bin/bash + #SBATCH --nodes=2 # request 2 nodes + #SBATCH --mincpus=1 # allocate one task per node... + #SBATCH --ntasks=2 # ...which means 2 tasks in total (see note below) + #SBATCH --cpus-per-task=6 # use 6 threads per task + #SBATCH --gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task) + #SBATCH --time=01:00:00 # run for 1 hour + #SBATCH -A Project1 # account CPU time to Project1 + + srun ./your/cuda/application # start you application (probably requires MPI to use both nodes) + ``` + +Please be aware that the partitions `gpu`, `gpu1` and `gpu2` can only be used for non-interactive +jobs which are submitted by `sbatch`. Interactive jobs (`salloc`, `srun`) will have to use the +partition `gpu-interactive`. Slurm will automatically select the right partition if the partition +parameter `-p, --partition` is omitted. + +!!! note + + Due to an unresolved issue concerning the Slurm job scheduling behavior, it is currently not + practical to use `--ntasks-per-node` together with GPU jobs. If you want to use multiple nodes, + please use the parameters `--ntasks` and `--mincpus` instead. The values of `mincpus`*`nodes` + has to equal `ntasks` in this case. + +### Limitations of GPU Job Allocations + +The number of cores per node that are currently allowed to be allocated for GPU jobs is limited +depending on how many GPUs are being requested. On the K80 nodes, you may only request up to 6 +cores per requested GPU (8 per on the K20 nodes). This is because we do not wish that GPUs remain +unusable due to all cores on a node being used by a single job which does not, at the same time, +request all GPUs. + +E.g., if you specify `--gres=gpu:2`, your total number of cores per node (meaning: +`ntasks`*`cpus-per-task`) may not exceed 12 (on the K80 nodes) + +Note that this also has implications for the use of the `--exclusive` parameter. Since this sets the +number of allocated cores to 24 (or 16 on the K20X nodes), you also **must** request all four GPUs +by specifying `--gres=gpu:4`, otherwise your job will not start. In the case of `--exclusive`, it won't +be denied on submission, because this is evaluated in a later scheduling step. Jobs that directly +request too many cores per GPU will be denied with the error message: + +```console +Batch job submission failed: Requested node configuration is not available +``` + +### Running Multiple GPU Applications Simultaneously in a Batch Job + +Our starting point is a (serial) program that needs a single GPU and four CPU cores to perform its +task (e.g. TensorFlow). The following batch script shows how to run such a job on the partition `ml`. + +!!! example + + ```bash + #!/bin/bash + #SBATCH --ntasks=1 + #SBATCH --cpus-per-task=4 + #SBATCH --gres=gpu:1 + #SBATCH --gpus-per-task=1 + #SBATCH --time=01:00:00 + #SBATCH --mem-per-cpu=1443 + #SBATCH --partition=ml + + srun some-gpu-application + ``` + +When `srun` is used within a submission script, it inherits parameters from `sbatch`, including +`--ntasks=1`, `--cpus-per-task=4`, etc. So we actually implicitly run the following + +```bash +srun --ntasks=1 --cpus-per-task=4 ... --partition=ml some-gpu-application +``` + +Now, our goal is to run four instances of this program concurrently in a single batch script. Of +course we could also start the above script multiple times with `sbatch`, but this is not what we want +to do here. + +#### Solution + +In order to run multiple programs concurrently in a single batch script/allocation we have to do +three things: + +1. Allocate enough resources to accommodate multiple instances of our program. This can be achieved + with an appropriate batch script header (see below). +1. Start job steps with srun as background processes. This is achieved by adding an ampersand at the + end of the `srun` command +1. Make sure that each background process gets its private resources. We need to set the resource + fraction needed for a single run in the corresponding srun command. The total aggregated + resources of all job steps must fit in the allocation specified in the batch script header. + Additionally, the option `--exclusive` is needed to make sure that each job step is provided with + its private set of CPU and GPU resources. The following example shows how four independent + instances of the same program can be run concurrently from a single batch script. Each instance + (task) is equipped with 4 CPUs (cores) and one GPU. + +!!! example "Job file simultaneously executing four independent instances of the same program" + + ```Bash + #!/bin/bash + #SBATCH --ntasks=4 + #SBATCH --cpus-per-task=4 + #SBATCH --gres=gpu:4 + #SBATCH --gpus-per-task=1 + #SBATCH --time=01:00:00 + #SBATCH --mem-per-cpu=1443 + #SBATCH --partition=ml + + srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application & + srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application & + srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application & + srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application & + + echo "Waiting for all job steps to complete..." + wait + echo "All jobs completed!" + ``` + +In practice it is possible to leave out resource options in `srun` that do not differ from the ones +inherited from the surrounding `sbatch` context. The following line would be sufficient to do the +job in this example: + +```bash +srun --exclusive --gres=gpu:1 --ntasks=1 some-gpu-application & +``` + +Yet, it adds some extra safety to leave them in, enabling the Slurm batch system to complain if not +enough resources in total were specified in the header of the batch script. + +## Exclusive Jobs for Benchmarking + +Jobs ZIH systems run, by default, in shared-mode, meaning that multiple jobs (from different users) +can run at the same time on the same compute node. Sometimes, this behavior is not desired (e.g. +for benchmarking purposes). Thus, the Slurm parameter `--exclusive` request for exclusive usage of +resources. + +Setting `--exclusive` **only** makes sure that there will be **no other jobs running on your nodes**. +It does not, however, mean that you automatically get access to all the resources which the node +might provide without explicitly requesting them, e.g. you still have to request a GPU via the +generic resources parameter (`gres`) to run on the GPU partitions, or you still have to request all +cores of a node if you need them. CPU cores can either to be used for a task (`--ntasks`) or for +multi-threading within the same task (`--cpus-per-task`). Since those two options are semantically +different (e.g., the former will influence how many MPI processes will be spawned by `srun` whereas +the latter does not), Slurm cannot determine automatically which of the two you might want to use. +Since we use cgroups for separation of jobs, your job is not allowed to use more resources than +requested.* + +If you just want to use all available cores in a node, you have to specify how Slurm should organize +them, like with `-p haswell -c 24` or `-p haswell --ntasks-per-node=24`. + +Here is a short example to ensure that a benchmark is not spoiled by other jobs, even if it doesn't +use up all resources in the nodes: + +!!! example "Exclusive resources" + + ```Bash + #!/bin/bash + #SBATCH -p haswell + #SBATCH --nodes=2 + #SBATCH --ntasks-per-node=2 + #SBATCH --cpus-per-task=8 + #SBATCH --exclusive # ensure that nobody spoils my measurement on 2 x 2 x 8 cores + #SBATCH --time=00:10:00 + #SBATCH -J Benchmark + #SBATCH --mail-user=your.name@tu-dresden.de + + srun ./my_benchmark + ``` + +## Array Jobs + +Array jobs can be used to create a sequence of jobs that share the same executable and resource +requirements, but have different input files, to be submitted, controlled, and monitored as a single +unit. The option is `-a, --array=<indexes>` where the parameter `indexes` specifies the array +indices. The following specifications are possible + +* comma separated list, e.g., `--array=0,1,2,17`, +* range based, e.g., `--array=0-42`, +* step based, e.g., `--array=0-15:4`, +* mix of comma separated and range base, e.g., `--array=0,1,2,16-42`. + +A maximum number of simultaneously running tasks from the job array may be specified using the `%` +separator. The specification `--array=0-23%8` limits the number of simultaneously running tasks from +this job array to 8. + +Within the job you can read the environment variables `SLURM_ARRAY_JOB_ID` and +`SLURM_ARRAY_TASK_ID` which is set to the first job ID of the array and set individually for each +step, respectively. + +Within an array job, you can use `%a` and `%A` in addition to `%j` and `%N` to make the output file +name specific to the job: + +* `%A` will be replaced by the value of `SLURM_ARRAY_JOB_ID` +* `%a` will be replaced by the value of `SLURM_ARRAY_TASK_ID` + +!!! example "Job file using job arrays" + + ```Bash + #!/bin/bash + #SBATCH --array 0-9 + #SBATCH -o arraytest-%A_%a.out + #SBATCH -e arraytest-%A_%a.err + #SBATCH --ntasks=864 + #SBATCH --time=08:00:00 + #SBATCH -J Science1 + #SBATCH --mail-type=end + #SBATCH --mail-user=your.name@tu-dresden.de + + echo "Hi, I am step $SLURM_ARRAY_TASK_ID in this array job $SLURM_ARRAY_JOB_ID" + ``` + +!!! note + + If you submit a large number of jobs doing heavy I/O in the Lustre filesystems you should limit + the number of your simultaneously running job with a second parameter like: + + ```Bash + #SBATCH --array=1-100000%100 + ``` + +Please read the Slurm documentation at https://slurm.schedmd.com/sbatch.html for further details. + +## Chain Jobs + +You can use chain jobs to create dependencies between jobs. This is often the case if a job relies +on the result of one or more preceding jobs. Chain jobs can also be used if the runtime limit of the +batch queues is not sufficient for your job. Slurm has an option +`-d, --dependency=<dependency_list>` that allows to specify that a job is only allowed to start if +another job finished. + +Here is an example of how a chain job can look like, the example submits 4 jobs (described in a job +file) that will be executed one after each other with different CPU numbers: + +!!! example "Script to submit jobs with dependencies" + + ```Bash + #!/bin/bash + TASK_NUMBERS="1 2 4 8" + DEPENDENCY="" + JOB_FILE="myjob.slurm" + + for TASKS in $TASK_NUMBERS ; do + JOB_CMD="sbatch --ntasks=$TASKS" + if [ -n "$DEPENDENCY" ] ; then + JOB_CMD="$JOB_CMD --dependency afterany:$DEPENDENCY" + fi + JOB_CMD="$JOB_CMD $JOB_FILE" + echo -n "Running command: $JOB_CMD " + OUT=`$JOB_CMD` + echo "Result: $OUT" + DEPENDENCY=`echo $OUT | awk '{print $4}'` + done + ``` + +## Array-Job with Afterok-Dependency and Datamover Usage + +This is a *todo* diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_profiling.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_profiling.md new file mode 100644 index 0000000000000000000000000000000000000000..0c3dc773898143ca25086a97370b079a6c74cebe --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_profiling.md @@ -0,0 +1,62 @@ +# Job Profiling + +Slurm offers the option to gather profiling data from every task/node of the job. Analyzing this +data allows for a better understanding of your jobs in terms of elapsed time, runtime and IO +behavior, and many more. + +The following data can be gathered: + +* Task data, such as CPU frequency, CPU utilization, memory consumption (RSS and VMSize), I/O +* Energy consumption of the nodes +* Infiniband data (currently deactivated) +* Lustre filesystem data (currently deactivated) + +The data is sampled at a fixed rate (i.e. every 5 seconds) and is stored in a HDF5 file. + +!!! note "Data hygiene" + + Please be aware that the profiling data may be quiet large, depending on job size, runtime, and + sampling rate. Always remove the local profiles from `/lustre/scratch2/profiling/${USER}`, + either by running `sh5util` as shown above or by simply removing those files. + +## Examples + +The following examples of `srun` profiling command lines are meant to replace the current `srun` +line within your job file. + +??? example "Create profiling data" + + (--acctg-freq is the sampling rate in seconds) + + ```console + # Energy and task profiling + srun --profile=All --acctg-freq=5,energy=5 -n 32 ./a.out + # Task profiling data only + srun --profile=All --acctg-freq=5 -n 32 ./a.out + ``` + +??? example "Merge the node local files" + + ... in `/lustre/scratch2/profiling/${USER}` to single file. + + ```console + # (without -o option output file defaults to job_$JOBID.h5) + sh5util -j <JOBID> -o profile.h5 + # in jobscripts or in interactive sessions (via salloc): + sh5util -j ${SLURM_JOBID} -o profile.h5 + ``` + +??? example "View data" + + ```console + marie@login$ module load HDFView + marie@login$ hdfview.sh profile.h5 + ``` + + +{: align="center"} + +More information about profiling with Slurm: + +- [Slurm Profiling](http://slurm.schedmd.com/hdf5_profile_user_guide.html) +- [`sh5util`](http://slurm.schedmd.com/sh5util.html) diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/system_taurus.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/system_taurus.md deleted file mode 100644 index 3625bf4503d4b41d73fc7a9de6c02dabc3d3feec..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/system_taurus.md +++ /dev/null @@ -1,210 +0,0 @@ -# Taurus - -## Information about the Hardware - -Detailed information on the current HPC hardware can be found -[here.](../jobs_and_resources/hardware_taurus.md) - -## Applying for Access to the System - -Project and login application forms for taurus are available -[here](../access/overview.md). - -## Login to the System - -Login to the system is available via ssh at taurus.hrsk.tu-dresden.de. -There are several login nodes (internally called tauruslogin3 to -tauruslogin6). Currently, if you use taurus.hrsk.tu-dresden.de, you will -be placed on tauruslogin5. It might be a good idea to give the other -login nodes a try if the load on tauruslogin5 is rather high (there will -once again be load balancer soon, but at the moment, there is none). - -Please note that if you store data on the local disk (e.g. under /tmp), -it will be on only one of the three nodes. If you relogin and the data -is not there, you are probably on another node. - -You can find an list of fingerprints [here](../access/key_fingerprints.md). - -## Transferring Data from/to Taurus - -taurus has two specialized data transfer nodes. Both nodes are -accessible via `taurusexport.hrsk.tu-dresden.de`. Currently, only rsync, -scp and sftp to these nodes will work. A login via SSH is not possible -as these nodes are dedicated to data transfers. - -These nodes are located behind a firewall. By default, they are only -accessible from IP addresses from with the Campus of the TU Dresden. -External IP addresses can be enabled upon request. These requests should -be send via eMail to `servicedesk@tu-dresden.de` and mention the IP -address range (or node names), the desired protocol and the time frame -that the firewall needs to be open. - -We are open to discuss options to export the data in the scratch file -system via CIFS or other protocols. If you have a need for this, please -contact the Service Desk as well. - -**Phase 2:** The nodes taurusexport\[3,4\] provide access to the -`/scratch` file system of the second phase. - -## Compiling Parallel Applications - -You have to explicitly load a compiler module and an MPI module on -Taurus. Eg. with `module load GCC OpenMPI`. ( [read more about -Modules](../software/runtime_environment.md), **todo link** (read more about -Compilers)(Compendium.Compilers)) - -Use the wrapper commands like e.g. `mpicc` (`mpiicc` for intel), -`mpicxx` (`mpiicpc`) or `mpif90` (`mpiifort`) to compile MPI source -code. To reveal the command lines behind the wrappers, use the option -`-show`. - -For running your code, you have to load the same compiler and MPI module -as for compiling the program. Please follow the following guiedlines to -run your parallel program using the batch system. - -## Batch System - -Applications on an HPC system can not be run on the login node. They -have to be submitted to compute nodes with dedicated resources for the -user's job. Normally a job can be submitted with these data: - -- number of CPU cores, -- requested CPU cores have to belong on one node (OpenMP programs) or - can distributed (MPI), -- memory per process, -- maximum wall clock time (after reaching this limit the process is - killed automatically), -- files for redirection of output and error messages, -- executable and command line parameters. - -The batch system on Taurus is Slurm. If you are migrating from LSF -(deimos, mars, atlas), the biggest difference is that Slurm has no -notion of batch queues any more. - -- [General information on the Slurm batch system](slurm.md) -- Slurm also provides process-level and node-level [profiling of - jobs](slurm.md#Job_Profiling) - -### Partitions - -Please note that the islands are also present as partitions for the -batch systems. They are called - -- romeo (Island 7 - AMD Rome CPUs) -- julia (large SMP machine) -- haswell (Islands 4 to 6 - Haswell CPUs) -- gpu (Island 2 - GPUs) - - gpu2 (K80X) -- smp2 (SMP Nodes) - -**Note:** usually you don't have to specify a partition explicitly with -the parameter -p, because SLURM will automatically select a suitable -partition depending on your memory and gres requirements. - -### Run-time Limits - -**Run-time limits are enforced**. This means, a job will be canceled as -soon as it exceeds its requested limit. At Taurus, the maximum run time -is 7 days. - -Shorter jobs come with multiple advantages:\<img alt="part.png" -height="117" src="%ATTACHURL%/part.png" style="float: right;" -title="part.png" width="284" /> - -- lower risk of loss of computing time, -- shorter waiting time for reservations, -- higher job fluctuation; thus, jobs with high priorities may start - faster. - -To bring down the percentage of long running jobs we restrict the number -of cores with jobs longer than 2 days to approximately 50% and with jobs -longer than 24 to 75% of the total number of cores. (These numbers are -subject to changes.) As best practice we advise a run time of about 8h. - -Please always try to make a good estimation of your needed time limit. -For this, you can use a command line like this to compare the requested -timelimit with the elapsed time for your completed jobs that started -after a given date: - - sacct -X -S 2021-01-01 -E now --format=start,JobID,jobname,elapsed,timelimit -s COMPLETED - -Instead of running one long job, you should split it up into a chain -job. Even applications that are not capable of chreckpoint/restart can -be adapted. The HOWTO can be found [here](../jobs_and_resources/checkpoint_restart.md), - -### Memory Limits - -**Memory limits are enforced.** This means that jobs which exceed their -per-node memory limit will be killed automatically by the batch system. -Memory requirements for your job can be specified via the *sbatch/srun* -parameters: **--mem-per-cpu=\<MB>** or **--mem=\<MB>** (which is "memory -per node"). The **default limit** is **300 MB** per cpu. - -Taurus has sets of nodes with a different amount of installed memory -which affect where your job may be run. To achieve the shortest possible -waiting time for your jobs, you should be aware of the limits shown in -the following table. - -| Partition | Nodes | # Nodes | Cores per Node | Avail. Memory per Core | Avail. Memory per Node | GPUs per node | -|:-------------------|:-----------------------------------------|:--------|:----------------|:-----------------------|:-----------------------|:------------------| -| `haswell64` | `taurusi[4001-4104,5001-5612,6001-6612]` | `1328` | `24` | `2541 MB` | `61000 MB` | `-` | -| `haswell128` | `taurusi[4105-4188]` | `84` | `24` | `5250 MB` | `126000 MB` | `-` | -| `haswell256` | `taurusi[4189-4232]` | `44` | `24` | `10583 MB` | `254000 MB` | `-` | -| `broadwell` | `taurusi[4233-4264]` | `32` | `28` | `2214 MB` | `62000 MB` | `-` | -| `smp2` | `taurussmp[3-7]` | `5` | `56` | `36500 MB` | `2044000 MB` | `-` | -| `gpu2` | `taurusi[2045-2106]` | `62` | `24` | `2583 MB` | `62000 MB` | `4 (2 dual GPUs)` | -| `gpu2-interactive` | `taurusi[2045-2108]` | `64` | `24` | `2583 MB` | `62000 MB` | `4 (2 dual GPUs)` | -| `hpdlf` | `taurusa[3-16]` | `14` | `12` | `7916 MB` | `95000 MB` | `3` | -| `ml` | `taurusml[1-32]` | `32` | `44 (HT: 176)` | `1443 MB*` | `254000 MB` | `6` | -| `romeo` | `taurusi[7001-7192]` | `192` | `128 (HT: 256)` | `1972 MB*` | `505000 MB` | `-` | -| `julia` | `taurussmp8` | `1` | `896` | `27343 MB*` | `49000000 MB` | `-` | - -\* note that the ML nodes have 4way-SMT, so for every physical core -allocated (e.g., with SLURM_HINT=nomultithread), you will always get -4\*1443MB because the memory of the other threads is allocated -implicitly, too. - -### Submission of Parallel Jobs - -To run MPI jobs ensure that the same MPI module is loaded as during -compile-time. In doubt, check you loaded modules with `module list`. If -your code has been compiled with the standard `bullxmpi` installation, -you can load the module via `module load bullxmpi`. Alternative MPI -libraries (`intelmpi`, `openmpi`) are also available. - -Please pay attention to the messages you get loading the module. They -are more up-to-date than this manual. - -## GPUs - -Island 2 of taurus contains a total of 128 NVIDIA Tesla K80 (dual) GPUs -in 64 nodes. - -More information on how to program applications for GPUs can be found -[GPU Programming](GPU Programming). - -The following software modules on taurus offer GPU support: - -- `CUDA` : The NVIDIA CUDA compilers -- `PGI` : The PGI compilers with OpenACC support - -## Hardware for Deep Learning (HPDLF) - -The partition hpdlf contains 14 servers. Each of them has: - -- 2 sockets CPU E5-2603 v4 (1.70GHz) with 6 cores each, -- 3 consumer GPU cards NVIDIA GTX1080, -- 96 GB RAM. - -## Energy Measurement - -Taurus contains sophisticated energy measurement instrumentation. -Especially HDEEM is available on the haswell nodes of Phase II. More -detailed information can be found at -**todo link** (EnergyMeasurement)(EnergyMeasurement). - -## Low level optimizations - -x86 processsors provide registers that can be used for optimizations and -performance monitoring. Taurus provides you access to such features via -the **todo link** (X86Adapt)(X86Adapt) software infrastructure. diff --git a/doc.zih.tu-dresden.de/docs/misc/HPC-Introduction.pdf b/doc.zih.tu-dresden.de/docs/misc/HPC-Introduction.pdf new file mode 100644 index 0000000000000000000000000000000000000000..71d47f04b75004fad2b9fd7181051c2beae4e2fe Binary files /dev/null and b/doc.zih.tu-dresden.de/docs/misc/HPC-Introduction.pdf differ diff --git a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md index 21966e1f3f03416e1a080a391894f370f9f1a5a8..72224113fdf8a9c6f4727d47771283dc1d0c1baa 100644 --- a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md +++ b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md @@ -7,7 +7,7 @@ algorithms and graphical techniques. R is an integrated suite of software facil manipulation, calculation and graphing. We recommend using the partitions Haswell and/or Romeo to work with R. For more details -see our [hardware documentation](../jobs_and_resources/hardware_taurus.md). +see our [hardware documentation](../jobs_and_resources/hardware_overview.md). ## R Console @@ -256,7 +256,7 @@ code to use `mclapply` function. Check out an example below. The disadvantages of using shared-memory parallelism approach are, that the number of parallel tasks is limited to the number of cores on a single node. The maximum number of cores on a single node can -be found in our [hardware documentation](../jobs_and_resources/hardware_taurus.md). +be found in our [hardware documentation](../jobs_and_resources/hardware_overview.md). Submitting a multicore R job to Slurm is very similar to submitting an [OpenMP Job](../jobs_and_resources/slurm.md#binding-and-distribution-of-tasks), diff --git a/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md b/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md index 92786013f0382c841eed253c71e4a39cbc1a9b62..38190764e6c9efedb275ec9ff4324d916c851566 100644 --- a/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md +++ b/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md @@ -190,7 +190,7 @@ There are the following script preparation steps for OmniOpt: ``` 1. Testing script functionality and determine software requirements for the chosen - [partition](../jobs_and_resources/system_taurus.md#partitions). In the following, the alpha + [partition](../jobs_and_resources/partitions_and_limits.md). In the following, the alpha partition is used. Please note the parameters `--out-layer1`, `--batchsize`, `--epochs` when calling the Python script. Additionally, note the `RESULT` string with the output for OmniOpt. diff --git a/doc.zih.tu-dresden.de/docs/software/mathematics.md b/doc.zih.tu-dresden.de/docs/software/mathematics.md index 3ae820eda962a63a1ff59c55536865f1437d582a..33f8ffb773301131138e19d6254f7b55fd3e96ff 100644 --- a/doc.zih.tu-dresden.de/docs/software/mathematics.md +++ b/doc.zih.tu-dresden.de/docs/software/mathematics.md @@ -1,11 +1,9 @@ # Mathematics Applications -!!! cite +!!! cite "Galileo Galilei Nature is written in mathematical language. - (Galileo Galilei) - <!--*Please do not run expensive interactive sessions on the login nodes. Instead, use* `srun --pty--> <!--...` *to let the batch system place it on a compute node.*--> diff --git a/doc.zih.tu-dresden.de/docs/software/pytorch.md b/doc.zih.tu-dresden.de/docs/software/pytorch.md index e8e2c4d5ecc7d123527a15140910005204a3d5ef..63d3eb91e516d24559a85d80c70a30b73e9af73c 100644 --- a/doc.zih.tu-dresden.de/docs/software/pytorch.md +++ b/doc.zih.tu-dresden.de/docs/software/pytorch.md @@ -18,7 +18,7 @@ to find out, which PyTorch modules are available on your partition. We recommend using **Alpha** and/or **ML** partitions when working with machine learning workflows and the PyTorch library. You can find detailed hardware specification in our -[hardware documentation](../jobs_and_resources/hardware_taurus.md). +[hardware documentation](../jobs_and_resources/hardware_overview.md). ## PyTorch Console @@ -44,7 +44,7 @@ Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5, PyTorch/1.9.0 and 54 dependencies marie@alpha$ pip install torchvision --no-deps ``` - Using the **--no-deps** option for "pip install" is necessary here as otherwise the PyTorch + Using the **--no-deps** option for "pip install" is necessary here as otherwise the PyTorch version might be replaced and you will run into trouble with the cuda drivers. On the **ML** partition: diff --git a/doc.zih.tu-dresden.de/docs/software/tensorboard.md b/doc.zih.tu-dresden.de/docs/software/tensorboard.md index a1fab030bfbca20b1a8f69cf302e95957b565185..d2c838d3961d8f48794e544ce1ca7846d24e7325 100644 --- a/doc.zih.tu-dresden.de/docs/software/tensorboard.md +++ b/doc.zih.tu-dresden.de/docs/software/tensorboard.md @@ -81,4 +81,4 @@ marie@local$ ssh -N -f -L 6006:taurusi8034.taurus.hrsk.tu-dresden.de:6006 <zih-l Now, you can see the TensorBoard in your browser at `http://localhost:6006/`. -Note that you can also use TensorBoard in an [sbatch file](../jobs_and_resources/batch_systems.md). +Note that you can also use TensorBoard in an [sbatch file](../jobs_and_resources/slurm.md). diff --git a/doc.zih.tu-dresden.de/docs/software/tensorflow.md b/doc.zih.tu-dresden.de/docs/software/tensorflow.md index d8ad85c3b1a5f870f5ced0848274fb866bd14dff..09a8352a32648178f3634a4099eee52ad6c0ccd0 100644 --- a/doc.zih.tu-dresden.de/docs/software/tensorflow.md +++ b/doc.zih.tu-dresden.de/docs/software/tensorflow.md @@ -19,7 +19,7 @@ TensorFlow 2 and TensorFlow 1, see the corresponding [section below](#compatibil We recommend using partitions **Alpha** and/or **ML** when working with machine learning workflows and the TensorFlow library. You can find detailed hardware specification in our -[Hardware](../jobs_and_resources/hardware_taurus.md) documentation. +[Hardware](../jobs_and_resources/hardware_overview.md) documentation. ## TensorFlow Console diff --git a/doc.zih.tu-dresden.de/docs/support/news_archive.md b/doc.zih.tu-dresden.de/docs/support/news_archive.md new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/doc.zih.tu-dresden.de/docs/support/support.md b/doc.zih.tu-dresden.de/docs/support/support.md new file mode 100644 index 0000000000000000000000000000000000000000..c2c9fbda8bbb70c1dddb82fb384b69a8201e6fb8 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/support/support.md @@ -0,0 +1,31 @@ +# How to Ask for Support + +## Create a Ticket + +The best way to ask for help send a message to +[hpcsupport@zih.tu-dresden.de](mailto:hpcsupport@zih.tu-dresden.de) with a +detailed description of your problem. + +It should include: + +- Who is reporting? (login name) +- Where have you seen the problem? (name of the HPC system and/or of the node) +- When has the issue occurred? Maybe, when did it work last? +- What exactly happened? + +If possible include + +- job ID, +- batch script, +- filesystem path, +- loaded modules and environment, +- output and error logs, +- steps to reproduce the error. + +This email automatically opens a trouble ticket which will be tracked by the HPC team. Please +always keep the ticket number in the subject on your answers so that our system can keep track +on our communication. + +For a new request, please simply send a new email (without any ticket number). + +!!! hint "Please try to find an answer in this documentation first." diff --git a/doc.zih.tu-dresden.de/mkdocs.yml b/doc.zih.tu-dresden.de/mkdocs.yml index 6cf922ce238b8a8f6f82bffffb0a35f05cbda232..7af3fcc88da4cce6acddedfddf4b74efa2d7dbcb 100644 --- a/doc.zih.tu-dresden.de/mkdocs.yml +++ b/doc.zih.tu-dresden.de/mkdocs.yml @@ -1,5 +1,4 @@ nav: - - Home: index.md - Application for Login and Resources: - Overview: application/overview.md @@ -85,41 +84,24 @@ nav: - Structuring Experiments: data_lifecycle/experiments.md - HPC Resources and Jobs: - Overview: jobs_and_resources/overview.md - - Batch Systems: jobs_and_resources/batch_systems.md - HPC Resources: - - Hardware Taurus: jobs_and_resources/hardware_taurus.md + - Overview: jobs_and_resources/hardware_overview.md - AMD Rome Nodes: jobs_and_resources/rome_nodes.md - IBM Power9 Nodes: jobs_and_resources/power9.md - NVMe Storage: jobs_and_resources/nvme_storage.md - Alpha Centauri: jobs_and_resources/alpha_centauri.md - HPE Superdome Flex: jobs_and_resources/sd_flex.md - - Checkpoint/Restart: jobs_and_resources/checkpoint_restart.md - - Overview2: jobs_and_resources/index.md - - Taurus: jobs_and_resources/system_taurus.md - - Slurm Examples: jobs_and_resources/slurm_examples.md - - Slurm: jobs_and_resources/slurm.md - - Binding And Distribution Of Tasks: jobs_and_resources/binding_and_distribution_of_tasks.md - - # - Queue Policy: jobs/policy.md - - # - Examples: jobs/examples/index.md - - # - Affinity: jobs/affinity/index.md - - # - Interactive: jobs/interactive.md - - # - Best Practices: jobs/best-practices.md - - # - Reservations: jobs/reservations.md - - # - Monitoring: jobs/monitoring.md - - # - FAQs: jobs/jobs-faq.md - - #- Tests: tests.md - - - Support: support.md - - Archive: + - Running Jobs: + - Batch System Slurm: jobs_and_resources/slurm.md + - Job Examples: jobs_and_resources/slurm_examples.md + - Partitions and Limits : jobs_and_resources/partitions_and_limits.md + - Checkpoint/Restart: jobs_and_resources/checkpoint_restart.md + - Job Profiling: jobs_and_resources/slurm_profiling.md + - Binding And Distribution Of Tasks: jobs_and_resources/binding_and_distribution_of_tasks.md + - Support: + - How to Ask for Support: support/support.md + - News Archive: support/news_archive.md + - Archive of the Old Wiki: - Overview: archive/overview.md - Bio Informatics: archive/bioinformatics.md - CXFS End of Support: archive/cxfs_end_of_support.md diff --git a/doc.zih.tu-dresden.de/tud_theme/stylesheets/extra.css b/doc.zih.tu-dresden.de/tud_theme/stylesheets/extra.css index 0fb1a3d46afe20b02e3fd9a03daf5b716819ad61..a3a992501bff7f7b153a1beb0779e7f3e576f9e6 100644 --- a/doc.zih.tu-dresden.de/tud_theme/stylesheets/extra.css +++ b/doc.zih.tu-dresden.de/tud_theme/stylesheets/extra.css @@ -28,19 +28,24 @@ .md-typeset h5 { font-family: 'Open Sans Semibold'; line-height: 130%; + margin: 0.2em; } .md-typeset h1 { font-family: 'Open Sans Regular'; - font-size: 1.6rem; + font-size: 1.6rem; + margin-bottom: 0.5em; } .md-typeset h2 { - font-size: 1.4rem; + font-size: 1.2rem; + margin: 0.5em; + border-bottom-style: solid; + border-bottom-width: 1px; } .md-typeset h3 { - font-size: 1.2rem; + font-size: 1.1rem; } .md-typeset h4 { @@ -48,8 +53,8 @@ } .md-typeset h5 { - font-size: 0.9rem; - line-height: 120%; + font-size: 0.8rem; + text-transform: initial; } strong { @@ -161,6 +166,7 @@ hr.solid { p { padding: 0 0.6rem; + margin: 0.2em; } /* main */ diff --git a/doc.zih.tu-dresden.de/util/grep-forbidden-words.sh b/doc.zih.tu-dresden.de/util/grep-forbidden-words.sh index 950579f5356d4efd06006386e2cec84381592882..e52633160a5844c6aad92069347d2b3dc8eb5988 100755 --- a/doc.zih.tu-dresden.de/util/grep-forbidden-words.sh +++ b/doc.zih.tu-dresden.de/util/grep-forbidden-words.sh @@ -17,7 +17,6 @@ s \<SLURM\> i file \+system HDFS i \<taurus\> taurus\.hrsk /taurus i \<hrskii\> -i hpc \+system i hpc[ -]\+da\> i work[ -]\+space" diff --git a/doc.zih.tu-dresden.de/wordlist.aspell b/doc.zih.tu-dresden.de/wordlist.aspell index b0e926dcd8d6be6a578d35ad1d7244b154e03968..d5588c6dce5782ee4eb73fd2018047a7a2d3a1a2 100644 --- a/doc.zih.tu-dresden.de/wordlist.aspell +++ b/doc.zih.tu-dresden.de/wordlist.aspell @@ -1,22 +1,29 @@ personal_ws-1.1 en 203 +Abaqus Altix +Amber Amdahl's analytics anonymized APIs +AVX BeeGFS benchmarking BLAS broadwell bsub bullx +CCM ccNUMA centauri CentOS +cgroups +checkpointing Chemnitz citable conda CPU +CPUID CPUs css CSV @@ -32,6 +39,7 @@ DDP DDR DFG DistributedDataParallel +DMTCP Dockerfile Dockerfiles DockerHub @@ -41,6 +49,8 @@ ecryptfs engl english env +EPYC +Espresso ESSL fastfs FFT @@ -48,8 +58,11 @@ FFTW filesystem filesystems Flink +FMA foreach Fortran +Gauss +Gaussian GBit GFLOPS gfortran @@ -62,12 +75,17 @@ glibc gnuplot GPU GPUs +GROMACS hadoop haswell +HDF HDFS +HDFView Horovod hostname +Hostnames HPC +HPE HPL html hyperparameter @@ -87,6 +105,7 @@ JupyterHub JupyterLab Keras KNL +LAMMPS LAPACK lapply LINPACK @@ -116,15 +135,22 @@ mpifort mpirun multicore multithreaded +Multithreading +NAMD +natively nbsp NCCL Neptun NFS +NGC NRINGS NUMA NUMAlink NumPy Nutzungsbedingungen +Nvidia +NVMe +NWChem OME OmniOpt OPARI @@ -154,19 +180,25 @@ PMI png PowerAI ppc +Preload +preloaded +preloading PSOCK Pthreads pymdownx +Quantum queue randint reachability README reproducibility +requeueing RHEL Rmpi rome romeo RSA +RSS RStudio Rsync runnable @@ -196,9 +228,14 @@ squeue srun ssd SSHFS +STAR stderr stdout +subdirectories +subdirectory +Superdome SUSE +SXM TBB TCP TensorBoard @@ -208,6 +245,8 @@ Theano tmp todo ToDo +toolchain +toolchains tracefile tracefiles transferability @@ -216,11 +255,13 @@ uplink Vampir VampirTrace VampirTrace's +VASP vectorization venv virtualenv VirtualGL VMs +VMSize VPN WebVNC WinSCP