diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md index a097655547cb08b079882f5037540a098e353838..7405eec766fa216e5eccdab2b4a1856ca5f98b2b 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md @@ -11,13 +11,13 @@ users and the ZIH. ## Login Nodes - Login-Nodes (`tauruslogin[3-6].hrsk.tu-dresden.de`) - - each with 2x Intel(R) Xeon(R) CPU E5-2680 v3 each with 12 cores - @ 2.50GHz, Multithreading Disabled, 64 GB RAM, 128 GB SSD local disk - - IPs: 141.30.73.\[102-105\] + - each with 2x Intel(R) Xeon(R) CPU E5-2680 v3 each with 12 cores + 2.50GHz, Multithreading Disabled, 64 GB RAM, 128 GB SSD local disk + - IPs: 141.30.73.\[102-105\] - Transfer-Nodes (`taurusexport[3-4].hrsk.tu-dresden.de`, DNS Alias `taurusexport.hrsk.tu-dresden.de`) - - 2 Servers without interactive login, only available via file transfer protocols (`rsync`, `ftp`) - - IPs: 141.30.73.82/83 + - 2 Servers without interactive login, only available via file transfer protocols (`rsync`, `ftp`) + - IPs: 141.30.73.82/83 - Direct access to these nodes is granted via IP whitelisting (contact hpcsupport@zih.tu-dresden.de) - otherwise use TU Dresden VPN. @@ -27,11 +27,11 @@ users and the ZIH. ## AMD Rome CPUs + NVIDIA A100 -- 32 nodes, each with - - 8 x NVIDIA A100-SXM4 - - 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, Multithreading disabled - - 1 TB RAM - - 3.5 TB local memory at NVMe device at `/tmp` +- 34 nodes, each with + - 8 x NVIDIA A100-SXM4 + - 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, Multithreading disabled + - 1 TB RAM + - 3.5 TB local memory at NVMe device at `/tmp` - Hostnames: `taurusi[8001-8034]` - Slurm partition `alpha` - Dedicated mostly for ScaDS-AI @@ -39,20 +39,21 @@ users and the ZIH. ## Island 7 - AMD Rome CPUs - 192 nodes, each with - - 2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, Multithreading - enabled, - - 512 GB RAM - - 200 GB /tmp on local SSD local disk + - 2x AMD EPYC CPU 7702 (64 cores) @ 2.0GHz, Multithreading + enabled, + - 512 GB RAM + - 200 GB /tmp on local SSD local disk - Hostnames: `taurusi[7001-7192]` - Slurm partition `romeo` - More information under [Rome Nodes](rome_nodes.md) ## Large SMP System HPE Superdome Flex -- 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20GHz (28 cores) -- 47 TB RAM +- 1 node, with + - 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20GHz (28 cores) + - 47 TB RAM - Currently configured as one single node - - Hostname: `taurussmp8` +- Hostname: `taurussmp8` - Slurm partition `julia` - More information under [HPE SD Flex](sd_flex.md) @@ -60,27 +61,26 @@ users and the ZIH. For machine learning, we have 32 IBM AC922 nodes installed with this configuration: -- 2 x IBM Power9 CPU (2.80 GHz, 3.10 GHz boost, 22 cores) -- 256 GB RAM DDR4 2666MHz -- 6x NVIDIA VOLTA V100 with 32GB HBM2 -- NVLINK bandwidth 150 GB/s between GPUs and host -- Slurm partition `ml` +- 32 nodes, each with + - 2 x IBM Power9 CPU (2.80 GHz, 3.10 GHz boost, 22 cores) + - 256 GB RAM DDR4 2666MHz + - 6x NVIDIA VOLTA V100 with 32GB HBM2 + - NVLINK bandwidth 150 GB/s between GPUs and host - Hostnames: `taurusml[1-32]` +- Slurm partition `ml` -## Island 4 to 6 - Intel Haswell CPUs +## Island 6 - Intel Haswell CPUs -- 1456 nodes, each with 2x Intel(R) Xeon(R) CPU E5-2680 v3 (12 cores) - @ 2.50GHz, Multithreading disabled, 128 GB SSD local disk -- Hostname: `taurusi[4001-4232]`, `taurusi[5001-5612]`, - `taurusi[6001-6612]` +- 612 nodes, each with + - 2x Intel(R) Xeon(R) CPU E5-2680 v3 (12 cores) + @ 2.50GHz, Multithreading disabled, 128 GB SSD local disk - Varying amounts of main memory (selected automatically by the batch system for you according to your job requirements) - - 1328 nodes with 2.67 GB RAM per core (64 GB total): - `taurusi[4001-4104,5001-5612,6001-6612]` - - 84 nodes with 5.34 GB RAM per core (128 GB total): - `taurusi[4105-4188]` - - 44 nodes with 10.67 GB RAM per core (256 GB total): - `taurusi[4189-4232]` + - 594 nodes with 2.67 GB RAM per core (64 GB total): + `taurusi[6001-6540,6559-6612]` + - 18 nodes with 10.67 GB RAM per core (256 GB total): + `taurusi[6541-6558]` +- Hostnames: `taurusi[6001-6612]` - Slurm Partition `haswell` ??? hint "Node topology" @@ -88,29 +88,26 @@ For machine learning, we have 32 IBM AC922 nodes installed with this configurati  {: align=center} -### Extension of Island 4 with Broadwell CPUs - -* 32 nodes, each with 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz - (**14 cores**), Multithreading disabled, 64 GB RAM, 256 GB SSD local disk -* from the users' perspective: Broadwell is like Haswell -* Hostname: `taurusi[4233-4264]` -* Slurm partition `broadwell` - ## Island 2 Phase 2 - Intel Haswell CPUs + NVIDIA K80 GPUs -* 64 nodes, each with 2x Intel(R) Xeon(R) CPU E5-E5-2680 v3 (12 cores) - @ 2.50GHz, Multithreading Disabled, 64 GB RAM (2.67 GB per core), - 128 GB SSD local disk, 4x NVIDIA Tesla K80 (12 GB GDDR RAM) GPUs -* Hostname: `taurusi[2045-2108]` -* Slurm Partition `gpu2` -* Node topology, same as [island 4 - 6](#island-4-to-6-intel-haswell-cpus) +- 64 nodes, each with + - 2x Intel(R) Xeon(R) CPU E5-E5-2680 v3 (12 cores) + @ 2.50GHz, Multithreading Disabled + - 64 GB RAM (2.67 GB per core) + - 128 GB SSD local disk + - 4x NVIDIA Tesla K80 (12 GB GDDR RAM) GPUs +- Hostnames: `taurusi[2045-2108]` +- Slurm Partition `gpu2` +- Node topology, same as [island 4 - 6](#island-6-intel-haswell-cpus) ## SMP Nodes - up to 2 TB RAM -- 5 Nodes each with 4x Intel(R) Xeon(R) CPU E7-4850 v3 (14 cores) @ - 2.20GHz, Multithreading Disabled, 2 TB RAM - - Hostname: `taurussmp[3-7]` - - Slurm partition `smp2` +- 5 Nodes, each with + - 4x Intel(R) Xeon(R) CPU E7-4850 v3 (14 cores) @ + 2.20GHz, Multithreading Disabled + - 2 TB RAM +- Hostnames: `taurussmp[3-7]` +- Slurm partition `smp2` ??? hint "Node topology" diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/partitions_and_limits.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/partitions_and_limits.md index ae450f8b46a75b3457428deaee8a24caff355e85..3374d016e336b31abb473395fd1a2e82d480389d 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/partitions_and_limits.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/partitions_and_limits.md @@ -10,19 +10,21 @@ smaller jobs. Thus, restrictions w.r.t. [memory](#memory-limits) and !!! warning "Runtime limits on login nodes" - There is a time limit set for processes on login nodes. If you run applications outside of a - compute job, it will be stopped automatically after 5 minutes with + There is a time limit of 600 seconds set for processes on login nodes. Each process running + longer than this time limit is automatically killed. The login nodes are shared ressources + between all users of ZIH system and thus, need to be available and cannot be used for productive + runs. ``` CPU time limit exceeded ``` - Please start a job using the [batch system](slurm.md). + Please submit extensive application runs to the compute nodes using the [batch system](slurm.md). !!! note "Runtime limits are enforced." - A job is canceled as soon as it exceeds its requested limit. Currently, the maximum run time is - 7 days. + A job is canceled as soon as it exceeds its requested limit. Currently, the maximum run time + limit is 7 days. Shorter jobs come with multiple advantages: @@ -47,9 +49,6 @@ Instead of running one long job, you should split it up into a chain job. Even a not capable of checkpoint/restart can be adapted. Please refer to the section [Checkpoint/Restart](../jobs_and_resources/checkpoint_restart.md) for further documentation. - -{: align="center" summary="Partitions image"} - ## Memory Limits !!! note "Memory limits are enforced." @@ -64,36 +63,44 @@ to request it. ZIH systems comprise different sets of nodes with different amount of installed memory which affect where your job may be run. To achieve the shortest possible waiting time for your jobs, you should -be aware of the limits shown in the following table. - -???+ hint "Partitions and memory limits" - - | Partition | Nodes | # Nodes | Cores per Node | MB per Core | MB per Node | GPUs per Node | - |:-------------------|:-----------------------------------------|:--------|:----------------|:------------|:------------|:------------------| - | `interactive` | `taurusi[6605-6612]` | `8` | `24` | `2541` | `61000` | `-` | - | `haswell64` | `taurusi[4037-4104,5001-5612,6001-6604]` | `1284` | `24` | `2541` | `61000` | `-` | - | `haswell64ht` | `taurusi[4018-4036]` | `18` | `24 (HT: 48)` | `1270*` | `61000` | `-` | - | `haswell128` | `taurusi[4105-4188]` | `84` | `24` | `5250` | `126000` | `-` | - | `haswell256` | `taurusi[4189-4232]` | `44` | `24` | `10583` | `254000` | `-` | - | `broadwell` | `taurusi[4233-4264]` | `32` | `28` | `2214` | `62000` | `-` | - | `smp2` | `taurussmp[3-7]` | `5` | `56` | `36500` | `2044000` | `-` | - | `gpu2`** | `taurusi[2045-2103]` | `59` | `24` | `2583` | `62000` | `4 (2 dual GPUs)` | - | `hpdlf` | `taurusa[3-16]` | `14` | `12` | `7916` | `95000` | `3` | - | `ml`** | `taurusml[1-32]` | `32` | `44 (HT: 176)` | `1443*` | `254000` | `6` | - | `romeo`** | `taurusi[7001-7192]` | `192` | `128 (HT: 256)` | `1972*` | `505000` | `-` | - | `julia` | `taurussmp8` | `1` | `896` | `54006` | `48390000` | `-` | - | `alpha`** | `taurusi[8001-8034]` | `34` | `48 (HT: 96)` | `10312*` | `990000` | `8` | - {: summary="Partitions and limits table" align="bottom"} - -!!! note - - Some nodes have multithreading (SMT) enabled, so for every physical core allocated - (e.g., with `SLURM_HINT=nomultithread`), you will always get `MB per Core`*`number of threads`, - because the memory of the other threads is allocated implicitly, too. - Those nodes are marked with an asterisk. - Some of the partitions, denoted with a double asterisk, have a counterpart for interactive - jobs. These partitions have a `-interactive` suffix (e.g. `ml-interactive`) and have the same - configuration. - There is also a meta partition `haswell`, which contain partition `haswell64`, `haswell128`, `haswell256` and `smp2`and this is also the default partition. - If you specify no partition or partition `haswell` a Slurm plugin will choose the partition which fits to your memory requirements. - There are some other partitions, which are not specified in the table above, but those partitions should not be used directly. +be aware of the limits shown in the table [Partitions and limits table](../jobs_and_resources/partitions_and_limits.md#slurm-partitions). + +## Slurm Partitions + +The available compute nodes are grouped into logical (possibly overlapping) sets, the so-called +**partitions**. You can submit your job to a certain partition using the Slurm option +`--partition=<partition-name>`. + +Some nodes have Multithreading (SMT) enabled, so for every physical core allocated +(e.g., with `SLURM_HINT=nomultithread`), you will always get `MB per Core`*`number of threads`, +because the memory of the other threads is allocated implicitly, too. + +Some partitions have a *interactive* counterpart for interactive jobs. The corresponding partitions +is suffixed with `-interactive` (e.g. `ml-interactive`) and have the same configuration. + +There is also a meta partition `haswell`, which contain partition `haswell64`, +`haswell256` and `smp2`and this is also the default partition. If you specify no partition or +partition `haswell` a Slurm plugin will choose the partition which fits to your memory requirements. +There are some other partitions, which are not specified in the table above, but those partitions +should not be used directly. + +<!-- partitions_and_limits_table --> +| Partition | Nodes | # Nodes | Cores per Node (SMT) | MB per Core (SMT) | MB per Node | GPUs per Node | +|:--------|:------|--------:|---------------:|------------:|------------:|--------------:| +| gpu2 | taurusi[2045-2103] | 59 | 24 | 2583 | 62000 | gpu:4 | +| gpu2-interactive | taurusi[2045-2103] | 59 | 24 | 2583 | 62000 | gpu:4 | +| haswell | taurusi[6001-6604],taurussmp[3-7] | 609 | 56 | 36500 | 2044000 | none | +| haswell64 | taurusi[6001-6540,6559-6604] | 586 | 24 | 2541 | 61000 | none | +| haswell256 | taurusi[6541-6558] | 18 | 24 | 10583 | 254000 | none | +| interactive | taurusi[6605-6612] | 8 | 24 | 2541 | 61000 | none | +| smp2 | taurussmp[3-7] | 5 | 56 | 36500 | 2044000 | none | +| ifm | taurusa2 | 1 | 16 (SMT: 32) | 12000 | 384000 | gpu:1 | +| hpdlf | taurusa[3-16] | 14 | 12 | 7916 | 95000 | gpu:3 | +| ml | taurusml[3-32] | 30 | 44 (SMT: 176) | 1443 | 254000 | gpu:6 | +| ml-interactive | taurusml[1-2] | 2 | 44 (SMT: 176) | 1443 | 254000 | gpu:6 | +| romeo | taurusi[7003-7192] | 190 | 128 (SMT: 256) | 1972 | 505000 | none | +| romeo-interactive | taurusi[7001-7002] | 2 | 128 (SMT: 256) | 1972 | 505000 | none | +| julia | taurussmp8 | 1 | 896 | 5400 | 4839000 | none | +| alpha | taurusi[8003-8034] | 32 | 48 (SMT: 96) | 10312 | 990000 | gpu:8 | +| alpha-interactive | taurusi[8001-8002] | 2 | 48 (SMT: 96) | 10312 | 990000 | gpu:8 | +{: summary="Partitions and limits table" align="bottom"} diff --git a/doc.zih.tu-dresden.de/docs/software/fem_software.md b/doc.zih.tu-dresden.de/docs/software/fem_software.md index 04facd9d816da492f1675f4ea6408027771c6b86..3c45319486d0bf3dca7a64f57b044c1837fe2548 100644 --- a/doc.zih.tu-dresden.de/docs/software/fem_software.md +++ b/doc.zih.tu-dresden.de/docs/software/fem_software.md @@ -188,6 +188,15 @@ that you can simply change to something like 16 or 24. For now, you should stay boundaries, because multi-node calculations require additional parameters. The number you choose should match your used `--cpus-per-task` parameter in your job file. +### Running MAPDL in Interactive Mode + +ANSYS Mechanical APDL (sometimes called ANSYS Classic, the older MAPDL scripted environment). + +```console +marie@login$ srun --partition=haswell --ntasks=1 --cpus-per-task=4 --time=1:00:00 --mem-per-cpu=1700 --pty bash +marie@node$ mapdl -smp +``` + ## COMSOL Multiphysics [COMSOL Multiphysics](http://www.comsol.com) (formerly FEMLAB) is a finite element analysis, solver