Update references to old GPU clusters

Remove references to "K80 nodes", and "gpu"/"gpu2"/"gpu-interactive" partitions Fixes #564

Update references to old GPU clusters
f2d9a374 · Alexander Grund · 9c1bf3ad · f2d9a374 · f2d9a374 · f2d9a374
Commit f2d9a374 authored 1 year ago by Alexander Grund
--- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_examples.md
+++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_examples.md
@@ -58,7 +58,7 @@ For MPI-parallel jobs one typically allocates one core per task that has to be s
 ### Multiple Programs Running Simultaneously in a Job
 In this short example, our goal is to run four instances of a program concurrently in a **single**
-batch script. Of course, we could also start a batch script four times with `sbatch` but this is not
+batch script. Of course, we could also start a batch script four times with `sbatch`, but this is not
 what we want to do here. However, you can also find an example about
 [how to run GPU programs simultaneously in a single job](#running-multiple-gpu-applications-simultaneously-in-a-batch-job)
 below.
@@ -109,10 +109,12 @@ But, do you need to request tasks or CPUs from Slurm in order to provide resourc
 ## Requesting GPUs
-Slurm will allocate one or many GPUs for your job if requested. Please note that GPUs are only
+Slurm will allocate one or many GPUs for your job if requested.
-available in certain partitions, like `gpu2`, `gpu3` or `gpu2-interactive`. The option
+Please note that GPUs are only available in the GPU clusters, like
-for `sbatch/srun` in this case is `--gres=gpu:[NUM_PER_NODE]` (where `NUM_PER_NODE` can be `1`, `2` or
+[Alpha Centauri](hardware_overview.md#alpha-centauri) or
-`4`, meaning that one, two or four of the GPUs per node will be used for the job).
+[Power9](hardware_overview.md#power9).
+The option for `sbatch/srun` in this case is `--gres=gpu:[NUM_PER_NODE]`,
+where `NUM_PER_NODE` is the number of GPUs **per node** that will be used for the job.
 !!! example "Job file to request a GPU"
@@ -129,10 +131,9 @@ for `sbatch/srun` in this case is `--gres=gpu:[NUM_PER_NODE]` (where `NUM_PER_NO
    srun ./your/cuda/application   # start you application (probably requires MPI to use both nodes)
    ```
-Please be aware that the partitions `gpu`, `gpu1` and `gpu2` can only be used for non-interactive
+With the transition to the sub-clusters it is no longer required to specify the partition with `-p, --partition`.
-jobs which are submitted by `sbatch`.  Interactive jobs (`salloc`, `srun`) will have to use the
+It can still be used and will lead to a failure when submitting the job on the wrong cluster.
-partition `gpu-interactive`. Slurm will automatically select the right partition if the partition
+This is useful to document the cluster used or avoid accidentally using the wrong SBATCH script.
-parameter `-p, --partition` is omitted.
 !!! note
@@ -144,24 +145,33 @@ parameter `-p, --partition` is omitted.
 ### Limitations of GPU Job Allocations
 The number of cores per node that are currently allowed to be allocated for GPU jobs is limited
-depending on how many GPUs are being requested. On the K80 nodes, you may only request up to 6
+depending on how many GPUs are being requested.
-cores per requested GPU (8 per on the K20 nodes). This is because we do not wish that GPUs remain
+On Alpha Centauri you may only request up to 6 cores per requested GPU.
-unusable due to all cores on a node being used by a single job which does not, at the same time,
+This is because we do not wish that GPUs become unusable due to all cores on a node being used by
-request all GPUs.
+a single job which does not, at the same time, request all GPUs.
 E.g., if you specify `--gres=gpu:2`, your total number of cores per node (meaning:
-`ntasks`*`cpus-per-task`) may not exceed 12 (on the K80 nodes)
+`ntasks`*`cpus-per-task`) may not exceed 12 (on Alpha Centauri)
-Note that this also has implications for the use of the `--exclusive` parameter. Since this sets the
+Note that this also has implications for the use of the `--exclusive` parameter.
-number of allocated cores to 24 (or 16 on the K20X nodes), you also **must** request all four GPUs
+Since this sets the number of allocated cores to 48, you also **must** request all eight GPUs
-by specifying `--gres=gpu:4`, otherwise your job will not start. In the case of `--exclusive`, it won't
+by specifying `--gres=gpu:8`, otherwise your job will not start.
-be denied on submission, because this is evaluated in a later scheduling step. Jobs that directly
+In the case of `--exclusive`, it won't be denied on submission,
-request too many cores per GPU will be denied with the error message:
+because this is evaluated in a later scheduling step.
+Jobs that directly request too many cores per GPU will be denied with the error message:
 ```console
 Batch job submission failed: Requested node configuration is not available
 ```
+Similar it is not allowed to start CPU-only jobs on the GPU cluster.
+I.e. you must request at least one GPU there, or you will get this error message:
+```console
+srun: error: QOSMinGRES
+srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
+```
 ### Running Multiple GPU Applications Simultaneously in a Batch Job
 Our starting point is a (serial) program that needs a single GPU and four CPU cores to perform its
@@ -186,7 +196,7 @@ When `srun` is used within a submission script, it inherits parameters from `sba
 `--ntasks=1`, `--cpus-per-task=4`, etc. So we actually implicitly run the following
 ```bash
-srun --ntasks=1 --cpus-per-task=4 [...] --partition=ml <some-gpu-application>
+srun --ntasks=1 --cpus-per-task=4 [...] --partition=ml some-gpu-application
 ```
 Now, our goal is to run four instances of this program concurrently in a single batch script. Of
@@ -232,12 +242,12 @@ three things:
    echo "All jobs completed!"
    ```
-In practice it is possible to leave out resource options in `srun` that do not differ from the ones
+In practice, it is possible to leave out resource options in `srun` that do not differ from the ones
 inherited from the surrounding `sbatch` context. The following line would be sufficient to do the
 job in this example:
 ```bash
-srun --exclusive --gres=gpu:1 --ntasks=1 <some-gpu-application> &
+srun --exclusive --gres=gpu:1 --ntasks=1 some-gpu-application &
 ```
 Yet, it adds some extra safety to leave them in, enabling the Slurm batch system to complain if not

--- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_generator.md
+++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_generator.md
@@ -329,31 +329,18 @@ along with sgen.  If not, see <http://www.gnu.org/licenses/>.
    <script>
      // dictionary containing the limits for the different partitions
      const limitsPartition = {
-        'gpu2' : gpu2 = {
+        'alpha' : alpha = {
          'MaxTime': 'INFINITE',
          'DefaultTime': 480,
          'Sockets': 2,
-          'CoresPerSocket': 12,
+          'CoresPerSocket': 24,
-          'ThreadsPerCore': 1,
+          'ThreadsPerCore': 2,
-          'nodes': 59,
+          'nodes': 37,
-          'GPU': 4,
+          'GPU': 8,
-          'HTCores': 24,
+          'HTCores': 96,
-          'Cores': 24,
+          'Cores': 48,
-          'MemoryPerNode': 62000,
+          'MemoryPerNode': 990000,
-          'MemoryPerCore': 2583
+          'MemoryPerCore': 10312
-        },
-        'gpu2-interactive': {
-          'MaxTime': 480,
-          'DefaultTime': 10,
-          'Sockets': 2,
-          'CoresPerSocket': 12,
-          'ThreadsPerCore': 1,
-          'nodes': 59,
-          'GPU': 4,
-          'HTCores': 24,
-          'Cores': 24,
-          'MemoryPerNode': 62000,
-          'MemoryPerCore': 2583
        },
        'haswell' : haswell = {
          'MaxTime': 'INFINITE',

--- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_limits.md
+++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_limits.md
@@ -76,9 +76,6 @@ following table depicts the resource limits for [all our HPC systems](hardware_o
 | HPC System | Nodes | # Nodes | Cores per Node | Threads per Core | Memory per Node [in MB] | Memory per (SMT) Core [in MB] | GPUs per Node | Job Max Time |
 |:-----------|:------|--------:|---------------:|-----------------:|------------------------:|------------------------------:|--------------:|-------------:|
-| gpu2 | taurusi[2045-2103] | 59 | 24 | 1 | 62,000 | 2,583 | 4 | |
-| gpu2-interactive | taurusi[2045-2103] | 59 | 24 | 1 | 62,000 | 2,583 | 4 | |
-| hpdlf | taurusa[3-16]                 | 14 | 12 | 1 | 95,000 | 7,916 | 3  | |
 | [`Barnard`](barnard.md)               | `n[1001-1630].barnard` | 630 | 104 | 2 | 515,000    | 4,951  | - | unlimited |
 | [`Power9`](power9.md)                 | `ml[1-29].power9`      | 29  | 44  | 4 | 254,000    | 1,443  | 6 | unlimited |
 | [`Romeo`](romeo.md)                   | `i[8001-8190].romeo`   | 190 | 128 | 2 | 505,000    | 1,972  | - | unlimited |

--- a/doc.zih.tu-dresden.de/docs/software/distributed_training.md
+++ b/doc.zih.tu-dresden.de/docs/software/distributed_training.md
@@ -209,8 +209,8 @@ parameter `--ntasks-per-node=<N>` equals the number of GPUs you use per node.
 Also, it can be useful to increase `memory/cpu` parameters if you run larger models.
 Memory can be set up to:
- `--mem=250G` and `--cpus-per-task=7` for the partition `ml`.
+- `--mem=250G` and `--cpus-per-task=7` for the partition `power9`.
- `--mem=60G` and `--cpus-per-task=6` for the partition `gpu2`.
+- `--mem=900G` and `--cpus-per-task=6` for the partition `alpha`.
 Keep in mind that only one memory parameter (`--mem-per-cpu=<MB>` or `--mem=<MB>`) can be specified.
@@ -338,7 +338,7 @@ Hello from: 0
    module load modenv/ml
    module load Horovod/0.19.5-fosscuda-2019b-TensorFlow-2.2.0-Python-3.7.4
-    srun python <your_program.py>
+    srun python your_program.py
    ```
    Do not forget to specify the total number of tasks `--ntasks` and the number of tasks per node

--- a/doc.zih.tu-dresden.de/docs/software/gpu_programming.md
+++ b/doc.zih.tu-dresden.de/docs/software/gpu_programming.md
@@ -23,14 +23,14 @@ When allocating resources on a GPU-node, you must specify the number of requeste
    #SBATCH --ntasks=1                    # All #SBATCH lines have to follow uninterrupted
    #SBATCH --time=01:00:00               # after the shebang line
-    #SBATCH --account=<KTR>               # Comments start with # and do not count as interruptions
+    #SBATCH --account=p_number_crunch     # Comments start with # and do not count as interruptions
    #SBATCH --job-name=fancyExp
    #SBATCH --output=simulation-%j.out
    #SBATCH --error=simulation-%j.err
    #SBATCH --gres=gpu:1                  # request GPU(s) from Slurm
    module purge                          # Set up environment, e.g., clean modules environment
-    module load <modules>                 # and load necessary modules
+    module load module/version module2    # and load necessary modules
    srun ./application [options]          # Execute parallel application with srun
    ```
@@ -39,7 +39,7 @@ Alternatively, you can work on the clusters interactively:
 ```bash
 marie@login.<cluster_name>$ srun --nodes=1 --gres=gpu:<N> --runtime=00:30:00 --pty bash
-marie@compute$ module purge; module switch modenv/<env>
+marie@compute$ module purge; module switch release/<env>
 ```
 ## Directive Based GPU Programming
@@ -60,10 +60,6 @@ Please use the following information as a start on OpenACC:
 OpenACC can be used with the PGI and NVIDIA HPC compilers. The NVIDIA HPC compiler, as part of the
 [NVIDIA HPC SDK](https://docs.nvidia.com/hpc-sdk/index.html), supersedes the PGI compiler.
-Various versions of the PGI compiler are available on the
-[NVIDIA Tesla K80 GPUs nodes](../jobs_and_resources/hardware_overview.md#island-2-phase-2-intel-haswell-cpus-nvidia-k80-gpus)
-(partition `gpu2`).
 The `nvc` compiler (NOT the `nvcc` compiler, which is used for CUDA) is available for the NVIDIA
 Tesla V100 and Nvidia A100 nodes.
@@ -74,7 +70,7 @@ Tesla V100 and Nvidia A100 nodes.
 * For compilation, please add the compiler flag `-acc` to enable OpenACC interpreting by the
  compiler
 * `-Minfo` tells you what the compiler is actually doing to your code
-* Add `-ta=nvidia:keple` to enable optimizations for the K80 GPUs
+* Add `-ta=nvidia:ampere` to enable optimizations for the A100 GPUs
 * You may find further information on the PGI compiler in the
 [user guide](https://docs.nvidia.com/hpc-sdk/pgi-compilers/20.4/x86/pgi-user-guide/index.htm)
 and in the [reference guide](https://docs.nvidia.com/hpc-sdk/pgi-compilers/20.4/x86/pgi-ref-guide/index.htm),
@@ -149,7 +145,6 @@ provided as well. The [toolkit documentation page](https://docs.nvidia.com/cuda/
 the [programming guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html) and the
 [best practice guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html).
 Optimization guides for supported NVIDIA architectures are available, including for
-[Kepler (K80)](https://docs.nvidia.com/cuda/kepler-tuning-guide/index.html),
 [Volta (V100)](https://docs.nvidia.com/cuda/volta-tuning-guide/index.html) and
 [Ampere (A100)](https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html).
@@ -175,7 +170,6 @@ is used as the host compiler. The following flags may be useful:
 * `--generate-code` (`-gencode`): generate optimized code for a target GPU (caution: these binaries
 cannot be used with GPUs of other generations).
-    * For Kepler (K80): `--generate-code arch=compute_37,code=sm_37`,
    * For Volta (V100): `--generate-code arch=compute_70,code=sm_70`,
    * For Ampere (A100): `--generate-code arch=compute_80,code=sm_80`
 * `-Xcompiler`: pass flags to the host compiler. E.g., generate OpenMP-parallel host code:
@@ -189,15 +183,9 @@ and the [performance guidelines](https://docs.nvidia.com/cuda/cuda-c-programming
 for possible steps to take for the performance analysis and optimization.
 Multiple tools can be used for the performance analysis.
-For the analysis of applications on the older K80 GPUs, we recommend two
+For the analysis of applications on the newer GPUs (V100 and A100),
-[profiler tools](https://docs.nvidia.com/cuda/profiler-users-guide/index.html):
+we recommend the use of the newer NVIDIA Nsight tools, [Nsight Systems](https://developer.nvidia.com/nsight-systems)
-the NVIDIA [nvprof](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview)
+for a system-wide sampling and tracing and [Nsight Compute](https://developer.nvidia.com/nsight-compute)
-command line profiler and the
-[NVIDIA Visual Profiler](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#visual)
-as the accompanying graphical profiler. These tools will be deprecated in future CUDA releases but
-are still available in CUDA <= 11. On the newer GPUs (V100 and A100), we recommend the use of of the
-newer NVIDIA Nsight tools, [Nsight Systems](https://developer.nvidia.com/nsight-systems) for a
-system wide sampling and tracing and [Nsight Compute](https://developer.nvidia.com/nsight-compute)
 for a detailed analysis of individual kernels.
 ### NVIDIA nvprof & Visual Profiler