diff --git a/doc.zih.tu-dresden.de/docs/application/request_for_resources.md b/doc.zih.tu-dresden.de/docs/application/request_for_resources.md index 09c5161147505b4ac79814cbf8d5aec654beb15c..51da946726c926a401d7f74bb7d5a069bd653860 100644 --- a/doc.zih.tu-dresden.de/docs/application/request_for_resources.md +++ b/doc.zih.tu-dresden.de/docs/application/request_for_resources.md @@ -14,9 +14,6 @@ Think in advance about the parallelization strategy for your project and how to ## Available Software Pre-installed software on our HPC systems is managed via [modules](../software/modules.md). -You can see the -[list of software that's already installed and accessible via modules](https://gauss-allianz.de/de/application?organizations%5B0%5D=1200). -However, there are many -different variants of these modules available. We have divided these into two different software -environments: `scs5` (for regular partitions) and `ml` (for the Machine Learning partition). Within -each environment there are further dependencies and variants. +There are many variants of these modules available. +We have divided these into two different software environments such as `release/23.04`. +Within each environment there are further dependencies and variants. diff --git a/doc.zih.tu-dresden.de/docs/archive/beegfs_on_demand.md b/doc.zih.tu-dresden.de/docs/archive/beegfs_on_demand.md deleted file mode 100644 index d44116c0b97ff204bafc3cf5240340627769bec5..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/archive/beegfs_on_demand.md +++ /dev/null @@ -1,163 +0,0 @@ ---- -search: - boost: 0.00001 ---- - -# BeeGFS Filesystem on Demand (Outdated) - -!!! warning - - This documentation page is outdated. - Please see the [new BeeGFS page](../data_lifecycle/beegfs.md). - -**Prerequisites:** To work with TensorFlow you obviously need a [login](../application/overview.md) -to the ZIH systems and basic knowledge about Linux, mounting, and batch system Slurm. - -**Aim** of this page is to introduce -users how to start working with the BeeGFS filesystem - a high-performance parallel filesystem. - -## Mount Point - -Understanding of mounting and the concept of the mount point is important for using filesystems and -object storage. A mount point is a directory (typically an empty one) in the currently accessible -filesystem on which an additional filesystem is mounted (i.e., logically attached). The default -mount points for a system are the directories in which filesystems will be automatically mounted -unless told by the user to do otherwise. All partitions are attached to the system via a mount -point. The mount point defines the place of a particular data set in the filesystem. Usually, all -partitions are connected through the root partition. On this partition, which is indicated with the -slash (/), directories are created. - -## BeeGFS Introduction - -[BeeGFS](https://www.beegfs.io/content/) is the parallel cluster filesystem. BeeGFS spreads data -across multiple servers to aggregate capacity and performance of all servers to provide a highly -scalable shared network filesystem with striped file contents. This is made possible by the -separation of metadata and file contents. - -BeeGFS is fast, flexible, and easy to manage storage if for your issue -filesystem plays an important role use BeeGFS. It addresses everyone, -who needs large and/or fast file storage. - -## Create BeeGFS Filesystem - -To reserve nodes for creating BeeGFS filesystem you need to create a -[batch](../jobs_and_resources/slurm.md) job - -```Bash -#!/bin/bash -#SBATCH -p nvme -#SBATCH -N 4 -#SBATCH --exclusive -#SBATCH --time=1-00:00:00 -#SBATCH --beegfs-create=yes - -srun sleep 1d # sleep for one day - -## when finished writing, submit with: sbatch <script_name> -``` - -Example output with job id: - -```Bash -Submitted batch job 11047414 #Job id n.1 -``` - -Check the status of the job with `squeue -u \<username>`. - -## Mount BeeGFS Filesystem - -You can mount BeeGFS filesystem on the partition ml (PowerPC architecture) or on the -partition haswell (x86_64 architecture). - -### Mount BeeGFS Filesystem on the Partition `ml` - -Job submission can be done with the command (use job id (n.1) from batch job used for creating -BeeGFS system): - -```console -srun -p ml --beegfs-mount=yes --beegfs-jobid=11047414 --pty bash #Job submission on ml nodes -```console - -Example output: - -```console -srun: job 11054579 queued and waiting for resources #Job id n.2 -srun: job 11054579 has been allocated resources -``` - -### Mount BeeGFS Filesystem on the Haswell Nodes (x86_64) - -Job submission can be done with the command (use job id (n.1) from batch -job used for creating BeeGFS system): - -```console -srun --constrain=DA --beegfs-mount=yes --beegfs-jobid=11047414 --pty bash #Job submission on the Haswell nodes -``` - -Example output: - -```console -srun: job 11054580 queued and waiting for resources #Job id n.2 -srun: job 11054580 has been allocated resources -``` - -## Working with BeeGFS files for both types of nodes - -Show contents of the previously created file, for example, -`beegfs_11054579` (where 11054579 - job id **n.2** of srun job): - -```console -cat .beegfs_11054579 -``` - -Note: don't forget to go over to your home directory where the file located - -Example output: - -```Bash -#!/bin/bash - -export BEEGFS_USER_DIR="/mnt/beegfs/<your_id>_<name_of_your_job>/<your_id>" -export BEEGFS_PROJECT_DIR="/mnt/beegfs/<your_id>_<name_of_your_job>/<name of your project>" -``` - -Execute the content of the file: - -```console -source .beegfs_11054579 -``` - -Show content of user's BeeGFS directory with the command: - -```console -ls -la ${BEEGFS_USER_DIR} -``` - -Example output: - -```console -total 0 -drwx--S--- 2 <username> swtest 6 21. Jun 10:54 . -drwxr-xr-x 4 root root 36 21. Jun 10:54 .. -``` - -Show content of the user's project BeeGFS directory with the command: - -```console -ls -la ${BEEGFS_PROJECT_DIR} -``` - -Example output: - -```console -total 0 -drwxrws--T 2 root swtest 6 21. Jun 10:54 . -drwxr-xr-x 4 root root 36 21. Jun 10:54 .. -``` - -!!! note - - If you want to mount the BeeGFS filesystem on an x86 instead of an ML (power) node, you can - either choose the partition "interactive" or the partition `haswell64`, but for the partition - `haswell64` you have to add the parameter `--exclude=taurusi[4001-4104,5001-5612]` to your job. - This is necessary because the BeeGFS client is only installed on the 6000 island. diff --git a/doc.zih.tu-dresden.de/docs/archive/install_jupyter.md b/doc.zih.tu-dresden.de/docs/archive/install_jupyter.md index b5687bc08b658648a1e3d9896843ad30e15e87ec..859aa52c4a9b648cbdbe96398239093407b20bcf 100644 --- a/doc.zih.tu-dresden.de/docs/archive/install_jupyter.md +++ b/doc.zih.tu-dresden.de/docs/archive/install_jupyter.md @@ -50,11 +50,11 @@ one is to download Anaconda in your home directory. 1. Load Anaconda module (recommended): ```console -marie@compute$ module load modenv/scs5 +marie@compute$ module load release/23.04 marie@compute$ module load Anaconda3 ``` -1. Download latest Anaconda release (see example below) and change the rights to make it an +1. Download the latest Anaconda release (see example below) and change the rights to make it an executable script and run the installation script: ```console diff --git a/doc.zih.tu-dresden.de/docs/archive/scs5_software.md b/doc.zih.tu-dresden.de/docs/archive/scs5_software.md index 79c7ba16d393f0f31963e9d6fe4e69dfcefcffd3..2f2b99d823da6fff15b71ba9ad31c6605e9f6728 100644 --- a/doc.zih.tu-dresden.de/docs/archive/scs5_software.md +++ b/doc.zih.tu-dresden.de/docs/archive/scs5_software.md @@ -42,44 +42,13 @@ module available ml av ``` -There is a special module that is always loaded (sticky) called -**modenv**. It determines the module environment you can see. +There is a special module that is always loaded (sticky) called **release**. +It determines the modules you can use. | Module Environment | Description | Status | |--------------------|---------------------------------------------|---------| -| `modenv/scs5` | SCS5 software | default | -| `modenv/ml` | Software for data analytics (partition ml) | | -| `modenv/classic` | Manually built pre-SCS5 (AE4.0) software | hidden | - -The old modules (pre-SCS5) are still available after loading the -corresponding **modenv** version (**classic**), however, due to changes -in the libraries of the operating system, it is not guaranteed that they -still work under SCS5. That's why those modenv versions are hidden. - -Example: - -```Bash -marie@compute$ ml modenv/classic ansys/19.0 - -The following have been reloaded with a version change: - 1) modenv/scs5 => modenv/classic - -Module ansys/19.0 loaded. -``` - -**modenv/scs5** will be loaded by default and contains all the software -that was built especially for SCS5. - -### Which modules should I use? - -If possible, please use the modules from **modenv/scs5**. In case there is a certain software -missing, you can write an [email to hpcsupport](mailto:hpc-support@tu-dresden.de) and we will try -to install the latest version of this particular software for you. - -However, if you still need *older* versions of some software, you have to resort to using the -modules in the old module environment (**modenv/classic** most probably). We won't keep those around -forever though, so in the long-term, it is advisable to migrate your workflow to up-to-date versions -of the software used. +| `release/23.04` | Release of April 2023 | default | +| `release/23.10` | Release of October 2023 | | ### Compilers, MPI-Libraries and Toolchains diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm.md index eea01a7ce3540c33e0cca0a0d7fdf1a205ff6c28..4be982b9ed1d41b11ea1ae98ea1456195d3105e9 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm.md @@ -98,7 +98,7 @@ can find it via `squeue --me`. The job ID allows you to !!! warning "srun vs. mpirun" On ZIH systems, `srun` is used to run your parallel application. The use of `mpirun` is provenly - broken on partitions `ml` and `alpha` for jobs requiring more than one node. Especially when + broken on clusters `Power9` and `Alpha` for jobs requiring more than one node. Especially when using code from github projects, double-check its configuration by looking for a line like 'submit command mpirun -n $ranks ./app' and replace it with 'srun ./app'. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_examples.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_examples.md index d705c86314b6d06a11b1ca1a061f802ed7c59905..1dcc8f1d995ac0b467d3ecce67eea9879e38215c 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_examples.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_examples.md @@ -58,7 +58,7 @@ For MPI-parallel jobs one typically allocates one core per task that has to be s ### Multiple Programs Running Simultaneously in a Job In this short example, our goal is to run four instances of a program concurrently in a **single** -batch script. Of course, we could also start a batch script four times with `sbatch` but this is not +batch script. Of course, we could also start a batch script four times with `sbatch`, but this is not what we want to do here. However, you can also find an example about [how to run GPU programs simultaneously in a single job](#running-multiple-gpu-applications-simultaneously-in-a-batch-job) below. @@ -109,10 +109,12 @@ But, do you need to request tasks or CPUs from Slurm in order to provide resourc ## Requesting GPUs -Slurm will allocate one or many GPUs for your job if requested. Please note that GPUs are only -available in certain partitions, like `gpu2`, `gpu3` or `gpu2-interactive`. The option -for `sbatch/srun` in this case is `--gres=gpu:[NUM_PER_NODE]` (where `NUM_PER_NODE` can be `1`, `2` or -`4`, meaning that one, two or four of the GPUs per node will be used for the job). +Slurm will allocate one or many GPUs for your job if requested. +Please note that GPUs are only available in the GPU clusters, like +[Alpha Centauri](hardware_overview.md#alpha-centauri) and +[Power9](hardware_overview.md#power9). +The option for `sbatch/srun` in this case is `--gres=gpu:[NUM_PER_NODE]`, +where `NUM_PER_NODE` is the number of GPUs **per node** that will be used for the job. !!! example "Job file to request a GPU" @@ -129,10 +131,9 @@ for `sbatch/srun` in this case is `--gres=gpu:[NUM_PER_NODE]` (where `NUM_PER_NO srun ./your/cuda/application # start you application (probably requires MPI to use both nodes) ``` -Please be aware that the partitions `gpu`, `gpu1` and `gpu2` can only be used for non-interactive -jobs which are submitted by `sbatch`. Interactive jobs (`salloc`, `srun`) will have to use the -partition `gpu-interactive`. Slurm will automatically select the right partition if the partition -parameter `-p, --partition` is omitted. +With the transition to the sub-clusters it is no longer required to specify the partition with `-p, --partition`. +It can still be used and will lead to a failure when submitting the job on the wrong cluster. +This is useful to document the cluster used or avoid accidentally using the wrong SBATCH script. !!! note @@ -144,28 +145,37 @@ parameter `-p, --partition` is omitted. ### Limitations of GPU Job Allocations The number of cores per node that are currently allowed to be allocated for GPU jobs is limited -depending on how many GPUs are being requested. On the K80 nodes, you may only request up to 6 -cores per requested GPU (8 per on the K20 nodes). This is because we do not wish that GPUs remain -unusable due to all cores on a node being used by a single job which does not, at the same time, -request all GPUs. +depending on how many GPUs are being requested. +On Alpha Centauri you may only request up to 6 cores per requested GPU. +This is because we do not wish that GPUs become unusable due to all cores on a node being used by +a single job which does not, at the same time, request all GPUs. E.g., if you specify `--gres=gpu:2`, your total number of cores per node (meaning: -`ntasks`*`cpus-per-task`) may not exceed 12 (on the K80 nodes) +`ntasks`*`cpus-per-task`) may not exceed 12 (on Alpha Centauri) -Note that this also has implications for the use of the `--exclusive` parameter. Since this sets the -number of allocated cores to 24 (or 16 on the K20X nodes), you also **must** request all four GPUs -by specifying `--gres=gpu:4`, otherwise your job will not start. In the case of `--exclusive`, it won't -be denied on submission, because this is evaluated in a later scheduling step. Jobs that directly -request too many cores per GPU will be denied with the error message: +Note that this also has implications for the use of the `--exclusive` parameter. +Since this sets the number of allocated cores to 48, you also **must** request all eight GPUs +by specifying `--gres=gpu:8`, otherwise your job will not start. +In the case of `--exclusive`, it won't be denied on submission, +because this is evaluated in a later scheduling step. +Jobs that directly request too many cores per GPU will be denied with the error message: ```console Batch job submission failed: Requested node configuration is not available ``` +Similar it is not allowed to start CPU-only jobs on the GPU cluster. +I.e. you must request at least one GPU there, or you will get this error message: + +```console +srun: error: QOSMinGRES +srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits) +``` + ### Running Multiple GPU Applications Simultaneously in a Batch Job Our starting point is a (serial) program that needs a single GPU and four CPU cores to perform its -task (e.g. TensorFlow). The following batch script shows how to run such a job on the partition `ml`. +task (e.g. TensorFlow). The following batch script shows how to run such a job on the cluster `Power9`. !!! example @@ -177,7 +187,7 @@ task (e.g. TensorFlow). The following batch script shows how to run such a job o #SBATCH --gpus-per-task=1 #SBATCH --time=01:00:00 #SBATCH --mem-per-cpu=1443 - #SBATCH --partition=ml + #SBATCH --partition=power9 srun some-gpu-application ``` @@ -186,7 +196,7 @@ When `srun` is used within a submission script, it inherits parameters from `sba `--ntasks=1`, `--cpus-per-task=4`, etc. So we actually implicitly run the following ```bash -srun --ntasks=1 --cpus-per-task=4 [...] --partition=ml <some-gpu-application> +srun --ntasks=1 --cpus-per-task=4 [...] some-gpu-application ``` Now, our goal is to run four instances of this program concurrently in a single batch script. Of @@ -220,7 +230,7 @@ three things: #SBATCH --gpus-per-task=1 #SBATCH --time=01:00:00 #SBATCH --mem-per-cpu=1443 - #SBATCH --partition=ml + #SBATCH --partition=power9 srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application & srun --exclusive --gres=gpu:1 --ntasks=1 --cpus-per-task=4 --gpus-per-task=1 --mem-per-cpu=1443 some-gpu-application & @@ -232,12 +242,12 @@ three things: echo "All jobs completed!" ``` -In practice it is possible to leave out resource options in `srun` that do not differ from the ones +In practice, it is possible to leave out resource options in `srun` that do not differ from the ones inherited from the surrounding `sbatch` context. The following line would be sufficient to do the job in this example: ```bash -srun --exclusive --gres=gpu:1 --ntasks=1 <some-gpu-application> & +srun --exclusive --gres=gpu:1 --ntasks=1 some-gpu-application & ``` Yet, it adds some extra safety to leave them in, enabling the Slurm batch system to complain if not diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_generator.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_generator.md index 2de9a37786fbf3cc7c5ce1f4e604234b29597124..963139b6d58b72c4c481383d65e8c9475865a36e 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_generator.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_generator.md @@ -329,31 +329,18 @@ along with sgen. If not, see <http://www.gnu.org/licenses/>. <script> // dictionary containing the limits for the different partitions const limitsPartition = { - 'gpu2' : gpu2 = { + 'alpha' : alpha = { 'MaxTime': 'INFINITE', 'DefaultTime': 480, 'Sockets': 2, - 'CoresPerSocket': 12, - 'ThreadsPerCore': 1, - 'nodes': 59, - 'GPU': 4, - 'HTCores': 24, - 'Cores': 24, - 'MemoryPerNode': 62000, - 'MemoryPerCore': 2583 - }, - 'gpu2-interactive': { - 'MaxTime': 480, - 'DefaultTime': 10, - 'Sockets': 2, - 'CoresPerSocket': 12, - 'ThreadsPerCore': 1, - 'nodes': 59, - 'GPU': 4, - 'HTCores': 24, - 'Cores': 24, - 'MemoryPerNode': 62000, - 'MemoryPerCore': 2583 + 'CoresPerSocket': 24, + 'ThreadsPerCore': 2, + 'nodes': 37, + 'GPU': 8, + 'HTCores': 96, + 'Cores': 48, + 'MemoryPerNode': 990000, + 'MemoryPerCore': 10312 }, 'haswell' : haswell = { 'MaxTime': 'INFINITE', @@ -433,7 +420,7 @@ along with sgen. If not, see <http://www.gnu.org/licenses/>. 'MemoryPerNode': 95000, 'MemoryPerCore': 7916 }, - 'ml' : ml = { + 'power9' : power9 = { 'MaxTime': 'INFINITE', 'DefaultTime': 60, 'Sockets': 2, @@ -446,19 +433,6 @@ along with sgen. If not, see <http://www.gnu.org/licenses/>. 'MemoryPerNode': 254000, 'MemoryPerCore': 1443 }, - 'ml-interactive': { - 'MaxTime': 480, - 'DefaultTime': 10, - 'Sockets': 2, - 'CoresPerSocket': 22, - 'ThreadsPerCore': 4, - 'nodes': 2, - 'GPU': 6, - 'HTCores': 176, - 'Cores': 44, - 'MemoryPerNode': 254000, - 'MemoryPerCore': 1443 - }, 'romeo' : romeo = { 'MaxTime': 'INFINITE', 'DefaultTime': 480, diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_limits.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_limits.md index c7018d096bf9998fa961aa02747b66943598e3d9..717505309502ead9d80419c0b510dcfbd815899f 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_limits.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm_limits.md @@ -76,9 +76,6 @@ following table depicts the resource limits for [all our HPC systems](hardware_o | HPC System | Nodes | # Nodes | Cores per Node | Threads per Core | Memory per Node [in MB] | Memory per (SMT) Core [in MB] | GPUs per Node | Job Max Time | |:-----------|:------|--------:|---------------:|-----------------:|------------------------:|------------------------------:|--------------:|-------------:| -| gpu2 | taurusi[2045-2103] | 59 | 24 | 1 | 62,000 | 2,583 | 4 | | -| gpu2-interactive | taurusi[2045-2103] | 59 | 24 | 1 | 62,000 | 2,583 | 4 | | -| hpdlf | taurusa[3-16] | 14 | 12 | 1 | 95,000 | 7,916 | 3 | | | [`Barnard`](barnard.md) | `n[1001-1630].barnard` | 630 | 104 | 2 | 515,000 | 4,951 | - | unlimited | | [`Power9`](power9.md) | `ml[1-29].power9` | 29 | 44 | 4 | 254,000 | 1,443 | 6 | unlimited | | [`Romeo`](romeo.md) | `i[8001-8190].romeo` | 190 | 128 | 2 | 505,000 | 1,972 | - | unlimited | diff --git a/doc.zih.tu-dresden.de/docs/quickstart/getting_started.md b/doc.zih.tu-dresden.de/docs/quickstart/getting_started.md index ffa80d58f92b4f47da6b73851062e9704b194812..57d9f7c7d6f6ba293ebe8223c517dd3d443a5b0e 100644 --- a/doc.zih.tu-dresden.de/docs/quickstart/getting_started.md +++ b/doc.zih.tu-dresden.de/docs/quickstart/getting_started.md @@ -359,17 +359,14 @@ marie@login$ module spider Python/3.9.5 ``` In some cases it is required to load additional modules before loading the desired software. -In the example above, these are `modenv/hiera` and `GCCcore/10.3.0`. +In the example above, it is `GCCcore/11.3.0`. - Load prerequisites and the desired software: ```console -marie@login$ module load modenv/hiera GCCcore/10.3.0 # load prerequisites +marie@login$ module load GCCcore/11.3.0 # load prerequisites -The following have been reloaded with a version change: - 1) modenv/scs5 => modenv/hiera - -Module GCCcore/10.3.0 loaded. +Module GCCcore/11.3.0 loaded. marie@login$ module load Python/3.9.5 # load desired version of software Module Python/3.9.5 and 11 dependencies loaded. diff --git a/doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md b/doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md index cfc5c6d82a1f60f5147f2e8a86673f3b66c27d86..171b416294e2939cc300a8622873081e4c90017d 100644 --- a/doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md +++ b/doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md @@ -2,8 +2,9 @@ [Apache Spark](https://spark.apache.org/), [Apache Flink](https://flink.apache.org/) and [Apache Hadoop](https://hadoop.apache.org/) are frameworks for processing and integrating -Big Data. These frameworks are also offered as software [modules](modules.md) in both `ml` and -`scs5` software environments. You can check module versions and availability with the command +Big Data. +These frameworks are also offered as software [modules](modules.md). +You can check module versions and availability with the command === "Spark" ```console diff --git a/doc.zih.tu-dresden.de/docs/software/cicd.md b/doc.zih.tu-dresden.de/docs/software/cicd.md index 7294622a292acb18d586ccee3108332f6e555272..f5eb3b42c05ee7b48e712ab5fe19fab4eca6899a 100644 --- a/doc.zih.tu-dresden.de/docs/software/cicd.md +++ b/doc.zih.tu-dresden.de/docs/software/cicd.md @@ -74,10 +74,10 @@ Use the variable `SCHEDULER_PARAMETERS` and define the same parameters you would !!! example The following YAML file defines a configuration section `.test-job`, and two jobs, - `test-job-haswell` and `test-job-ml`, extending from that. The two job share the + `test-job-haswell` and `test-job-power9`, extending from that. The two job share the `before_script`, `script`, and `after_script` configuration, but differ in the - `SCHEDULER_PARAMETERS`. The `test-job-haswell` and `test-job-ml` are scheduled on the partition - `haswell` and partition `ml`, respectively. + `SCHEDULER_PARAMETERS`. The `test-job-haswell` and `test-job-power9` are scheduled on the partition + `haswell` and partition `power9`, respectively. ``` yaml .test-job: @@ -100,10 +100,10 @@ Use the variable `SCHEDULER_PARAMETERS` and define the same parameters you would SCHEDULER_PARAMETERS: -p haswell - test-job-ml: + test-job-power9: extends: .test-job variables: - SCHEDULER_PARAMETERS: -p ml + SCHEDULER_PARAMETERS: -p power9 ``` ## Current limitations diff --git a/doc.zih.tu-dresden.de/docs/software/containers.md b/doc.zih.tu-dresden.de/docs/software/containers.md index 402dedb5e1b668d3c8d71398a78bb97a2fb0319e..f15caa5c8fbade1ec026c55d64a214da679cc6f1 100644 --- a/doc.zih.tu-dresden.de/docs/software/containers.md +++ b/doc.zih.tu-dresden.de/docs/software/containers.md @@ -9,11 +9,12 @@ opposed to Docker (the most famous container solution), Singularity is much more used in an HPC environment and more efficient in many cases. Docker images can easily be used in Singularity. Information about the use of Singularity on ZIH systems can be found on this page. -In some cases using Singularity requires a Linux machine with root privileges (e.g. using the -partition `ml`), the same architecture and a compatible kernel. For many reasons, users on ZIH -systems cannot be granted root permissions. A solution is a Virtual Machine (VM) on the partition -`ml` which allows users to gain root permissions in an isolated environment. There are two main -options on how to work with Virtual Machines on ZIH systems: +In some cases using Singularity requires a Linux machine with root privileges +(e.g. using the cluster `Power9`), the same architecture and a compatible kernel. +For many reasons, users on ZIH systems cannot be granted root permissions. +A solution is a Virtual Machine (VM) on the cluster `Power9` which allows users to gain +root permissions in an isolated environment. +There are two main options on how to work with Virtual Machines on ZIH systems: 1. [VM tools](singularity_power9.md): Automated algorithms for using virtual machines; 1. [Manual method](virtual_machines.md): It requires more operations but gives you more flexibility diff --git a/doc.zih.tu-dresden.de/docs/software/custom_easy_build_environment.md b/doc.zih.tu-dresden.de/docs/software/custom_easy_build_environment.md index 913665cc3f464626c42b18b3c219aeb0b63997c0..c009e532ca6fe55acfe09eabb917a8d2fe64ecec 100644 --- a/doc.zih.tu-dresden.de/docs/software/custom_easy_build_environment.md +++ b/doc.zih.tu-dresden.de/docs/software/custom_easy_build_environment.md @@ -76,24 +76,15 @@ marie@login$ srun --nodes=1 --cpus-per-task=4 --time=08:00:00 --pty /bin/bash -l **Step 3:** Specify the workspace. The rest of the guide is based on it. Please create an environment variable called `WORKSPACE` with the path to your workspace: -_The module environments /hiera, /scs5, /classic and /ml originated from the taurus system are -momentarily under construction. The script will be updated after completion of the redesign -accordingly_ - ```console marie@compute$ export WORKSPACE=/data/horse/ws/marie-EasyBuild #see output of ws_list above ``` -**Step 4:** Load the correct module environment `modenv` according to your current or target -architecture: +**Step 4:** Load the correct module environment `release` according to your needs: -=== "x86 (default, e. g. partition `haswell`)" - ```console - marie@compute$ module load modenv/scs5 - ``` -=== "Power9 (partition `ml`)" +=== "23.04" ```console - marie@ml$ module load modenv/ml + marie@compute$ module load release/23.04 ``` **Step 5:** Load module `EasyBuild` diff --git a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md index b319fc3ee43ea151a8bc781697ae1ebdc08af258..90f82449fff8175fd15ee31ba3d90830d860da16 100644 --- a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md +++ b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md @@ -434,10 +434,6 @@ comm = MPI.COMM_WORLD print("%d of %d" % (comm.Get_rank(), comm.Get_size())) ``` -_The module environments /hiera, /scs5, /classic and /ml originated from the taurus system are -momentarily under construction. The script will be updated after completion of the redesign -accordingly_ - For the multi-node case, use a script similar to this: ```bash diff --git a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md index 266f145ca9783978e517db6aac66ac58acbd65fb..bd994d6fa510c313c9c103095668b2bf47d0e433 100644 --- a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md +++ b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md @@ -14,10 +14,6 @@ see our [hardware documentation](../jobs_and_resources/hardware_overview.md). In the following example, the `srun` command is used to start an interactive job, so that the output is visible to the user. Please check the [Slurm page](../jobs_and_resources/slurm.md) for details. -_The module environments /hiera, /scs5, /classic and /ml originated from the taurus system are -momentarily under construction. The script will be updated after completion of the redesign -accordingly_ - ```console marie@login.barnard$ srun --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2541 --time=01:00:00 --pty bash [marie@barnard ]$ module load release/23.10 GCC/11.3.0 OpenMPI/4.1.4 R/4.2.1 @@ -265,10 +261,6 @@ Submitting a multicore R job to Slurm is very similar to submitting an [OpenMP Job](../jobs_and_resources/binding_and_distribution_of_tasks.md), since both are running multicore jobs on a **single** node. Below is an example: -_The module environments /hiera, /scs5, /classic and /ml originated from the taurus system are -momentarily under construction. The script will be updated after completion of the redesign -accordingly_ - ```Bash #!/bin/bash #SBATCH --nodes=1 diff --git a/doc.zih.tu-dresden.de/docs/software/distributed_training.md b/doc.zih.tu-dresden.de/docs/software/distributed_training.md index 096281640d192b434714cfecb30628026137d1a3..a1d1f0ee5797003cae77c4433c9d6037276e7439 100644 --- a/doc.zih.tu-dresden.de/docs/software/distributed_training.md +++ b/doc.zih.tu-dresden.de/docs/software/distributed_training.md @@ -123,8 +123,7 @@ Each worker runs the training loop independently. IP_1=$(dig +short ${NODE_1}.alpha.hpc.tu-dresden.de) IP_2=$(dig +short ${NODE_2}.alpha.hpc.tu-dresden.de) - module load modenv/hiera - module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 TensorFlow/2.4.1 + module load release/23.04 GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 TensorFlow/2.4.1 # On the first node TF_CONFIG='{"cluster": {"worker": ["'"${NODE_1}"':33562", "'"${NODE_2}"':33561"]}, "task": {"index": 0, "type": "worker"}}' srun --nodelist=${NODE_1} --nodes=1 --ntasks=1 --gres=gpu:1 python main_ddl.py & @@ -209,8 +208,8 @@ parameter `--ntasks-per-node=<N>` equals the number of GPUs you use per node. Also, it can be useful to increase `memory/cpu` parameters if you run larger models. Memory can be set up to: -- `--mem=250G` and `--cpus-per-task=7` for the partition `ml`. -- `--mem=60G` and `--cpus-per-task=6` for the partition `gpu2`. +- `--mem=250G` and `--cpus-per-task=7` for the `Power9` cluster. +- `--mem=900G` and `--cpus-per-task=6` for the `Alpha` cluster. Keep in mind that only one memory parameter (`--mem-per-cpu=<MB>` or `--mem=<MB>`) can be specified. @@ -260,7 +259,7 @@ Or if you want to use Horovod on the cluster `alpha`, you can load it with the d ```console marie@alpha$ module spider Horovod #Check available modules -marie@alpha$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 Horovod/0.21.1-TensorFlow-2.4.1 +marie@alpha$ module load release/23.04 release/23.04 GCC/11.3.0 OpenMPI/4.1.4 Horovod/0.28.1-CUDA-11.7.0-TensorFlow-2.11.0 ``` #### Horovod Installation @@ -335,10 +334,10 @@ Hello from: 0 #SBATCH --time=01:00:00 #SBATCH --output=run_horovod.out - module load modenv/ml + module load release/23.04 module load Horovod/0.19.5-fosscuda-2019b-TensorFlow-2.2.0-Python-3.7.4 - srun python <your_program.py> + srun python your_program.py ``` Do not forget to specify the total number of tasks `--ntasks` and the number of tasks per node diff --git a/doc.zih.tu-dresden.de/docs/software/gpu_programming.md b/doc.zih.tu-dresden.de/docs/software/gpu_programming.md index 5ec0a83332a4178b89ed68f7399c47035b4d8a2f..5de6a12d965c768ab9bc1edc4cecfb7294f5a717 100644 --- a/doc.zih.tu-dresden.de/docs/software/gpu_programming.md +++ b/doc.zih.tu-dresden.de/docs/software/gpu_programming.md @@ -6,11 +6,6 @@ The full hardware specifications of the GPU-compute nodes may be found in the [HPC Resources](../jobs_and_resources/hardware_overview.md#hpc-resources) page. Each node uses a different modules(modules.md#module-environments): -* [NVIDIA A100 nodes](../jobs_and_resources/hardware_overview.md#amd-rome-cpus-nvidia-a100) -(cluster `alpha`): use the `hiera` module environment (`module switch modenv/hiera`) -* [NVIDIA Tesla V100 nodes](../jobs_and_resources/hardware_overview.md#ibm-power9-nodes-for-machine-learning) -(cluster `power9`): use the `module spider <module name>` - ## Using GPUs with Slurm For general information on how to use Slurm, read the respective [page in this compendium](../jobs_and_resources/slurm.md). @@ -23,14 +18,14 @@ When allocating resources on a GPU-node, you must specify the number of requeste #SBATCH --ntasks=1 # All #SBATCH lines have to follow uninterrupted #SBATCH --time=01:00:00 # after the shebang line - #SBATCH --account=<KTR> # Comments start with # and do not count as interruptions + #SBATCH --account=p_number_crunch # Comments start with # and do not count as interruptions #SBATCH --job-name=fancyExp #SBATCH --output=simulation-%j.out #SBATCH --error=simulation-%j.err #SBATCH --gres=gpu:1 # request GPU(s) from Slurm module purge # Set up environment, e.g., clean modules environment - module load <modules> # and load necessary modules + module load module/version module2 # and load necessary modules srun ./application [options] # Execute parallel application with srun ``` @@ -39,7 +34,7 @@ Alternatively, you can work on the clusters interactively: ```bash marie@login.<cluster_name>$ srun --nodes=1 --gres=gpu:<N> --runtime=00:30:00 --pty bash -marie@compute$ module purge; module switch modenv/<env> +marie@compute$ module purge; module switch release/<env> ``` ## Directive Based GPU Programming @@ -60,10 +55,6 @@ Please use the following information as a start on OpenACC: OpenACC can be used with the PGI and NVIDIA HPC compilers. The NVIDIA HPC compiler, as part of the [NVIDIA HPC SDK](https://docs.nvidia.com/hpc-sdk/index.html), supersedes the PGI compiler. -Various versions of the PGI compiler are available on the -[NVIDIA Tesla K80 GPUs nodes](../jobs_and_resources/hardware_overview.md#island-2-phase-2-intel-haswell-cpus-nvidia-k80-gpus) -(partition `gpu2`). - The `nvc` compiler (NOT the `nvcc` compiler, which is used for CUDA) is available for the NVIDIA Tesla V100 and Nvidia A100 nodes. @@ -74,7 +65,7 @@ Tesla V100 and Nvidia A100 nodes. * For compilation, please add the compiler flag `-acc` to enable OpenACC interpreting by the compiler * `-Minfo` tells you what the compiler is actually doing to your code -* Add `-ta=nvidia:keple` to enable optimizations for the K80 GPUs +* Add `-ta=nvidia:ampere` to enable optimizations for the A100 GPUs * You may find further information on the PGI compiler in the [user guide](https://docs.nvidia.com/hpc-sdk/pgi-compilers/20.4/x86/pgi-user-guide/index.htm) and in the [reference guide](https://docs.nvidia.com/hpc-sdk/pgi-compilers/20.4/x86/pgi-ref-guide/index.htm), @@ -108,7 +99,7 @@ Furthermore, some compilers, such as GCC, have basic support for target offloadi enable these features by default and/or achieve poor performance. On the ZIH system, compilers with OpenMP target offloading support are provided on the clusters -`ml` and `alpha`. Two compilers with good performance can be used: the NVIDIA HPC compiler and the +`power9` and `alpha`. Two compilers with good performance can be used: the NVIDIA HPC compiler and the IBM XL compiler. #### Using OpenMP target offloading with NVIDIA HPC compilers @@ -125,8 +116,7 @@ available for OpenMP, including the `-gpu=ccXY` flag as mentioned above. #### Using OpenMP target offloading with the IBM XL compilers The IBM XL compilers (`xlc` for C, `xlc++` for C++ and `xlf` for Fortran (with sub-version for -different versions of Fortran)) are only available on the cluster `ml` with NVIDIA Tesla V100 GPUs. -They are available by default when switching to `modenv/ml`. +different versions of Fortran)) are only available on the cluster `power9` with NVIDIA Tesla V100 GPUs. * The `-qsmp -qoffload` combination of flags enables OpenMP target offloading support * Optimizations specific to the V100 GPUs can be enabled by using the @@ -149,7 +139,6 @@ provided as well. The [toolkit documentation page](https://docs.nvidia.com/cuda/ the [programming guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html) and the [best practice guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html). Optimization guides for supported NVIDIA architectures are available, including for -[Kepler (K80)](https://docs.nvidia.com/cuda/kepler-tuning-guide/index.html), [Volta (V100)](https://docs.nvidia.com/cuda/volta-tuning-guide/index.html) and [Ampere (A100)](https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html). @@ -175,7 +164,6 @@ is used as the host compiler. The following flags may be useful: * `--generate-code` (`-gencode`): generate optimized code for a target GPU (caution: these binaries cannot be used with GPUs of other generations). - * For Kepler (K80): `--generate-code arch=compute_37,code=sm_37`, * For Volta (V100): `--generate-code arch=compute_70,code=sm_70`, * For Ampere (A100): `--generate-code arch=compute_80,code=sm_80` * `-Xcompiler`: pass flags to the host compiler. E.g., generate OpenMP-parallel host code: @@ -189,15 +177,9 @@ and the [performance guidelines](https://docs.nvidia.com/cuda/cuda-c-programming for possible steps to take for the performance analysis and optimization. Multiple tools can be used for the performance analysis. -For the analysis of applications on the older K80 GPUs, we recommend two -[profiler tools](https://docs.nvidia.com/cuda/profiler-users-guide/index.html): -the NVIDIA [nvprof](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview) -command line profiler and the -[NVIDIA Visual Profiler](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#visual) -as the accompanying graphical profiler. These tools will be deprecated in future CUDA releases but -are still available in CUDA <= 11. On the newer GPUs (V100 and A100), we recommend the use of of the -newer NVIDIA Nsight tools, [Nsight Systems](https://developer.nvidia.com/nsight-systems) for a -system wide sampling and tracing and [Nsight Compute](https://developer.nvidia.com/nsight-compute) +For the analysis of applications on the newer GPUs (V100 and A100), +we recommend the use of the newer NVIDIA Nsight tools, [Nsight Systems](https://developer.nvidia.com/nsight-systems) +for a system-wide sampling and tracing and [Nsight Compute](https://developer.nvidia.com/nsight-compute) for a detailed analysis of individual kernels. ### NVIDIA nvprof & Visual Profiler diff --git a/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md b/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md index 5b79aaa47c0861871034fe6fc16e9305c3bc0e83..4288eb08a8dab9eecc79bb0f3436226a3a656ff1 100644 --- a/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md +++ b/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md @@ -202,7 +202,7 @@ There are the following script preparation steps for OmniOpt: [workspace](../data_lifecycle/workspaces.md). ```console - marie@login$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch/1.9.0 + marie@login$ module load release/23.04 GCC/11.3.0 OpenMPI/4.1.4 PyTorch/1.12.1 marie@login$ mkdir </path/to/workspace/python-environments> #create folder marie@login$ virtualenv --system-site-packages </path/to/workspace/python-environments/torchvision_env> marie@login$ source </path/to/workspace/python-environments/torchvision_env>/bin/activate #activate virtual environment @@ -212,13 +212,10 @@ There are the following script preparation steps for OmniOpt: ```console # Job submission on alpha nodes with 1 GPU on 1 node with 800 MB per CPU marie@login$ srun --gres=gpu:1 -n 1 -c 7 --pty --mem-per-cpu=800 bash - marie@alpha$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch/1.9.0 + marie@alpha$ module load release/23.04 GCC/11.3.0 OpenMPI/4.1.4 PyTorch/1.12.1 # Activate virtual environment marie@alpha$ source </path/to/workspace/python-environments/torchvision_env>/bin/activate - The following have been reloaded with a version change: - 1) modenv/scs5 => modenv/hiera - Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5, PyTorch/1.9.0 and 54 dependencies loaded. marie@alpha$ python </path/to/your/script/mnistFashion.py> --out-layer1=200 --batchsize=10 --epochs=3 [...] Epoch 3 @@ -252,7 +249,7 @@ environment. The recommended way is to wrap the necessary calls in a shell scrip # srun bash -l run.sh # Load modules your program needs, always specify versions! - module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch/1.7.1 + module load release/23.04 GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch/1.7.1 source </path/to/workspace/python-environments/torchvision_env>/bin/activate #activate virtual environment # Load your script. $@ is all the parameters that are given to this shell file. diff --git a/doc.zih.tu-dresden.de/docs/software/machine_learning.md b/doc.zih.tu-dresden.de/docs/software/machine_learning.md index 4dea3aa3089ffdfb109d95c0c3959f2fdc947a94..37b326705aa59b7abfe2db5de65e15d03c3f9597 100644 --- a/doc.zih.tu-dresden.de/docs/software/machine_learning.md +++ b/doc.zih.tu-dresden.de/docs/software/machine_learning.md @@ -18,17 +18,7 @@ cluster `power` has 6x Tesla V-100 GPUs. You can find a detailed specification o !!! note The cluster `power` is based on the Power9 architecture, which means that the software built - for x86_64 will not work on this cluster. Also, users need to use the modules which are - specially build for this architecture (from `modenv/ml`). - -### Modules - -On the cluster `power` load the module environment: - -```console -marie@power$ module load modenv/ml -The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml -``` + for x86_64 will not work on this cluster. ### Power AI diff --git a/doc.zih.tu-dresden.de/docs/software/modules.md b/doc.zih.tu-dresden.de/docs/software/modules.md index aa3c21c08d3a528a84b012b496f9affff61779a4..1f26e889631030fa0bfae9b2a22f9ad421379e0c 100644 --- a/doc.zih.tu-dresden.de/docs/software/modules.md +++ b/doc.zih.tu-dresden.de/docs/software/modules.md @@ -114,7 +114,7 @@ certain module, you can use `module avail softwarename` and it will display the Die folgenden Module wurden nicht entladen: (Benutzen Sie "module --force purge" um alle Module zu entladen): - 1) modenv/scs5 + 1) release/23.04 Module Python/3.8.6-GCCcore-10.2.0 and 11 dependencies unloaded. ``` @@ -168,7 +168,7 @@ There is a front end for the module command, which helps you to type less. It is marie@compute$ ml Derzeit geladene Module: - 1) modenv/scs5 (S) 5) bzip2/1.0.8-GCCcore-10.2.0 9) SQLite/3.33.0-GCCcore-10.2.0 13) Python/3.8.6-GCCcore-10.2.0 + 1) release/23.04 (S) 5) bzip2/1.0.8-GCCcore-10.2.0 9) SQLite/3.33.0-GCCcore-10.2.0 13) Python/3.8.6-GCCcore-10.2.0 2) GCCcore/10.2.0 6) ncurses/6.2-GCCcore-10.2.0 10) XZ/5.2.5-GCCcore-10.2.0 3) zlib/1.2.11-GCCcore-10.2.0 7) libreadline/8.0-GCCcore-10.2.0 11) GMP/6.2.0-GCCcore-10.2.0 4) binutils/2.35-GCCcore-10.2.0 8) Tcl/8.6.10-GCCcore-10.2.0 12) libffi/3.3-GCCcore-10.2.0 @@ -184,41 +184,20 @@ There is a front end for the module command, which helps you to type less. It is ## Module Environments On ZIH systems, there exist different **module environments**, each containing a set of software -modules. They are activated via the meta module `modenv` which has different versions, one of which -is loaded by default. You can switch between them by simply loading the desired modenv-version, e.g. +modules. +They are activated via the meta module `release` which has different versions, +one of which is loaded by default. +You can switch between them by simply loading the desired version, e.g. ```console -marie@compute$ module load modenv/ml +marie@compute$ module load release/23.10 ``` -### modenv/scs5 (default) - -* SCS5 software -* usually optimized for Intel processors (Cluster `Barnard`, `Julia`) - -### modenv/ml - -* data analytics software (for use on the Cluster `ml`) -* necessary to run most software on the cluster `ml` -(The instruction set [Power ISA](https://en.wikipedia.org/wiki/Power_ISA#Power_ISA_v.3.0) -is different from the usual x86 instruction set. -Thus the 'machine code' of other modenvs breaks). - -### modenv/hiera - -* uses a hierarchical module load scheme -* optimized software for AMD processors (Cluster `Romeo` and `Alpha`) - -### modenv/classic - -* deprecated, old software. Is not being curated. -* may break due to library inconsistencies with the operating system. -* please don't use software from that modenv - ### Searching for Software -The command `module spider <modname>` allows searching for a specific software across all modenv -environments. It will also display information on how to load a particular module when giving a +The command `module spider <modname>` allows searching for a specific software across all module +environments. +It will also display information on how to load a particular module when giving a precise module (with version) as the parameter. ??? example "Spider command" @@ -270,9 +249,7 @@ In some cases a desired software is available as an extension of a module. -------------------------------------------------------------------------------------------------------------------------------- This extension is provided by the following modules. To access the extension you must load one of the following modules. Note that any module names in parentheses show the module location in the software hierarchy. - TensorFlow/2.4.1 (modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5) - TensorFlow/2.4.1-fosscuda-2019b-Python-3.7.4 (modenv/ml) - TensorFlow/2.4.1-foss-2020b (modenv/scs5) + TensorFlow/2.4.1 (release/23.04 GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5) Names marked by a trailing (E) are extensions provided by another module. ``` @@ -280,22 +257,24 @@ In some cases a desired software is available as an extension of a module. Finaly, you can load the dependencies and `tensorboard/2.4.1` and check the version. ```console - marie@login$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 + marie@login$ module load release/23.04 GCC/11.3.0 OpenMPI/4.1.4 + + Modules GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5 and 15 dependencies loaded. + marie@login$ module load TensorFlow/2.11.0-CUDA-11.7.0 - The following have been reloaded with a version change: - 1) modenv/scs5 => modenv/hiera + Aktiviere Module: + 1) CUDA/11.7.0 2) GDRCopy/2.3 + + Module TensorFlow/2.11.0-CUDA-11.7.0 and 39 dependencies loaded. - Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5 and 15 dependencies loaded. - marie@login$ module load TensorFlow/2.4.1 - Module TensorFlow/2.4.1 and 34 dependencies loaded. marie@login$ tensorboard --version - 2.4.1 + 2.11.1 ``` ## Toolchains A program or library may break in various ways (e.g. not starting, crashing or producing wrong -results) when it is used with a software of a different version than it expects. So each module +results) when it is used with a software of a different version than it expects.So each module specifies the exact other modules it depends on. They get loaded automatically when the dependent module is loaded. @@ -308,14 +287,9 @@ means they now have a wrong dependency (version) which can be a problem (see abo To avoid this there are (versioned) toolchains and for each toolchain there is (usually) at most one version of each software. A "toolchain" is a set of modules used to build the software for other modules. -The most common one is the `foss`-toolchain comprising of `GCC`, `OpenMPI`, `OpenBLAS` & `FFTW`. - -!!! info +The most common one is the `foss`-toolchain consisting of `GCC`, `OpenMPI`, `OpenBLAS` & `FFTW`. - Modules are named like `<Softwarename>/<Version>-<Toolchain>` so `Python/3.6.6-foss-2019a` - uses the `foss-2019a` toolchain. - -This toolchain can be broken down into a sub-toolchain called `gompi` comprising of only +This toolchain can be broken down into a sub-toolchain called `gompi` consisting of only `GCC` & `OpenMPI`, or further to `GCC` (the compiler and linker) and even further to `GCCcore` which is only the runtime libraries required to run programs built with the GCC standard library. @@ -338,7 +312,7 @@ Examples: | `iimpi` | `intel-compilers` `impi` | | `intel-compilers` | `GCCcore` `binutils` | -As you can see `GCC` and `intel-compilers` are on the same level, as are `gompi` and `iimpi` +As you can see `GCC` and `intel-compilers` are on the same level, as are `gompi` and `iimpi`, although they are one level higher than the former. You can load and use modules from a lower toolchain with modules from @@ -352,58 +326,24 @@ However `LLVM/7.0.1-GCCcore-8.2.0` can be used with either `QuantumESPRESSO/6.5-intel-2019a` or `Python/3.6.6-foss-2019a` because `GCCcore-8.2.0` is a sub-toolchain of `intel-2019a` and `foss-2019a`. -For [modenv/hiera](#modenvhiera) it is much easier to avoid loading incompatible -modules as modules from other toolchains cannot be directly loaded -and don't show up in `module av`. +With the hierarchical module scheme we use at ZIH modules from other toolchains cannot be directly +loaded and don't show up in `module av` which avoids loading incompatible modules. So the concept if this hierarchical toolchains is already built into this module environment. -In the other module environments it is up to you to make sure the modules you load are compatible. - -So watch the output when you load another module as a message will be shown when loading a module -causes other modules to be loaded in a different version: - -??? example "Module reload" - - ```console - marie@login$ ml OpenFOAM/8-foss-2020a - Module OpenFOAM/8-foss-2020a and 72 dependencies loaded. - - marie@login$ ml Biopython/1.78-foss-2020b - The following have been reloaded with a version change: - 1) FFTW/3.3.8-gompi-2020a => FFTW/3.3.8-gompi-2020b 15) binutils/2.34-GCCcore-9.3.0 => binutils/2.35-GCCcore-10.2.0 - 2) GCC/9.3.0 => GCC/10.2.0 16) bzip2/1.0.8-GCCcore-9.3.0 => bzip2/1.0.8-GCCcore-10.2.0 - 3) GCCcore/9.3.0 => GCCcore/10.2.0 17) foss/2020a => foss/2020b - [...] - ``` !!! info - The higher toolchains have a year and letter as their version corresponding to their release. + The toolchains usually have a year and letter as their version corresponding to their release. So `2019a` and `2020b` refer to the first half of 2019 and the 2nd half of 2020 respectively. ## Per-Architecture Builds -Since we have a heterogeneous cluster, we do individual builds of some of the software for each -architecture present. This ensures that, no matter what partition the software runs on, a build +Since we have a heterogeneous cluster, we do individual builds of the software for each +architecture present. +This ensures that, no matter what partition/cluster the software runs on, a build optimized for the host architecture is used automatically. -For that purpose we have created symbolic links on the compute nodes, -at the system path `/sw/installed`. - -However, not every module will be available for each node type or partition. Especially when -introducing new hardware to the cluster, we do not want to rebuild all of the older module versions -and in some cases cannot fall-back to a more generic build either. That's why we provide the script: -`ml_arch_avail` that displays the availability of modules for the different node architectures. - -### Example Invocation of ml_arch_avail - -```console -marie@compute$ ml_arch_avail TensorFlow/2.4.1 -TensorFlow/2.4.1: haswell, rome -TensorFlow/2.4.1: haswell, rome -``` -The command shows all modules that match on `TensorFlow/2.4.1`, and their respective availability. -Note that this will not work for meta-modules that do not have an installation directory -(like some tool chain modules). +However, not every module will be available on all clusters. +Use `ml av` or `ml spider` to search for modules available on the sub-cluster you are on. ## Advanced Usage @@ -414,8 +354,7 @@ For writing your own module files please have a look at the ### When I log in, the wrong modules are loaded by default -Reset your currently loaded modules with `module purge` -(or `module purge --force` if you also want to unload your basic `modenv` module). +Reset your currently loaded modules with `module purge`. Then run `module save` to overwrite the list of modules you load by default when logging in. @@ -446,11 +385,11 @@ before the TensorFlow module can be loaded. Sie müssen alle Module in einer der nachfolgenden Zeilen laden bevor Sie das Modul "TensorFlow/2.4.1" laden können. - modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 + release/23.04 GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 This extension is provided by the following modules. To access the extension you must load one of the following modules. Note that any module names in parentheses show the module location in the software hierarchy. - TensorFlow/2.4.1 (modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5) + TensorFlow/2.4.1 (release/23.04 GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5) This module provides the following extensions: @@ -484,18 +423,18 @@ before the TensorFlow module can be loaded. - marie@compute$ ml +modenv/hiera +GCC/10.2.0 +CUDA/11.1.1 +OpenMPI/4.0.5 +TensorFlow/2.4.1 + marie@compute$ ml +GCC/10.2.0 +CUDA/11.1.1 +OpenMPI/4.0.5 +TensorFlow/2.4.1 Die folgenden Module wurden in einer anderen Version erneut geladen: 1) GCC/7.3.0-2.30 => GCC/10.2.0 3) binutils/2.30-GCCcore-7.3.0 => binutils/2.35 - 2) GCCcore/7.3.0 => GCCcore/10.2.0 4) modenv/scs5 => modenv/hiera + 2) GCCcore/7.3.0 => GCCcore/10.2.0 Module GCCcore/7.3.0, binutils/2.30-GCCcore-7.3.0, GCC/7.3.0-2.30, GCC/7.3.0-2.30 and 3 dependencies unloaded. Module GCCcore/7.3.0, GCC/7.3.0-2.30, GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5, TensorFlow/2.4.1 and 50 dependencies loaded. marie@compute$ module list Derzeit geladene Module: - 1) modenv/hiera (S) 28) Tcl/8.6.10 + 1) release/23.04 (S) 28) Tcl/8.6.10 2) GCCcore/10.2.0 29) SQLite/3.33.0 3) zlib/1.2.11 30) GMP/6.2.0 4) binutils/2.35 31) libffi/3.3 diff --git a/doc.zih.tu-dresden.de/docs/software/nanoscale_simulations.md b/doc.zih.tu-dresden.de/docs/software/nanoscale_simulations.md index aac64bfc25aa6f8945f5a486fc4d1b9face26579..bf60df228179d0ffe6c08289af391ea79fbdf0b3 100644 --- a/doc.zih.tu-dresden.de/docs/software/nanoscale_simulations.md +++ b/doc.zih.tu-dresden.de/docs/software/nanoscale_simulations.md @@ -64,13 +64,7 @@ please look at the [GAMESS home page](https://www.msg.chem.iastate.edu/gamess/in GAMESS is available as [modules](modules.md) within the classic environment. Available packages can be listed and loaded with the following commands: -_The module environments /hiera, /scs5, /classic and /ml originated from the taurus system are -momentarily under construction. The script will be updated after completion of the redesign -accordingly_ - ```console -marie@login$ module load modenv/classic -[...] marie@login$:~> module avail gamess ----------------------- /sw/modules/taurus/applications ------------------------ gamess/2013 @@ -91,7 +85,6 @@ For runs with [Slurm](../jobs_and_resources/slurm.md), please use a script like ## you have to make sure that an even number of tasks runs on each node !! #SBATCH --mem-per-cpu=1900 -module load modenv/classic module load gamess rungms.slurm cTT_M_025.inp /data/horse/ws/marie-gamess # the third parameter is the location of your horse directory diff --git a/doc.zih.tu-dresden.de/docs/software/power_ai.md b/doc.zih.tu-dresden.de/docs/software/power_ai.md index 1488d85f9f6b4a749b5535dea59474a7c2cf36a4..a4fc430fff59b646530b974a98254513e02fe645 100644 --- a/doc.zih.tu-dresden.de/docs/software/power_ai.md +++ b/doc.zih.tu-dresden.de/docs/software/power_ai.md @@ -5,7 +5,7 @@ the PowerAI Framework for Machine Learning. In the following the links are valid for PowerAI version 1.5.4. !!! warning - The information provided here is available from IBM and can be used on partition `ml` only! + The information provided here is available from IBM and can be used on the `Power9` cluster only! ## General Overview @@ -47,7 +47,7 @@ are valid for PowerAI version 1.5.4. (Open Neural Network Exchange) provides support for moving models between those frameworks. - [Distributed Deep Learning](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.5.4/navigation/pai_getstarted_ddl.html?view=kc) - Distributed Deep Learning (DDL). Works on up to 4 nodes on partition `ml`. + Distributed Deep Learning (DDL). Works on up to 4 nodes on cluster `Power9`. ## PowerAI Container diff --git a/doc.zih.tu-dresden.de/docs/software/pytorch.md b/doc.zih.tu-dresden.de/docs/software/pytorch.md index 574df2caba8e5d66cf96c1470766dc7208752572..efbce5ac4265fbd5f85eb113ac22891662be8362 100644 --- a/doc.zih.tu-dresden.de/docs/software/pytorch.md +++ b/doc.zih.tu-dresden.de/docs/software/pytorch.md @@ -20,10 +20,6 @@ and the PyTorch library. You can find detailed hardware specification in our [hardware documentation](../jobs_and_resources/hardware_overview.md). -_The module environments /hiera, /scs5, /classic and /ml originated from the taurus system are -momentarily under construction. The script will be updated after completion of the redesign -accordingly_ - ## PyTorch Console On the cluster `alpha`, load the module environment: @@ -69,8 +65,8 @@ marie@login.power$ module spider pytorch we know that we can load PyTorch (including torchvision) with ```console -marie@power$ module load modenv/ml torchvision/0.7.0-fossCUDA-2019b-Python-3.7.4-PyTorch-1.6.0 -Module torchvision/0.7.0-fossCUDA-2019b-Python-3.7.4-PyTorch-1.6.0 and 55 dependencies loaded. +marie@power$ module load release/23.04 GCC/11.3.0 OpenMPI/4.1.4 torchvision/0.13.1 +Modules GCC/11.3.0, OpenMPI/4.1.4, torchvision/0.13.1 and 62 dependencies loaded. ``` Now, we check that we can access PyTorch: diff --git a/doc.zih.tu-dresden.de/docs/software/singularity_power9.md b/doc.zih.tu-dresden.de/docs/software/singularity_power9.md index 5abb5c5019a8b0505ad24b947fb72551e4d35aed..609e52801a68148589d832f5d53a1fff221afc9e 100644 --- a/doc.zih.tu-dresden.de/docs/software/singularity_power9.md +++ b/doc.zih.tu-dresden.de/docs/software/singularity_power9.md @@ -67,7 +67,7 @@ have reasonable defaults. The most important ones are: * Various Singularity options are passed through. E.g. `--notest, --force, --update`. See, e.g., `singularity --help` for details. -For **advanced users**, it is also possible to manually request a job with a VM (`srun -p ml +For **advanced users**, it is also possible to manually request a job with a VM (`srun -p power9 --cloud=kvm ...`) and then use this script to build a Singularity container from within the job. In this case, the `--arch` and other Slurm related parameters are not required. The advantage of using this script is that it automates the waiting for the VM and mounting of host directories into it diff --git a/doc.zih.tu-dresden.de/docs/software/tensorflow.md b/doc.zih.tu-dresden.de/docs/software/tensorflow.md index f95b936871c3b57b752fb55a3f2e18a8e0538269..eb644266350860ee34df6bc29ff0ccd7077a5cab 100644 --- a/doc.zih.tu-dresden.de/docs/software/tensorflow.md +++ b/doc.zih.tu-dresden.de/docs/software/tensorflow.md @@ -23,10 +23,6 @@ and the TensorFlow library. You can find detailed hardware specification in our ## TensorFlow Console -_The module environments /hiera, /scs5, /classic and /ml originated from the old taurus system are -momentarily under construction. The script will be updated after completion of the redesign -accordingly_ - On the cluster `alpha`, load the module environment: ```console @@ -53,13 +49,6 @@ Module TensorFlow/2.9.1 and 35 dependencies loaded. >Module: Recommended toolchain version, load to access other modules that depend on it ``` -On the cluster `power` load the module environment: - -```console -marie@power$ module load modenv/ml -The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml -``` - This example shows how to install and start working with TensorFlow using the modules system. ```console diff --git a/doc.zih.tu-dresden.de/docs/software/virtual_machines.md b/doc.zih.tu-dresden.de/docs/software/virtual_machines.md index 0738f4fb4da398091cd373901dc94e8fa3fe5d94..d8391dc2d0eed5e5ff71ac4f2d25f8e85c2cbd67 100644 --- a/doc.zih.tu-dresden.de/docs/software/virtual_machines.md +++ b/doc.zih.tu-dresden.de/docs/software/virtual_machines.md @@ -47,7 +47,7 @@ times till it succeeds. bash-4.2$ cat /tmp/marie_2759627/activate #!/bin/bash -if ! grep -q -- "Key for the VM on the partition ml" "/home/marie/.ssh/authorized_keys" > /dev/null; then +if ! grep -q -- "Key for the VM on the cluster power" "/home/marie/.ssh/authorized_keys" > /dev/null; then cat "/tmp/marie_2759627/kvm.pub" >> "/home/marie/.ssh/authorized_keys" else sed -i "s|.*Key for the VM on the cluster power.*|ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC3siZfQ6vQ6PtXPG0RPZwtJXYYFY73TwGYgM6mhKoWHvg+ZzclbBWVU0OoU42B3Ddofld7TFE8sqkHM6M+9jh8u+pYH4rPZte0irw5/27yM73M93q1FyQLQ8Rbi2hurYl5gihCEqomda7NQVQUjdUNVc6fDAvF72giaoOxNYfvqAkw8lFyStpqTHSpcOIL7pm6f76Jx+DJg98sXAXkuf9QK8MurezYVj1qFMho570tY+83ukA04qQSMEY5QeZ+MJDhF0gh8NXjX/6+YQrdh8TklPgOCmcIOI8lwnPTUUieK109ndLsUFB5H0vKL27dA2LZ3ZK+XRCENdUbpdoG2Czz Key for the VM on the cluster power|" "/home/marie/.ssh/authorized_keys" diff --git a/doc.zih.tu-dresden.de/docs/software/visualization.md b/doc.zih.tu-dresden.de/docs/software/visualization.md index 00a49f26345837eb049dfb96ea7427740d0593d2..427bf746840383e69c9a4a85a84997d618cc9b15 100644 --- a/doc.zih.tu-dresden.de/docs/software/visualization.md +++ b/doc.zih.tu-dresden.de/docs/software/visualization.md @@ -11,10 +11,6 @@ batch and in-situ workflows. ParaView is available on ZIH systems from the [modules system](modules.md#module-environments). The following command lists the available versions -_The module environments /hiera, /scs5, /classic and /ml originated from the taurus system are -momentarily under construction. The script will be updated after completion of the redesign -accordingly_ - ```console marie@login$ module avail ParaView diff --git a/doc.zih.tu-dresden.de/mkdocs.yml b/doc.zih.tu-dresden.de/mkdocs.yml index 1df4ca71d8af8a00d660b11791f1e25b4459a0c8..fce102bcdcfabd7f24b240b96e578cc011345df9 100644 --- a/doc.zih.tu-dresden.de/mkdocs.yml +++ b/doc.zih.tu-dresden.de/mkdocs.yml @@ -128,7 +128,6 @@ nav: - Jobs without InfiniBand: archive/no_ib_jobs.md - Migration towards Phase 2: archive/phase2_migration.md - Platform LSF: archive/platform_lsf.md - - BeeGFS Filesystem on Demand: archive/beegfs_on_demand.md - Jupyter Installation: archive/install_jupyter.md - Profile Jobs with Slurm: archive/slurm_profiling.md - Switched-Off Systems: