Skip to content
Snippets Groups Projects
Commit 4622418b authored by Guilherme Calandrini's avatar Guilherme Calandrini
Browse files

fixes for using clusters intead partitions

parent 04120b8c
No related branches found
No related tags found
2 merge requests!990Automated merge from preview to main,!980fixes for using clusters intead partitions
...@@ -212,16 +212,16 @@ there is a list of conventions w.r.t. spelling and technical wording. ...@@ -212,16 +212,16 @@ there is a list of conventions w.r.t. spelling and technical wording.
| ZIH system(s) | Taurus, HRSK II, our HPC systems, etc. | | ZIH system(s) | Taurus, HRSK II, our HPC systems, etc. |
| workspace | work space | | workspace | work space |
| | HPC-DA | | | HPC-DA |
| partition `ml` | ML partition, ml partition, `ml` partition, "ml" partition, etc. | | cluster `romeo` | ROMEO cluster, romeo cluster, `romeo` cluster, "romeo" cluster, etc. |
### Code Blocks and Command Prompts ### Code Blocks and Command Prompts
* Use ticks to mark code blocks and commands, not an italic font. * Use ticks to mark code blocks and commands, not an italic font.
* Specify language for code blocks ([see below](#code-blocks-and-syntax-highlighting)). * Specify language for code blocks ([see below](#code-blocks-and-syntax-highlighting)).
* All code blocks and commands should be runnable from a login node or a node within a specific * All code blocks and commands should be runnable from a login node or a node within a specific
partition (e.g., `ml`). cluster (e.g., `alpha`).
* It should be clear from the [prompt](#list-of-prompts), where the command is run (e.g., local * It should be clear from the [prompt](#list-of-prompts), where the command is run (e.g., local
machine, login node, or specific partition). machine, login node, or specific cluster).
#### Code Blocks and Syntax Highlighting #### Code Blocks and Syntax Highlighting
......
...@@ -4,14 +4,12 @@ ...@@ -4,14 +4,12 @@
The full hardware specifications of the GPU-compute nodes may be found in the The full hardware specifications of the GPU-compute nodes may be found in the
[HPC Resources](../jobs_and_resources/hardware_overview.md#hpc-resources) page. [HPC Resources](../jobs_and_resources/hardware_overview.md#hpc-resources) page.
Each node uses a different [module environment](modules.md#module-environments): Each node uses a different modules(modules.md#module-environments):
* [NVIDIA Tesla K80 GPUs nodes](../jobs_and_resources/hardware_overview.md#island-2-phase-2-intel-haswell-cpus-nvidia-k80-gpus)
(partition `gpu2`): use the default `scs5` module environment (`module switch modenv/scs5`).
* [NVIDIA Tesla V100 nodes](../jobs_and_resources/hardware_overview.md#ibm-power9-nodes-for-machine-learning)
(partition `ml`): use the `ml` module environment (`module switch modenv/ml`)
* [NVIDIA A100 nodes](../jobs_and_resources/hardware_overview.md#amd-rome-cpus-nvidia-a100) * [NVIDIA A100 nodes](../jobs_and_resources/hardware_overview.md#amd-rome-cpus-nvidia-a100)
(partition `alpha`): use the `hiera` module environment (`module switch modenv/hiera`) (cluster `alpha`): use the `hiera` module environment (`module switch modenv/hiera`)
* [NVIDIA Tesla V100 nodes](../jobs_and_resources/hardware_overview.md#ibm-power9-nodes-for-machine-learning)
(cluster `power9`): use the `module spider <module name>`
## Using GPUs with Slurm ## Using GPUs with Slurm
...@@ -19,45 +17,7 @@ For general information on how to use Slurm, read the respective [page in this c ...@@ -19,45 +17,7 @@ For general information on how to use Slurm, read the respective [page in this c
When allocating resources on a GPU-node, you must specify the number of requested GPUs by using the When allocating resources on a GPU-node, you must specify the number of requested GPUs by using the
`--gres=gpu:<N>` option, like this: `--gres=gpu:<N>` option, like this:
=== "partition `gpu2`" === "cluster `alpha`"
```bash
#!/bin/bash # Batch script starts with shebang line
#SBATCH --ntasks=1 # All #SBATCH lines have to follow uninterrupted
#SBATCH --time=01:00:00 # after the shebang line
#SBATCH --account=<KTR> # Comments start with # and do not count as interruptions
#SBATCH --job-name=fancyExp
#SBATCH --output=simulation-%j.out
#SBATCH --error=simulation-%j.err
#SBATCH --partition=gpu2
#SBATCH --gres=gpu:1 # request GPU(s) from Slurm
module purge # Set up environment, e.g., clean modules environment
module switch modenv/scs5 # switch module environment
module load <modules> # and load necessary modules
srun ./application [options] # Execute parallel application with srun
```
=== "partition `ml`"
```bash
#!/bin/bash # Batch script starts with shebang line
#SBATCH --ntasks=1 # All #SBATCH lines have to follow uninterrupted
#SBATCH --time=01:00:00 # after the shebang line
#SBATCH --account=<KTR> # Comments start with # and do not count as interruptions
#SBATCH --job-name=fancyExp
#SBATCH --output=simulation-%j.out
#SBATCH --error=simulation-%j.err
#SBATCH --partition=ml
#SBATCH --gres=gpu:1 # request GPU(s) from Slurm
module purge # Set up environment, e.g., clean modules environment
module switch modenv/ml # switch module environment
module load <modules> # and load necessary modules
srun ./application [options] # Execute parallel application with srun
```
=== "partition `alpha`"
```bash ```bash
#!/bin/bash # Batch script starts with shebang line #!/bin/bash # Batch script starts with shebang line
...@@ -67,20 +27,18 @@ When allocating resources on a GPU-node, you must specify the number of requeste ...@@ -67,20 +27,18 @@ When allocating resources on a GPU-node, you must specify the number of requeste
#SBATCH --job-name=fancyExp #SBATCH --job-name=fancyExp
#SBATCH --output=simulation-%j.out #SBATCH --output=simulation-%j.out
#SBATCH --error=simulation-%j.err #SBATCH --error=simulation-%j.err
#SBATCH --partition=alpha
#SBATCH --gres=gpu:1 # request GPU(s) from Slurm #SBATCH --gres=gpu:1 # request GPU(s) from Slurm
module purge # Set up environment, e.g., clean modules environment module purge # Set up environment, e.g., clean modules environment
module switch modenv/hiera # switch module environment
module load <modules> # and load necessary modules module load <modules> # and load necessary modules
srun ./application [options] # Execute parallel application with srun srun ./application [options] # Execute parallel application with srun
``` ```
Alternatively, you can work on the partitions interactively: Alternatively, you can work on the clusters interactively:
```bash ```bash
marie@login$ srun --partition=<partition>-interactive --gres=gpu:<N> --pty bash marie@login.<cluster_name>$ srun --nodes=1 --gres=gpu:<N> --runtime=00:30:00 --pty bash
marie@compute$ module purge; module switch modenv/<env> marie@compute$ module purge; module switch modenv/<env>
``` ```
...@@ -149,7 +107,7 @@ the [official list](https://www.openmp.org/resources/openmp-compilers-tools/) fo ...@@ -149,7 +107,7 @@ the [official list](https://www.openmp.org/resources/openmp-compilers-tools/) fo
Furthermore, some compilers, such as GCC, have basic support for target offloading, but do not Furthermore, some compilers, such as GCC, have basic support for target offloading, but do not
enable these features by default and/or achieve poor performance. enable these features by default and/or achieve poor performance.
On the ZIH system, compilers with OpenMP target offloading support are provided on the partitions On the ZIH system, compilers with OpenMP target offloading support are provided on the clusters
`ml` and `alpha`. Two compilers with good performance can be used: the NVIDIA HPC compiler and the `ml` and `alpha`. Two compilers with good performance can be used: the NVIDIA HPC compiler and the
IBM XL compiler. IBM XL compiler.
...@@ -167,7 +125,7 @@ available for OpenMP, including the `-gpu=ccXY` flag as mentioned above. ...@@ -167,7 +125,7 @@ available for OpenMP, including the `-gpu=ccXY` flag as mentioned above.
#### Using OpenMP target offloading with the IBM XL compilers #### Using OpenMP target offloading with the IBM XL compilers
The IBM XL compilers (`xlc` for C, `xlc++` for C++ and `xlf` for Fortran (with sub-version for The IBM XL compilers (`xlc` for C, `xlc++` for C++ and `xlf` for Fortran (with sub-version for
different versions of Fortran)) are only available on the partition `ml` with NVIDIA Tesla V100 GPUs. different versions of Fortran)) are only available on the cluster `ml` with NVIDIA Tesla V100 GPUs.
They are available by default when switching to `modenv/ml`. They are available by default when switching to `modenv/ml`.
* The `-qsmp -qoffload` combination of flags enables OpenMP target offloading support * The `-qsmp -qoffload` combination of flags enables OpenMP target offloading support
...@@ -328,8 +286,7 @@ documentation for a ...@@ -328,8 +286,7 @@ documentation for a
### NVIDIA Nsight Compute ### NVIDIA Nsight Compute
Nsight Compute is used for the analysis of individual GPU-kernels. It supports GPUs from the Volta Nsight Compute is used for the analysis of individual GPU-kernels. It supports GPUs from the Volta
architecture onward (on the ZIH system: V100 and A100). Therefore, you cannot use Nsight Compute on architecture onward (on the ZIH system: V100 and A100). If you are familiar with nvprof, you may want to consult the
the partition `gpu2`. If you are familiar with nvprof, you may want to consult the
[Nvprof Transition Guide](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#nvprof-guide), [Nvprof Transition Guide](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#nvprof-guide),
as Nsight Compute uses a new scheme for metrics. as Nsight Compute uses a new scheme for metrics.
We recommend those kernels as optimization targets that require a large portion of you run time, We recommend those kernels as optimization targets that require a large portion of you run time,
......
...@@ -283,7 +283,7 @@ Using OmniOpt for a first trial example, it is often sufficient to concentrate o ...@@ -283,7 +283,7 @@ Using OmniOpt for a first trial example, it is often sufficient to concentrate o
configuration parameters: configuration parameters:
1. **Optimization run name:** A name for an OmniOpt run given a belonging configuration. 1. **Optimization run name:** A name for an OmniOpt run given a belonging configuration.
1. **Partition:** Choose the partition on the ZIH system that fits the programs' needs. 1. **Partition:** Choose the cluster on the ZIH system that fits the programs' needs.
1. **Enable GPU:** Decide whether a program could benefit from GPU usage or not. 1. **Enable GPU:** Decide whether a program could benefit from GPU usage or not.
1. **Workdir:** The directory where OmniOpt is saving its necessary files and all results. Derived 1. **Workdir:** The directory where OmniOpt is saving its necessary files and all results. Derived
from the optimization run name, each configuration creates a single directory. from the optimization run name, each configuration creates a single directory.
......
...@@ -151,11 +151,10 @@ It is also possible to run this command using a job file to retrieve the topolog ...@@ -151,11 +151,10 @@ It is also possible to run this command using a job file to retrieve the topolog
```bash ```bash
#!/bin/bash #!/bin/bash
#SBATCH --job-name=topo_haswell #SBATCH --job-name=topo_node
#SBATCH --ntasks=1 #SBATCH --ntasks=1
#SBATCH --cpus-per-task=1 #SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=300m #SBATCH --mem-per-cpu=300m
#SBATCH --partition=haswell
#SBATCH --time=00:05:00 #SBATCH --time=00:05:00
#SBATCH --output=get_topo.out #SBATCH --output=get_topo.out
#SBATCH --error=get_topo.err #SBATCH --error=get_topo.err
......
...@@ -110,9 +110,8 @@ cards (GPUs) specified by the device index. For that, make sure to use the modul ...@@ -110,9 +110,8 @@ cards (GPUs) specified by the device index. For that, make sure to use the modul
#!/bin/bash #!/bin/bash
#SBATCH --nodes=1 #SBATCH --nodes=1
#SBATCH --cpus-per-task=12 #SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:2 #SBATCH --gres=gpu:1
#SBATCH --partition=gpu2
#SBATCH --time=01:00:00 #SBATCH --time=01:00:00
# Make sure to only use ParaView # Make sure to only use ParaView
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment