diff --git a/doc.zih.tu-dresden.de/docs/software/gpu_programming.md b/doc.zih.tu-dresden.de/docs/software/gpu_programming.md index 070176efcb2ab0f463da30675841ade0e0a585a3..3911c94f8f8b65d6ef9ce6090867da132d22414d 100644 --- a/doc.zih.tu-dresden.de/docs/software/gpu_programming.md +++ b/doc.zih.tu-dresden.de/docs/software/gpu_programming.md @@ -1,5 +1,89 @@ # GPU Programming +## Available GPUs + +The full hardware specifications of the GPU-compute nodes may be found in the +[HPC Resources](../jobs_and_resources/hardware_overview.md#hpc-resources) page. +Each node uses a different [module environment](modules.md#module-environments): + +* [NVIDIA Tesla K80 GPUs nodes](../jobs_and_resources/hardware_overview.md#island-2-phase-2-intel-haswell-cpus-nvidia-k80-gpus) +(partition `gpu2`): use the default `scs5` module environment (`module switch modenv/scs5`). +* [NVIDIA Tesla V100 nodes](../jobs_and_resources/hardware_overview.md#ibm-power9-nodes-for-machine-learning) +(partition `ml`): use the `ml` module environment (`modenv switch modenv/ml`) +* [NVIDIA A100 nodes](../jobs_and_resources/hardware_overview.md#amd-rome-cpus-nvidia-a100) +(partition `alpha`): use the `hiera` module environment (`module switch modenv/hiera`) + +## Using GPUs with Slurm + +For general information on how to use Slurm, read the respective [page in this compendium](../jobs_and_resources/slurm.md). +When allocating resources on a GPU-node, you must specify the number of requested GPUs by using the +`--gres=gpu:<N>` option, like this: + +=== "partition `gpu2`" + ```bash + #!/bin/bash # Batch script starts with shebang line + + #SBATCH --ntasks=1 # All #SBATCH lines have to follow uninterrupted + #SBATCH --time=01:00:00 # after the shebang line + #SBATCH --account=<KTR> # Comments start with # and do not count as interruptions + #SBATCH --job-name=fancyExp + #SBATCH --output=simulation-%j.out + #SBATCH --error=simulation-%j.err + #SBATCH --partition=gpu2 + #SBATCH --gres=gpu:1 # request GPU(s) from Slurm + + module purge # Set up environment, e.g., clean modules environment + module switch modenv/scs5 # switch module environment + module load <modules> # and load necessary modules + + srun ./application [options] # Execute parallel application with srun + ``` +=== "partition `ml`" + ```bash + #!/bin/bash # Batch script starts with shebang line + + #SBATCH --ntasks=1 # All #SBATCH lines have to follow uninterrupted + #SBATCH --time=01:00:00 # after the shebang line + #SBATCH --account=<KTR> # Comments start with # and do not count as interruptions + #SBATCH --job-name=fancyExp + #SBATCH --output=simulation-%j.out + #SBATCH --error=simulation-%j.err + #SBATCH --partition=ml + #SBATCH --gres=gpu:1 # request GPU(s) from Slurm + + module purge # Set up environment, e.g., clean modules environment + module switch modenv/ml # switch module environment + module load <modules> # and load necessary modules + + srun ./application [options] # Execute parallel application with srun + ``` +=== "partition `alpha`" + ```bash + #!/bin/bash # Batch script starts with shebang line + + #SBATCH --ntasks=1 # All #SBATCH lines have to follow uninterrupted + #SBATCH --time=01:00:00 # after the shebang line + #SBATCH --account=<KTR> # Comments start with # and do not count as interruptions + #SBATCH --job-name=fancyExp + #SBATCH --output=simulation-%j.out + #SBATCH --error=simulation-%j.err + #SBATCH --partition=alpha + #SBATCH --gres=gpu:1 # request GPU(s) from Slurm + + module purge # Set up environment, e.g., clean modules environment + module switch modenv/hiera # switch module environment + module load <modules> # and load necessary modules + + srun ./application [options] # Execute parallel application with srun + ``` + +Alternatively, you can work on the partitions interactively: + +```bash +marie@login$ srun --partition=<partition>-interactive --gres=gpu:<N> --pty bash +marie@compute$ module purge; modenv switch modenv/<env> +``` + ## Directive Based GPU Programming Directives are special compiler commands in your C/C++ or Fortran source code. They tell the @@ -8,36 +92,281 @@ technique. ### OpenACC -[OpenACC](http://www.openacc-standard.org) is a directive based GPU programming model. It currently +[OpenACC](https://www.openacc.org) is a directive based GPU programming model. It currently only supports NVIDIA GPUs as a target. Please use the following information as a start on OpenACC: -Introduction +#### Introduction -OpenACC can be used with the PGI and CAPS compilers. For PGI please be sure to load version 13.4 or -newer for full support for the NVIDIA Tesla K20x GPUs at ZIH. +OpenACC can be used with the PGI and NVIDIA HPC compilers. The NVIDIA HPC compiler, as part of the +[NVIDIA HPC SDK](https://docs.nvidia.com/hpc-sdk/index.html), supersedes the PGI compiler. + +Various versions of the PGI compiler are available on the +[NVIDIA Tesla K80 GPUs nodes](../jobs_and_resources/hardware_overview.md#island-2-phase-2-intel-haswell-cpus-nvidia-k80-gpus) +(partition `gpu2`). + +The `nvc` compiler (NOT the `nvcc` compiler, which is used for CUDA) is available for the NVIDIA +Tesla V100 and Nvidia A100 nodes. #### Using OpenACC with PGI compilers +* Load the latest version via `module load PGI` or search for available versions with +`module search PGI` * For compilation, please add the compiler flag `-acc` to enable OpenACC interpreting by the - compiler; -* `-Minfo` tells you what the compiler is actually doing to your code; -* If you only want to use the created binary at ZIH resources, please also add `-ta=nvidia:keple`; -* OpenACC Tutorial: intro1.pdf, intro2.pdf. + compiler +* `-Minfo` tells you what the compiler is actually doing to your code +* Add `-ta=nvidia:keple` to enable optimizations for the K80 GPUs +* You may find further information on the PGI compiler in the +[user guide](https://docs.nvidia.com/hpc-sdk/pgi-compilers/20.4/x86/pgi-user-guide/index.htm) +and in the [reference guide](https://docs.nvidia.com/hpc-sdk/pgi-compilers/20.4/x86/pgi-ref-guide/index.htm), +which includes descriptions of available +[command line options](https://docs.nvidia.com/hpc-sdk/pgi-compilers/20.4/x86/pgi-ref-guide/index.htm#cmdln-options-ref) + +#### Using OpenACC with NVIDIA HPC compilers + +* Switch into the correct module environment for your selected compute nodes +(see [list of available GPUs](#available-gpus)) +* Load the `NVHPC` module for the correct module environment. +Either load the default (`module load NVHPC`) or search for a specific version. +* Use the correct compiler for your code: `nvc` for C, `nvc++` for C++ and `nvfortran` for Fortran +* Use the `-acc` and `-Minfo` flag as with the PGI compiler +* To create optimized code for either the V100 or A100, use `-gpu=cc70` or `-gpu=cc80`, respectively +* Further information on this compiler is provided in the +[user guide](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html) and the +[reference guide](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-ref-guide/index.html), +which includes descriptions of available +[command line options](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-ref-guide/index.html#cmdln-options-ref) +* Information specific the use of OpenACC with the NVIDIA HPC compiler is compiled in a +[guide](https://docs.nvidia.com/hpc-sdk/compilers/openacc-gs/index.html) + +### OpenMP target offloading + +[OpenMP](https://www.openmp.org/) supports target offloading as of version 4.0. A dedicated set of +compiler directives can be used to annotate code-sections that are intended for execution on the +GPU (i.e., target offloading). Not all compilers with OpenMP support target offloading, refer to +the [official list](https://www.openmp.org/resources/openmp-compilers-tools/) for details. +Furthermore, some compilers, such as GCC, have basic support for target offloading, but do not +enable these features by default and/or achieve poor performance. + +On the ZIH system, compilers with OpenMP target offloading support are provided on the partitions +`ml` and `alpha`. Two compilers with good performance can be used: the NVIDIA HPC compiler and the +IBM XL compiler. + +#### Using OpenMP target offloading with NVIDIA HPC compilers + +* Load the module environments and the NVIDIA HPC SDK as described in the +[OpenACC](#using-openacc-with-nvidia-hpc-compilers) section +* Use the `-mp=gpu` flag to enable OpenMP with offloading +* `-Minfo` tells you what the compiler is actually doing to your code +* The same compiler options as mentioned [above](#using-openacc-with-nvidia-hpc-compilers) are +available for OpenMP, including the `-gpu=ccXY` flag as mentioned above. +* OpenMP-specific advice may be found in the +[respective section in the user guide](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/#openmp-use) -### HMPP +#### Using OpenMP target offloading with the IBM XL compilers -HMPP is available from the CAPS compilers. +The IBM XL compilers (`xlc` for C, `xlc++` for C++ and `xlf` for Fortran (with sub-version for +different versions of Fortran)) are only available on the partition `ml` with NVIDIA Tesla V100 GPUs. +They are available by default when switching to `modenv/ml`. + +* The `-qsmp -qoffload` combination of flags enables OpenMP target offloading support +* Optimizations specific to the V100 GPUs can be enabled by using the +[`-qtgtarch=sm_70`](https://www.ibm.com/docs/en/xl-c-and-cpp-linux/16.1.1?topic=descriptions-qtgtarch) +flag. +* IBM provides a [XL compiler documentation](https://www.ibm.com/docs/en/xl-c-and-cpp-linux/16.1.1) +with a +[list of supported OpenMP directives](https://www.ibm.com/docs/en/xl-c-and-cpp-linux/16.1.1?topic=reference-pragma-directives-openmp-parallelization) +and information on +[target-offloading specifics](https://www.ibm.com/docs/en/xl-c-and-cpp-linux/16.1.1?topic=gpus-programming-openmp-device-constructs) ## Native GPU Programming ### CUDA -Native [CUDA](http://www.nvidia.com/cuda) programs can sometimes offer a better performance. Please -use the following slides as an introduction: +Native [CUDA](http://www.nvidia.com/cuda) programs can sometimes offer a better performance. +NVIDIA provides some [introductory material and links](https://developer.nvidia.com/how-to-cuda-c-cpp). +An [introduction to CUDA](https://developer.nvidia.com/blog/even-easier-introduction-cuda/) is +provided as well. The [toolkit documentation page](https://docs.nvidia.com/cuda/index.html) links to +the [programming guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html) and the +[best practice guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html). +Optimization guides for supported NVIDIA architectures are available, including for +[Kepler (K80)](https://docs.nvidia.com/cuda/kepler-tuning-guide/index.html), +[Volta (V100)](https://docs.nvidia.com/cuda/volta-tuning-guide/index.html) and +[Ampere (A100)](https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html). + +In order to compile an application with CUDA use the `nvcc` compiler command, which is described in +detail in [nvcc documentation](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html). +This compiler is available via several `CUDA` packages, a default version can be loaded via +`module load CUDA`. Additionally, the `NVHPC` modules provide CUDA tools as well. + +#### Usage of the CUDA compiler + +The simple invocation `nvcc <code.cu>` will compile a valid CUDA program. `nvcc` differentiates +between the device and the host code, which will be compiled in separate phases. Therefore, compiler +options can be defined specifically for the device as well as for the host code. By default, the GCC +is used as the host compiler. The following flags may be useful: + +* `--generate-code` (`-gencode`): generate optimized code for a target GPU (caution: these binaries +cannot be used with GPUs of other generations). + * For Kepler (K80): `--generate-code arch=compute_37,code=sm_37`, + * For Volta (V100): `--generate-code arch=compute_70,code=sm_70`, + * For Ampere (A100): `--generate-code arch=compute_80,code=sm_80` +* `-Xcompiler`: pass flags to the host compiler. E.g., generate OpenMP-parallel host code: +`-Xcompiler -fopenmp`. +The `-Xcompiler` flag has to be invoked for each host-flag + +## Performance Analysis + +Consult NVIDIA's [Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html) +and the [performance guidelines](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#performance-guidelines) +for possible steps to take for the performance analysis and optimization. + +Multiple tools can be used for the performance analysis. +For the analysis of applications on the older K80 GPUs, we recommend two +[profiler tools](https://docs.nvidia.com/cuda/profiler-users-guide/index.html): +the NVIDIA [nvprof](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview) +command line profiler and the +[NVIDIA Visual Profiler](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#visual) +as the accompanying graphical profiler. These tools will be deprecated in future CUDA releases but +are still available in CUDA <= 11. On the newer GPUs (V100 and A100), we recommend the use of of the +newer NVIDIA Nsight tools, [Nsight Systems](https://developer.nvidia.com/nsight-systems) for a +system wide sampling and tracing and [Nsight Compute](https://developer.nvidia.com/nsight-compute) +for a detailed analysis of individual kernels. + +### NVIDIA nvprof & Visual Profiler + +The nvprof command line and the Visual Profiler are available once a CUDA module has been loaded. +For a simple analysis, you can call `nvprof` without any options, like such: + +```bash +marie@compute$ nvprof ./application [options] +``` + +For a more in-depth analysis, we recommend you use the command line tool first to generate a report +file, which you can later analyze in the Visual Profiler. In order to collect a set of general +metrics for the analysis in the Visual Profiler, use the `--analysis-metrics` flag to collect +metrics and `--export-profile` to generate a report file, like this: + +```bash +marie@compute$ nvprof --analysis-metrics --export-profile <output>.nvvp ./application [options] +``` + +[Transfer the report file to your local system](../data_transfer/export_nodes.md) and analyze it in +the Visual Profiler (`nvvp`) locally. This will give the smoothest user experience. Alternatively, +you can use [X11-forwarding](../access/ssh_login.md). Refer to the documentation for details about +the individual +[features and views of the Visual Profiler](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#visual-views). + +Besides these generic analysis methods, you can profile specific aspects of your GPU kernels. +`nvprof` can profile specific events. For this, use + +```bash +marie@compute$ nvprof --query-events +``` + +to get a list of available events. +Analyze one or more events by using specifying one or more events, separated by comma: + +```bash +marie@compute$ nvprof --events <event_1>[,<event_2>[,...]] ./application [options] +``` + +Additionally, you can analyze specific metrics. +Similar to the profiling of events, you can get a list of available metrics: + +```bash +marie@compute$ nvprof --query-metrics +``` + +One or more metrics can be profiled at the same time: + +```bash +marie@compute$ nvprof --metrics <metric_1>[,<metric_2>[,...]] ./application [options] +``` + +If you want to limit the profiler's scope to one or more kernels, you can use the +`--kernels <kernel_1>[,<kernel_2>]` flag. For further command line options, refer to the +[documentation on command line options](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-command-line-options). + +### NVIDIA Nsight Systems + +Use [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems) for a system-wide sampling +of your code. Refer to the +[NVIDIA Nsight Systems User Guide](https://docs.nvidia.com/nsight-systems/UserGuide/index.html) for +details. With this, you can identify parts of your code that take a long time to run and are +suitable optimization candidates. + +Use the command-line version to sample your code and create a report file for later analysis: + +```bash +marie@compute$ nsys profile [--stats=true] ./application [options] +``` + +The `--stats=true` flag is optional and will create a summary on the command line. Depending on your +needs, this analysis may be sufficient to identify optimizations targets. + +The graphical user interface version can be used for a thorough analysis of your previously +generated report file. For an optimal user experience, we recommend a local installation of NVIDIA +Nsight Systems. In this case, you can +[transfer the report file to your local system](../data_transfer/export_nodes.md). +Alternatively, you can use [X11-forwarding](../access/ssh_login.md). The graphical user interface is +usually available as `nsys-ui`. + +Furthermore, you can use the command line interface for further analyses. Refer to the +documentation for a +[list of available command line options](https://docs.nvidia.com/nsight-systems/UserGuide/index.html#cli-options). + +### NVIDIA Nsight Compute + +Nsight Compute is used for the analysis of individual GPU-kernels. It supports GPUs from the Volta +architecture onward (on the ZIH system: V100 and A100). Therefore, you cannot use Nsight Compute on +the partition `gpu2`. If you are familiar with nvprof, you may want to consult the +[Nvprof Transition Guide](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#nvprof-guide), +as Nsight Compute uses a new scheme for metrics. +We recommend those kernels as optimization targets that require a large portion of you run time, +according to Nsight Systems. Nsight Compute is particularly useful for CUDA code, as you have much +greater control over your code compared to the directive based approaches. + +Nsight Compute comes in a +[command line](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html) +and a [graphical version](https://docs.nvidia.com/nsight-compute/NsightCompute/index.html). +Refer to the +[Kernel Profiling Guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html) +to get an overview of the functionality of these tools. + +You can call the command line version (`ncu`) without further options to get a broad overview of +your kernel's performance: + +```bash +marie@compute$ ncu ./application [options] +``` + +As with the other profiling tools, the Nsight Compute profiler can generate report files like this: + +```bash +marie@compute$ ncu --export <report> ./application [options] +``` + +The report file will automatically get the file ending `.ncu-rep`, you do not need to specify this +manually. + +This report file can be analyzed in the graphical user interface profiler. Again, we recommend you +generate a report file on a compute node and +[transfer the report file to your local system](../data_transfer/export_nodes.md). +Alternatively, you can use [X11-forwarding](../access/ssh_login.md). The graphical user interface is +usually available as `ncu-ui` or `nv-nsight-cu`. + +Similar to the `nvprof` profiler, you can analyze specific metrics. NVIDIA provides a +[Metrics Guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-guide). Use +`--query-metrics` to get a list of available metrics, listing them by base name. Individual metrics +can be collected by using + +```bash +marie@compute$ ncu --metrics <metric_1>[,<metric_2>,...] ./application [options] +``` -* Introduction to CUDA; -* Advanced Tuning for NVIDIA Kepler GPUs. +Collection of events is no longer possible with Nsight Compute. Instead, many nvprof events can be +[measured with metrics](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#nvprof-event-comparison). -In order to compile an application with CUDA use the `nvcc` compiler command. +You can collect metrics for individual kernels by specifying the `--kernel-name` flag. diff --git a/doc.zih.tu-dresden.de/wordlist.aspell b/doc.zih.tu-dresden.de/wordlist.aspell index c82255fabf3d4e27a07d8aef285ee79c0fb6cc01..5e05a10aa5d8528189070fcd6b3cf99721ace069 100644 --- a/doc.zih.tu-dresden.de/wordlist.aspell +++ b/doc.zih.tu-dresden.de/wordlist.aspell @@ -1,4 +1,4 @@ -personal_ws-1.1 en 203 +personal_ws-1.1 en 406 Abaqus Addon Addons @@ -67,6 +67,7 @@ Dockerfile Dockerfiles DockerHub dockerized +DOI dotfile dotfiles downtime @@ -224,18 +225,24 @@ NGC nodelist NODELIST NRINGS +Nsight ntasks NUM NUMA NUMAlink NumPy Nutzungsbedingungen +nvcc Nvidia +NVIDIA NVLINK NVMe +nvprof +Nvprof NWChem OME OmniOpt +OpARA OPARI OpenACC OpenBLAS @@ -246,6 +253,7 @@ openmpi OpenMPI OpenSSH Opteron +OST OTF overfitting pandarallel @@ -276,6 +284,9 @@ preloaded preloading prepend preprocessing +profiler +Profiler +profiler's PSOCK Pthread Pthreads @@ -323,6 +334,7 @@ SciPy scontrol scp scs +SDK SFTP SGEMM SGI @@ -339,6 +351,8 @@ spython squeue srun ssd +ssh +SSH SSHFS STAR stderr