diff --git a/doc.zih.tu-dresden.de/docs/index.md b/doc.zih.tu-dresden.de/docs/index.md index 57e6d4b82cb9e1a7344b94c3bc583b5ff36e3fc9..f76dddd4ff8cb73162890b39f9f43e4301208bcb 100644 --- a/doc.zih.tu-dresden.de/docs/index.md +++ b/doc.zih.tu-dresden.de/docs/index.md @@ -31,7 +31,7 @@ Please also find out the other ways you could contribute in our ## News -* **2023-10-16** [OpenMPI 4.1.x - Workaround for MPI-IO Performance Loss](jobs_and_resources/mpi_issues/#openmpi-v41x-performance-loss-with-mpi-io-module-ompio) +* **2023-10-16** [Open MPI 4.1.x - Workaround for MPI-IO Performance Loss](jobs_and_resources/mpi_issues/#performance-loss-with-mpi-io-module-ompio) * **2023-10-04** [User tests on Barnard](jobs_and_resources/barnard_test.md) * **2023-06-01** [New hardware and complete re-design](jobs_and_resources/architecture_2023.md) * **2023-01-04** [New hardware: NVIDIA Arm HPC Developer Kit](jobs_and_resources/arm_hpc_devkit.md) diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/mpi_issues.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/mpi_issues.md index 95f6eb58990233e85c5dfa535e0c1bde0c29ade6..006331a4438176bfc1b762ea803a3457a6efa271 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/mpi_issues.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/mpi_issues.md @@ -2,15 +2,17 @@ This pages holds known issues observed with MPI and concrete MPI implementations. -## OpenMPI v4.1.x - Performance Loss with MPI-IO-Module OMPIO +## Open MPI -OpenMPI v4.1.x introduced a couple of major enhancements, e.g., the `OMPIO` module is now the +### Performance Loss with MPI-IO-Module OMPIO + +Open MPI v4.1.x introduced a couple of major enhancements, e.g., the `OMPIO` module is now the default module for MPI-IO on **all** filesystems incl. Lustre (cf. -[NEWS file in OpenMPI source code](https://raw.githubusercontent.com/open-mpi/ompi/v4.1.x/NEWS)). +[NEWS file in Open MPI source code](https://raw.githubusercontent.com/open-mpi/ompi/v4.1.x/NEWS)). Prior to this, `ROMIO` was the default MPI-IO module for Lustre. Colleagues of ZIH have found that some MPI-IO access patterns suffer a significant performance loss -using `OMPIO` as MPI-IO module with OpenMPI/4.1.x modules on ZIH systems. At the moment, the root +using `OMPIO` as MPI-IO module with `OpenMPI/4.1.x` modules on ZIH systems. At the moment, the root cause is unclear and needs further investigation. **A workaround** for this performance loss is to use "old", i.e., `ROMIO` MPI-IO-module. This @@ -18,17 +20,17 @@ is achieved by setting the environment variable `OMPI_MCA_io` before executing t follows ```console -export OMPI_MCA_io=^ompio -srun ... +marie@login$ export OMPI_MCA_io=^ompio +marie@login$ srun ... ``` or setting the option as argument, in case you invoke `mpirun` directly ```console -mpirun --mca io ^ompio ... +marie@login$ mpirun --mca io ^ompio ... ``` -## Mpirun on partition `alpha` and `ml` +### Mpirun on partition `alpha` and `ml` Using `mpirun` on partitions `alpha` and `ml` leads to wrong resource distribution when more than one node is involved. This yields a strange distribution like e.g. `SLURM_NTASKS_PER_NODE=15,1` @@ -39,23 +41,22 @@ Another issue arises when using the Intel toolchain: mpirun calls a different MP 8-9x slowdown in the PALM app in comparison to using srun or the GCC-compiled version of the app (which uses the correct MPI). -## R Parallel Library on Multiple Nodes +### R Parallel Library on Multiple Nodes Using the R parallel library on MPI clusters has shown problems when using more than a few compute -nodes. The error messages indicate that there are buggy interactions of R/Rmpi/OpenMPI and UCX. +nodes. The error messages indicate that there are buggy interactions of R/Rmpi/Open MPI and UCX. Disabling UCX has solved these problems in our experiments. We invoked the R script successfully with the following command: ```console -mpirun -mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx -np 1 Rscript ---vanilla the-script.R +marie@login$ mpirun -mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx -np 1 Rscript --vanilla the-script.R ``` where the arguments `-mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx` disable usage of UCX. -## MPI Function `MPI_Win_allocate` +### MPI Function `MPI_Win_allocate` The function `MPI_Win_allocate` is a one-sided MPI call that allocates memory and returns a window object for RDMA operations (ref. [man page](https://www.open-mpi.org/doc/v3.0/man3/MPI_Win_allocate.3.php)). @@ -65,6 +66,6 @@ object for RDMA operations (ref. [man page](https://www.open-mpi.org/doc/v3.0/ma It was observed for at least for the `OpenMPI/4.0.5` module that using `MPI_Win_Allocate` instead of `MPI_Alloc_mem` in conjunction with `MPI_Win_create` leads to segmentation faults in the calling -application . To be precise, the segfaults occurred at partition `romeo` when about 200 GB per node +application. To be precise, the segfaults occurred at partition `romeo` when about 200 GB per node where allocated. In contrast, the segmentation faults vanished when the implementation was -refactored to call the `MPI_Alloc_mem + MPI_Win_create` functions. +refactored to call the `MPI_Alloc_mem` + `MPI_Win_create` functions. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md index 4347dd6b0e64005a67f4c60627a2002138a00631..f270f8f1da6100ab3989c0358a473c09a9cf3194 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md @@ -103,5 +103,5 @@ case on Rome. You might want to try `-mavx2 -fma` instead. ### Intel MPI We have seen only half the theoretical peak bandwidth via Infiniband between two nodes, whereas -OpenMPI got close to the peak bandwidth, so you might want to avoid using Intel MPI on partition +Open MPI got close to the peak bandwidth, so you might want to avoid using Intel MPI on partition `rome` if your application heavily relies on MPI communication until this issue is resolved. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md index f4788c1cf8b4b0cda39f891eeac73f3c5c60cf5e..946cca8bc4b988cd311b635e7fe78d569b6f15d0 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md @@ -21,8 +21,8 @@ project's quota can be increased or dedicated volumes of up to the full capacity - Granularity should be a socket (28 cores) - Can be used for OpenMP applications with large memory demands -- To use OpenMPI it is necessary to export the following environment - variables, so that OpenMPI uses shared-memory instead of Infiniband +- To use Open MPI it is necessary to export the following environment + variables, so that Open MPI uses shared-memory instead of Infiniband for message transport: ``` diff --git a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md index c7334d918cdba1ff5e97c2f4cd0ea3b788b2c26d..41670d12bdb41df2e096b47c6c9db9c75217ecc7 100644 --- a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md +++ b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md @@ -391,7 +391,7 @@ Another example: list_of_averages <- parLapply(X=sample_sizes, fun=average, cl=cl) # shut down the cluster - #snow::stopCluster(cl) # usually it hangs over here with OpenMPI > 2.0. In this case this command may be avoided, Slurm will clean up after the job finishes + #snow::stopCluster(cl) # usually it hangs over here with Open MPI > 2.0. In this case this command may be avoided, Slurm will clean up after the job finishes ``` To use Rmpi and MPI please use one of these partitions: `haswell`, `broadwell` or `rome`. diff --git a/doc.zih.tu-dresden.de/docs/software/distributed_training.md b/doc.zih.tu-dresden.de/docs/software/distributed_training.md index 4e8fc427e71bd28ad1a3b663aba82d11bad088e6..094b6f8dc4e79ee3732234f26ba950c2ef8188d9 100644 --- a/doc.zih.tu-dresden.de/docs/software/distributed_training.md +++ b/doc.zih.tu-dresden.de/docs/software/distributed_training.md @@ -300,8 +300,8 @@ Available Tensor Operations: [ ] Gloo ``` -If you want to use OpenMPI then specify `HOROVOD_GPU_ALLREDUCE=MPI`. -To have better performance it is recommended to use NCCL instead of OpenMPI. +If you want to use Open MPI then specify `HOROVOD_GPU_ALLREDUCE=MPI`. +To have better performance it is recommended to use NCCL instead of Open MPI. ##### Verify Horovod Works diff --git a/doc.zih.tu-dresden.de/docs/software/gpu_programming.md b/doc.zih.tu-dresden.de/docs/software/gpu_programming.md index 2e5b57422a0472650a6fc64c5c4bfeac433e5801..28d220fed90d75afcc2341a897df0abae2a160e4 100644 --- a/doc.zih.tu-dresden.de/docs/software/gpu_programming.md +++ b/doc.zih.tu-dresden.de/docs/software/gpu_programming.md @@ -200,12 +200,12 @@ detail in [nvcc documentation](https://docs.nvidia.com/cuda/cuda-compiler-driver This compiler is available via several `CUDA` packages, a default version can be loaded via `module load CUDA`. Additionally, the `NVHPC` modules provide CUDA tools as well. -For using CUDA with OpenMPI at multiple nodes, the OpenMPI module loaded shall have be compiled with +For using CUDA with Open MPI at multiple nodes, the `OpenMPI` module loaded shall have be compiled with CUDA support. If you aren't sure if the module you are using has support for it you can check it as following: ```console -ompi_info --parsable --all | grep mpi_built_with_cuda_support:value | awk -F":" '{print "OpenMPI supports CUDA:",$7}' +ompi_info --parsable --all | grep mpi_built_with_cuda_support:value | awk -F":" '{print "Open MPI supports CUDA:",$7}' ``` #### Usage of the CUDA Compiler diff --git a/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-alpha.cfg b/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-alpha.cfg index 18743ba58e98e299227fc0273cf301b52330ed4c..a0a1d8670b942ffdba5dc7fd3a3a2f6c5e0048c5 100644 --- a/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-alpha.cfg +++ b/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-alpha.cfg @@ -172,7 +172,7 @@ preEnv_MPICH_GPU_EAGER_DEVICE_MEM=0 %endif %ifdef %{ucx} -# if using OpenMPI with UCX support, these settings are needed with use of CUDA Aware MPI +# if using Open MPI with UCX support, these settings are needed with use of CUDA Aware MPI # without these flags, LBM is known to hang when using OpenACC and OpenMP Target to GPUs preENV_UCX_MEMTYPE_CACHE=n preENV_UCX_TLS=self,shm,cuda_copy diff --git a/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-ppc.cfg b/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-ppc.cfg index 06b9e1b85549892df1880e9ae2c461276ac95a2d..6e6112b1a8f81e01836541fe8f2257c215eb2fa7 100644 --- a/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-ppc.cfg +++ b/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-ppc.cfg @@ -217,7 +217,7 @@ preEnv_MPICH_GPU_EAGER_DEVICE_MEM=0 %endif %ifdef %{ucx} -# if using OpenMPI with UCX support, these settings are needed with use of CUDA Aware MPI +# if using Open MPI with UCX support, these settings are needed with use of CUDA Aware MPI # without these flags, LBM is known to hang when using OpenACC and OpenMP Target to GPUs preENV_UCX_MEMTYPE_CACHE=n preENV_UCX_TLS=self,shm,cuda_copy diff --git a/doc.zih.tu-dresden.de/docs/software/singularity_recipe_hints.md b/doc.zih.tu-dresden.de/docs/software/singularity_recipe_hints.md index 9fd398d76521ca608ae738f91182367fb59f4ac7..c1f570c9360a4521b0588e30224c6444f9f02c06 100644 --- a/doc.zih.tu-dresden.de/docs/software/singularity_recipe_hints.md +++ b/doc.zih.tu-dresden.de/docs/software/singularity_recipe_hints.md @@ -123,12 +123,12 @@ At the HPC system run as following: marie@login$ srun -n 4 --ntasks-per-node 2 --time=00:10:00 singularity exec ubuntu_mpich.sif /opt/mpitest ``` -### CUDA + CuDNN + OpenMPI +### CUDA + CuDNN + Open MPI * Chosen CUDA version depends on installed driver of host -* OpenMPI needs PMI for Slurm integration -* OpenMPI needs CUDA for GPU copy-support -* OpenMPI needs `ibverbs` library for Infiniband +* Open MPI needs PMI for Slurm integration +* Open MPI needs CUDA for GPU copy-support +* Open MPI needs `ibverbs` library for Infiniband * `openmpi-mca-params.conf` required to avoid warnings on fork (OK on ZIH systems) * Environment variables `SLURM_VERSION` and `OPENMPI_VERSION` can be set to choose different version when building the container diff --git a/doc.zih.tu-dresden.de/wordlist.aspell b/doc.zih.tu-dresden.de/wordlist.aspell index e50cbf260de330ac63a967fa82511888f7cb3871..78bf541c6dfea01baedbb3735f94dfe05b7d4341 100644 --- a/doc.zih.tu-dresden.de/wordlist.aspell +++ b/doc.zih.tu-dresden.de/wordlist.aspell @@ -245,6 +245,7 @@ Mortem mountpoint mpi Mpi +MPI mpicc mpiCC mpicxx @@ -295,8 +296,6 @@ OpenBLAS OpenCL OpenGL OpenMP -openmpi -OpenMPI OpenSSH Opteron ORCA