From 03c9da5cbe6672261726f0976f4659b707bf0e4a Mon Sep 17 00:00:00 2001 From: Bert Wesarg <bert.wesarg@tu-dresden.de> Date: Thu, 26 Oct 2023 09:20:09 +0200 Subject: [PATCH] Its "Open MPI" --- doc.zih.tu-dresden.de/docs/index.md | 2 +- .../docs/jobs_and_resources/mpi_issues.md | 31 ++++++++++--------- .../docs/jobs_and_resources/rome_nodes.md | 2 +- .../docs/jobs_and_resources/sd_flex.md | 4 +-- .../docs/software/data_analytics_with_r.md | 2 +- .../docs/software/distributed_training.md | 4 +-- .../docs/software/gpu_programming.md | 4 +-- .../docs/software/misc/spec_nvhpc-alpha.cfg | 2 +- .../docs/software/misc/spec_nvhpc-ppc.cfg | 2 +- .../docs/software/singularity_recipe_hints.md | 8 ++--- doc.zih.tu-dresden.de/wordlist.aspell | 3 +- 11 files changed, 32 insertions(+), 32 deletions(-) diff --git a/doc.zih.tu-dresden.de/docs/index.md b/doc.zih.tu-dresden.de/docs/index.md index 57e6d4b82..f76dddd4f 100644 --- a/doc.zih.tu-dresden.de/docs/index.md +++ b/doc.zih.tu-dresden.de/docs/index.md @@ -31,7 +31,7 @@ Please also find out the other ways you could contribute in our ## News -* **2023-10-16** [OpenMPI 4.1.x - Workaround for MPI-IO Performance Loss](jobs_and_resources/mpi_issues/#openmpi-v41x-performance-loss-with-mpi-io-module-ompio) +* **2023-10-16** [Open MPI 4.1.x - Workaround for MPI-IO Performance Loss](jobs_and_resources/mpi_issues/#performance-loss-with-mpi-io-module-ompio) * **2023-10-04** [User tests on Barnard](jobs_and_resources/barnard_test.md) * **2023-06-01** [New hardware and complete re-design](jobs_and_resources/architecture_2023.md) * **2023-01-04** [New hardware: NVIDIA Arm HPC Developer Kit](jobs_and_resources/arm_hpc_devkit.md) diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/mpi_issues.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/mpi_issues.md index 95f6eb589..006331a44 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/mpi_issues.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/mpi_issues.md @@ -2,15 +2,17 @@ This pages holds known issues observed with MPI and concrete MPI implementations. -## OpenMPI v4.1.x - Performance Loss with MPI-IO-Module OMPIO +## Open MPI -OpenMPI v4.1.x introduced a couple of major enhancements, e.g., the `OMPIO` module is now the +### Performance Loss with MPI-IO-Module OMPIO + +Open MPI v4.1.x introduced a couple of major enhancements, e.g., the `OMPIO` module is now the default module for MPI-IO on **all** filesystems incl. Lustre (cf. -[NEWS file in OpenMPI source code](https://raw.githubusercontent.com/open-mpi/ompi/v4.1.x/NEWS)). +[NEWS file in Open MPI source code](https://raw.githubusercontent.com/open-mpi/ompi/v4.1.x/NEWS)). Prior to this, `ROMIO` was the default MPI-IO module for Lustre. Colleagues of ZIH have found that some MPI-IO access patterns suffer a significant performance loss -using `OMPIO` as MPI-IO module with OpenMPI/4.1.x modules on ZIH systems. At the moment, the root +using `OMPIO` as MPI-IO module with `OpenMPI/4.1.x` modules on ZIH systems. At the moment, the root cause is unclear and needs further investigation. **A workaround** for this performance loss is to use "old", i.e., `ROMIO` MPI-IO-module. This @@ -18,17 +20,17 @@ is achieved by setting the environment variable `OMPI_MCA_io` before executing t follows ```console -export OMPI_MCA_io=^ompio -srun ... +marie@login$ export OMPI_MCA_io=^ompio +marie@login$ srun ... ``` or setting the option as argument, in case you invoke `mpirun` directly ```console -mpirun --mca io ^ompio ... +marie@login$ mpirun --mca io ^ompio ... ``` -## Mpirun on partition `alpha` and `ml` +### Mpirun on partition `alpha` and `ml` Using `mpirun` on partitions `alpha` and `ml` leads to wrong resource distribution when more than one node is involved. This yields a strange distribution like e.g. `SLURM_NTASKS_PER_NODE=15,1` @@ -39,23 +41,22 @@ Another issue arises when using the Intel toolchain: mpirun calls a different MP 8-9x slowdown in the PALM app in comparison to using srun or the GCC-compiled version of the app (which uses the correct MPI). -## R Parallel Library on Multiple Nodes +### R Parallel Library on Multiple Nodes Using the R parallel library on MPI clusters has shown problems when using more than a few compute -nodes. The error messages indicate that there are buggy interactions of R/Rmpi/OpenMPI and UCX. +nodes. The error messages indicate that there are buggy interactions of R/Rmpi/Open MPI and UCX. Disabling UCX has solved these problems in our experiments. We invoked the R script successfully with the following command: ```console -mpirun -mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx -np 1 Rscript ---vanilla the-script.R +marie@login$ mpirun -mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx -np 1 Rscript --vanilla the-script.R ``` where the arguments `-mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx` disable usage of UCX. -## MPI Function `MPI_Win_allocate` +### MPI Function `MPI_Win_allocate` The function `MPI_Win_allocate` is a one-sided MPI call that allocates memory and returns a window object for RDMA operations (ref. [man page](https://www.open-mpi.org/doc/v3.0/man3/MPI_Win_allocate.3.php)). @@ -65,6 +66,6 @@ object for RDMA operations (ref. [man page](https://www.open-mpi.org/doc/v3.0/ma It was observed for at least for the `OpenMPI/4.0.5` module that using `MPI_Win_Allocate` instead of `MPI_Alloc_mem` in conjunction with `MPI_Win_create` leads to segmentation faults in the calling -application . To be precise, the segfaults occurred at partition `romeo` when about 200 GB per node +application. To be precise, the segfaults occurred at partition `romeo` when about 200 GB per node where allocated. In contrast, the segmentation faults vanished when the implementation was -refactored to call the `MPI_Alloc_mem + MPI_Win_create` functions. +refactored to call the `MPI_Alloc_mem` + `MPI_Win_create` functions. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md index 4347dd6b0..f270f8f1d 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md @@ -103,5 +103,5 @@ case on Rome. You might want to try `-mavx2 -fma` instead. ### Intel MPI We have seen only half the theoretical peak bandwidth via Infiniband between two nodes, whereas -OpenMPI got close to the peak bandwidth, so you might want to avoid using Intel MPI on partition +Open MPI got close to the peak bandwidth, so you might want to avoid using Intel MPI on partition `rome` if your application heavily relies on MPI communication until this issue is resolved. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md index f4788c1cf..946cca8bc 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md @@ -21,8 +21,8 @@ project's quota can be increased or dedicated volumes of up to the full capacity - Granularity should be a socket (28 cores) - Can be used for OpenMP applications with large memory demands -- To use OpenMPI it is necessary to export the following environment - variables, so that OpenMPI uses shared-memory instead of Infiniband +- To use Open MPI it is necessary to export the following environment + variables, so that Open MPI uses shared-memory instead of Infiniband for message transport: ``` diff --git a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md index c7334d918..41670d12b 100644 --- a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md +++ b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md @@ -391,7 +391,7 @@ Another example: list_of_averages <- parLapply(X=sample_sizes, fun=average, cl=cl) # shut down the cluster - #snow::stopCluster(cl) # usually it hangs over here with OpenMPI > 2.0. In this case this command may be avoided, Slurm will clean up after the job finishes + #snow::stopCluster(cl) # usually it hangs over here with Open MPI > 2.0. In this case this command may be avoided, Slurm will clean up after the job finishes ``` To use Rmpi and MPI please use one of these partitions: `haswell`, `broadwell` or `rome`. diff --git a/doc.zih.tu-dresden.de/docs/software/distributed_training.md b/doc.zih.tu-dresden.de/docs/software/distributed_training.md index 4e8fc427e..094b6f8dc 100644 --- a/doc.zih.tu-dresden.de/docs/software/distributed_training.md +++ b/doc.zih.tu-dresden.de/docs/software/distributed_training.md @@ -300,8 +300,8 @@ Available Tensor Operations: [ ] Gloo ``` -If you want to use OpenMPI then specify `HOROVOD_GPU_ALLREDUCE=MPI`. -To have better performance it is recommended to use NCCL instead of OpenMPI. +If you want to use Open MPI then specify `HOROVOD_GPU_ALLREDUCE=MPI`. +To have better performance it is recommended to use NCCL instead of Open MPI. ##### Verify Horovod Works diff --git a/doc.zih.tu-dresden.de/docs/software/gpu_programming.md b/doc.zih.tu-dresden.de/docs/software/gpu_programming.md index 2e5b57422..28d220fed 100644 --- a/doc.zih.tu-dresden.de/docs/software/gpu_programming.md +++ b/doc.zih.tu-dresden.de/docs/software/gpu_programming.md @@ -200,12 +200,12 @@ detail in [nvcc documentation](https://docs.nvidia.com/cuda/cuda-compiler-driver This compiler is available via several `CUDA` packages, a default version can be loaded via `module load CUDA`. Additionally, the `NVHPC` modules provide CUDA tools as well. -For using CUDA with OpenMPI at multiple nodes, the OpenMPI module loaded shall have be compiled with +For using CUDA with Open MPI at multiple nodes, the `OpenMPI` module loaded shall have be compiled with CUDA support. If you aren't sure if the module you are using has support for it you can check it as following: ```console -ompi_info --parsable --all | grep mpi_built_with_cuda_support:value | awk -F":" '{print "OpenMPI supports CUDA:",$7}' +ompi_info --parsable --all | grep mpi_built_with_cuda_support:value | awk -F":" '{print "Open MPI supports CUDA:",$7}' ``` #### Usage of the CUDA Compiler diff --git a/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-alpha.cfg b/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-alpha.cfg index 18743ba58..a0a1d8670 100644 --- a/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-alpha.cfg +++ b/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-alpha.cfg @@ -172,7 +172,7 @@ preEnv_MPICH_GPU_EAGER_DEVICE_MEM=0 %endif %ifdef %{ucx} -# if using OpenMPI with UCX support, these settings are needed with use of CUDA Aware MPI +# if using Open MPI with UCX support, these settings are needed with use of CUDA Aware MPI # without these flags, LBM is known to hang when using OpenACC and OpenMP Target to GPUs preENV_UCX_MEMTYPE_CACHE=n preENV_UCX_TLS=self,shm,cuda_copy diff --git a/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-ppc.cfg b/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-ppc.cfg index 06b9e1b85..6e6112b1a 100644 --- a/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-ppc.cfg +++ b/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-ppc.cfg @@ -217,7 +217,7 @@ preEnv_MPICH_GPU_EAGER_DEVICE_MEM=0 %endif %ifdef %{ucx} -# if using OpenMPI with UCX support, these settings are needed with use of CUDA Aware MPI +# if using Open MPI with UCX support, these settings are needed with use of CUDA Aware MPI # without these flags, LBM is known to hang when using OpenACC and OpenMP Target to GPUs preENV_UCX_MEMTYPE_CACHE=n preENV_UCX_TLS=self,shm,cuda_copy diff --git a/doc.zih.tu-dresden.de/docs/software/singularity_recipe_hints.md b/doc.zih.tu-dresden.de/docs/software/singularity_recipe_hints.md index 9fd398d76..c1f570c93 100644 --- a/doc.zih.tu-dresden.de/docs/software/singularity_recipe_hints.md +++ b/doc.zih.tu-dresden.de/docs/software/singularity_recipe_hints.md @@ -123,12 +123,12 @@ At the HPC system run as following: marie@login$ srun -n 4 --ntasks-per-node 2 --time=00:10:00 singularity exec ubuntu_mpich.sif /opt/mpitest ``` -### CUDA + CuDNN + OpenMPI +### CUDA + CuDNN + Open MPI * Chosen CUDA version depends on installed driver of host -* OpenMPI needs PMI for Slurm integration -* OpenMPI needs CUDA for GPU copy-support -* OpenMPI needs `ibverbs` library for Infiniband +* Open MPI needs PMI for Slurm integration +* Open MPI needs CUDA for GPU copy-support +* Open MPI needs `ibverbs` library for Infiniband * `openmpi-mca-params.conf` required to avoid warnings on fork (OK on ZIH systems) * Environment variables `SLURM_VERSION` and `OPENMPI_VERSION` can be set to choose different version when building the container diff --git a/doc.zih.tu-dresden.de/wordlist.aspell b/doc.zih.tu-dresden.de/wordlist.aspell index e50cbf260..78bf541c6 100644 --- a/doc.zih.tu-dresden.de/wordlist.aspell +++ b/doc.zih.tu-dresden.de/wordlist.aspell @@ -245,6 +245,7 @@ Mortem mountpoint mpi Mpi +MPI mpicc mpiCC mpicxx @@ -295,8 +296,6 @@ OpenBLAS OpenCL OpenGL OpenMP -openmpi -OpenMPI OpenSSH Opteron ORCA -- GitLab