diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/mpi_issues.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/mpi_issues.md index 95f6eb58990233e85c5dfa535e0c1bde0c29ade6..93d0cf42392f3ede58dbe08bc97fb12db8c9ab22 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/mpi_issues.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/mpi_issues.md @@ -1,16 +1,18 @@ -# Known Issues when Using MPI +# Known Issues with MPI This pages holds known issues observed with MPI and concrete MPI implementations. -## OpenMPI v4.1.x - Performance Loss with MPI-IO-Module OMPIO +## Open MPI -OpenMPI v4.1.x introduced a couple of major enhancements, e.g., the `OMPIO` module is now the +### Version v4.1.x - Performance Loss with MPI-IO-Module OMPIO + +Open MPI v4.1.x introduced a couple of major enhancements, e.g., the `OMPIO` module is now the default module for MPI-IO on **all** filesystems incl. Lustre (cf. -[NEWS file in OpenMPI source code](https://raw.githubusercontent.com/open-mpi/ompi/v4.1.x/NEWS)). +[NEWS file in Open MPI source code](https://raw.githubusercontent.com/open-mpi/ompi/v4.1.x/NEWS)). Prior to this, `ROMIO` was the default MPI-IO module for Lustre. Colleagues of ZIH have found that some MPI-IO access patterns suffer a significant performance loss -using `OMPIO` as MPI-IO module with OpenMPI/4.1.x modules on ZIH systems. At the moment, the root +using `OMPIO` as MPI-IO module with `OpenMPI/4.1.x` modules on ZIH systems. At the moment, the root cause is unclear and needs further investigation. **A workaround** for this performance loss is to use "old", i.e., `ROMIO` MPI-IO-module. This @@ -18,17 +20,17 @@ is achieved by setting the environment variable `OMPI_MCA_io` before executing t follows ```console -export OMPI_MCA_io=^ompio -srun ... +marie@login$ export OMPI_MCA_io=^ompio +marie@login$ srun … ``` or setting the option as argument, in case you invoke `mpirun` directly ```console -mpirun --mca io ^ompio ... +marie@login$ mpirun --mca io ^ompio … ``` -## Mpirun on partition `alpha` and `ml` +### Mpirun on partition `alpha` and `ml` Using `mpirun` on partitions `alpha` and `ml` leads to wrong resource distribution when more than one node is involved. This yields a strange distribution like e.g. `SLURM_NTASKS_PER_NODE=15,1` @@ -39,23 +41,22 @@ Another issue arises when using the Intel toolchain: mpirun calls a different MP 8-9x slowdown in the PALM app in comparison to using srun or the GCC-compiled version of the app (which uses the correct MPI). -## R Parallel Library on Multiple Nodes +### R Parallel Library on Multiple Nodes Using the R parallel library on MPI clusters has shown problems when using more than a few compute -nodes. The error messages indicate that there are buggy interactions of R/Rmpi/OpenMPI and UCX. +nodes. The error messages indicate that there are buggy interactions of R/Rmpi/Open MPI and UCX. Disabling UCX has solved these problems in our experiments. We invoked the R script successfully with the following command: ```console -mpirun -mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx -np 1 Rscript ---vanilla the-script.R +marie@login$ mpirun -mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx -np 1 Rscript --vanilla the-script.R ``` where the arguments `-mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx` disable usage of UCX. -## MPI Function `MPI_Win_allocate` +### MPI Function `MPI_Win_allocate` The function `MPI_Win_allocate` is a one-sided MPI call that allocates memory and returns a window object for RDMA operations (ref. [man page](https://www.open-mpi.org/doc/v3.0/man3/MPI_Win_allocate.3.php)). @@ -65,6 +66,6 @@ object for RDMA operations (ref. [man page](https://www.open-mpi.org/doc/v3.0/ma It was observed for at least for the `OpenMPI/4.0.5` module that using `MPI_Win_Allocate` instead of `MPI_Alloc_mem` in conjunction with `MPI_Win_create` leads to segmentation faults in the calling -application . To be precise, the segfaults occurred at partition `romeo` when about 200 GB per node +application. To be precise, the segfaults occurred at partition `romeo` when about 200 GB per node where allocated. In contrast, the segmentation faults vanished when the implementation was -refactored to call the `MPI_Alloc_mem + MPI_Win_create` functions. +refactored to call the `MPI_Alloc_mem` + `MPI_Win_create` functions.