Skip to content
Snippets Groups Projects
mpi_issues.md 3.28 KiB
Newer Older
Bert Wesarg's avatar
Bert Wesarg committed
# Known Issues with MPI

This pages holds known issues observed with MPI and concrete MPI implementations.

Bert Wesarg's avatar
Bert Wesarg committed
## Open MPI
Bert Wesarg's avatar
Bert Wesarg committed
### Performance Loss with MPI-IO-Module OMPIO
Bert Wesarg's avatar
Bert Wesarg committed

Open MPI v4.1.x introduced a couple of major enhancements, e.g., the `OMPIO` module is now the
default module for MPI-IO on **all** filesystems incl. Lustre (cf.
Bert Wesarg's avatar
Bert Wesarg committed
[NEWS file in Open MPI source code](https://raw.githubusercontent.com/open-mpi/ompi/v4.1.x/NEWS)).
Prior to this, `ROMIO` was the default MPI-IO module for Lustre.

Colleagues of ZIH have found that some MPI-IO access patterns suffer a significant performance loss
Bert Wesarg's avatar
Bert Wesarg committed
using `OMPIO` as MPI-IO module with `OpenMPI/4.1.x` modules on ZIH systems. At the moment, the root
cause is unclear and needs further investigation.
Martin Schroschk's avatar
Martin Schroschk committed
**A workaround** for this performance loss is to use the "old", i.e., `ROMIO` MPI-IO-module. This
is achieved by setting the environment variable `OMPI_MCA_io` before executing the application as
Bert Wesarg's avatar
Bert Wesarg committed
marie@login$ export OMPI_MCA_io=^ompio
```

or setting the option as argument, in case you invoke `mpirun` directly

```console
marie@login$ mpirun --mca io ^ompio [...]
### Mpirun on clusters `alpha` and `power9`
<!-- laut max möglich dass es nach dem update von alpha und power9 das problem nicht mehr relevant ist.-->
Using `mpirun` on clusters `alpha` and `power` leads to wrong resource distribution when more than
Martin Schroschk's avatar
Martin Schroschk committed
one node is involved. This yields a strange distribution like e.g. `SLURM_NTASKS_PER_NODE=15,1`
even though `--tasks-per-node=8` was specified. Unless you really know what you're doing (e.g.
use rank pinning via perl script), avoid using mpirun.

Another issue arises when using the Intel toolchain: mpirun calls a different MPI and caused a
8-9x slowdown in the PALM app in comparison to using srun or the GCC-compiled version of the app
(which uses the correct MPI).

Bert Wesarg's avatar
Bert Wesarg committed
### R Parallel Library on Multiple Nodes

Using the R parallel library on MPI clusters has shown problems when using more than a few compute
Bert Wesarg's avatar
Bert Wesarg committed
nodes. The error messages indicate that there are buggy interactions of R/Rmpi/Open MPI and UCX.
Disabling UCX has solved these problems in our experiments.

We invoked the R script successfully with the following command:

```console
Bert Wesarg's avatar
Bert Wesarg committed
marie@login$ mpirun -mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx -np 1 Rscript --vanilla the-script.R
```

where the arguments `-mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx` disable usage of
UCX.

Bert Wesarg's avatar
Bert Wesarg committed
### MPI Function `MPI_Win_allocate`

The function `MPI_Win_allocate` is a one-sided MPI call that allocates memory and returns a window
object for RDMA operations (ref. [man page](https://www.open-mpi.org/doc/v3.0/man3/MPI_Win_allocate.3.php)).

> Using MPI_Win_allocate rather than separate MPI_Alloc_mem + MPI_Win_create may allow the MPI
> implementation to optimize the memory allocation. (Using advanced MPI)
Martin Schroschk's avatar
Martin Schroschk committed
It was observed for at least for the `OpenMPI/4.0.5` module that using `MPI_Win_Allocate` instead of
`MPI_Alloc_mem` in conjunction with `MPI_Win_create` leads to segmentation faults in the calling
Bert Wesarg's avatar
Bert Wesarg committed
application. To be precise, the segfaults occurred at partition `romeo` when about 200 GB per node
Martin Schroschk's avatar
Martin Schroschk committed
where allocated. In contrast, the segmentation faults vanished when the implementation was
Bert Wesarg's avatar
Bert Wesarg committed
refactored to call the `MPI_Alloc_mem` + `MPI_Win_create` functions.