Skip to content
Snippets Groups Projects
Commit 5c6b4ea2 authored by Bert Wesarg's avatar Bert Wesarg :keyboard:
Browse files

Its "Open MPI"

parent 2472bcd0
No related branches found
No related tags found
2 merge requests!919Automated merge from preview to main,!911Its "Open MPI"
Showing
with 37 additions and 37 deletions
......@@ -31,7 +31,7 @@ Please also find out the other ways you could contribute in our
## News
* **2023-10-16** [OpenMPI 4.1.x - Workaround for MPI-IO Performance Loss](jobs_and_resources/mpi_issues/#openmpi-v41x-performance-loss-with-mpi-io-module-ompio)
* **2023-10-16** [Open MPI 4.1.x - Workaround for MPI-IO Performance Loss](jobs_and_resources/mpi_issues/#performance-loss-with-mpi-io-module-ompio)
* **2023-10-04** [User tests on Barnard](jobs_and_resources/barnard_test.md)
* **2023-06-01** [New hardware and complete re-design](jobs_and_resources/architecture_2023.md)
* **2023-01-04** [New hardware: NVIDIA Arm HPC Developer Kit](jobs_and_resources/arm_hpc_devkit.md)
......
......@@ -2,15 +2,17 @@
This pages holds known issues observed with MPI and concrete MPI implementations.
## OpenMPI v4.1.x - Performance Loss with MPI-IO-Module OMPIO
## Open MPI
OpenMPI v4.1.x introduced a couple of major enhancements, e.g., the `OMPIO` module is now the
### Performance Loss with MPI-IO-Module OMPIO
Open MPI v4.1.x introduced a couple of major enhancements, e.g., the `OMPIO` module is now the
default module for MPI-IO on **all** filesystems incl. Lustre (cf.
[NEWS file in OpenMPI source code](https://raw.githubusercontent.com/open-mpi/ompi/v4.1.x/NEWS)).
[NEWS file in Open MPI source code](https://raw.githubusercontent.com/open-mpi/ompi/v4.1.x/NEWS)).
Prior to this, `ROMIO` was the default MPI-IO module for Lustre.
Colleagues of ZIH have found that some MPI-IO access patterns suffer a significant performance loss
using `OMPIO` as MPI-IO module with OpenMPI/4.1.x modules on ZIH systems. At the moment, the root
using `OMPIO` as MPI-IO module with `OpenMPI/4.1.x` modules on ZIH systems. At the moment, the root
cause is unclear and needs further investigation.
**A workaround** for this performance loss is to use "old", i.e., `ROMIO` MPI-IO-module. This
......@@ -18,17 +20,17 @@ is achieved by setting the environment variable `OMPI_MCA_io` before executing t
follows
```console
export OMPI_MCA_io=^ompio
srun ...
marie@login$ export OMPI_MCA_io=^ompio
marie@login$ srun ...
```
or setting the option as argument, in case you invoke `mpirun` directly
```console
mpirun --mca io ^ompio ...
marie@login$ mpirun --mca io ^ompio ...
```
## Mpirun on partition `alpha` and `ml`
### Mpirun on partition `alpha` and `ml`
Using `mpirun` on partitions `alpha` and `ml` leads to wrong resource distribution when more than
one node is involved. This yields a strange distribution like e.g. `SLURM_NTASKS_PER_NODE=15,1`
......@@ -39,23 +41,22 @@ Another issue arises when using the Intel toolchain: mpirun calls a different MP
8-9x slowdown in the PALM app in comparison to using srun or the GCC-compiled version of the app
(which uses the correct MPI).
## R Parallel Library on Multiple Nodes
### R Parallel Library on Multiple Nodes
Using the R parallel library on MPI clusters has shown problems when using more than a few compute
nodes. The error messages indicate that there are buggy interactions of R/Rmpi/OpenMPI and UCX.
nodes. The error messages indicate that there are buggy interactions of R/Rmpi/Open MPI and UCX.
Disabling UCX has solved these problems in our experiments.
We invoked the R script successfully with the following command:
```console
mpirun -mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx -np 1 Rscript
--vanilla the-script.R
marie@login$ mpirun -mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx -np 1 Rscript --vanilla the-script.R
```
where the arguments `-mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx` disable usage of
UCX.
## MPI Function `MPI_Win_allocate`
### MPI Function `MPI_Win_allocate`
The function `MPI_Win_allocate` is a one-sided MPI call that allocates memory and returns a window
object for RDMA operations (ref. [man page](https://www.open-mpi.org/doc/v3.0/man3/MPI_Win_allocate.3.php)).
......@@ -65,6 +66,6 @@ object for RDMA operations (ref. [man page](https://www.open-mpi.org/doc/v3.0/ma
It was observed for at least for the `OpenMPI/4.0.5` module that using `MPI_Win_Allocate` instead of
`MPI_Alloc_mem` in conjunction with `MPI_Win_create` leads to segmentation faults in the calling
application . To be precise, the segfaults occurred at partition `romeo` when about 200 GB per node
application. To be precise, the segfaults occurred at partition `romeo` when about 200 GB per node
where allocated. In contrast, the segmentation faults vanished when the implementation was
refactored to call the `MPI_Alloc_mem + MPI_Win_create` functions.
refactored to call the `MPI_Alloc_mem` + `MPI_Win_create` functions.
......@@ -103,5 +103,5 @@ case on Rome. You might want to try `-mavx2 -fma` instead.
### Intel MPI
We have seen only half the theoretical peak bandwidth via Infiniband between two nodes, whereas
OpenMPI got close to the peak bandwidth, so you might want to avoid using Intel MPI on partition
Open MPI got close to the peak bandwidth, so you might want to avoid using Intel MPI on partition
`rome` if your application heavily relies on MPI communication until this issue is resolved.
......@@ -21,8 +21,8 @@ project's quota can be increased or dedicated volumes of up to the full capacity
- Granularity should be a socket (28 cores)
- Can be used for OpenMP applications with large memory demands
- To use OpenMPI it is necessary to export the following environment
variables, so that OpenMPI uses shared-memory instead of Infiniband
- To use Open MPI it is necessary to export the following environment
variables, so that Open MPI uses shared-memory instead of Infiniband
for message transport:
```
......
......@@ -391,7 +391,7 @@ Another example:
list_of_averages <- parLapply(X=sample_sizes, fun=average, cl=cl)
# shut down the cluster
#snow::stopCluster(cl) # usually it hangs over here with OpenMPI > 2.0. In this case this command may be avoided, Slurm will clean up after the job finishes
#snow::stopCluster(cl) # usually it hangs over here with Open MPI > 2.0. In this case this command may be avoided, Slurm will clean up after the job finishes
```
To use Rmpi and MPI please use one of these partitions: `haswell`, `broadwell` or `rome`.
......
......@@ -300,8 +300,8 @@ Available Tensor Operations:
[ ] Gloo
```
If you want to use OpenMPI then specify `HOROVOD_GPU_ALLREDUCE=MPI`.
To have better performance it is recommended to use NCCL instead of OpenMPI.
If you want to use Open MPI then specify `HOROVOD_GPU_ALLREDUCE=MPI`.
To have better performance it is recommended to use NCCL instead of Open MPI.
##### Verify Horovod Works
......
......@@ -200,12 +200,12 @@ detail in [nvcc documentation](https://docs.nvidia.com/cuda/cuda-compiler-driver
This compiler is available via several `CUDA` packages, a default version can be loaded via
`module load CUDA`. Additionally, the `NVHPC` modules provide CUDA tools as well.
For using CUDA with OpenMPI at multiple nodes, the OpenMPI module loaded shall have be compiled with
For using CUDA with Open MPI at multiple nodes, the `OpenMPI` module loaded shall have be compiled with
CUDA support. If you aren't sure if the module you are using has support for it you can check it as
following:
```console
ompi_info --parsable --all | grep mpi_built_with_cuda_support:value | awk -F":" '{print "OpenMPI supports CUDA:",$7}'
ompi_info --parsable --all | grep mpi_built_with_cuda_support:value | awk -F":" '{print "Open MPI supports CUDA:",$7}'
```
#### Usage of the CUDA Compiler
......
......@@ -172,7 +172,7 @@ preEnv_MPICH_GPU_EAGER_DEVICE_MEM=0
%endif
%ifdef %{ucx}
# if using OpenMPI with UCX support, these settings are needed with use of CUDA Aware MPI
# if using Open MPI with UCX support, these settings are needed with use of CUDA Aware MPI
# without these flags, LBM is known to hang when using OpenACC and OpenMP Target to GPUs
preENV_UCX_MEMTYPE_CACHE=n
preENV_UCX_TLS=self,shm,cuda_copy
......
......@@ -217,7 +217,7 @@ preEnv_MPICH_GPU_EAGER_DEVICE_MEM=0
%endif
%ifdef %{ucx}
# if using OpenMPI with UCX support, these settings are needed with use of CUDA Aware MPI
# if using Open MPI with UCX support, these settings are needed with use of CUDA Aware MPI
# without these flags, LBM is known to hang when using OpenACC and OpenMP Target to GPUs
preENV_UCX_MEMTYPE_CACHE=n
preENV_UCX_TLS=self,shm,cuda_copy
......
......@@ -93,18 +93,18 @@ From: ubuntu:20.04
mkdir -p /opt
# Download
cd /tmp/mpich && wget -O mpich-$MPICH_VERSION.tar.gz $MPICH_URL && tar -xf mpich-$MPICH_VERSION.tar.gz
# Configure and compile/install
cd /tmp/mpich/mpich-$MPICH_VERSION
./configure --prefix=$MPICH_DIR && make install
# Set env variables so we can compile our application
export PATH=$MPICH_DIR/bin:$PATH
export LD_LIBRARY_PATH=$MPICH_DIR/lib:$LD_LIBRARY_PATH
export MANPATH=$MPICH_DIR/share/man:$MANPATH
echo "Compiling the MPI application..."
cd /opt && mpicc -o mpitest mpitest.c
```
......@@ -123,12 +123,12 @@ At the HPC system run as following:
marie@login$ srun -n 4 --ntasks-per-node 2 --time=00:10:00 singularity exec ubuntu_mpich.sif /opt/mpitest
```
### CUDA + CuDNN + OpenMPI
### CUDA + CuDNN + Open MPI
* Chosen CUDA version depends on installed driver of host
* OpenMPI needs PMI for Slurm integration
* OpenMPI needs CUDA for GPU copy-support
* OpenMPI needs `ibverbs` library for Infiniband
* Open MPI needs PMI for Slurm integration
* Open MPI needs CUDA for GPU copy-support
* Open MPI needs `ibverbs` library for Infiniband
* `openmpi-mca-params.conf` required to avoid warnings on fork (OK on ZIH systems)
* Environment variables `SLURM_VERSION` and `OPENMPI_VERSION` can be set to choose different
version when building the container
......
......@@ -245,6 +245,7 @@ Mortem
mountpoint
mpi
Mpi
MPI
mpicc
mpiCC
mpicxx
......@@ -295,8 +296,6 @@ OpenBLAS
OpenCL
OpenGL
OpenMP
openmpi
OpenMPI
OpenSSH
Opteron
ORCA
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment