From 03c9da5cbe6672261726f0976f4659b707bf0e4a Mon Sep 17 00:00:00 2001
From: Bert Wesarg <bert.wesarg@tu-dresden.de>
Date: Thu, 26 Oct 2023 09:20:09 +0200
Subject: [PATCH] Its "Open MPI"

---
 doc.zih.tu-dresden.de/docs/index.md           |  2 +-
 .../docs/jobs_and_resources/mpi_issues.md     | 31 ++++++++++---------
 .../docs/jobs_and_resources/rome_nodes.md     |  2 +-
 .../docs/jobs_and_resources/sd_flex.md        |  4 +--
 .../docs/software/data_analytics_with_r.md    |  2 +-
 .../docs/software/distributed_training.md     |  4 +--
 .../docs/software/gpu_programming.md          |  4 +--
 .../docs/software/misc/spec_nvhpc-alpha.cfg   |  2 +-
 .../docs/software/misc/spec_nvhpc-ppc.cfg     |  2 +-
 .../docs/software/singularity_recipe_hints.md |  8 ++---
 doc.zih.tu-dresden.de/wordlist.aspell         |  3 +-
 11 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/doc.zih.tu-dresden.de/docs/index.md b/doc.zih.tu-dresden.de/docs/index.md
index 57e6d4b82..f76dddd4f 100644
--- a/doc.zih.tu-dresden.de/docs/index.md
+++ b/doc.zih.tu-dresden.de/docs/index.md
@@ -31,7 +31,7 @@ Please also find out the other ways you could contribute in our
 
 ## News
 
-* **2023-10-16** [OpenMPI 4.1.x  - Workaround for MPI-IO Performance Loss](jobs_and_resources/mpi_issues/#openmpi-v41x-performance-loss-with-mpi-io-module-ompio)
+* **2023-10-16** [Open MPI 4.1.x  - Workaround for MPI-IO Performance Loss](jobs_and_resources/mpi_issues/#performance-loss-with-mpi-io-module-ompio)
 * **2023-10-04** [User tests on Barnard](jobs_and_resources/barnard_test.md)
 * **2023-06-01** [New hardware and complete re-design](jobs_and_resources/architecture_2023.md)
 * **2023-01-04** [New hardware: NVIDIA Arm HPC Developer Kit](jobs_and_resources/arm_hpc_devkit.md)
diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/mpi_issues.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/mpi_issues.md
index 95f6eb589..006331a44 100644
--- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/mpi_issues.md
+++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/mpi_issues.md
@@ -2,15 +2,17 @@
 
 This pages holds known issues observed with MPI and concrete MPI implementations.
 
-## OpenMPI v4.1.x - Performance Loss with MPI-IO-Module OMPIO
+## Open MPI
 
-OpenMPI v4.1.x introduced a couple of major enhancements, e.g., the `OMPIO` module is now the
+### Performance Loss with MPI-IO-Module OMPIO
+
+Open MPI v4.1.x introduced a couple of major enhancements, e.g., the `OMPIO` module is now the
 default module for MPI-IO on **all** filesystems incl. Lustre (cf.
-[NEWS file in OpenMPI source code](https://raw.githubusercontent.com/open-mpi/ompi/v4.1.x/NEWS)).
+[NEWS file in Open MPI source code](https://raw.githubusercontent.com/open-mpi/ompi/v4.1.x/NEWS)).
 Prior to this, `ROMIO` was the default MPI-IO module for Lustre.
 
 Colleagues of ZIH have found that some MPI-IO access patterns suffer a significant performance loss
-using `OMPIO` as MPI-IO module with OpenMPI/4.1.x modules on ZIH systems. At the moment, the root
+using `OMPIO` as MPI-IO module with `OpenMPI/4.1.x` modules on ZIH systems. At the moment, the root
 cause is unclear and needs further investigation.
 
 **A workaround** for this performance loss is to use "old", i.e., `ROMIO` MPI-IO-module. This
@@ -18,17 +20,17 @@ is achieved by setting the environment variable `OMPI_MCA_io` before executing t
 follows
 
 ```console
-export OMPI_MCA_io=^ompio
-srun ...
+marie@login$ export OMPI_MCA_io=^ompio
+marie@login$ srun ...
 ```
 
 or setting the option as argument, in case you invoke `mpirun` directly
 
 ```console
-mpirun --mca io ^ompio ...
+marie@login$ mpirun --mca io ^ompio ...
 ```
 
-## Mpirun on partition `alpha` and `ml`
+### Mpirun on partition `alpha` and `ml`
 
 Using `mpirun` on partitions `alpha` and `ml` leads to wrong resource distribution when more than
 one node is involved. This yields a strange distribution like e.g. `SLURM_NTASKS_PER_NODE=15,1`
@@ -39,23 +41,22 @@ Another issue arises when using the Intel toolchain: mpirun calls a different MP
 8-9x slowdown in the PALM app in comparison to using srun or the GCC-compiled version of the app
 (which uses the correct MPI).
 
-## R Parallel Library on Multiple Nodes
+### R Parallel Library on Multiple Nodes
 
 Using the R parallel library on MPI clusters has shown problems when using more than a few compute
-nodes. The error messages indicate that there are buggy interactions of R/Rmpi/OpenMPI and UCX.
+nodes. The error messages indicate that there are buggy interactions of R/Rmpi/Open MPI and UCX.
 Disabling UCX has solved these problems in our experiments.
 
 We invoked the R script successfully with the following command:
 
 ```console
-mpirun -mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx -np 1 Rscript
---vanilla the-script.R
+marie@login$ mpirun -mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx -np 1 Rscript --vanilla the-script.R
 ```
 
 where the arguments `-mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx` disable usage of
 UCX.
 
-## MPI Function `MPI_Win_allocate`
+### MPI Function `MPI_Win_allocate`
 
 The function `MPI_Win_allocate` is a one-sided MPI call that allocates memory and returns a window
 object for RDMA operations (ref. [man page](https://www.open-mpi.org/doc/v3.0/man3/MPI_Win_allocate.3.php)).
@@ -65,6 +66,6 @@ object for RDMA operations (ref. [man page](https://www.open-mpi.org/doc/v3.0/ma
 
 It was observed for at least for the `OpenMPI/4.0.5` module that using `MPI_Win_Allocate` instead of
 `MPI_Alloc_mem` in conjunction with `MPI_Win_create` leads to segmentation faults in the calling
-application . To be precise, the segfaults occurred at partition `romeo` when about 200 GB per node
+application. To be precise, the segfaults occurred at partition `romeo` when about 200 GB per node
 where allocated. In contrast, the segmentation faults vanished when the implementation was
-refactored to call the `MPI_Alloc_mem + MPI_Win_create` functions.
+refactored to call the `MPI_Alloc_mem` + `MPI_Win_create` functions.
diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md
index 4347dd6b0..f270f8f1d 100644
--- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md
+++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/rome_nodes.md
@@ -103,5 +103,5 @@ case on Rome. You might want to try `-mavx2 -fma` instead.
 ### Intel MPI
 
 We have seen only half the theoretical peak bandwidth via Infiniband between two nodes, whereas
-OpenMPI got close to the peak bandwidth, so you might want to avoid using Intel MPI on partition
+Open MPI got close to the peak bandwidth, so you might want to avoid using Intel MPI on partition
 `rome` if your application heavily relies on MPI communication until this issue is resolved.
diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md
index f4788c1cf..946cca8bc 100644
--- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md
+++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/sd_flex.md
@@ -21,8 +21,8 @@ project's quota can be increased or dedicated volumes of up to the full capacity
 
 - Granularity should be a socket (28 cores)
 - Can be used for OpenMP applications with large memory demands
-- To use OpenMPI it is necessary to export the following environment
-  variables, so that OpenMPI uses shared-memory instead of Infiniband
+- To use Open MPI it is necessary to export the following environment
+  variables, so that Open MPI uses shared-memory instead of Infiniband
   for message transport:
 
   ```
diff --git a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md
index c7334d918..41670d12b 100644
--- a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md
+++ b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md
@@ -391,7 +391,7 @@ Another example:
     list_of_averages <- parLapply(X=sample_sizes, fun=average, cl=cl)
 
     # shut down the cluster
-    #snow::stopCluster(cl)  # usually it hangs over here with OpenMPI > 2.0. In this case this command may be avoided, Slurm will clean up after the job finishes
+    #snow::stopCluster(cl)  # usually it hangs over here with Open MPI > 2.0. In this case this command may be avoided, Slurm will clean up after the job finishes
     ```
 
 To use Rmpi and MPI please use one of these partitions: `haswell`, `broadwell` or `rome`.
diff --git a/doc.zih.tu-dresden.de/docs/software/distributed_training.md b/doc.zih.tu-dresden.de/docs/software/distributed_training.md
index 4e8fc427e..094b6f8dc 100644
--- a/doc.zih.tu-dresden.de/docs/software/distributed_training.md
+++ b/doc.zih.tu-dresden.de/docs/software/distributed_training.md
@@ -300,8 +300,8 @@ Available Tensor Operations:
     [ ] Gloo
 ```
 
-If you want to use OpenMPI then specify `HOROVOD_GPU_ALLREDUCE=MPI`.
-To have better performance it is recommended to use NCCL instead of OpenMPI.
+If you want to use Open MPI then specify `HOROVOD_GPU_ALLREDUCE=MPI`.
+To have better performance it is recommended to use NCCL instead of Open MPI.
 
 ##### Verify Horovod Works
 
diff --git a/doc.zih.tu-dresden.de/docs/software/gpu_programming.md b/doc.zih.tu-dresden.de/docs/software/gpu_programming.md
index 2e5b57422..28d220fed 100644
--- a/doc.zih.tu-dresden.de/docs/software/gpu_programming.md
+++ b/doc.zih.tu-dresden.de/docs/software/gpu_programming.md
@@ -200,12 +200,12 @@ detail in [nvcc documentation](https://docs.nvidia.com/cuda/cuda-compiler-driver
 This compiler is available via several `CUDA` packages, a default version can be loaded via
 `module load CUDA`. Additionally, the `NVHPC` modules provide CUDA tools as well.
 
-For using CUDA with OpenMPI at multiple nodes, the OpenMPI module loaded shall have be compiled with
+For using CUDA with Open MPI at multiple nodes, the `OpenMPI` module loaded shall have be compiled with
 CUDA support. If you aren't sure if the module you are using has support for it you can check it as
 following:
 
 ```console
-ompi_info --parsable --all | grep mpi_built_with_cuda_support:value | awk -F":" '{print "OpenMPI supports CUDA:",$7}'
+ompi_info --parsable --all | grep mpi_built_with_cuda_support:value | awk -F":" '{print "Open MPI supports CUDA:",$7}'
 ```
 
 #### Usage of the CUDA Compiler
diff --git a/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-alpha.cfg b/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-alpha.cfg
index 18743ba58..a0a1d8670 100644
--- a/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-alpha.cfg
+++ b/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-alpha.cfg
@@ -172,7 +172,7 @@ preEnv_MPICH_GPU_EAGER_DEVICE_MEM=0
 %endif
 
 %ifdef %{ucx}
-# if using OpenMPI with UCX support, these settings are needed with use of CUDA Aware MPI
+# if using Open MPI with UCX support, these settings are needed with use of CUDA Aware MPI
 # without these flags, LBM is known to hang when using OpenACC and OpenMP Target to GPUs
 preENV_UCX_MEMTYPE_CACHE=n
 preENV_UCX_TLS=self,shm,cuda_copy
diff --git a/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-ppc.cfg b/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-ppc.cfg
index 06b9e1b85..6e6112b1a 100644
--- a/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-ppc.cfg
+++ b/doc.zih.tu-dresden.de/docs/software/misc/spec_nvhpc-ppc.cfg
@@ -217,7 +217,7 @@ preEnv_MPICH_GPU_EAGER_DEVICE_MEM=0
 %endif
 
 %ifdef %{ucx}
-# if using OpenMPI with UCX support, these settings are needed with use of CUDA Aware MPI
+# if using Open MPI with UCX support, these settings are needed with use of CUDA Aware MPI
 # without these flags, LBM is known to hang when using OpenACC and OpenMP Target to GPUs
 preENV_UCX_MEMTYPE_CACHE=n
 preENV_UCX_TLS=self,shm,cuda_copy
diff --git a/doc.zih.tu-dresden.de/docs/software/singularity_recipe_hints.md b/doc.zih.tu-dresden.de/docs/software/singularity_recipe_hints.md
index 9fd398d76..c1f570c93 100644
--- a/doc.zih.tu-dresden.de/docs/software/singularity_recipe_hints.md
+++ b/doc.zih.tu-dresden.de/docs/software/singularity_recipe_hints.md
@@ -123,12 +123,12 @@ At the HPC system run as following:
 marie@login$ srun -n 4 --ntasks-per-node 2 --time=00:10:00 singularity exec ubuntu_mpich.sif /opt/mpitest
 ```
 
-### CUDA + CuDNN + OpenMPI
+### CUDA + CuDNN + Open MPI
 
 * Chosen CUDA version depends on installed driver of host
-* OpenMPI needs PMI for Slurm integration
-* OpenMPI needs CUDA for GPU copy-support
-* OpenMPI needs `ibverbs` library for Infiniband
+* Open MPI needs PMI for Slurm integration
+* Open MPI needs CUDA for GPU copy-support
+* Open MPI needs `ibverbs` library for Infiniband
 * `openmpi-mca-params.conf` required to avoid warnings on fork (OK on ZIH systems)
 * Environment variables `SLURM_VERSION` and `OPENMPI_VERSION` can be set to  choose different
   version when building the container
diff --git a/doc.zih.tu-dresden.de/wordlist.aspell b/doc.zih.tu-dresden.de/wordlist.aspell
index e50cbf260..78bf541c6 100644
--- a/doc.zih.tu-dresden.de/wordlist.aspell
+++ b/doc.zih.tu-dresden.de/wordlist.aspell
@@ -245,6 +245,7 @@ Mortem
 mountpoint
 mpi
 Mpi
+MPI
 mpicc
 mpiCC
 mpicxx
@@ -295,8 +296,6 @@ OpenBLAS
 OpenCL
 OpenGL
 OpenMP
-openmpi
-OpenMPI
 OpenSSH
 Opteron
 ORCA
-- 
GitLab