Skip to content
Snippets Groups Projects
Commit c103c94c authored by Martin Schroschk's avatar Martin Schroschk
Browse files

Merge branch 'preview' into 'issue-507'

# Conflicts:
#   doc.zih.tu-dresden.de/docs/jobs_and_resources/migration_2023.md
parents bee7546a 31dc582a
No related branches found
No related tags found
2 merge requests!919Automated merge from preview to main,!914Issue 507
Showing
with 18 additions and 17 deletions
......@@ -20,7 +20,7 @@ search:
At the moment when parts of the IB stop we will start batch system plugins to parse for this batch
system option: `--comment=NO_IB`. Jobs with this option set can run on nodes without
Infiniband access if (and only if) they have set the `--tmp`-option as well:
InfiniBand access if (and only if) they have set the `--tmp`-option as well:
*From the Slurm documentation:*
......
......@@ -17,7 +17,7 @@ Here are the major changes from the user's perspective:
| Red Hat Enterprise Linux (RHEL) | 6.x | 7.x |
| Linux kernel | 2.26 | 3.10 |
| glibc | 2.12 | 2.17 |
| Infiniband stack | OpenIB | Mellanox |
| InfiniBand stack | OpenIB | Mellanox |
| Lustre client | 2.5 | 2.10 |
## Host Keys
......
......@@ -14,7 +14,7 @@ The following data can be gathered:
* Task data, such as CPU frequency, CPU utilization, memory consumption (RSS and VMSize), I/O
* Energy consumption of the nodes
* Infiniband data (currently deactivated)
* InfiniBand data (currently deactivated)
* Lustre filesystem data (currently deactivated)
The data is sampled at a fixed rate (i.e. every 5 seconds) and is stored in a HDF5 file.
......
......@@ -32,7 +32,7 @@ node has 180 GB local disk space for scratch mounted on `/tmp`. The jobs for the
scheduled by the [Platform LSF](platform_lsf.md) batch system from the login nodes
`atlas.hrsk.tu-dresden.de` .
A QDR Infiniband interconnect provides the communication and I/O infrastructure for low latency /
A QDR InfiniBand interconnect provides the communication and I/O infrastructure for low latency /
high throughput data traffic.
Users with a login on the [SGI Altix](system_altix.md) can access their home directory via NFS
......
......@@ -29,7 +29,7 @@ mounted on `/tmp`. The jobs for the compute nodes are scheduled by the
[Platform LSF](platform_lsf.md)
batch system from the login nodes `deimos.hrsk.tu-dresden.de` .
Two separate Infiniband networks (10 Gb/s) with low cascading switches provide the communication and
Two separate InfiniBand networks (10 Gb/s) with low cascading switches provide the communication and
I/O infrastructure for low latency / high throughput data traffic. An additional gigabit Ethernet
network is used for control and service purposes.
......
......@@ -25,7 +25,7 @@ All nodes share a 4.4 TB SAN. Each node has additional local disk space mounted
jobs for the compute nodes are scheduled by a [Platform LSF](platform_lsf.md) batch system running on
the login node `phobos.hrsk.tu-dresden.de`.
Two separate Infiniband networks (10 Gb/s) with low cascading switches provide the infrastructure
Two separate InfiniBand networks (10 Gb/s) with low cascading switches provide the infrastructure
for low latency / high throughput data traffic. An additional GB/Ethernetwork is used for control
and service purposes.
......
......@@ -2,7 +2,7 @@
With the replacement of the Taurus system by the cluster `Barnard` in 2023,
the rest of the installed hardware had to be re-connected, both with
Infiniband and with Ethernet.
InfiniBand and with Ethernet.
![Architecture overview 2023](../jobs_and_resources/misc/architecture_2023.png)
{: align=center}
......
......@@ -12,7 +12,7 @@ This Arm HPC Developer kit offers:
* 512G DDR4 memory (8x 64G)
* 6TB SAS/ SATA 3.5″
* 2x NVIDIA A100 GPU
* 2x NVIDIA BlueField-2 E-Series DPU: 200GbE/HDR single-port, both connected to the Infiniband network
* 2x NVIDIA BlueField-2 E-Series DPU: 200GbE/HDR single-port, both connected to the InfiniBand network
## Further Information
......
......@@ -10,7 +10,7 @@ The new HPC system "Barnard" from Bull comes with these main properties:
* 630 compute nodes based on Intel Sapphire Rapids
* new Lustre-based storage systems
* HDR Infiniband network large enough to integrate existing and near-future non-Bull hardware
* HDR InfiniBand network large enough to integrate existing and near-future non-Bull hardware
* To help our users to find the best location for their data we now use the name of
animals (size, speed) as mnemonics.
......@@ -24,8 +24,9 @@ To lower this hurdle we now create homogeneous clusters with their own Slurm ins
cluster specific login nodes running on the same CPU. Job submission is possible only
from within the cluster (compute or login node).
All clusters will be integrated to the new Infiniband fabric and have then the same access to
the shared filesystems. This re-cabling requires a brief downtime of a few days.
All clusters will be integrated to the new InfiniBand fabric and have then the same access to
the shared filesystems. This recabling requires a brief downtime of a few days.
[Details on architecture](/jobs_and_resources/architecture_2023).
### New Software
......
......@@ -4,7 +4,7 @@
- 8x Intel NVMe Datacenter SSD P4610, 3.2 TB
- 3.2 GB/s (8x 3.2 =25.6 GB/s)
- 2 Infiniband EDR links, Mellanox MT27800, ConnectX-5, PCIe x16, 100
- 2 InfiniBand EDR links, Mellanox MT27800, ConnectX-5, PCIe x16, 100
Gbit/s
- 2 sockets Intel Xeon E5-2620 v4 (16 cores, 2.10GHz)
- 64 GB RAM
......
......@@ -102,6 +102,6 @@ case on Rome. You might want to try `-mavx2 -fma` instead.
### Intel MPI
We have seen only half the theoretical peak bandwidth via Infiniband between two nodes, whereas
We have seen only half the theoretical peak bandwidth via InfiniBand between two nodes, whereas
Open MPI got close to the peak bandwidth, so you might want to avoid using Intel MPI on partition
`rome` if your application heavily relies on MPI communication until this issue is resolved.
......@@ -22,7 +22,7 @@ project's quota can be increased or dedicated volumes of up to the full capacity
- Granularity should be a socket (28 cores)
- Can be used for OpenMP applications with large memory demands
- To use Open MPI it is necessary to export the following environment
variables, so that Open MPI uses shared-memory instead of Infiniband
variables, so that Open MPI uses shared-memory instead of InfiniBand
for message transport:
```
......@@ -31,4 +31,4 @@ project's quota can be increased or dedicated volumes of up to the full capacity
```
- Use `I_MPI_FABRICS=shm` so that Intel MPI doesn't even consider
using Infiniband devices itself, but only shared-memory instead
using InfiniBand devices itself, but only shared-memory instead
......@@ -128,7 +128,7 @@ marie@login$ srun -n 4 --ntasks-per-node 2 --time=00:10:00 singularity exec ubun
* Chosen CUDA version depends on installed driver of host
* Open MPI needs PMI for Slurm integration
* Open MPI needs CUDA for GPU copy-support
* Open MPI needs `ibverbs` library for Infiniband
* Open MPI needs `ibverbs` library for InfiniBand
* `openmpi-mca-params.conf` required to avoid warnings on fork (OK on ZIH systems)
* Environment variables `SLURM_VERSION` and `OPENMPI_VERSION` can be set to choose different
version when building the container
......
......@@ -175,7 +175,7 @@ iDataPlex
ifort
ImageNet
img
Infiniband
InfiniBand
InfluxDB
init
inode
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment