Merge branch 'barnard-cleanup' of...

Merge branch 'barnard-cleanup' of gitlab.hrz.tu-chemnitz.de:zih/hpcsupport/hpc-compendium into barnard-cleanup

Merge branch 'barnard-cleanup' of...
Merge branch 'barnard-cleanup' of gitlab.hrz.tu-chemnitz.de:zih/hpcsupport/hpc-compendium into barnard-cleanup
be2c4a4f · Ulf Markwardt · 8421241d · 46761f3c · be2c4a4f · be2c4a4f
Commit be2c4a4f authored 1 year ago by Ulf Markwardt
--- a/doc.zih.tu-dresden.de/docs/index.md
+++ b/doc.zih.tu-dresden.de/docs/index.md
@@ -31,8 +31,8 @@ Please also find out the other ways you could contribute in our

 ## News

+* **2023-11-06** [Substantial update on "How-To: Migration to Barnard](jobs_and_resources/migration_to_barnard.md)
 * **2023-10-16** [Open MPI 4.1.x  - Workaround for MPI-IO Performance Loss](jobs_and_resources/mpi_issues/#performance-loss-with-mpi-io-module-ompio)
-* **2023-10-04** [User tests on Barnard](jobs_and_resources/barnard_test.md)
 * **2023-06-01** [New hardware and complete re-design](jobs_and_resources/architecture_2023.md)
 * **2023-01-04** [New hardware: NVIDIA Arm HPC Developer Kit](jobs_and_resources/arm_hpc_devkit.md)


--- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/architecture_2023.md
+++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/architecture_2023.md
 # Architectural Re-Design 2023

-With the replacement of the Taurus system by the cluster [Barnard](hardware_overview_2023.md#barnard-intel-sapphire-rapids-cpus)
-in 2023, the rest of the installed hardware had to be re-connected, both with
-InfiniBand and with Ethernet.
+Over the last decade we have been running our HPC system of high heterogeneity with a single
+Slurm batch system. This made things very complicated, especially to inexperienced users.
+With the replacement of the Taurus system by the cluster
+[Barnard](hardware_overview_2023.md#barnard-intel-sapphire-rapids-cpus)
+we **now create homogeneous clusters with their own Slurm instances and with cluster specific login
+nodes** running on the same CPU.  Job submission will be possible only from within the cluster
+(compute or login node).
+
+All clusters will be integrated to the new InfiniBand fabric and have then the same access to
+the shared filesystems. This recabling will require a brief downtime of a few days.

 ![Architecture overview 2023](../jobs_and_resources/misc/architecture_2023.png)
 {: align=center}
@@ -54,5 +61,11 @@ storages.
 ## Migration Phase

 For about one month, the new cluster Barnard, and the old cluster Taurus
-will run side-by-side - both with their respective filesystems. You can find a comprehensive
-[description of the migration phase here](migration_2023.md).
+will run side-by-side - both with their respective filesystems. We provide a comprehensive
+[description of the migration to Barnard](migration_to_barnard.md).
+
+The follwing figure provides a graphical overview of the overall process (red: user action
+required):
+
+![Migration timeline 2023](../jobs_and_resources/misc/migration_2023.png)
+{: align=center}
--- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview_2023.md
+++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview_2023.md
-# Overview 2023
+# HPC Resources Overview 2023

 With the installation and start of operation of the [new HPC system Barnard](#barnard-intel-sapphire-rapids-cpus),
 quite significant changes w.r.t. HPC system landscape at ZIH follow. The former HPC system Taurus is
@@ -49,7 +49,8 @@ All clusters will have access to these shared parallel filesystems:
    - 2 x AMD EPYC CPU 7702 (64 cores) @ 2.0 GHz, Multithreading available
    - 512 GB RAM
    - 200 GB local memory on SSD at `/tmp`
- Hostnames: `taurusi[7001-7192]` -> `i[7001-7190].romeo.hpc.tu-dresden.de`
+- Hostnames: `taurusi[7001-7192]` -> `i[7001-7190].romeo.hpc.tu-dresden.de` (after
+  [recabling phase](architecture_2023.md#migration-phase)])
 - Login nodes: `login[1-2].romeo.hpc.tu-dresden.de`
 - Further information on the usage is documented on the site [AMD Rome Nodes](rome_nodes.md)

@@ -61,7 +62,8 @@ All clusters will have access to these shared parallel filesystems:
 - Configured as one single node
 - 48 TB RAM (usable: 47 TB - one TB is used for cache coherence protocols)
 - 370 TB of fast NVME storage available at `/nvme/<projectname>`
- Hostname: `taurussmp8` -> `smp8.julia.hpc.tu-dresden.de`
+- Hostname: `taurussmp8` -> `smp8.julia.hpc.tu-dresden.de` (after
+  [recabling phase](architecture_2023.md#migration-phase)])
 - Further information on the usage is documented on the site [HPE Superdome Flex](sd_flex.md)

 ## IBM Power9 Nodes for Machine Learning
@@ -73,5 +75,6 @@ For machine learning, we have IBM AC922 nodes installed with this configuration:
    - 256 GB RAM DDR4 2666 MHz
    - 6 x NVIDIA VOLTA V100 with 32 GB HBM2
    - NVLINK bandwidth 150 GB/s between GPUs and host
- Hostnames: `taurusml[1-32]` -> `ml[1-29].power9.hpc.tu-dresden.de`
+- Hostnames: `taurusml[1-32]` -> `ml[1-29].power9.hpc.tu-dresden.de` (after
+  [recabling phase](architecture_2023.md#migration-phase)])
 - Login nodes: `login[1-2].power9.hpc.tu-dresden.de`
--- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/migration_2023.md
+++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/migration_2023.md
-# Migration 2023
-
-## Brief Overview over Coming Changes
-
-All components of Taurus will be dismantled step by step.
-
-### New Hardware
-
-The new HPC system [Barnard](hardware_overview_2023.md#barnard-intel-sapphire-rapids-cpus) from Bull
-comes with these main properties:
-
-* 630 compute nodes based on Intel Sapphire Rapids
-* new Lustre-based storage systems
-* HDR InfiniBand network large enough to integrate existing and near-future non-Bull hardware
-* To help our users to find the best location for their data we now use the name of
-animals (size, speed) as mnemonics.
-
-### New Architecture
-
-Over the last decade we have been running our HPC system of high heterogeneity with a single
-Slurm batch system. This made things very complicated, especially to inexperienced users.
-To lower this hurdle we **now create homogeneous clusters with their own Slurm instances and with
-cluster specific login nodes** running on the same CPU. Job submission is possible only
-from within the cluster (compute or login node).
-
-All clusters will be integrated to the new InfiniBand fabric and have then the same access to
-the shared filesystems. This recabling requires a brief downtime of a few days.
-
-Please refer to the overview page [Architectural Re-Design 2023](architecture_2023.md)
-for details on the new architecture.
-
-### New Software
-
-The new nodes run on Linux RHEL 8.7. For a seamless integration of other compute hardware,
-all operating system will be updated to the same versions of operating system, Mellanox and Lustre
-drivers. With this all application software was re-built consequently using Git and CI/CD pipelines
-for handling the multitude of versions.
-
-We start with `release/23.10` which is based on software requests from user feedbacks of our
-HPC users. Most major software versions exist on all hardware platforms.
-
-## Migration Path
-
-Please make sure to have read the details on the [Architectural Re-Design 2023](architecture_2023.md)
-before further reading.
-
-!!! note
-
-    The migration can only be successful as a joint effort of HPC team and users.
-
-Here is a description of the action items.
-
-|When?|TODO ZIH |TODO users |Remark |
-|---|---|---|---|
-| done (May 2023) |first sync `/scratch` to `/data/horse/old_scratch2`| |copied 4 PB in about 3 weeks|
-| done (June 2023) |enable access to Barnard| |initialized LDAP tree with Taurus users|
-| done (July 2023) | |install new software stack|tedious work |
-| ASAP | |adapt scripts|new Slurm version, new resources, no partitions|
-| August 2023 | |test new software stack on Barnard|new versions sometimes require different prerequisites|
-| August 2023| |test new software stack on other clusters|a few nodes will be made available with the new software stack, but with the old filesystems|
-| ASAP | |prepare data migration|The small filesystems `/beegfs` and `/lustre/ssd`, and `/home` are mounted on the old systems "until the end". They will *not* be migrated to the new system.|
-| July 2023 | sync `/warm_archive` to new hardware| |using datamover nodes with Slurm jobs |
-| September 2023 |prepare re-cabling of older hardware (Bull)| |integrate other clusters in the IB infrastructure |
-| Autumn 2023 |finalize integration of other clusters (Bull)| |**~2 days downtime**, final rsync and migration of `/projects`, `/warm_archive`|
-| Autumn 2023 ||transfer last data from old filesystems | `/beegfs`, `/lustre/scratch`, `/lustre/ssd` are no longer available on the new systems|
-
-### Data Migration
-
-Why do users need to copy their data? Why only some? How to do it best?
-
-* The sync of hundreds of terabytes can only be done planned and carefully.
-(`/scratch`, `/warm_archive`, `/projects`). The HPC team will use multiple syncs
-to not forget the last bytes. During the downtime, `/projects` will be migrated.
-* User homes (`/home`) are relatively small and can be copied by the scientists.
-Keeping in mind that maybe deleting and archiving is a better choice.
-* For this, datamover nodes are available to run transfer jobs under Slurm. Please refer to the
-section [Transfer Data to New Home Directory](../barnard_test#transfer-data-to-new-home-directory)
-for more detailed instructions.
-
-### A Graphical Overview
-
-(red: user action required):
-
-![Migration timeline 2023](../jobs_and_resources/misc/migration_2023.png)
-{: align=center}
--- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/barnard_test.md
+++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/barnard_test.md
@@ -3,21 +3,25 @@
 All HPC users are cordially invited to migrate to our new HPC system **Barnard** and prepare your
 software and workflows for production there.

-!!! warning
+!!! note "Migration Phase"

-    All data in the `/home` directory or in workspaces on the BeeGFS or Lustre SSD file
-    systems will be deleted by the end of 2023, since these filesystems will be decommissioned.
+    Please make sure to have read the details on the overall
+    [Architectural Re-Design 2023](architecture_2023.md#migration-phase) before further reading.

-Existing Taurus users that would like to keep some of their data need to copy them to the new system
-manually, using the [steps described below](#data-management-and-data-transfer).
+The migration from Taurus to Barnard comprises the following steps:

-For general hints regarding the migration please refer to these sites:
+* [Prepare login to Barnard](#login-to-barnard)
+* [Data management and data transfer to new filesystems](#data-management-and-data-transfer)
+* [Update job scripts and workflow to new software](#software)
+* [Update job scripts and workflow w.r.t. Slurm](#slurm)

-* [Details on architecture](/jobs_and_resources/architecture_2023),
-* [Description of the migration](migration_2023.md).
+!!! note

+    We highly recommand to first read the entire page carefully, and then execute the steps.
+
+The migration can only be successful as a joint effort of HPC team and users.
 We value your feedback. Please provide it directly via our ticket system. For better processing,
-please add "Barnard:" as a prefix to the subject of the [support ticket](../support/support).
+please add "Barnard:" as a prefix to the subject of the [support ticket](../support/support.md).

 ## Login to Barnard

@@ -184,7 +188,7 @@ target filesystems.
    [workspaces](../data_lifecycle/workspaces.md). Before you invoke any data transer from the old
    working filesystems to the new ones, you need to allocate a workspace first.

-    The command `ws_list -l` lists the available and the default filesystem for workspaces.
+    The command `ws_list --list` lists the available and the default filesystem for workspaces.

    ```
    marie@barnard$ ws_list --list
@@ -310,6 +314,14 @@ Please use `module spider` to identify the software modules you need to load.

 The default release version is 23.10.

+The new nodes run on Linux RHEL 8.7. For a seamless integration of other compute hardware,
+all operating system will be updated to the same versions of operating system, Mellanox and Lustre
+drivers. With this all application software was re-built consequently using Git and CI/CD pipelines
+for handling the multitude of versions.
+
+We start with `release/23.10` which is based on software requests from user feedbacks of our
+HPC users. Most major software versions exist on all hardware platforms.
+
 ## Slurm

 * We are running the most recent Slurm version.

--- a/doc.zih.tu-dresden.de/mkdocs.yml
+++ b/doc.zih.tu-dresden.de/mkdocs.yml
@@ -103,9 +103,8 @@ nav:
      - Overview: jobs_and_resources/hardware_overview.md
      - New Systems 2023:
        - Architectural Re-Design 2023: jobs_and_resources/architecture_2023.md
-        - Overview 2023: jobs_and_resources/hardware_overview_2023.md
-        - Migration 2023: jobs_and_resources/migration_2023.md
-        - "How-To: Migration to Barnard": jobs_and_resources/barnard_test.md
+        - HPC Resources Overview 2023: jobs_and_resources/hardware_overview_2023.md
+        - "How-To: Migration to Barnard": jobs_and_resources/migration_to_barnard.md
      - AMD Rome Nodes: jobs_and_resources/rome_nodes.md
      - NVMe Storage: jobs_and_resources/nvme_storage.md
      - Alpha Centauri: jobs_and_resources/alpha_centauri.md