Update

27fcb953 · Martin Schroschk · 5e78b7c5 · 27fcb953 · 27fcb953 · 5e78b7c5
Commit 27fcb953 authored 1 year ago by Martin Schroschk
--- a/doc.zih.tu-dresden.de/docs/index.md
+++ b/doc.zih.tu-dresden.de/docs/index.md
@@ -31,8 +31,8 @@ Please also find out the other ways you could contribute in our

 ## News

+* **2023-11-06** [Substantial update on "How-To: Migration to Barnard](jobs_and_resources/migration_to_barnard.md)
 * **2023-10-16** [Open MPI 4.1.x  - Workaround for MPI-IO Performance Loss](jobs_and_resources/mpi_issues/#performance-loss-with-mpi-io-module-ompio)
-* **2023-10-04** [User tests on Barnard](jobs_and_resources/barnard_test.md)
 * **2023-06-01** [New hardware and complete re-design](jobs_and_resources/architecture_2023.md)
 * **2023-01-04** [New hardware: NVIDIA Arm HPC Developer Kit](jobs_and_resources/arm_hpc_devkit.md)


--- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/architecture_2023.md
+++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/architecture_2023.md
 # Architectural Re-Design 2023

-With the replacement of the Taurus system by the cluster [Barnard](hardware_overview_2023.md#barnard-intel-sapphire-rapids-cpus)
-in 2023, the rest of the installed hardware had to be re-connected, both with
-InfiniBand and with Ethernet.
+Over the last decade we have been running our HPC system of high heterogeneity with a single
+Slurm batch system. This made things very complicated, especially to inexperienced users.
+With the replacement of the Taurus system by the cluster
+[Barnard](hardware_overview_2023.md#barnard-intel-sapphire-rapids-cpus)
+we **now create homogeneous clusters with their own Slurm instances and with cluster specific login
+nodes** running on the same CPU.  Job submission will be possible only from within the cluster
+(compute or login node).
+
+All clusters will be integrated to the new InfiniBand fabric and have then the same access to
+the shared filesystems. This recabling will require a brief downtime of a few days.

 ![Architecture overview 2023](../jobs_and_resources/misc/architecture_2023.png)
 {: align=center}
@@ -54,5 +61,11 @@ storages.
 ## Migration Phase

 For about one month, the new cluster Barnard, and the old cluster Taurus
-will run side-by-side - both with their respective filesystems. You can find a comprehensive
-[description of the migration phase here](migration_2023.md).
+will run side-by-side - both with their respective filesystems. We provide a comprehensive
+[description of the migration to Barnard](migration_to_barnard.md).
+
+The follwing figure provides a graphical overview of the overall process (red: user action
+required):
+
+![Migration timeline 2023](../jobs_and_resources/misc/migration_2023.png)
+{: align=center}
--- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/migration_2023.md
+++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/migration_2023.md
-# Migration 2023
-
-## Brief Overview over Coming Changes
-
-All components of Taurus will be dismantled step by step.
-
-### New Hardware
-
-The new HPC system [Barnard](hardware_overview_2023.md#barnard-intel-sapphire-rapids-cpus) from Bull
-comes with these main properties:
-
-* 630 compute nodes based on Intel Sapphire Rapids
-* new Lustre-based storage systems
-* HDR InfiniBand network large enough to integrate existing and near-future non-Bull hardware
-* To help our users to find the best location for their data we now use the name of
-animals (size, speed) as mnemonics.
-
-### New Architecture
-
-Over the last decade we have been running our HPC system of high heterogeneity with a single
-Slurm batch system. This made things very complicated, especially to inexperienced users.
-To lower this hurdle we **now create homogeneous clusters with their own Slurm instances and with
-cluster specific login nodes** running on the same CPU. Job submission is possible only
-from within the cluster (compute or login node).
-
-All clusters will be integrated to the new InfiniBand fabric and have then the same access to
-the shared filesystems. This recabling requires a brief downtime of a few days.
-
-Please refer to the overview page [Architectural Re-Design 2023](architecture_2023.md)
-for details on the new architecture.
-
-### New Software
-
-The new nodes run on Linux RHEL 8.7. For a seamless integration of other compute hardware,
-all operating system will be updated to the same versions of operating system, Mellanox and Lustre
-drivers. With this all application software was re-built consequently using Git and CI/CD pipelines
-for handling the multitude of versions.
-
-We start with `release/23.10` which is based on software requests from user feedbacks of our
-HPC users. Most major software versions exist on all hardware platforms.
-
-## Migration Path
-
-Please make sure to have read the details on the [Architectural Re-Design 2023](architecture_2023.md)
-before further reading.
-
-!!! note
-
-    The migration can only be successful as a joint effort of HPC team and users.
-
-Here is a description of the action items.
-
-|When?|TODO ZIH |TODO users |Remark |
-|---|---|---|---|
-| done (May 2023) |first sync `/scratch` to `/data/horse/old_scratch2`| |copied 4 PB in about 3 weeks|
-| done (June 2023) |enable access to Barnard| |initialized LDAP tree with Taurus users|
-| done (July 2023) | |install new software stack|tedious work |
-| ASAP | |adapt scripts|new Slurm version, new resources, no partitions|
-| August 2023 | |test new software stack on Barnard|new versions sometimes require different prerequisites|
-| August 2023| |test new software stack on other clusters|a few nodes will be made available with the new software stack, but with the old filesystems|
-| ASAP | |prepare data migration|The small filesystems `/beegfs` and `/lustre/ssd`, and `/home` are mounted on the old systems "until the end". They will *not* be migrated to the new system.|
-| July 2023 | sync `/warm_archive` to new hardware| |using datamover nodes with Slurm jobs |
-| September 2023 |prepare re-cabling of older hardware (Bull)| |integrate other clusters in the IB infrastructure |
-| Autumn 2023 |finalize integration of other clusters (Bull)| |**~2 days downtime**, final rsync and migration of `/projects`, `/warm_archive`|
-| Autumn 2023 ||transfer last data from old filesystems | `/beegfs`, `/lustre/scratch`, `/lustre/ssd` are no longer available on the new systems|
-
-### Data Migration
-
-Why do users need to copy their data? Why only some? How to do it best?
-
-* The sync of hundreds of terabytes can only be done planned and carefully.
-(`/scratch`, `/warm_archive`, `/projects`). The HPC team will use multiple syncs
-to not forget the last bytes. During the downtime, `/projects` will be migrated.
-* User homes (`/home`) are relatively small and can be copied by the scientists.
-Keeping in mind that maybe deleting and archiving is a better choice.
-* For this, datamover nodes are available to run transfer jobs under Slurm. Please refer to the
-section [Transfer Data to New Home Directory](../barnard_test#transfer-data-to-new-home-directory)
-for more detailed instructions.
-
-### A Graphical Overview
-
-(red: user action required):
-
-![Migration timeline 2023](../jobs_and_resources/misc/migration_2023.png)
-{: align=center}
--- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/migration_to_barnard.md
+++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/migration_to_barnard.md
+# How-To: Migration to Barnard
+
+All HPC users are cordially invited to migrate to our new HPC system **Barnard** and prepare your
+software and workflows for production there.
+
+!!! note "Migration Phase"
+
+    Please make sure to have read the details on the overall
+    [Architectural Re-Design 2023](architecture_2023.md#migration-phase) before further reading.
+
+The migration from Taurus to Barnard comprises the following steps:
+
+* [Prepare login to Barnard](#login-to-barnard)
+* [Data management and data transfer to new filesystems](#data-management-and-data-transfer)
+* [Update job scripts and workflow to new software](#software)
+* [Update job scripts and workflow w.r.t. Slurm](#slurm)
+
+!!! note
+
+    We highly recommand to first read the entire page carefully, and then execute the steps.
+
+The migration can only be successful as a joint effort of HPC team and users.
+We value your feedback. Please provide it directly via our ticket system. For better processing,
+please add "Barnard:" as a prefix to the subject of the [support ticket](../support/support.md).
+
+## Login to Barnard
+
+!!! hint
+
+    All users and projects from Taurus now can work on Barnard.
+
+You use `login[1-4].barnard.hpc.tu-dresden.de` to access the system
+from campus (or VPN). In order to verify the SSH fingerprints of the login nodes, please refer to
+the page [Fingerprints](/access/key_fingerprints/#barnard).
+
+All users have **new empty HOME** file systems, this means you have first to ...
+
+??? "... install your public SSH key on Barnard"
+
+    - Please create a new SSH keypair with ed25519 encryption, secured with
+        a passphrase. Please refer to this
+        [page for instructions](../../access/ssh_login#before-your-first-connection).
+    - After login, add the public key to your `.ssh/authorized_keys` file on Barnard.
+
+## Data Management and Data Transfer
+
+!!! warning
+
+    All data in the `/home` directory or in workspaces on the BeeGFS or Lustre SSD file
+    systems will be deleted by the end of 2023, since these filesystems will be decommissioned.
+
+Existing Taurus users that would like to keep some of their data need to copy them to the new system
+manually, using the [steps described below](#data-management-and-data-transfer).
+
+### Filesystems on Barnard
+
+Our new HPC system Barnard also comes with **two new Lustre filesystems**, namely `/data/horse` and
+`/data/walrus`. Both have a capacity of 20 PB, but differ in performance and intended usage, see
+below. In order to support the data life cycle management, the well-known
+[workspace concept](#workspaces-on-barnard) is applied.
+
+* The `/project` filesystem is the same on Taurus and Barnard
+(mounted read-only on the compute nodes).
+* The new work filesystem is `/data/horse`.
+* The slower `/data/walrus` can be considered as a substitute for the old
+  `/warm_archive`- mounted **read-only** on the compute nodes.
+  It can be used to store e.g. results.
+
+!!! Warning
+
+    All old filesystems, i.e., `ssd`, `beegfs`, and `scratch`, will be shutdown by the end of 2023.
+    To work with your data from Taurus you might have to move/copy them to the new storages.
+
+    Please, carefully read the following documentation and instructions.
+
+### Workspaces on Barnard
+
+The filesystems `/data/horse` and `/data/walrus` can only be accessed via workspaces. Please refer
+to the [workspace page](../../data_lifecycle/workspaces/), if you are not familiar with the
+workspace concept and the corresponding commands. The following table provides the settings for
+workspaces on these two filesystems.
+
+| Filesystem (use with parameter `--filesystem=<filesystem>`) | Max. Duration in Days | Extensions | Keeptime in Days |
+|:-------------------------------------|---------------:|-----------:|--------:|
+| `/data/horse` (default)              | 100            | 10         | 30      |
+| `/data/walrus`                       | 365            |  2         | 60      |
+{: summary="Settings for Workspace Filesystem `/data/horse` and `/data/walrus`."}
+
+### Data Migration to New Filesystems
+
+Since all old filesystems of Taurus will be shutdown by the end of 2023, your data needs to be
+migrated to the new filesystems on Barnard. This migration comprises
+
+* your personal `/home` directory,
+* your workspaces on `/ssd`, `/beegfs` and `/scratch`.
+
+!!! note "It's your turn"
+
+    **You are responsible for the migration of your data**. With the shutdown of the old
+    filesystems, all data will be deleted.
+
+!!! note "Make a plan"
+
+    We highly recommand to **take some minutes for planing the transfer process**. Do not act with
+    precipitation.
+
+    Please **do not copy your entire data** from the old to the new filesystems, but consider this
+    opportunity for **cleaning up your data**. E.g., it might make sense to delete outdated scripts,
+    old log files, etc., and move other files, e.g., results, to the `/data/walrus` filesystem.
+
+!!! hint "Generic login"
+
+    In the following we will use the generic login `marie` and workspace `numbercrunch`
+    ([cf. content rules on generic names](../contrib/content_rules.md#data-privacy-and-generic-names)).
+    **Please make sure to replace it with your personal login.**
+
+We have four new [datamover nodes](/data_transfer/datamover) that have mounted all storages
+of the old Taurus and new Barnard system. Do not use the datamovers from Taurus, i.e., all data
+transfer need to be invoked from Barnard! Thus, the very first step is to
+[login to Barnard](#login-to-barnard).
+
+The command `dtinfo` will provide you the mountpoints of the old filesystems
+
+```console
+marie@barnard$ dtinfo
+[...]
+directory on datamover      mounting clusters   directory on cluster
+
+/data/old/home              Taurus              /home
+/data/old/lustre/scratch2   Taurus              /scratch
+/data/old/lustre/ssd        Taurus              /lustre/ssd
+[...]
+```
+
+In the following, we will provide instructions with comprehensive examples for the data transfer of
+your data to the new `/home` filesystem, as well as the working filesystems `/data/horse` and
+`/data/walrus`.
+
+??? "Migration of Your Home Directory"
+
+    Your personal (old) home directory at Taurus will not be automatically transferred to the new
+    Barnard system. Please do not copy your entire home, but clean up your data. E.g., it might
+    make sense to delete outdated scripts, old log files, etc., and move other files to an archive
+    filesystem. Thus, please transfer only selected directories and files that you need on the new
+    system.
+
+    The steps are as follows:
+
+    1. Login to Barnard, i.e.,
+
+        ```
+        ssh login[1-4].barnard.tu-dresden.de
+        ```
+
+    1. The command `dtinfo` will provide you the mountpoint
+
+        ```console
+        marie@barnard$ dtinfo
+        [...]
+        directory on datamover      mounting clusters   directory on cluster
+
+        /data/old/home              Taurus              /home
+        [...]
+        ```
+
+    1. Use the `dtls` command to list your files on the old home directory
+
+         ```
+         marie@barnard$ dtls /data/old/home/marie
+         [...]
+         ```
+
+    1. Use the `dtcp` command to invoke a transfer job, e.g.,
+
+        ```console
+        marie@barnard$ dtcp --recursive /data/old/home/marie/<useful data> /home/marie/
+        ```
+
+    **Note**, please adopt the source and target paths to your needs. All available options can be
+    queried via `dtinfo --help`.
+
+    !!! warning
+
+        Please be aware that there is **no synchronisation process** between your home directories
+        at Taurus and Barnard. Thus, after the very first transfer, they will become divergent.
+
+Please follow this instructions for transferring you data from `ssd`, `beegfs` and `scratch` to the
+new filesystems. The instructions and examples are divided by the target not the source filesystem.
+
+This migration task requires a preliminary step: You need to allocate workspaces on the
+target filesystems.
+
+??? Note "Preliminary Step: Allocate a workspace"
+
+    Both `/data/horse/` and `/data/walrus` can only be used with
+    [workspaces](../data_lifecycle/workspaces.md). Before you invoke any data transer from the old
+    working filesystems to the new ones, you need to allocate a workspace first.
+
+    The command `ws_list -l` lists the available and the default filesystem for workspaces.
+
+    ```
+    marie@barnard$ ws_list --list
+    available filesystems:
+    horse (default)
+    walrus
+    ```
+
+    As you can see, `/data/horse` is the default workspace filesystem at Barnard. I.e., if you
+    want to allocate, extend or release a workspace on `/data/walrus`, you need to pass the
+    option `--filesystem=walrus` explicitly to the corresponding workspace commands. Please
+    refer to our [workspace documentation](../data_lifecycle/workspaces.md), if you need refresh
+    your knowledge.
+
+    The most simple command to allocate a workspace is as follows
+
+    ```
+    marie@barnard$ ws_allocate numbercrunch 90
+    ```
+
+    Please refer to the table holding the settings
+    (cf. [subection workspaces on Barnard](#workspaces-on-barnard)) for the max. duration and
+    `ws_allocate --help` for all available options.
+
+??? "Migration to work filesystem `/data/horse`"
+
+    === "Source: old `/scratch`"
+
+        If you transfer data from the old `/scratch` to `/data/horse`, it is sufficient to use
+        `dtmv` instead of `dtcp` since this data has already been copied to a special directory on
+        the new `horse` filesystem. Thus, you just need to move it to the right place (the Lustre
+        metadata system will update the correspoding entries).
+
+        ```console
+        marie@barnard$ dtmv /data/horse/lustre/scratch2/0/marie-numbercrunch /data/horse/ws/marie-numbercrunch
+        ```
+
+    === "Source: old `/ssd`"
+
+        The old `ssd` filesystem is mounted at `/data/old/lustre/ssd` on the datamover nodes and the
+        workspaces are within the subdirectory `ws/`. A corresponding data transfer using `dtcopy`
+        looks like
+
+        ```console
+        marie@barnard$ dtcp --recursive /data/old/lustre/ssd/ws/marie-numbercrunch /data/horse/ws/marie-numbercrunch
+        ```
+
+    === "Source: old `/beegfs`"
+
+        The old `beegfs` filesystem is mounted at `/data/old/beegfs` on the datamover nodes and the
+        workspaces are within the subdirectories `ws/0` and `ws/1`, respectively. A corresponding
+        data transfer using `dtcp` looks like
+
+        ```console
+        marie@barnard$ dtcp --recursive /data/old/beegfs/ws/0/marie-numbercrunch /data/horse/ws/marie-numbercrunch
+        ```
+
+??? "Migration to `/data/walrus`"
+
+    === "Source: old `/scratch`"
+
+        The old `scratch` filesystem is mounted at `/data/old/lustre/scratch2` on the datamover
+        nodes and the workspaces are within the subdirectories `ws/0` and `ws/1`, respectively. A
+        corresponding data transfer using `dtcopy` looks like
+
+        ```console
+        marie@barnard$ dtcp --recursive /data/old/lustre/scratch2/ws/0/marie-numbercrunch /data/walrus/ws/marie-numbercrunch
+        ```
+
+    === "Source: old `/ssd`"
+
+        The old `ssd` filesystem is mounted at `/data/old/lustre/ssd` on the datamover nodes and the
+        workspaces are within the subdirectory `ws/`. A corresponding data transfer using `dtcopy`
+        looks like
+
+        ```console
+        marie@barnard$ dtcp --recursive /data/old/lustre/ssd/ /data/walrus/ws/marie-numbercrunch
+        ```
+
+    === "Source: old `/beegfs`"
+
+        The old `beegfs` filesystem is mounted at `/data/old/beegfs` on the datamover nodes and the
+        workspaces are within the subdirectories `ws/0` and `ws/1`, respectively. A corresponding
+        data transfer using `dtcopy` looks like
+
+        ```console
+        marie@barnard$ dtcp --recursive /data/old/beegfs/ws/0/marie-numbercrunch /data/walrus/ws/marie-numbercrunch
+        ```
+
+??? "Migration from `/lustre/ssd` or `/beegfs`"
+
+    **You** are entirely responsible for the transfer of these data to the new location.
+    Start the dtrsync process as soon as possible. (And maybe repeat it at a later time.)
+
+??? "Migration from `/lustre/scratch2` aka `/scratch`"
+
+    We are synchronizing this (**last: October 18**) to `/data/horse/lustre/scratch2/`.
+
+    Please do **NOT** copy those data yourself. Instead check if it is already sychronized
+    to `/data/walrus/warm_archive/ws`.
+
+    In case you need to update this (Gigabytes, not Terabytes!) please run `dtrsync` like in
+    `dtrsync -a /data/old/lustre/scratch2/ws/0/my-workspace/newest/  /data/horse/lustre/scratch2/ws/0/my-workspace/newest/`
+
+??? "Migration from `/warm_archive`"
+
+    The process of syncing data from `/warm_archive` to `/data/walrus/warm_archive` is still ongoing.
+
+    Please do **NOT** copy those data yourself. Instead check if it is already sychronized
+    to `/data/walrus/warm_archive/ws`.
+
+    In case you need to update this (Gigabytes, not Terabytes!) please run `dtrsync` like in
+    `dtrsync -a /data/old/warm_archive/ws/my-workspace/newest/  /data/walrus/warm_archive/ws/my-workspace/newest/`
+
+When the last compute system will have been migrated the old file systems will be
+set write-protected and we start a final synchronization (sratch+walrus).
+The target directories for synchronization `/data/horse/lustre/scratch2/ws` and
+`/data/walrus/warm_archive/ws/` will not be deleted automatically in the mean time.
+
+## Software
+
+Please use `module spider` to identify the software modules you need to load.
+
+The default release version is 23.10.
+
+The new nodes run on Linux RHEL 8.7. For a seamless integration of other compute hardware,
+all operating system will be updated to the same versions of operating system, Mellanox and Lustre
+drivers. With this all application software was re-built consequently using Git and CI/CD pipelines
+for handling the multitude of versions.
+
+We start with `release/23.10` which is based on software requests from user feedbacks of our
+HPC users. Most major software versions exist on all hardware platforms.
+
+## Slurm
+
+* We are running the most recent Slurm version.
+* You must not use the old partition names.
+* Not all things are tested.
+
+## Updates after your feedback (state: October 19)
+
+* A **second synchronization** from `/scratch` has started on **October, 18**, and is
+  now nearly done.
+* A first, and incomplete synchronization from `/warm_archive` has been done (see above).
+  With support from NEC we are transferring the rest in the next weeks.
+* The **data transfer tools** now work fine.
+* After fixing too tight security restrictions, **all users can login** now.
+* **ANSYS** now starts: please check if your specific use case works.
+* **login1** is under construction, do not use it at the moment. Workspace creation does
+  not work there.
--- a/doc.zih.tu-dresden.de/mkdocs.yml
+++ b/doc.zih.tu-dresden.de/mkdocs.yml
@@ -103,9 +103,8 @@ nav:
      - Overview: jobs_and_resources/hardware_overview.md
      - New Systems 2023:
        - Architectural Re-Design 2023: jobs_and_resources/architecture_2023.md
-        - Overview 2023: jobs_and_resources/hardware_overview_2023.md
-        - Migration 2023: jobs_and_resources/migration_2023.md
-        - "How-To: Migration to Barnard": jobs_and_resources/barnard_test.md
+        - HPC Resources Overview 2023: jobs_and_resources/hardware_overview_2023.md
+        - "How-To: Migration to Barnard": jobs_and_resources/migration_to_barnard.md
      - AMD Rome Nodes: jobs_and_resources/rome_nodes.md
      - NVMe Storage: jobs_and_resources/nvme_storage.md
      - Alpha Centauri: jobs_and_resources/alpha_centauri.md