diff --git a/doc.zih.tu-dresden.de/docs/index.md b/doc.zih.tu-dresden.de/docs/index.md index f76dddd4ff8cb73162890b39f9f43e4301208bcb..da67d905d731c8b11a731174875e2dc65f0b3ad9 100644 --- a/doc.zih.tu-dresden.de/docs/index.md +++ b/doc.zih.tu-dresden.de/docs/index.md @@ -31,8 +31,8 @@ Please also find out the other ways you could contribute in our ## News +* **2023-11-06** [Substantial update on "How-To: Migration to Barnard](jobs_and_resources/migration_to_barnard.md) * **2023-10-16** [Open MPI 4.1.x - Workaround for MPI-IO Performance Loss](jobs_and_resources/mpi_issues/#performance-loss-with-mpi-io-module-ompio) -* **2023-10-04** [User tests on Barnard](jobs_and_resources/barnard_test.md) * **2023-06-01** [New hardware and complete re-design](jobs_and_resources/architecture_2023.md) * **2023-01-04** [New hardware: NVIDIA Arm HPC Developer Kit](jobs_and_resources/arm_hpc_devkit.md) diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/architecture_2023.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/architecture_2023.md index f7c25a6d1b8a5fce09d6cc0b657f42d20b92b4cd..23721a0a9808c56a3681c0114cc14e90f9c66ef2 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/architecture_2023.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/architecture_2023.md @@ -1,8 +1,15 @@ # Architectural Re-Design 2023 -With the replacement of the Taurus system by the cluster `Barnard` in 2023, -the rest of the installed hardware had to be re-connected, both with -InfiniBand and with Ethernet. +Over the last decade we have been running our HPC system of high heterogeneity with a single +Slurm batch system. This made things very complicated, especially to inexperienced users. +With the replacement of the Taurus system by the cluster +[Barnard](hardware_overview_2023.md#barnard-intel-sapphire-rapids-cpus) +we **now create homogeneous clusters with their own Slurm instances and with cluster specific login +nodes** running on the same CPU. Job submission will be possible only from within the cluster +(compute or login node). + +All clusters will be integrated to the new InfiniBand fabric and have then the same access to +the shared filesystems. This recabling will require a brief downtime of a few days.  {: align=center} @@ -21,26 +28,26 @@ computations, please use interactive jobs. ## Storage Systems +For an easier grasp on the major categories (size, speed), the +work filesystems now come with the names of animals. + ### Permanent Filesystems -We now have `/home`, `/projects` and `/software` in a Lustre filesystem. Snapshots -and tape backup are configured. For convenience, we will make the old home available -read-only as `/home_old` on the data mover nodes for the data migration period. +We now have `/home` and `/software` in a Lustre filesystem. Snapshots +and tape backup are configured. (`/projects` remains the same until a recabling.) -`/warm_archive` is mounted on the data movers, only. +The Lustre filesystem `/data/walrus` is meant for larger data with a slow +access. It is installed to replace `/warm_archive`. ### Work Filesystems -With new players with new software in the filesystem market it is getting more and more -complicated to identify the best suited filesystem for temporary data. In many cases, -only tests can provide the right answer, for a short time. - -For an easier grasp on the major categories (size, speed), the work filesystems now come -with the names of animals: +In the filesystem market with new players it is getting more and more +complicated to identify the best suited filesystem for a specific use case. Often, +only tests can find the best setup for a specific workload. * `/data/horse` - 20 PB - high bandwidth (Lustre) -* `/data/octopus` - 0.5 PB - for interactive usage (Lustre) -* `/data/weasel` - 1 PB - for high IOPS (WEKA) - coming soon +* `/data/octopus` - 0.5 PB - for interactive usage (Lustre) - to be mounted on Alpha Centauri +* `/data/weasel` - 1 PB - for high IOPS (WEKA) - coming 2024. ### Difference Between "Work" And "Permanent" @@ -48,11 +55,22 @@ A large number of changing files is a challenge for any backup system. To protec our snapshots and backup from work data, `/projects` cannot be used for temporary data on the compute nodes - it is mounted read-only. +For `/home`, we create snapshots and tape backups. That's why working there, +with a high frequency of changing files is a bad idea. + Please use our data mover mechanisms to transfer worthy data to permanent -storages. +storages or long-term archives. ## Migration Phase For about one month, the new cluster Barnard, and the old cluster Taurus -will run side-by-side - both with their respective filesystems. You can find a comprehensive -[description of the migration phase here](migration_2023.md). +will run side-by-side - both with their respective filesystems. We provide a comprehensive +[description of the migration to Barnard](migration_to_barnard.md). + +<! -- +The follwing figure provides a graphical overview of the overall process (red: user action +required): + + +{: align=center} +--> diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/barnard_test.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/barnard_test.md deleted file mode 100644 index 5603936791a9fa117c539dde95b59693dd119052..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/barnard_test.md +++ /dev/null @@ -1,172 +0,0 @@ -# Barnard Migration How-To - -All HPC users are invited to test our new HPC system Barnard and prepare your software -and workflows for production there. - -**Furthermore, all data in the `/home` directory or in workspaces on the BeeGFS or Lustre SSD file systems -will be deleted by the end of 2023!** - -Existing Taurus users that would like to keep some of their need to copy some of their data on these -file systems need -to copy them the new system manually, using the steps described below. - -For general hints regarding the migration please refer to these sites: - -* [Details on architecture](/jobs_and_resources/architecture_2023), -* [Description of the migration](migration_2023.md). - -We value your feedback. Please provide it directly via our ticket system. For better processing, -please add "Barnard:" as a prefix to the subject of the [support ticket](../support/support). - -Here, you can find few hints which might help you with the first steps. - -## Login to Barnard - -All users and projects from Taurus now can work on Barnard. - -They can use `login[2-4].barnard.hpc.tu-dresden.de` to access the system -from campus (or VPN). [Fingerprints](/access/key_fingerprints/#barnard) - -All users have **new empty HOME** file systems, this means you have first to... - -??? "... install your public ssh key on the system" - - - Please create a new SSH keypair with ed25519 encryption, secured with - a passphrase. Please refer to this - [page for instructions](../../access/ssh_login#before-your-first-connection). - - After login, add the public key to your `.ssh/authorized_keys` file - on Barnard. - -## Data Management - -* The `/project` filesystem is the same on Taurus and Barnard -(mounted read-only on the compute nodes). -* The new work filesystem is `/data/horse`. -* The slower `/data/walrus` can be considered as a substitute for the old - `/warm_archive`- mounted **read-only** on the compute nodes. - It can be used to store e.g. results. - -These `/data/horse` and `/data/walrus` can be accesed via workspaces. Please refer to the -[workspace page](../../data_lifecycle/workspaces/), if you are not familiar with workspaces. - -??? "Tips on workspaces" - * To list all available workspace filessystem, invoke the command `ws_list -l`." - * Please use the command `dtinfo` to get the current mount points: - ``` - marie@login1> dtinfo - [...] - directory on datamover mounting clusters directory on cluster - - /data/old/home Taurus /home - /data/old/lustre/scratch2 Taurus /scratch - /data/old/lustre/ssd Taurus /lustre/ssd - [...] - ``` - -!!! Warning - - All old filesystems fill be shutdown by the end of 2023. - - To work with your data from Taurus you might have to move/copy them to the new storages. - -For this, we have four new [datamover nodes](/data_transfer/datamover) that have mounted all storages -of the old and new system. (Do not use the datamovers from Taurus!) - -??? "Migration from Home Directory" - - Your personal (old) home directory at Taurus will not be automatically transferred to the new Barnard - system. **You are responsible for this task.** Please do not copy your entire home, but consider - this opportunity for cleaning up you data. E.g., it might make sense to delete outdated scripts, old - log files, etc., and move other files to an archive filesystem. Thus, please transfer only selected - directories and files that you need on the new system. - - The well-known [datamover tools](../../data_transfer/datamover/) are available to run such transfer - jobs under Slurm. The steps are as follows: - - 1. Login to Barnard: `ssh login[1-4].barnard.tu-dresden.de` - 1. The command `dtinfo` will provide you the mountpoints - - ```console - marie@barnard$ dtinfo - [...] - directory on datamover mounting clusters directory on cluster - - /data/old/home Taurus /home - /data/old/lustre/scratch2 Taurus /scratch - /data/old/lustre/ssd Taurus /lustre/ssd - [...] - ``` - - 1. Use the `dtls` command to list your files on the old home directory: `marie@barnard$ dtls - /data/old/home/marie` - 1. Use `dtcp` command to invoke a transfer job, e.g., - - ```console - marie@barnard$ dtcp --recursive /data/old/home/marie/<useful data> /home/marie/ - ``` - - **Note**, please adopt the source and target paths to your needs. All available options can be - queried via `dtinfo --help`. - - !!! warning - - Please be aware that there is **no synchronisation process** between your home directories at - Taurus and Barnard. Thus, after the very first transfer, they will become divergent. - - We recommand to **take some minutes for planing the transfer process**. Do not act with - precipitation. - -??? "Migration from `/lustre/ssd` or `/beegfs`" - - **You** are entirely responsible for the transfer of these data to the new location. - Start the dtrsync process as soon as possible. (And maybe repeat it at a later time.) - -??? "Migration from `/lustre/scratch2` aka `/scratch`" - - We are synchronizing this (**last: October 18**) to `/data/horse/lustre/scratch2/`. - - Please do **NOT** copy those data yourself. Instead check if it is already sychronized - to `/data/walrus/warm_archive/ws`. - - In case you need to update this (Gigabytes, not Terabytes!) please run `dtrsync` like in - `dtrsync -a /data/old/lustre/scratch2/ws/0/my-workspace/newest/ /data/horse/lustre/scratch2/ws/0/my-workspace/newest/` - -??? "Migration from `/warm_archive`" - - The process of syncing data from `/warm_archive` to `/data/walrus/warm_archive` is still ongoing. - - Please do **NOT** copy those data yourself. Instead check if it is already sychronized - to `/data/walrus/warm_archive/ws`. - - In case you need to update this (Gigabytes, not Terabytes!) please run `dtrsync` like in - `dtrsync -a /data/old/warm_archive/ws/my-workspace/newest/ /data/walrus/warm_archive/ws/my-workspace/newest/` - -When the last compute system will have been migrated the old file systems will be -set write-protected and we start a final synchronization (sratch+walrus). -The target directories for synchronization `/data/horse/lustre/scratch2/ws` and -`/data/walrus/warm_archive/ws/` will not be deleted automatically in the mean time. - -## Software - -Please use `module spider` to identify the software modules you need to load.Like -on Taurus. - - The default release version is 23.10. - -## Slurm - -* We are running the most recent Slurm version. -* You must not use the old partition names. -* Not all things are tested. - -## Updates after your feedback (state: October 19) - -* A **second synchronization** from `/scratch` has started on **October, 18**, and is - now nearly done. -* A first, and incomplete synchronization from `/warm_archive` has been done (see above). - With support from NEC we are transferring the rest in the next weeks. -* The **data transfer tools** now work fine. -* After fixing too tight security restrictions, **all users can login** now. -* **ANSYS** now starts: please check if your specific use case works. -* **login1** is under construction, do not use it at the moment. Workspace creation does - not work there. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview_2023.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview_2023.md index c888857b47414e2c068cac78f9ca9804efb056b5..68cc848159e612f867ff84e254a153e0159efd29 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview_2023.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview_2023.md @@ -1,24 +1,20 @@ -# HPC Resources +# HPC Resources Overview 2023 -The architecture specifically tailored to data-intensive computing, Big Data -analytics, and artificial intelligence methods with extensive capabilities -for performance monitoring provides ideal conditions to achieve the ambitious -research goals of the users and the ZIH. - -## Overview - -From the users' perspective, there are separate clusters, all of them with their subdomains: +With the installation and start of operation of the [new HPC system Barnard](#barnard-intel-sapphire-rapids-cpus), +quite significant changes w.r.t. HPC system landscape at ZIH follow. The former HPC system Taurus is +partly switched-off and partly split up into separate clusters. In the end, from the users' +perspective, there will be **five separate clusters**: | Name | Description | Year| DNS | | --- | --- | --- | --- | -| **Barnard** | CPU cluster |2023| n[1001-1630].barnard.hpc.tu-dresden.de | -| **Romeo** | CPU cluster |2020|i[8001-8190].romeo.hpc.tu-dresden.de | -| **Alpha Centauri** | GPU cluster |2021|i[8001-8037].alpha.hpc.tu-dresden.de | -| **Julia** | single SMP system |2021|smp8.julia.hpc.tu-dresden.de | -| **Power** | IBM Power/GPU system |2018|ml[1-29].power9.hpc.tu-dresden.de | +| **Barnard** | CPU cluster |2023| `n[1001-1630].barnard.hpc.tu-dresden.de` | +| **Romeo** | CPU cluster |2020| `i[8001-8190].romeo.hpc.tu-dresden.de` | +| **Alpha Centauri** | GPU cluster | 2021| `i[8001-8037].alpha.hpc.tu-dresden.de` | +| **Julia** | single SMP system |2021| `smp8.julia.hpc.tu-dresden.de` | +| **Power** | IBM Power/GPU system |2018| `ml[1-29].power9.hpc.tu-dresden.de` | -They run with their own Slurm batch system. Job submission is possible only from -their respective login nodes. +All clusters will run with their own [Slurm batch system](slurm.md) and job submission is possible +only from their respective login nodes. All clusters will have access to these shared parallel filesystems: @@ -36,7 +32,7 @@ All clusters will have access to these shared parallel filesystems: - Hostnames: `n[1001-1630].barnard.hpc.tu-dresden.de` - Login nodes: `login[1-4].barnard.hpc.tu-dresden.de` -## AMD Rome CPUs + NVIDIA A100 +## Alpha Centauri - AMD Rome CPUs + NVIDIA A100 - 34 nodes, each with - 8 x NVIDIA A100-SXM4 Tensor Core-GPUs @@ -47,17 +43,18 @@ All clusters will have access to these shared parallel filesystems: - Login nodes: `login[1-2].alpha.hpc.tu-dresden.de` - Further information on the usage is documented on the site [Alpha Centauri Nodes](alpha_centauri.md) -## Island 7 - AMD Rome CPUs +## Romeo - AMD Rome CPUs - 192 nodes, each with - 2 x AMD EPYC CPU 7702 (64 cores) @ 2.0 GHz, Multithreading available - 512 GB RAM - 200 GB local memory on SSD at `/tmp` -- Hostnames: `taurusi[7001-7192]` -> `i[7001-7190].romeo.hpc.tu-dresden.de` +- Hostnames: `taurusi[7001-7192]` -> `i[7001-7190].romeo.hpc.tu-dresden.de` (after + [recabling phase](architecture_2023.md#migration-phase)]) - Login nodes: `login[1-2].romeo.hpc.tu-dresden.de` - Further information on the usage is documented on the site [AMD Rome Nodes](rome_nodes.md) -## Large SMP System HPE Superdome Flex +## Julia - Large SMP System HPE Superdome Flex - 1 node, with - 32 x Intel Xeon Platinum 8276M CPU @ 2.20 GHz (28 cores) @@ -65,7 +62,8 @@ All clusters will have access to these shared parallel filesystems: - Configured as one single node - 48 TB RAM (usable: 47 TB - one TB is used for cache coherence protocols) - 370 TB of fast NVME storage available at `/nvme/<projectname>` -- Hostname: `taurussmp8` -> `smp8.julia.hpc.tu-dresden.de` +- Hostname: `taurussmp8` -> `smp8.julia.hpc.tu-dresden.de` (after + [recabling phase](architecture_2023.md#migration-phase)]) - Further information on the usage is documented on the site [HPE Superdome Flex](sd_flex.md) ## IBM Power9 Nodes for Machine Learning @@ -77,5 +75,6 @@ For machine learning, we have IBM AC922 nodes installed with this configuration: - 256 GB RAM DDR4 2666 MHz - 6 x NVIDIA VOLTA V100 with 32 GB HBM2 - NVLINK bandwidth 150 GB/s between GPUs and host -- Hostnames: `taurusml[1-32]` -> `ml[1-29].power9.hpc.tu-dresden.de` +- Hostnames: `taurusml[1-32]` -> `ml[1-29].power9.hpc.tu-dresden.de` (after + [recabling phase](architecture_2023.md#migration-phase)]) - Login nodes: `login[1-2].power9.hpc.tu-dresden.de` diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/migration_2023.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/migration_2023.md deleted file mode 100644 index 1d93d1d038796928b1b596c6acd619f3d8d67ba1..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/migration_2023.md +++ /dev/null @@ -1,82 +0,0 @@ -# Migration 2023 - -## Brief Overview over Coming Changes - -All components of Taurus will be dismantled step by step. - -### New Hardware - -The new HPC system "Barnard" from Bull comes with these main properties: - -* 630 compute nodes based on Intel Sapphire Rapids -* new Lustre-based storage systems -* HDR InfiniBand network large enough to integrate existing and near-future non-Bull hardware -* To help our users to find the best location for their data we now use the name of -animals (size, speed) as mnemonics. - -More details can be found in the [overview](/jobs_and_resources/hardware_overview_2023). - -### New Architecture - -Over the last decade we have been running our HPC system of high heterogeneity with a single -Slurm batch system. This made things very complicated, especially to inexperienced users. -To lower this hurdle we now create homogeneous clusters with their own Slurm instances and with -cluster specific login nodes running on the same CPU. Job submission is possible only -from within the cluster (compute or login node). - -All clusters will be integrated to the new InfiniBand fabric and have then the same access to -the shared filesystems. This recabling requires a brief downtime of a few days. - -[Details on architecture](/jobs_and_resources/architecture_2023). - -### New Software - -The new nodes run on Linux RHEL 8.7. For a seamless integration of other compute hardware, -all operating system will be updated to the same versions of OS, Mellanox and Lustre drivers. -With this all application software was re-built consequently using GIT and CI for handling -the multitude of versions. - -We start with `release/23.10` which is based on software requests from user feedbacks of our -HPC users. Most major software versions exist on all hardware platforms. - -## Migration Path - -Please make sure to have read [Details on architecture](/jobs_and_resources/architecture_2023) before -further reading. - -The migration can only be successful as a joint effort of HPC team and users. Here is a description -of the action items. - -|When?|TODO ZIH |TODO users |Remark | -|---|---|---|---| -| done (May 2023) |first sync `/scratch` to `/data/horse/old_scratch2`| |copied 4 PB in about 3 weeks| -| done (June 2023) |enable access to Barnard| |initialized LDAP tree with Taurus users| -| done (July 2023) | |install new software stack|tedious work | -| ASAP | |adapt scripts|new Slurm version, new resources, no partitions| -| August 2023 | |test new software stack on Barnard|new versions sometimes require different prerequisites| -| August 2023| |test new software stack on other clusters|a few nodes will be made available with the new software stack, but with the old filesystems| -| ASAP | |prepare data migration|The small filesystems `/beegfs` and `/lustre/ssd`, and `/home` are mounted on the old systems "until the end". They will *not* be migrated to the new system.| -| July 2023 | sync `/warm_archive` to new hardware| |using datamover nodes with Slurm jobs | -| September 2023 |prepare re-cabling of older hardware (Bull)| |integrate other clusters in the IB infrastructure | -| Autumn 2023 |finalize integration of other clusters (Bull)| |**~2 days downtime**, final rsync and migration of `/projects`, `/warm_archive`| -| Autumn 2023 ||transfer last data from old filesystems | `/beegfs`, `/lustre/scratch`, `/lustre/ssd` are no longer available on the new systems| - -### Data Migration - -Why do users need to copy their data? Why only some? How to do it best? - -* The sync of hundreds of terabytes can only be done planned and carefully. -(`/scratch`, `/warm_archive`, `/projects`). The HPC team will use multiple syncs -to not forget the last bytes. During the downtime, `/projects` will be migrated. -* User homes (`/home`) are relatively small and can be copied by the scientists. -Keeping in mind that maybe deleting and archiving is a better choice. -* For this, datamover nodes are available to run transfer jobs under Slurm. Please refer to the -section [Transfer Data to New Home Directory](../barnard_test#transfer-data-to-new-home-directory) -for more detailed instructions. - -### A Graphical Overview - -(red: user action required): - - -{: align=center} diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/migration_to_barnard.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/migration_to_barnard.md new file mode 100644 index 0000000000000000000000000000000000000000..7fd7ed4333aeee74afec4da6af94c7f4d301dde3 --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/migration_to_barnard.md @@ -0,0 +1,344 @@ +# How-To: Migration to Barnard + +All HPC users are cordially invited to migrate to our new HPC system **Barnard** and prepare your +software and workflows for production there. + +!!! note "Migration Phase" + + Please make sure to have read the details on the overall + [Architectural Re-Design 2023](architecture_2023.md#migration-phase) before further reading. + +The migration from Taurus to Barnard comprises the following steps: + +* [Prepare login to Barnard](#login-to-barnard) +* [Data management and data transfer to new filesystems](#data-management-and-data-transfer) +* [Update job scripts and workflow to new software](#software) +* [Update job scripts and workflow w.r.t. Slurm](#slurm) + +!!! note + + We highly recommand to first read the entire page carefully, and then execute the steps. + +The migration can only be successful as a joint effort of HPC team and users. +We value your feedback. Please provide it directly via our ticket system. For better processing, +please add "Barnard:" as a prefix to the subject of the [support ticket](../support/support.md). + +## Login to Barnard + +!!! hint + + All users and projects from Taurus now can work on Barnard. + +You use `login[1-4].barnard.hpc.tu-dresden.de` to access the system +from campus (or VPN). In order to verify the SSH fingerprints of the login nodes, please refer to +the page [Fingerprints](/access/key_fingerprints/#barnard). + +All users have **new empty HOME** file systems, this means you have first to ... + +??? "... install your public SSH key on Barnard" + + - Please create a new SSH keypair with ed25519 encryption, secured with + a passphrase. Please refer to this + [page for instructions](../../access/ssh_login#before-your-first-connection). + - After login, add the public key to your `.ssh/authorized_keys` file on Barnard. + +## Data Management and Data Transfer + +### Filesystems on Barnard + +Our new HPC system Barnard also comes with **two new Lustre filesystems**, namely `/data/horse` and +`/data/walrus`. Both have a capacity of 20 PB, but differ in performance and intended usage, see +below. In order to support the data life cycle management, the well-known +[workspace concept](#workspaces-on-barnard) is applied. + +* The `/project` filesystem is the same on Taurus and Barnard +(mounted read-only on the compute nodes). +* The new work filesystem is `/data/horse`. +* The slower `/data/walrus` can be considered as a substitute for the old + `/warm_archive`- mounted **read-only** on the compute nodes. + It can be used to store e.g. results. + +!!! Warning + + All old filesystems, i.e., `ssd`, `beegfs`, and `scratch`, will be shutdown by the end of 2023. + To work with your data from Taurus you might have to move/copy them to the new storages. + + Please, carefully read the following documentation and instructions. + +### Workspaces on Barnard + +The filesystems `/data/horse` and `/data/walrus` can only be accessed via workspaces. Please refer +to the [workspace page](../../data_lifecycle/workspaces/), if you are not familiar with the +workspace concept and the corresponding commands. The following table provides the settings for +workspaces on these two filesystems. + +| Filesystem (use with parameter `--filesystem=<filesystem>`) | Max. Duration in Days | Extensions | Keeptime in Days | +|:-------------------------------------|---------------:|-----------:|--------:| +| `/data/horse` (default) | 100 | 10 | 30 | +| `/data/walrus` | 365 | 2 | 60 | +{: summary="Settings for Workspace Filesystem `/data/horse` and `/data/walrus`."} + +### Data Migration to New Filesystems + +Since all old filesystems of Taurus will be shutdown by the end of 2023, your data needs to be +migrated to the new filesystems on Barnard. This migration comprises + +* your personal `/home` directory, +* your workspaces on `/ssd`, `/beegfs` and `/scratch`. + +!!! note "It's your turn" + + **You are responsible for the migration of your data**. With the shutdown of the old + filesystems, all data will be deleted. + +!!! note "Make a plan" + + We highly recommand to **take some minutes for planing the transfer process**. Do not act with + precipitation. + + Please **do not copy your entire data** from the old to the new filesystems, but consider this + opportunity for **cleaning up your data**. E.g., it might make sense to delete outdated scripts, + old log files, etc., and move other files, e.g., results, to the `/data/walrus` filesystem. + +!!! hint "Generic login" + + In the following we will use the generic login `marie` and workspace `numbercrunch` + ([cf. content rules on generic names](../contrib/content_rules.md#data-privacy-and-generic-names)). + **Please make sure to replace it with your personal login.** + +We have four new [datamover nodes](/data_transfer/datamover) that have mounted all storages +of the old Taurus and new Barnard system. Do not use the datamovers from Taurus, i.e., all data +transfer need to be invoked from Barnard! Thus, the very first step is to +[login to Barnard](#login-to-barnard). + +The command `dtinfo` will provide you the mountpoints of the old filesystems + +```console +marie@barnard$ dtinfo +[...] +directory on datamover mounting clusters directory on cluster + +/data/old/home Taurus /home +/data/old/lustre/scratch2 Taurus /scratch +/data/old/lustre/ssd Taurus /lustre/ssd +[...] +``` + +In the following, we will provide instructions with comprehensive examples for the data transfer of +your data to the new `/home` filesystem, as well as the working filesystems `/data/horse` and +`/data/walrus`. + +??? "Migration of Your Home Directory" + + Your personal (old) home directory at Taurus will not be automatically transferred to the new + Barnard system. Please do not copy your entire home, but clean up your data. E.g., it might + make sense to delete outdated scripts, old log files, etc., and move other files to an archive + filesystem. Thus, please transfer only selected directories and files that you need on the new + system. + + The steps are as follows: + + 1. Login to Barnard, i.e., + + ``` + ssh login[1-4].barnard.tu-dresden.de + ``` + + 1. The command `dtinfo` will provide you the mountpoint + + ```console + marie@barnard$ dtinfo + [...] + directory on datamover mounting clusters directory on cluster + + /data/old/home Taurus /home + [...] + ``` + + 1. Use the `dtls` command to list your files on the old home directory + + ``` + marie@barnard$ dtls /data/old/home/marie + [...] + ``` + + 1. Use the `dtcp` command to invoke a transfer job, e.g., + + ```console + marie@barnard$ dtcp --recursive /data/old/home/marie/<useful data> /home/marie/ + ``` + + **Note**, please adopt the source and target paths to your needs. All available options can be + queried via `dtinfo --help`. + + !!! warning + + Please be aware that there is **no synchronisation process** between your home directories + at Taurus and Barnard. Thus, after the very first transfer, they will become divergent. + +Please follow this instructions for transferring you data from `ssd`, `beegfs` and `scratch` to the +new filesystems. The instructions and examples are divided by the target not the source filesystem. + +This migration task requires a preliminary step: You need to allocate workspaces on the +target filesystems. + +??? Note "Preliminary Step: Allocate a workspace" + + Both `/data/horse/` and `/data/walrus` can only be used with + [workspaces](../data_lifecycle/workspaces.md). Before you invoke any data transer from the old + working filesystems to the new ones, you need to allocate a workspace first. + + The command `ws_list --list` lists the available and the default filesystem for workspaces. + + ``` + marie@barnard$ ws_list --list + available filesystems: + horse (default) + walrus + ``` + + As you can see, `/data/horse` is the default workspace filesystem at Barnard. I.e., if you + want to allocate, extend or release a workspace on `/data/walrus`, you need to pass the + option `--filesystem=walrus` explicitly to the corresponding workspace commands. Please + refer to our [workspace documentation](../data_lifecycle/workspaces.md), if you need refresh + your knowledge. + + The most simple command to allocate a workspace is as follows + + ``` + marie@barnard$ ws_allocate numbercrunch 90 + ``` + + Please refer to the table holding the settings + (cf. [subsection workspaces on Barnard](#workspaces-on-barnard)) for the max. duration and + `ws_allocate --help` for all available options. + +??? "Migration to work filesystem `/data/horse`" + + === "Source: old `/scratch`" + + We are synchronizing the old `/scratch` to `/data/horse/lustre/scratch2/` (**last: October + 18**). + If you transfer data from the old `/scratch` to `/data/horse`, it is sufficient to use + `dtmv` instead of `dtcp` since this data has already been copied to a special directory on + the new `horse` filesystem. Thus, you just need to move it to the right place (the Lustre + metadata system will update the correspoding entries). + + The workspaces within the subdirectories `ws/0` and `ws/1`, respectively. A corresponding + data transfer using `dtmv` looks like + + ```console + marie@barnard$ dtmv /data/horse/lustre/scratch2/ws/0/marie-numbercrunch/<useful data> /data/horse/ws/marie-numbercrunch/ + ``` + + Please do **NOT** copy those data yourself. Instead check if it is already sychronized + to `/data/horse/lustre/scratch2/ws/0/marie-numbercrunch`. + + In case you need to update this (Gigabytes, not Terabytes!) please run `dtrsync` like in + + ``` + marie@barnard$ dtrsync -a /data/old/lustre/scratch2/ws/0/marie-numbercrunch/<useful data> /data/horse/ws/marie-numbercrunch/ + ``` + + === "Source: old `/ssd`" + + The old `ssd` filesystem is mounted at `/data/old/lustre/ssd` on the datamover nodes and the + workspaces are within the subdirectory `ws/`. A corresponding data transfer using `dtcp` + looks like + + ```console + marie@barnard$ dtcp --recursive /data/old/lustre/ssd/ws/marie-numbercrunch/<useful data> /data/horse/ws/marie-numbercrunch/ + ``` + + === "Source: old `/beegfs`" + + The old `beegfs` filesystem is mounted at `/data/old/beegfs` on the datamover nodes and the + workspaces are within the subdirectories `ws/0` and `ws/1`, respectively. A corresponding + data transfer using `dtcp` looks like + + ```console + marie@barnard$ dtcp --recursive /data/old/beegfs/ws/0/marie-numbercrunch/<useful data> /data/horse/ws/marie-numbercrunch/ + ``` + +??? "Migration to `/data/walrus`" + + === "Source: old `/scratch`" + + We are synchronizing the old `/scratch` to `/data/horse/lustre/scratch2/` (**last: October + 18**). The old `scratch` filesystem has been already synchronized to + `/data/horse/lustre/scratch2` nodes and the workspaces are within the subdirectories `ws/0` + and `ws/1`, respectively. A corresponding data transfer using `dtcp` looks like + + ```console + marie@barnard$ dtcp --recursive /data/horse/lustre/scratch2/ws/0/marie-numbercrunch/<useful data> /data/walrus/ws/marie-numbercrunch/ + ``` + + Please do **NOT** copy those data yourself. Instead check if it is already sychronized + to `/data/horse/lustre/scratch2/ws/0/marie-numbercrunch`. + + In case you need to update this (Gigabytes, not Terabytes!) please run `dtrsync` like in + + ``` + marie@barnard$ dtrsync -a /data/old/lustre/scratch2/ws/0/marie-numbercrunch/<useful data> /data/walrus/ws/marie-numbercrunch/ + ``` + + === "Source: old `/ssd`" + + The old `ssd` filesystem is mounted at `/data/old/lustre/ssd` on the datamover nodes and the + workspaces are within the subdirectory `ws/`. A corresponding data transfer using `dtcp` + looks like + + ```console + marie@barnard$ dtcp --recursive /data/old/lustre/ssd/<useful data> /data/walrus/ws/marie-numbercrunch/ + ``` + + === "Source: old `/beegfs`" + + The old `beegfs` filesystem is mounted at `/data/old/beegfs` on the datamover nodes and the + workspaces are within the subdirectories `ws/0` and `ws/1`, respectively. A corresponding + data transfer using `dtcp` looks like + + ```console + marie@barnard$ dtcp --recursive /data/old/beegfs/ws/0/marie-numbercrunch/<useful data> /data/walrus/ws/marie-numbercrunch/ + ``` + +??? "Migration from `/warm_archive`" + + We are synchronizing the old `/warm_archive` to `/data/walrus/warm_archive/`. Therefor, it can + be sufficient to use `dtmv` instead of `dtcp` (No data will be copied, but the Lustre system + will update the correspoding metadata entries). A corresponding data transfer using `dtmv` looks + like + + ```console + marie@barnard$ dtmv /data/walrus/warm_archive/ws/marie-numbercrunch/<useful data> /data/walrus/ws/marie-numbercrunch/ + ``` + + Please do **NOT** copy those data yourself. Instead check if it is already sychronized + to `/data/walrus/warm_archive/ws`. + + In case you need to update this (Gigabytes, not Terabytes!) please run `dtrsync` like in + + ``` + marie@barnard$ dtrsync -a /data/old/warm_archive/ws/marie-numbercrunch/<useful data> /data/walrus/ws/marie-numbercrunch/ + ``` + +When the last compute system will have been migrated the old file systems will be +set write-protected and we start a final synchronization (sratch+walrus). +The target directories for synchronization `/data/horse/lustre/scratch2/ws` and +`/data/walrus/warm_archive/ws/` will not be deleted automatically in the mean time. + +## Software + +Barnard is running on Linux RHEL 8.7. All application software was re-built consequently using Git +and CI/CD pipelines for handling the multitude of versions. + +We start with `release/23.10` which is based on software requests from user feedbacks of our +HPC users. Most major software versions exist on all hardware platforms. + +Please use `module spider` to identify the software modules you need to load. + +## Slurm + +* We are running the most recent Slurm version. +* You must not use the old partition names. +* Not all things are tested. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/architecture_2023.png b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/architecture_2023.png index bc1083880f5172240dd78f57dd8b1a7bac39dab5..bf5235a6e75b516cd096877e59787e5e3c5c1c0b 100644 Binary files a/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/architecture_2023.png and b/doc.zih.tu-dresden.de/docs/jobs_and_resources/misc/architecture_2023.png differ diff --git a/doc.zih.tu-dresden.de/mkdocs.yml b/doc.zih.tu-dresden.de/mkdocs.yml index de58b22700e88293c005777a3f31460342419b92..e803b237847a6797ab7ee73f075a873172aa654a 100644 --- a/doc.zih.tu-dresden.de/mkdocs.yml +++ b/doc.zih.tu-dresden.de/mkdocs.yml @@ -103,9 +103,8 @@ nav: - Overview: jobs_and_resources/hardware_overview.md - New Systems 2023: - Architectural Re-Design 2023: jobs_and_resources/architecture_2023.md - - Overview 2023: jobs_and_resources/hardware_overview_2023.md - - Migration 2023: jobs_and_resources/migration_2023.md - - "How-To: Migration 2023": jobs_and_resources/barnard_test.md + - HPC Resources Overview 2023: jobs_and_resources/hardware_overview_2023.md + - "How-To: Migration to Barnard": jobs_and_resources/migration_to_barnard.md - AMD Rome Nodes: jobs_and_resources/rome_nodes.md - NVMe Storage: jobs_and_resources/nvme_storage.md - Alpha Centauri: jobs_and_resources/alpha_centauri.md