diff --git a/doc.zih.tu-dresden.de/docs/index.md b/doc.zih.tu-dresden.de/docs/index.md index f76dddd4ff8cb73162890b39f9f43e4301208bcb..da67d905d731c8b11a731174875e2dc65f0b3ad9 100644 --- a/doc.zih.tu-dresden.de/docs/index.md +++ b/doc.zih.tu-dresden.de/docs/index.md @@ -31,8 +31,8 @@ Please also find out the other ways you could contribute in our ## News +* **2023-11-06** [Substantial update on "How-To: Migration to Barnard](jobs_and_resources/migration_to_barnard.md) * **2023-10-16** [Open MPI 4.1.x - Workaround for MPI-IO Performance Loss](jobs_and_resources/mpi_issues/#performance-loss-with-mpi-io-module-ompio) -* **2023-10-04** [User tests on Barnard](jobs_and_resources/barnard_test.md) * **2023-06-01** [New hardware and complete re-design](jobs_and_resources/architecture_2023.md) * **2023-01-04** [New hardware: NVIDIA Arm HPC Developer Kit](jobs_and_resources/arm_hpc_devkit.md) diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/architecture_2023.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/architecture_2023.md index edcf705997db07e2358f5eff08145889104df519..8c2c5078ecc1a5bdf8ee3d8596436f635dd33f10 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/architecture_2023.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/architecture_2023.md @@ -1,8 +1,15 @@ # Architectural Re-Design 2023 -With the replacement of the Taurus system by the cluster [Barnard](hardware_overview_2023.md#barnard-intel-sapphire-rapids-cpus) -in 2023, the rest of the installed hardware had to be re-connected, both with -InfiniBand and with Ethernet. +Over the last decade we have been running our HPC system of high heterogeneity with a single +Slurm batch system. This made things very complicated, especially to inexperienced users. +With the replacement of the Taurus system by the cluster +[Barnard](hardware_overview_2023.md#barnard-intel-sapphire-rapids-cpus) +we **now create homogeneous clusters with their own Slurm instances and with cluster specific login +nodes** running on the same CPU. Job submission will be possible only from within the cluster +(compute or login node). + +All clusters will be integrated to the new InfiniBand fabric and have then the same access to +the shared filesystems. This recabling will require a brief downtime of a few days.  {: align=center} @@ -54,5 +61,11 @@ storages. ## Migration Phase For about one month, the new cluster Barnard, and the old cluster Taurus -will run side-by-side - both with their respective filesystems. You can find a comprehensive -[description of the migration phase here](migration_2023.md). +will run side-by-side - both with their respective filesystems. We provide a comprehensive +[description of the migration to Barnard](migration_to_barnard.md). + +The follwing figure provides a graphical overview of the overall process (red: user action +required): + + +{: align=center} diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview_2023.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview_2023.md index cc09c236cd8ae47fb1f24f3011251b011a2e8fdd..68cc848159e612f867ff84e254a153e0159efd29 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview_2023.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview_2023.md @@ -1,4 +1,4 @@ -# Overview 2023 +# HPC Resources Overview 2023 With the installation and start of operation of the [new HPC system Barnard](#barnard-intel-sapphire-rapids-cpus), quite significant changes w.r.t. HPC system landscape at ZIH follow. The former HPC system Taurus is @@ -49,7 +49,8 @@ All clusters will have access to these shared parallel filesystems: - 2 x AMD EPYC CPU 7702 (64 cores) @ 2.0 GHz, Multithreading available - 512 GB RAM - 200 GB local memory on SSD at `/tmp` -- Hostnames: `taurusi[7001-7192]` -> `i[7001-7190].romeo.hpc.tu-dresden.de` +- Hostnames: `taurusi[7001-7192]` -> `i[7001-7190].romeo.hpc.tu-dresden.de` (after + [recabling phase](architecture_2023.md#migration-phase)]) - Login nodes: `login[1-2].romeo.hpc.tu-dresden.de` - Further information on the usage is documented on the site [AMD Rome Nodes](rome_nodes.md) @@ -61,7 +62,8 @@ All clusters will have access to these shared parallel filesystems: - Configured as one single node - 48 TB RAM (usable: 47 TB - one TB is used for cache coherence protocols) - 370 TB of fast NVME storage available at `/nvme/<projectname>` -- Hostname: `taurussmp8` -> `smp8.julia.hpc.tu-dresden.de` +- Hostname: `taurussmp8` -> `smp8.julia.hpc.tu-dresden.de` (after + [recabling phase](architecture_2023.md#migration-phase)]) - Further information on the usage is documented on the site [HPE Superdome Flex](sd_flex.md) ## IBM Power9 Nodes for Machine Learning @@ -73,5 +75,6 @@ For machine learning, we have IBM AC922 nodes installed with this configuration: - 256 GB RAM DDR4 2666 MHz - 6 x NVIDIA VOLTA V100 with 32 GB HBM2 - NVLINK bandwidth 150 GB/s between GPUs and host -- Hostnames: `taurusml[1-32]` -> `ml[1-29].power9.hpc.tu-dresden.de` +- Hostnames: `taurusml[1-32]` -> `ml[1-29].power9.hpc.tu-dresden.de` (after + [recabling phase](architecture_2023.md#migration-phase)]) - Login nodes: `login[1-2].power9.hpc.tu-dresden.de` diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/migration_2023.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/migration_2023.md deleted file mode 100644 index 54637c407476297d30de46a4bf233359a6053b2b..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/migration_2023.md +++ /dev/null @@ -1,85 +0,0 @@ -# Migration 2023 - -## Brief Overview over Coming Changes - -All components of Taurus will be dismantled step by step. - -### New Hardware - -The new HPC system [Barnard](hardware_overview_2023.md#barnard-intel-sapphire-rapids-cpus) from Bull -comes with these main properties: - -* 630 compute nodes based on Intel Sapphire Rapids -* new Lustre-based storage systems -* HDR InfiniBand network large enough to integrate existing and near-future non-Bull hardware -* To help our users to find the best location for their data we now use the name of -animals (size, speed) as mnemonics. - -### New Architecture - -Over the last decade we have been running our HPC system of high heterogeneity with a single -Slurm batch system. This made things very complicated, especially to inexperienced users. -To lower this hurdle we **now create homogeneous clusters with their own Slurm instances and with -cluster specific login nodes** running on the same CPU. Job submission is possible only -from within the cluster (compute or login node). - -All clusters will be integrated to the new InfiniBand fabric and have then the same access to -the shared filesystems. This recabling requires a brief downtime of a few days. - -Please refer to the overview page [Architectural Re-Design 2023](architecture_2023.md) -for details on the new architecture. - -### New Software - -The new nodes run on Linux RHEL 8.7. For a seamless integration of other compute hardware, -all operating system will be updated to the same versions of operating system, Mellanox and Lustre -drivers. With this all application software was re-built consequently using Git and CI/CD pipelines -for handling the multitude of versions. - -We start with `release/23.10` which is based on software requests from user feedbacks of our -HPC users. Most major software versions exist on all hardware platforms. - -## Migration Path - -Please make sure to have read the details on the [Architectural Re-Design 2023](architecture_2023.md) -before further reading. - -!!! note - - The migration can only be successful as a joint effort of HPC team and users. - -Here is a description of the action items. - -|When?|TODO ZIH |TODO users |Remark | -|---|---|---|---| -| done (May 2023) |first sync `/scratch` to `/data/horse/old_scratch2`| |copied 4 PB in about 3 weeks| -| done (June 2023) |enable access to Barnard| |initialized LDAP tree with Taurus users| -| done (July 2023) | |install new software stack|tedious work | -| ASAP | |adapt scripts|new Slurm version, new resources, no partitions| -| August 2023 | |test new software stack on Barnard|new versions sometimes require different prerequisites| -| August 2023| |test new software stack on other clusters|a few nodes will be made available with the new software stack, but with the old filesystems| -| ASAP | |prepare data migration|The small filesystems `/beegfs` and `/lustre/ssd`, and `/home` are mounted on the old systems "until the end". They will *not* be migrated to the new system.| -| July 2023 | sync `/warm_archive` to new hardware| |using datamover nodes with Slurm jobs | -| September 2023 |prepare re-cabling of older hardware (Bull)| |integrate other clusters in the IB infrastructure | -| Autumn 2023 |finalize integration of other clusters (Bull)| |**~2 days downtime**, final rsync and migration of `/projects`, `/warm_archive`| -| Autumn 2023 ||transfer last data from old filesystems | `/beegfs`, `/lustre/scratch`, `/lustre/ssd` are no longer available on the new systems| - -### Data Migration - -Why do users need to copy their data? Why only some? How to do it best? - -* The sync of hundreds of terabytes can only be done planned and carefully. -(`/scratch`, `/warm_archive`, `/projects`). The HPC team will use multiple syncs -to not forget the last bytes. During the downtime, `/projects` will be migrated. -* User homes (`/home`) are relatively small and can be copied by the scientists. -Keeping in mind that maybe deleting and archiving is a better choice. -* For this, datamover nodes are available to run transfer jobs under Slurm. Please refer to the -section [Transfer Data to New Home Directory](../barnard_test#transfer-data-to-new-home-directory) -for more detailed instructions. - -### A Graphical Overview - -(red: user action required): - - -{: align=center} diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/barnard_test.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/migration_to_barnard.md similarity index 90% rename from doc.zih.tu-dresden.de/docs/jobs_and_resources/barnard_test.md rename to doc.zih.tu-dresden.de/docs/jobs_and_resources/migration_to_barnard.md index 4aab108f1fb12490bec84c7ce30f489843658a66..152cd79df0fbe6d849c1f5b089bc04fea053bca4 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/barnard_test.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/migration_to_barnard.md @@ -3,21 +3,25 @@ All HPC users are cordially invited to migrate to our new HPC system **Barnard** and prepare your software and workflows for production there. -!!! warning +!!! note "Migration Phase" - All data in the `/home` directory or in workspaces on the BeeGFS or Lustre SSD file - systems will be deleted by the end of 2023, since these filesystems will be decommissioned. + Please make sure to have read the details on the overall + [Architectural Re-Design 2023](architecture_2023.md#migration-phase) before further reading. -Existing Taurus users that would like to keep some of their data need to copy them to the new system -manually, using the [steps described below](#data-management-and-data-transfer). +The migration from Taurus to Barnard comprises the following steps: -For general hints regarding the migration please refer to these sites: +* [Prepare login to Barnard](#login-to-barnard) +* [Data management and data transfer to new filesystems](#data-management-and-data-transfer) +* [Update job scripts and workflow to new software](#software) +* [Update job scripts and workflow w.r.t. Slurm](#slurm) -* [Details on architecture](/jobs_and_resources/architecture_2023), -* [Description of the migration](migration_2023.md). +!!! note + We highly recommand to first read the entire page carefully, and then execute the steps. + +The migration can only be successful as a joint effort of HPC team and users. We value your feedback. Please provide it directly via our ticket system. For better processing, -please add "Barnard:" as a prefix to the subject of the [support ticket](../support/support). +please add "Barnard:" as a prefix to the subject of the [support ticket](../support/support.md). ## Login to Barnard @@ -184,7 +188,7 @@ target filesystems. [workspaces](../data_lifecycle/workspaces.md). Before you invoke any data transer from the old working filesystems to the new ones, you need to allocate a workspace first. - The command `ws_list -l` lists the available and the default filesystem for workspaces. + The command `ws_list --list` lists the available and the default filesystem for workspaces. ``` marie@barnard$ ws_list --list @@ -310,6 +314,14 @@ Please use `module spider` to identify the software modules you need to load. The default release version is 23.10. +The new nodes run on Linux RHEL 8.7. For a seamless integration of other compute hardware, +all operating system will be updated to the same versions of operating system, Mellanox and Lustre +drivers. With this all application software was re-built consequently using Git and CI/CD pipelines +for handling the multitude of versions. + +We start with `release/23.10` which is based on software requests from user feedbacks of our +HPC users. Most major software versions exist on all hardware platforms. + ## Slurm * We are running the most recent Slurm version. diff --git a/doc.zih.tu-dresden.de/mkdocs.yml b/doc.zih.tu-dresden.de/mkdocs.yml index 2f46fed3da8ef98c096e2b64c9d8fa0611749912..6f106d108cfbdef1594d1c85b72e8715f7fb15cf 100644 --- a/doc.zih.tu-dresden.de/mkdocs.yml +++ b/doc.zih.tu-dresden.de/mkdocs.yml @@ -103,9 +103,8 @@ nav: - Overview: jobs_and_resources/hardware_overview.md - New Systems 2023: - Architectural Re-Design 2023: jobs_and_resources/architecture_2023.md - - Overview 2023: jobs_and_resources/hardware_overview_2023.md - - Migration 2023: jobs_and_resources/migration_2023.md - - "How-To: Migration to Barnard": jobs_and_resources/barnard_test.md + - HPC Resources Overview 2023: jobs_and_resources/hardware_overview_2023.md + - "How-To: Migration to Barnard": jobs_and_resources/migration_to_barnard.md - AMD Rome Nodes: jobs_and_resources/rome_nodes.md - NVMe Storage: jobs_and_resources/nvme_storage.md - Alpha Centauri: jobs_and_resources/alpha_centauri.md