diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/architecture_2023.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/architecture_2023.md deleted file mode 100644 index c0e992d5b308384b3849e324dd1d6577cfd61e22..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/architecture_2023.md +++ /dev/null @@ -1,76 +0,0 @@ -# Architectural Re-Design 2023 - -Over the last decade we have been running our HPC system of high heterogeneity with a single -Slurm batch system. This made things very complicated, especially to inexperienced users. -With the replacement of the Taurus system by the cluster -[Barnard](hardware_overview.md#barnard) -we **now create homogeneous clusters with their own Slurm instances and with cluster specific login -nodes** running on the same CPU. Job submission will be possible only from within the cluster -(compute or login node). - -All clusters will be integrated to the new InfiniBand fabric and have then the same access to -the shared filesystems. This recabling will require a brief downtime of a few days. - - -{: align=center} - -## Compute Systems - -All compute clusters now act as separate entities having their own -login nodes of the same hardware and their very own Slurm batch systems. The different hardware, -e.g. Romeo and Alpha Centauri, is no longer managed via a single Slurm instance with -corresponding partitions. Instead, you as user now chose the hardware by the choice of the -correct login node. - -The login nodes can be used for smaller interactive jobs on the clusters. There are -restrictions in place, though, wrt. usable resources and time per user. For larger -computations, please use interactive jobs. - -## Storage Systems - -For an easier grasp on the major categories (size, speed), the -work filesystems now come with the names of animals. - -### Permanent Filesystems - -We now have `/home` and `/software` in a Lustre filesystem. Snapshots -and tape backup are configured. (`/projects` remains the same until a recabling.) - -The Lustre filesystem `/data/walrus` is meant for larger data with a slow -access. It is installed to replace `/warm_archive`. - -### Work Filesystems - -In the filesystem market with new players it is getting more and more -complicated to identify the best suited filesystem for a specific use case. Often, -only tests can find the best setup for a specific workload. - -* `/data/horse` - 20 PB - high bandwidth (Lustre) -* `/data/octopus` - 0.5 PB - for interactive usage (Lustre) - to be mounted on Alpha Centauri -* `/data/weasel` - 1 PB - for high IOPS (WEKA) - coming 2024. - -### Difference Between "Work" And "Permanent" - -A large number of changing files is a challenge for any backup system. To protect -our snapshots and backup from work data, -`/projects` cannot be used for temporary data on the compute nodes - it is mounted read-only. - -For `/home`, we create snapshots and tape backups. That's why working there, -with a high frequency of changing files is a bad idea. - -Please use our data mover mechanisms to transfer worthy data to permanent -storages or long-term archives. - -## Migration Phase - -For about one month, the new cluster Barnard, and the old cluster Taurus -will run side-by-side - both with their respective filesystems. We provide a comprehensive -[description of the migration to Barnard](migration_to_barnard.md). - -<! -- -The follwing figure provides a graphical overview of the overall process (red: user action -required): - - -{: align=center} ---> diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md index 9b3f3e8f96e5fadc35a6eacf2ce832d46ce71cb0..09660f035434ab0a1d588d01029ad1f2ac904454 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md @@ -34,6 +34,85 @@ perspective, there will be **five separate clusters**: All clusters will run with their own [Slurm batch system](slurm.md) and job submission is possible only from their respective login nodes. +??? note "Architectural Re-Design 2023" + + Over the last decade we have been running our HPC system of high heterogeneity with a single + Slurm batch system. This made things very complicated, especially to inexperienced users. + With the replacement of the Taurus system by the cluster + [Barnard](hardware_overview.md#barnard) + we **now create homogeneous clusters with their own Slurm instances and with cluster specific login + nodes** running on the same CPU. Job submission will be possible only from within the cluster + (compute or login node). + + All clusters will be integrated to the new InfiniBand fabric and have then the same access to + the shared filesystems. This recabling will require a brief downtime of a few days. + +  + {: align=center} + + ## Compute Systems + + All compute clusters now act as separate entities having their own + login nodes of the same hardware and their very own Slurm batch systems. The different hardware, + e.g. Romeo and Alpha Centauri, is no longer managed via a single Slurm instance with + corresponding partitions. Instead, you as user now chose the hardware by the choice of the + correct login node. + + The login nodes can be used for smaller interactive jobs on the clusters. There are + restrictions in place, though, wrt. usable resources and time per user. For larger + computations, please use interactive jobs. + + ## Storage Systems + + For an easier grasp on the major categories (size, speed), the + work filesystems now come with the names of animals. + + ### Permanent Filesystems + + We now have `/home` and `/software` in a Lustre filesystem. Snapshots + and tape backup are configured. (`/projects` remains the same until a recabling.) + + The Lustre filesystem `/data/walrus` is meant for larger data with a slow + access. It is installed to replace `/warm_archive`. + + ### Work Filesystems + + In the filesystem market with new players it is getting more and more + complicated to identify the best suited filesystem for a specific use case. Often, + only tests can find the best setup for a specific workload. + + * `/data/horse` - 20 PB - high bandwidth (Lustre) + * `/data/octopus` - 0.5 PB - for interactive usage (Lustre) - to be mounted on Alpha Centauri + * `/data/weasel` - 1 PB - for high IOPS (WEKA) - coming 2024. + + ### Difference Between "Work" And "Permanent" + + A large number of changing files is a challenge for any backup system. To protect + our snapshots and backup from work data, + `/projects` cannot be used for temporary data on the compute nodes - it is mounted read-only. + + For `/home`, we create snapshots and tape backups. That's why working there, + with a high frequency of changing files is a bad idea. + + Please use our data mover mechanisms to transfer worthy data to permanent + storages or long-term archives. + + ## Migration Phase + + For about one month, the new cluster Barnard, and the old cluster Taurus + will run side-by-side - both with their respective filesystems. We provide a comprehensive + [description of the migration to Barnard](migration_to_barnard.md). + + <! -- + The follwing figure provides a graphical overview of the overall process (red: user action + required): + +  + {: align=center} + --> + + + ## Login and Export Nodes !!! Note " **On December 11 2023 Taurus will be decommissioned for good**." diff --git a/doc.zih.tu-dresden.de/mkdocs.yml b/doc.zih.tu-dresden.de/mkdocs.yml index 728a9fbd71808a96f37f6c8d989770dccd45c849..4bd3b402e47a839183443cb5d4f65d6bec8db358 100644 --- a/doc.zih.tu-dresden.de/mkdocs.yml +++ b/doc.zih.tu-dresden.de/mkdocs.yml @@ -102,7 +102,6 @@ nav: - HPC Resources: - Overview: jobs_and_resources/hardware_overview.md - New Systems 2023: - - Architectural Re-Design 2023: jobs_and_resources/architecture_2023.md - "How-To: Migration to Barnard": jobs_and_resources/migration_to_barnard.md - CPU Custer Romeo: jobs_and_resources/romeo.md - NVMe Storage: jobs_and_resources/nvme_storage.md