diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md index a29bed873913112a37103304af6e96dbe936eb43..4f98ef6ff17c3d4909282a21f9ad68ea1a918328 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview.md @@ -25,6 +25,10 @@ users and the ZIH. ## Login and Export Nodes +!!! Note " **On December 11 2023 Taurus will be decommissioned for good**." + + Do not use Taurus for production anymore. + - 4 Login-Nodes `tauruslogin[3-6].hrsk.tu-dresden.de` - Each login node is equipped with 2x Intel(R) Xeon(R) CPU E5-2680 v3 with 24 cores in total @ 2.50 GHz, Multithreading disabled, 64 GB RAM, 128 GB SSD local disk @@ -37,28 +41,83 @@ users and the ZIH. - Further information on the usage is documented on the site [Export Nodes](../data_transfer/export_nodes.md) -## AMD Rome CPUs + NVIDIA A100 +## Barnard - Intel Sapphire Rapids CPUs + +- 630 diskless nodes, each with + - 2 x Intel Xeon Platinum 8470 (52 cores) @ 2.00 GHz, Multithreading enabled + - 512 GB RAM +- Hostnames: `n[1001-1630].barnard.hpc.tu-dresden.de` +- Login nodes: `login[1-4].barnard.hpc.tu-dresden.de` + +## Alpha Centauri - AMD Rome CPUs + NVIDIA A100 - 34 nodes, each with - 8 x NVIDIA A100-SXM4 Tensor Core-GPUs - 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, Multithreading available - 1 TB RAM - 3.5 TB local memory on NVMe device at `/tmp` -- Hostnames: `taurusi[8001-8034]` -- Slurm partition: `alpha` +- Hostnames: `i[8001-8037].alpha.hpc.tu-dresden.de` +- Login nodes: `login[1-2].alpha.hpc.tu-dresden.de` - Further information on the usage is documented on the site [Alpha Centauri Nodes](alpha_centauri.md) -## Island 7 - AMD Rome CPUs +??? note "Maintenance from November 27 to December 12" + + The GPU cluster Alpha Centauri (Partition `alpha`) will be shut down from November 27 to + December 10 for works on the power supply infrastructure. These additional changes are planned: + + * update the software stack (OS and applications, firmware, etc) + * change the ethernet access (new VLANs), + * integrate 5 new (identical) nodes. + + After this maintenance, Alpha Centauri will then reappear as stand-alone cluster that can be + reached via `login[1,2].alpha.hpc.tu-dresden.de`. + + **Changes w.r.t. filesystems:** + Your new `/home` directory (from Barnard) will become your `/home` on Romeo, Julia, *Alpha + Centauri* and the Power9 system. Thus, please [migrate your `/home` from Taurus to your **new** + `/home` on Barnard](migration_to_barnard.md#data-management-and-data-transfer). + + For now, Alpha Centauri will not be integrated in the InfiniBand fabric of Barnard. + With this comes a dire + restriction: **the only work filesystems for Alpha Centauri** will be the `/beegfs` filesystems. + `/scratch` and `/lustre/ssd` are not usable any longer. Please, prepare your stage-in/stage-out + workflows using our [datamovers](../data_transfer/datamover.md) to enable the work with larger + datasets that might be stored on Barnard’s new capacity filesystem `/data/walrus`. + +## Romeo - AMD Rome CPUs - 192 nodes, each with - 2 x AMD EPYC CPU 7702 (64 cores) @ 2.0 GHz, Multithreading available - 512 GB RAM - 200 GB local memory on SSD at `/tmp` -- Hostnames: `taurusi[7001-7192]` -- Slurm partition: `romeo` +- Hostnames: `i[7001-7190].romeo.hpc.tu-dresden.de` (after + [recabling phase](architecture_2023.md#migration-phase)]) +- Login nodes: `login[1-2].romeo.hpc.tu-dresden.de` - Further information on the usage is documented on the site [AMD Rome Nodes](rome_nodes.md) -## Large SMP System HPE Superdome Flex +??? note "Maintenance from November 27 to December 12" + + The recabling will take place from November 27 to December 12. These works are planned: + + * update the software stack (OS, firmware, software), + * change the ethernet access (new VLANs), + * complete integration of Romeo and Julia into the Barnard Infiniband network to get full + bandwidth access to all Barnard filesystems, + * configure and deploy stand-alone Slurm batch systems. + + After the maintenance, the Rome nodes reappear as a stand-alone cluster that can be reached via + `login[1,2].romeo.hpc.tu-dresden.de`. + + **Changes w.r.t. filesystems:** + Your new `/home` directory (from Barnard) will become your `/home` on *Romeo*, Julia, Alpha + Centauri and the Power9 system. Thus, please [migrate your `/home` from Taurus to your **new** + `/home` on Barnard](migration_to_barnard.md#data-management-and-data-transfer). + + The old work filesystems `/lustre/scratch` and `/lustre/ssd will` be turned off on January 1 + 2024 for good (no data access afterwards!). The new work filesystem available on Romeo will be + `/horse`. Please [migrate your workding data to `/horse`](migration_to_barnard.md#data-migration-to-new-filesystems). + +## Julia - Large SMP System HPE Superdome Flex - 1 node, with - 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20 GHz (28 cores) @@ -66,10 +125,33 @@ users and the ZIH. - Configured as one single node - 48 TB RAM (usable: 47 TB - one TB is used for cache coherence protocols) - 370 TB of fast NVME storage available at `/nvme/<projectname>` -- Hostname: `taurussmp8` -- Slurm partition: `julia` +- Hostname: `smp8.julia.hpc.tu-dresden.de` (after + [recabling phase](architecture_2023.md#migration-phase)]) - Further information on the usage is documented on the site [HPE Superdome Flex](sd_flex.md) +??? note "Maintenance from November 27 to December 12" + + The recabling will take place from November 27 to December 12. These works are planned: + + * update the software stack (OS, firmware, software), + * change the ethernet access (new VLANs), + * complete integration of Romeo and Julia into the Barnard Infiniband network to get full + bandwidth access to all Barnard filesystems, + * configure and deploy stand-alone Slurm batch systems. + + After the maintenance, the Julia system reappears as a stand-alone cluster that can be reached + via `smp8.julia.hpc.tu-dresden.de`. + + **Changes w.r.t. filesystems:** + Your new `/home` directory (from Barnard) will become your `/home` on Romeo, *Julia*, Alpha + Centauri and the Power9 system. Thus, please [migrate your `/home` from Taurus to your **new** + `/home` on Barnard](migration_to_barnard.md#data-management-and-data-transfer). + + The old work filesystems `/lustre/scratch` and `/lustre/ssd will` be turned off on January 1 + 2024 for good (no data access afterwards!). The new work filesystem available on the Julia + system will be `/horse`. Please + [migrate your workding data to `/horse`](migration_to_barnard.md#data-migration-to-new-filesystems). + ## IBM Power9 Nodes for Machine Learning For machine learning, we have IBM AC922 nodes installed with this configuration: @@ -79,33 +161,22 @@ For machine learning, we have IBM AC922 nodes installed with this configuration: - 256 GB RAM DDR4 2666 MHz - 6 x NVIDIA VOLTA V100 with 32 GB HBM2 - NVLINK bandwidth 150 GB/s between GPUs and host -- Hostnames: `taurusml[1-32]` -- Slurm partition: `ml` - -## Island 6 - Intel Haswell CPUs - -- 612 nodes, each with - - 2 x Intel(R) Xeon(R) CPU E5-2680 v3 (12 cores) @ 2.50 GHz, Multithreading disabled - - 128 GB local memory on SSD -- Varying amounts of main memory (selected automatically by the batch system for you according to - your job requirements) - * 594 nodes with 2.67 GB RAM per core (64 GB in total): `taurusi[6001-6540,6559-6612]` - - 18 nodes with 10.67 GB RAM per core (256 GB in total): `taurusi[6541-6558]` -- Hostnames: `taurusi[6001-6612]` -- Slurm Partition: `haswell` - -??? hint "Node topology" - -  - {: align=center} - -## Island 2 Phase 2 - Intel Haswell CPUs + NVIDIA K80 GPUs - -- 64 nodes, each with - - 2 x Intel(R) Xeon(R) CPU E5-E5-2680 v3 (12 cores) @ 2.50 GHz, Multithreading disabled - - 64 GB RAM (2.67 GB per core) - - 128 GB local memory on SSD - - 4 x NVIDIA Tesla K80 (12 GB GDDR RAM) GPUs -- Hostnames: `taurusi[2045-2108]` -- Slurm Partition: `gpu2` -- Node topology, same as [island 4 - 6](#island-6-intel-haswell-cpus) +- Hostnames: `ml[1-29].power9.hpc.tu-dresden.de` (after + [recabling phase](architecture_2023.md#migration-phase)]) +- Login nodes: `login[1-2].power9.hpc.tu-dresden.de` + +??? note "Maintenance from November 27 to December 12" + + The recabling will take place from November 27 to December 12. After the maintenance, the Power9 + system reappears as a stand-alone cluster that can be reached via + `ml[1-29].power9.hpc.tu-dresden.de`. + + **Changes w.r.t. filesystems:** + Your new `/home` directory (from Barnard) will become your `/home` on Romeo, Julia, Alpha + Centauri and the *Power9* system. Thus, please [migrate your `/home` from Taurus to your **new** + `/home` on Barnard](migration_to_barnard.md#data-management-and-data-transfer). + + The old work filesystems `/lustre/scratch` and `/lustre/ssd will` be turned off on January 1 + 2024 for good (no data access afterwards!). The new work filesystem available on the Power9 + system will be `/horse`. Please + [migrate your workding data to `/horse`](migration_to_barnard.md#data-migration-to-new-filesystems). diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview_2023.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview_2023.md index 5c03db6d3609ea4d9d5178c6d82136b4ed63dea0..075d753eb441e9ab1d293046f6bc1f83c511ee07 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview_2023.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hardware_overview_2023.md @@ -1,5 +1,7 @@ # HPC Resources Overview 2023 +TODO Move to other page + With the installation and start of operation of the [new HPC system Barnard](#barnard-intel-sapphire-rapids-cpus), quite significant changes w.r.t. HPC system landscape at ZIH follow. The former HPC system Taurus is partly switched-off and partly split up into separate clusters. In the end, from the users' @@ -24,144 +26,3 @@ All clusters will have access to these shared parallel filesystems: | Project | `/projects` | Lustre | quota per project | permanent project data | | Scratch for large data / streaming | `/data/horse` | Lustre | 20 PB | | -## Barnard - Intel Sapphire Rapids CPUs - -- 630 diskless nodes, each with - - 2 x Intel Xeon Platinum 8470 (52 cores) @ 2.00 GHz, Multithreading enabled - - 512 GB RAM -- Hostnames: `n[1001-1630].barnard.hpc.tu-dresden.de` -- Login nodes: `login[1-4].barnard.hpc.tu-dresden.de` - -## Alpha Centauri - AMD Rome CPUs + NVIDIA A100 - -- 34 nodes, each with - - 8 x NVIDIA A100-SXM4 Tensor Core-GPUs - - 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, Multithreading available - - 1 TB RAM - - 3.5 TB local memory on NVMe device at `/tmp` -- Hostnames: `taurusi[8001-8034]` -> `i[8001-8037].alpha.hpc.tu-dresden.de` -- Login nodes: `login[1-2].alpha.hpc.tu-dresden.de` -- Further information on the usage is documented on the site [Alpha Centauri Nodes](alpha_centauri.md) - -??? note "Maintenance from November 27 to December 12" - - The GPU cluster Alpha Centauri (Partition `alpha`) will be shut down from November 27 to - December 10 for works on the power supply infrastructure. These additional changes are planned: - - * update the software stack (OS and applications, firmware, etc) - * change the ethernet access (new VLANs), - * integrate 5 new (identical) nodes. - - After this maintenance, Alpha Centauri will then reappear as stand-alone cluster that can be - reached via `login[1,2].alpha.hpc.tu-dresden.de`. - - **Changes w.r.t. filesystems:** - Your new `/home` directory (from Barnard) will become your `/home` on Romeo, Julia, *Alpha - Centauri* and the Power9 system. Thus, please [migrate your `/home` from Taurus to your **new** - `/home` on Barnard](migration_to_barnard.md#data-management-and-data-transfer). - - For now, Alpha Centauri will not be integrated in the InfiniBand fabric of Barnard. - With this comes a dire - restriction: **the only work filesystems for Alpha Centauri** will be the `/beegfs` filesystems. - `/scratch` and `/lustre/ssd` are not usable any longer. Please, prepare your stage-in/stage-out - workflows using our [datamovers](../data_transfer/datamover.md) to enable the work with larger - datasets that might be stored on Barnard’s new capacity filesystem `/data/walrus`. - -## Romeo - AMD Rome CPUs - -- 192 nodes, each with - - 2 x AMD EPYC CPU 7702 (64 cores) @ 2.0 GHz, Multithreading available - - 512 GB RAM - - 200 GB local memory on SSD at `/tmp` -- Hostnames: `taurusi[7001-7192]` -> `i[7001-7190].romeo.hpc.tu-dresden.de` (after - [recabling phase](architecture_2023.md#migration-phase)]) -- Login nodes: `login[1-2].romeo.hpc.tu-dresden.de` -- Further information on the usage is documented on the site [AMD Rome Nodes](rome_nodes.md) - -??? note "Maintenance from November 27 to December 12" - - The recabling will take place from November 27 to December 12. These works are planned: - - * update the software stack (OS, firmware, software), - * change the ethernet access (new VLANs), - * complete integration of Romeo and Julia into the Barnard Infiniband network to get full - bandwidth access to all Barnard filesystems, - * configure and deploy stand-alone Slurm batch systems. - - After the maintenance, the Rome nodes reappear as a stand-alone cluster that can be reached via - `login[1,2].romeo.hpc.tu-dresden.de`. - - **Changes w.r.t. filesystems:** - Your new `/home` directory (from Barnard) will become your `/home` on *Romeo*, Julia, Alpha - Centauri and the Power9 system. Thus, please [migrate your `/home` from Taurus to your **new** - `/home` on Barnard](migration_to_barnard.md#data-management-and-data-transfer). - - The old work filesystems `/lustre/scratch` and `/lustre/ssd will` be turned off on January 1 - 2024 for good (no data access afterwards!). The new work filesystem available on Romeo will be - `/horse`. Please [migrate your workding data to `/horse`](migration_to_barnard.md#data-migration-to-new-filesystems). - -## Julia - Large SMP System HPE Superdome Flex - -- 1 node, with - - 32 x Intel Xeon Platinum 8276M CPU @ 2.20 GHz (28 cores) - - 47 TB RAM -- Configured as one single node -- 48 TB RAM (usable: 47 TB - one TB is used for cache coherence protocols) -- 370 TB of fast NVME storage available at `/nvme/<projectname>` -- Hostname: `taurussmp8` -> `smp8.julia.hpc.tu-dresden.de` (after - [recabling phase](architecture_2023.md#migration-phase)]) -- Further information on the usage is documented on the site [HPE Superdome Flex](sd_flex.md) - -??? note "Maintenance from November 27 to December 12" - - The recabling will take place from November 27 to December 12. These works are planned: - - * update the software stack (OS, firmware, software), - * change the ethernet access (new VLANs), - * complete integration of Romeo and Julia into the Barnard Infiniband network to get full - bandwidth access to all Barnard filesystems, - * configure and deploy stand-alone Slurm batch systems. - - After the maintenance, the Julia system reappears as a stand-alone cluster that can be reached - via `smp8.julia.hpc.tu-dresden.de`. - - **Changes w.r.t. filesystems:** - Your new `/home` directory (from Barnard) will become your `/home` on Romeo, *Julia*, Alpha - Centauri and the Power9 system. Thus, please [migrate your `/home` from Taurus to your **new** - `/home` on Barnard](migration_to_barnard.md#data-management-and-data-transfer). - - The old work filesystems `/lustre/scratch` and `/lustre/ssd will` be turned off on January 1 - 2024 for good (no data access afterwards!). The new work filesystem available on the Julia - system will be `/horse`. Please - [migrate your workding data to `/horse`](migration_to_barnard.md#data-migration-to-new-filesystems). - -## IBM Power9 Nodes for Machine Learning - -For machine learning, we have IBM AC922 nodes installed with this configuration: - -- 32 nodes, each with - - 2 x IBM Power9 CPU (2.80 GHz, 3.10 GHz boost, 22 cores) - - 256 GB RAM DDR4 2666 MHz - - 6 x NVIDIA VOLTA V100 with 32 GB HBM2 - - NVLINK bandwidth 150 GB/s between GPUs and host -- Hostnames: `taurusml[1-32]` -> `ml[1-29].power9.hpc.tu-dresden.de` (after - [recabling phase](architecture_2023.md#migration-phase)]) -- Login nodes: `login[1-2].power9.hpc.tu-dresden.de` - -??? note "Maintenance from November 27 to December 12" - - The recabling will take place from November 27 to December 12. After the maintenance, the Power9 - system reappears as a stand-alone cluster that can be reached via - `ml[1-29].power9.hpc.tu-dresden.de`. - - **Changes w.r.t. filesystems:** - Your new `/home` directory (from Barnard) will become your `/home` on Romeo, Julia, Alpha - Centauri and the *Power9* system. Thus, please [migrate your `/home` from Taurus to your **new** - `/home` on Barnard](migration_to_barnard.md#data-management-and-data-transfer). - - The old work filesystems `/lustre/scratch` and `/lustre/ssd will` be turned off on January 1 - 2024 for good (no data access afterwards!). The new work filesystem available on the Power9 - system will be `/horse`. Please - [migrate your workding data to `/horse`](migration_to_barnard.md#data-migration-to-new-filesystems). - - The recabling will take in the weeks after November 27. These works are planned: