From f15b6e221d0f382aeb824c600332211bff1ed174 Mon Sep 17 00:00:00 2001 From: Martin Schroschk <martin.schroschk@tu-dresden.de> Date: Wed, 6 Dec 2023 18:19:38 +0100 Subject: [PATCH] Review Alpha page: state of stand-alone cluster and module load example --- .../docs/jobs_and_resources/alpha_centauri.md | 116 +++++++++++++----- 1 file changed, 86 insertions(+), 30 deletions(-) diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md index 7dacca7f5..1e4d6cced 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md @@ -1,10 +1,46 @@ # GPU Cluster Alpha Centauri -The multi-GPU sub-cluster "Alpha Centauri" has been installed for AI-related computations (ScaDS.AI). +The multi-GPU cluster `Alpha Centauri` has been installed for AI-related computations (ScaDS.AI). The hardware specification is documented on the page [HPC Resources](hardware_overview.md#alpha-centauri). +## Becoming a Stand-Alone Cluster + +The former HPC system Taurus is partly switched-off and partly split up into separate clusters +until the end of 2023. One such upcoming separate cluster is what you have known as partition +`alpha` so far. With the end of the maintenance at November 30 2023, `Alpha Centauri` is now a +stand-alone cluster with + +* homogenous hardware resources incl. two login nodes `login[1-2].alpha.hpc.tu-dresden.de`, +* and own Slurm batch system. + +### Filesystems + +Your new `/home` directory (from `Barnard`) is also your `/home` on `Alpha Centauri`. +If you have not +[migrated your `/home` from Taurus to your **new** `/home` on Barnard](barnard.md#data-management-and-data-transfer) +, please do so as soon as possible! + +!!! warning "Current limititations w.r.t. filesystems" + + For now, `Alpha Centauri` will not be integrated in the InfiniBand fabric of Barnard. With this + comes a dire restriction: **the only work filesystems for Alpha Centauri** will be the `/beegfs` + filesystems. (`/scratch` and `/lustre/ssd` are not usable any longer.) + + Please, prepare your + stage-in/stage-out workflows using our [datamovers](../data_transfer/datamover.md) to enable the + work with larger datasets that might be stored on Barnard’s new capacity filesystem + `/data/walrus`. The datamover commands are not yet running. Thus, you need to use them from + Barnard! + + The new Lustre filesystems, namely `horse` and `walrus`, will be mounted as soon as `Alpha` is + recabled (planned for May 2024). + +!!! warning "Current limititations w.r.t. workspace management" + + Workspace management commands do not work for `beegfs` yet. (Use them from Taurus!) + ## Usage !!! note @@ -23,48 +59,68 @@ cores are available per node. ### Modules The easiest way is using the [module system](../software/modules.md). -The software for the cluster `alpha` is available in module environment `modenv/hiera`. +All software available from the module system has been specifically build for the cluster `Alpha` +i.e., with optimzation for Zen2 microarchitecture and CUDA-support enabled. -To check the available modules for `modenv/hiera`, use the command +To check the available modules for `Alpha`, use the command ```console -marie@alpha$ module spider <module_name> +marie@login.alpha$ module spider <module_name> ``` -For example, to check whether PyTorch is available in version 1.7.1: +??? example "Searching and loading PyTorch" -```console -marie@alpha$ module spider PyTorch/1.7.1 + For example, to check which `PyTorch` versions are available you can invoke ------------------------------------------------------------------------------------------------------------------------------------------ - PyTorch: PyTorch/1.7.1 ------------------------------------------------------------------------------------------------------------------------------------------ - Description: - Tensors and Dynamic neural networks in Python with strong GPU acceleration. PyTorch is a deep learning framework that puts Python - first. + ```console + marie@login.alpha$ module spider PyTorch + ------------------------------------------------------------------------------------------------------------------------- + PyTorch: + ------------------------------------------------------------------------------------------------------------------------- + Description: + Tensors and Dynamic neural networks in Python with strong GPU acceleration. PyTorch is a deep learning framework + that puts Python first. + Versions: + PyTorch/1.12.0 + PyTorch/1.12.1-CUDA-11.7.0 + PyTorch/1.12.1 + [...] + ``` - You will need to load all module(s) on any one of the lines below before the "PyTorch/1.7.1" module is available to load. + Not all modules can be loaded directly. Most modules are build with a certain compiler or toolchain + that need to be loaded beforehand. + Luckely, the module system can tell us, what we need to do for a specific module or software version - modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 + ```console + marie@login.alpha$ module spider PyTorch/1.12.1-CUDA-11.7.0 -[...] -``` + ------------------------------------------------------------------------------------------------------------------------- + PyTorch: PyTorch/1.12.1-CUDA-11.7.0 + ------------------------------------------------------------------------------------------------------------------------- + Description: + Tensors and Dynamic neural networks in Python with strong GPU acceleration. PyTorch is a deep learning framework + that puts Python first. -The output of `module spider <module_name>` provides hints which dependencies should be loaded beforehand: -```console -marie@alpha$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 -Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5 and 15 dependencies loaded. -marie@alpha$ module avail PyTorch --------------------------------------- /sw/modules/hiera/all/MPI/GCC-CUDA/10.2.0-11.1.1/OpenMPI/4.0.5 --------------------------------------- - PyTorch/1.7.1 (L) PyTorch/1.9.0 (D) -marie@alpha$ module load PyTorch/1.7.1 -Module PyTorch/1.7.1 and 39 dependencies loaded. -marie@alpha$ python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())" -1.7.1 -True -``` + You will need to load all module(s) on any one of the lines below before the "PyTorch/1.12.1" module is available to load. + + release/23.04 GCC/11.3.0 OpenMPI/4.1.4 + [...] + ``` + + Finaly, the commandline to load the `PyTorch/1.12.1-CUDA-11.7.0` module is + + ```console + marie@login.alpha$ module load release/23.04 GCC/11.3.0 OpenMPI/4.1.4 PyTorch/1.12.1-CUDA-11.7.0 + Module GCC/11.3.0, OpenMPI/4.1.4, PyTorch/1.12.1-CUDA-11.7.0 and 64 dependencies loaded. + ``` + + ```console + marie@login.alpha$ python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())" + 1.12.1 + True + ``` ### Python Virtual Environments -- GitLab