Skip to content
Snippets Groups Projects
Commit f15b6e22 authored by Martin Schroschk's avatar Martin Schroschk
Browse files

Review Alpha page: state of stand-alone cluster and module load example

parent 1ce3e8b0
No related branches found
No related tags found
2 merge requests!938Automated merge from preview to main,!936Update to Five-Cluster-Operation
# GPU Cluster Alpha Centauri
The multi-GPU sub-cluster "Alpha Centauri" has been installed for AI-related computations (ScaDS.AI).
The multi-GPU cluster `Alpha Centauri` has been installed for AI-related computations (ScaDS.AI).
The hardware specification is documented on the page
[HPC Resources](hardware_overview.md#alpha-centauri).
## Becoming a Stand-Alone Cluster
The former HPC system Taurus is partly switched-off and partly split up into separate clusters
until the end of 2023. One such upcoming separate cluster is what you have known as partition
`alpha` so far. With the end of the maintenance at November 30 2023, `Alpha Centauri` is now a
stand-alone cluster with
* homogenous hardware resources incl. two login nodes `login[1-2].alpha.hpc.tu-dresden.de`,
* and own Slurm batch system.
### Filesystems
Your new `/home` directory (from `Barnard`) is also your `/home` on `Alpha Centauri`.
If you have not
[migrated your `/home` from Taurus to your **new** `/home` on Barnard](barnard.md#data-management-and-data-transfer)
, please do so as soon as possible!
!!! warning "Current limititations w.r.t. filesystems"
For now, `Alpha Centauri` will not be integrated in the InfiniBand fabric of Barnard. With this
comes a dire restriction: **the only work filesystems for Alpha Centauri** will be the `/beegfs`
filesystems. (`/scratch` and `/lustre/ssd` are not usable any longer.)
Please, prepare your
stage-in/stage-out workflows using our [datamovers](../data_transfer/datamover.md) to enable the
work with larger datasets that might be stored on Barnard’s new capacity filesystem
`/data/walrus`. The datamover commands are not yet running. Thus, you need to use them from
Barnard!
The new Lustre filesystems, namely `horse` and `walrus`, will be mounted as soon as `Alpha` is
recabled (planned for May 2024).
!!! warning "Current limititations w.r.t. workspace management"
Workspace management commands do not work for `beegfs` yet. (Use them from Taurus!)
## Usage
!!! note
......@@ -23,48 +59,68 @@ cores are available per node.
### Modules
The easiest way is using the [module system](../software/modules.md).
The software for the cluster `alpha` is available in module environment `modenv/hiera`.
All software available from the module system has been specifically build for the cluster `Alpha`
i.e., with optimzation for Zen2 microarchitecture and CUDA-support enabled.
To check the available modules for `modenv/hiera`, use the command
To check the available modules for `Alpha`, use the command
```console
marie@alpha$ module spider <module_name>
marie@login.alpha$ module spider <module_name>
```
For example, to check whether PyTorch is available in version 1.7.1:
??? example "Searching and loading PyTorch"
```console
marie@alpha$ module spider PyTorch/1.7.1
For example, to check which `PyTorch` versions are available you can invoke
-----------------------------------------------------------------------------------------------------------------------------------------
PyTorch: PyTorch/1.7.1
-----------------------------------------------------------------------------------------------------------------------------------------
Description:
Tensors and Dynamic neural networks in Python with strong GPU acceleration. PyTorch is a deep learning framework that puts Python
first.
```console
marie@login.alpha$ module spider PyTorch
-------------------------------------------------------------------------------------------------------------------------
PyTorch:
-------------------------------------------------------------------------------------------------------------------------
Description:
Tensors and Dynamic neural networks in Python with strong GPU acceleration. PyTorch is a deep learning framework
that puts Python first.
Versions:
PyTorch/1.12.0
PyTorch/1.12.1-CUDA-11.7.0
PyTorch/1.12.1
[...]
```
You will need to load all module(s) on any one of the lines below before the "PyTorch/1.7.1" module is available to load.
Not all modules can be loaded directly. Most modules are build with a certain compiler or toolchain
that need to be loaded beforehand.
Luckely, the module system can tell us, what we need to do for a specific module or software version
modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5
```console
marie@login.alpha$ module spider PyTorch/1.12.1-CUDA-11.7.0
[...]
```
-------------------------------------------------------------------------------------------------------------------------
PyTorch: PyTorch/1.12.1-CUDA-11.7.0
-------------------------------------------------------------------------------------------------------------------------
Description:
Tensors and Dynamic neural networks in Python with strong GPU acceleration. PyTorch is a deep learning framework
that puts Python first.
The output of `module spider <module_name>` provides hints which dependencies should be loaded beforehand:
```console
marie@alpha$ module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5
Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5 and 15 dependencies loaded.
marie@alpha$ module avail PyTorch
-------------------------------------- /sw/modules/hiera/all/MPI/GCC-CUDA/10.2.0-11.1.1/OpenMPI/4.0.5 ---------------------------------------
PyTorch/1.7.1 (L) PyTorch/1.9.0 (D)
marie@alpha$ module load PyTorch/1.7.1
Module PyTorch/1.7.1 and 39 dependencies loaded.
marie@alpha$ python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"
1.7.1
True
```
You will need to load all module(s) on any one of the lines below before the "PyTorch/1.12.1" module is available to load.
release/23.04 GCC/11.3.0 OpenMPI/4.1.4
[...]
```
Finaly, the commandline to load the `PyTorch/1.12.1-CUDA-11.7.0` module is
```console
marie@login.alpha$ module load release/23.04 GCC/11.3.0 OpenMPI/4.1.4 PyTorch/1.12.1-CUDA-11.7.0
Module GCC/11.3.0, OpenMPI/4.1.4, PyTorch/1.12.1-CUDA-11.7.0 and 64 dependencies loaded.
```
```console
marie@login.alpha$ python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"
1.12.1
True
```
### Python Virtual Environments
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment