Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
hpc-compendium
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Deploy
Releases
Package Registry
Container Registry
Model registry
Operate
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
ZIH
hpcsupport
hpc-compendium
Commits
f15b6e22
Commit
f15b6e22
authored
1 year ago
by
Martin Schroschk
Browse files
Options
Downloads
Patches
Plain Diff
Review Alpha page: state of stand-alone cluster and module load example
parent
1ce3e8b0
No related branches found
No related tags found
2 merge requests
!938
Automated merge from preview to main
,
!936
Update to Five-Cluster-Operation
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md
+86
-30
86 additions, 30 deletions
...h.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md
with
86 additions
and
30 deletions
doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md
+
86
−
30
View file @
f15b6e22
# GPU Cluster Alpha Centauri
The multi-GPU
sub-
cluster
"
Alpha Centauri
"
has been installed for AI-related computations (ScaDS.AI).
The multi-GPU cluster
`
Alpha Centauri
`
has been installed for AI-related computations (ScaDS.AI).
The hardware specification is documented on the page
[
HPC Resources
](
hardware_overview.md#alpha-centauri
)
.
## Becoming a Stand-Alone Cluster
The former HPC system Taurus is partly switched-off and partly split up into separate clusters
until the end of 2023. One such upcoming separate cluster is what you have known as partition
`alpha`
so far. With the end of the maintenance at November 30 2023,
`Alpha Centauri`
is now a
stand-alone cluster with
*
homogenous hardware resources incl. two login nodes
`login[1-2].alpha.hpc.tu-dresden.de`
,
*
and own Slurm batch system.
### Filesystems
Your new
`/home`
directory (from
`Barnard`
) is also your
`/home`
on
`Alpha Centauri`
.
If you have not
[
migrated your `/home` from Taurus to your **new** `/home` on Barnard
](
barnard.md#data-management-and-data-transfer
)
, please do so as soon as possible!
!!! warning "Current limititations w.r.t. filesystems"
For now, `Alpha Centauri` will not be integrated in the InfiniBand fabric of Barnard. With this
comes a dire restriction: **the only work filesystems for Alpha Centauri** will be the `/beegfs`
filesystems. (`/scratch` and `/lustre/ssd` are not usable any longer.)
Please, prepare your
stage-in/stage-out workflows using our [datamovers](../data_transfer/datamover.md) to enable the
work with larger datasets that might be stored on Barnard’s new capacity filesystem
`/data/walrus`. The datamover commands are not yet running. Thus, you need to use them from
Barnard!
The new Lustre filesystems, namely `horse` and `walrus`, will be mounted as soon as `Alpha` is
recabled (planned for May 2024).
!!! warning "Current limititations w.r.t. workspace management"
Workspace management commands do not work for `beegfs` yet. (Use them from Taurus!)
## Usage
!!! note
...
...
@@ -23,48 +59,68 @@ cores are available per node.
### Modules
The easiest way is using the
[
module system
](
../software/modules.md
)
.
The software for the cluster
`alpha`
is available in module environment
`modenv/hiera`
.
All software available from the module system has been specifically build for the cluster
`Alpha`
i.e., with optimzation for Zen2 microarchitecture and CUDA-support enabled.
To check the available modules for
`
modenv/hier
a`
, use the command
To check the available modules for
`
Alph
a`
, use the command
```
console
marie@alpha$
module spider <module_name>
marie@
login.
alpha$
module spider <module_name>
```
For
example
, to check whether PyTorch is available in version 1.7.1:
???
example
"Searching and loading PyTorch"
```
console
marie@alpha$
module spider PyTorch/1.7.1
For example, to check which `PyTorch` versions are available you can invoke
-----------------------------------------------------------------------------------------------------------------------------------------
PyTorch: PyTorch/1.7.1
-----------------------------------------------------------------------------------------------------------------------------------------
Description:
Tensors and Dynamic neural networks in Python with strong GPU acceleration. PyTorch is a deep learning framework that puts Python
first.
```console
marie@login.alpha$ module spider PyTorch
-------------------------------------------------------------------------------------------------------------------------
PyTorch:
-------------------------------------------------------------------------------------------------------------------------
Description:
Tensors and Dynamic neural networks in Python with strong GPU acceleration. PyTorch is a deep learning framework
that puts Python first.
Versions:
PyTorch/1.12.0
PyTorch/1.12.1-CUDA-11.7.0
PyTorch/1.12.1
[...]
```
You will need to load all module(s) on any one of the lines below before the "PyTorch/1.7.1" module is available to load.
Not all modules can be loaded directly. Most modules are build with a certain compiler or toolchain
that need to be loaded beforehand.
Luckely, the module system can tell us, what we need to do for a specific module or software version
modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5
```console
marie@login.alpha$ module spider PyTorch/1.12.1-CUDA-11.7.0
[...]
```
-------------------------------------------------------------------------------------------------------------------------
PyTorch: PyTorch/1.12.1-CUDA-11.7.0
-------------------------------------------------------------------------------------------------------------------------
Description:
Tensors and Dynamic neural networks in Python with strong GPU acceleration. PyTorch is a deep learning framework
that puts Python first.
The output of
`module spider <module_name>`
provides hints which dependencies should be loaded beforehand:
```
console
marie@alpha$
module load modenv/hiera GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5
Module GCC/10.2.0, CUDA/11.1.1, OpenMPI/4.0.5 and 15 dependencies loaded.
marie@alpha$
module avail PyTorch
-------------------------------------- /sw/modules/hiera/all/MPI/GCC-CUDA/10.2.0-11.1.1/OpenMPI/4.0.5 ---------------------------------------
PyTorch/1.7.1 (L) PyTorch/1.9.0 (D)
marie@alpha$
module load PyTorch/1.7.1
Module PyTorch/1.7.1 and 39 dependencies loaded.
marie@alpha$
python
-c
"import torch; print(torch.__version__); print(torch.cuda.is_available())"
1.7.1
True
```
You will need to load all module(s) on any one of the lines below before the "PyTorch/1.12.1" module is available to load.
release/23.04 GCC/11.3.0 OpenMPI/4.1.4
[...]
```
Finaly, the commandline to load the `PyTorch/1.12.1-CUDA-11.7.0` module is
```console
marie@login.alpha$ module load release/23.04 GCC/11.3.0 OpenMPI/4.1.4 PyTorch/1.12.1-CUDA-11.7.0
Module GCC/11.3.0, OpenMPI/4.1.4, PyTorch/1.12.1-CUDA-11.7.0 and 64 dependencies loaded.
```
```console
marie@login.alpha$ python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"
1.12.1
True
```
### Python Virtual Environments
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment