Skip to content
Snippets Groups Projects
Commit ccc25389 authored by Martin Schroschk's avatar Martin Schroschk
Browse files

Review

- Remove outdated content
- Update content
- Change tense to present
parent 8f37a26d
No related branches found
No related tags found
2 merge requests!1138Automated merge from preview to main,!1134Review documentation w.r.t. filesystems and hardware
......@@ -22,68 +22,22 @@ HPC resources at ZIH comprise a total of **six systems**:
All clusters will run with their own [Slurm batch system](slurm.md) and job submission is possible
only from their respective login nodes.
## Architectural Re-Design 2023
## Architectural Design
Over the last decade we have been running our HPC system of high heterogeneity with a single
Slurm batch system. This made things very complicated, especially to inexperienced users. With
the replacement of the Taurus system by the cluster [Barnard](#barnard) we
**now create homogeneous clusters with their own Slurm instances and with cluster specific login
nodes** running on the same CPU. Job submission will be possible only from within the cluster
(compute or login node).
Slurm batch system. This made things very complicated, especially to inexperienced users. With
the replacement of the Taurus system by the cluster [Barnard](#barnard) in 2023 we have a new
archtictural design comprising **six homogeneous clusters with their own Slurm instances and with
cluster specific login nodes** running on the same CPU. Job submission is possible only from
within the corresponding cluster (compute or login node).
All clusters will be integrated to the new InfiniBand fabric and have then the same access to
the shared filesystems. This recabling will require a brief downtime of a few days.
All clusters are integrated to the new InfiniBand fabric and have the same access to
the shared filesystems. You find a comprehensive documentation on the available working and
permanent filesystems on the page [Filesystems](data_lifecycle/file_systems.md).
![Architecture overview 2023](../jobs_and_resources/misc/architecture_2024.png)
{: align=center}
### Compute Systems
All compute clusters now act as separate entities having their own
login nodes of the same hardware and their very own Slurm batch systems. The different hardware,
e.g. Romeo and Alpha Centauri, is no longer managed via a single Slurm instance with
corresponding partitions. Instead, you as user now chose the hardware by the choice of the
correct login node.
The login nodes can be used for smaller interactive jobs on the clusters. There are
restrictions in place, though, wrt. usable resources and time per user. For larger
computations, please use interactive jobs.
### Storage Systems
For an easier grasp on the major categories (size, speed), the
work filesystems now come with the names of animals.
#### Permanent Filesystems
We now have `/home` and `/software` in a Lustre filesystem. Snapshots
and tape backup are configured. (`/projects` remains the same until a recabling.)
The Lustre filesystem `/data/walrus` is meant for larger data with a slow
access. It is installed to replace `/warm_archive`.
#### Work Filesystems
In the filesystem market with new players it is getting more and more
complicated to identify the best suited filesystem for a specific use case. Often,
only tests can find the best setup for a specific workload.
* `/data/horse` - 20 PB - high bandwidth (Lustre)
* `/data/octopus` - 0.5 PB - for interactive usage (Lustre) - to be mounted on Alpha Centauri
* `/data/weasel` - 1 PB - for high IOPS (WEKA) - coming 2024.
#### Difference Between "Work" And "Permanent"
A large number of changing files is a challenge for any backup system. To protect
our snapshots and backup from work data,
`/projects` cannot be used for temporary data on the compute nodes - it is mounted read-only.
For `/home`, we create snapshots and tape backups. That's why working there,
with a high frequency of changing files is a bad idea.
Please use our data mover mechanisms to transfer worthy data to permanent
storages or long-term archives.
## Login and Dataport Nodes
- Login-Nodes
......@@ -95,11 +49,6 @@ storages or long-term archives.
- IPs: 141.30.73.\[4,5\]
- Further information on the usage is documented on the site
[dataport Nodes](../data_transfer/dataport_nodes.md)
- *outdated*: 2 Data-Transfer-Nodes `taurusexport[3-4].hrsk.tu-dresden.de`
- DNS Alias `taurusexport.hrsk.tu-dresden.de`
- 2 Servers without interactive login, only available via file transfer protocols
(`rsync`, `ftp`)
- available as long as outdated filesystems (e.g. `scratch`) are accessible
## Barnard
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment