Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
hpc-compendium
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Deploy
Releases
Package Registry
Container Registry
Model registry
Operate
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
ZIH
hpcsupport
hpc-compendium
Commits
966daab0
Commit
966daab0
authored
3 years ago
by
Martin Schroschk
Browse files
Options
Downloads
Plain Diff
Merge branch 'batch_system_overview' into 'preview'
Review overview page See merge request
!222
parents
975c9528
a68e1737
No related branches found
No related tags found
3 merge requests
!322
Merge preview into main
,
!319
Merge preview into main
,
!222
Review overview page
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc.zih.tu-dresden.de/docs/jobs_and_resources/overview.md
+61
-29
61 additions, 29 deletions
doc.zih.tu-dresden.de/docs/jobs_and_resources/overview.md
with
61 additions
and
29 deletions
doc.zih.tu-dresden.de/docs/jobs_and_resources/overview.md
+
61
−
29
View file @
966daab0
# Jobs and Resources
# Jobs and Resources
To run the software, do some calculations or compile your code compute nodes have to be used. Login
When log in to ZIH systems, you are placed on a
*login node*
**TODO**
link to login nodes section
nodes which are using for login can not be used for your computations. Submit your tasks (by using
where you can
[
manage data life cycle
](
../data_lifecycle/overview.md
)
,
[
jobs]**todo link**) to compute nodes. The [Slurm
](
slurm.md
)
(
scheduler
to handle your jobs) is
[
setup experiments
](
../data_lifecycle/experiments.md
)
, execute short tests and compile moderate
using on Taurus for this purposes. [HPC Introduction]
**todo link**
is a good resource to get started
projects. The login nodes cannot be used for real experiments and computations. Long and extensive
with it.
computational work and experiments have to be encapsulated into so called
**jobs**
and scheduled to
the compute nodes.
## What do I need a CPU or GPU?
<!--Login nodes which are using for login can not be used for your computations.-->
<!--To run software, do calculations and experiments, or compile your code compute nodes have to be used.-->
ZIH uses the batch system Slurm for resource management and job scheduling.
<!--[HPC Introduction]**todo link** is a good resource to get started with it.-->
??? note "Batch Job"
In order to allow the batch scheduler an efficient job placement it needs these
specifications:
* **requirements:** cores, memory per core, (nodes), additional resources (GPU),
* maximum run-time,
* HPC project (normally use primary group which gives id),
* who gets an email on which occasion,
The runtime environment (see [here](../software/overview.md)) as well as the executable and
certain command-line agruments have to be specified to run the computational work.
??? note "Batch System"
The batch system is the central organ of every HPC system users interact with its compute
resources. The batchsystem finds an adequate compute system (partition/island) for your compute
jobs. It organizes the queueing and messaging, if all resources are in use. If resources are
available for your job, the batch system allocates and connects to these resources, transfers
run-time environment, and starts the job.
Follow the page
[
Slurm
](
slurm.md
)
for comprehensive documentation using the batch system at
ZIH systems. There is also a page with extensive set of
[
Slurm examples
](
slurm_examples.md
)
.
## Selection of Suitable Hardware
### What do I need a CPU or GPU?
The main difference between CPU and GPU architecture is that a CPU is designed to handle a wide
The main difference between CPU and GPU architecture is that a CPU is designed to handle a wide
range of tasks quickly, but are limited in the concurrency of tasks that can be running. While GPUs
range of tasks quickly, but are limited in the concurrency of tasks that can be running. While GPUs
can process data much faster than a CPU due to massive parallelism (but the amount of data which
can process data much faster than a CPU due to massive parallelism (but the amount of data which
a single GPU's core can handle is small), GPUs are not as versatile as CPUs.
a single GPU's core can handle is small), GPUs are not as versatile as CPUs.
## Selection of Suitable Hardware
### Available Hardware
Available [hardware]
**todo link**
: Normal compute nodes (Haswell[[64]
**todo link**
, [128]
**todo link**
,
[256]
**todo link**
], Broadwell, [Rome]
**todo link**
), Large SMP nodes, Accelerator(GPU) nodes: (gpu2
partition, [ml partition]
**todo link**
).
The exact partition could be specified by
`-p`
flag with the srun command or in your batch job.
ZIH provides a broad variety of compute resources ranging from normal server CPUs of different
manufactures, to large shared memory nodes, GPU-assisted nodes up to highly specialised resources for
[
Machine Learing
](
../software/machine_learning.md
)
and AI.
The page
[
Hardware Taurus
](
hardware_taurus.md
)
holds a comprehensive overview.
Majority of the basic task could be done on the conventional nodes like a Haswell. Slurm will
The desired hardware can be specified by the partition
`-p, --partition`
flag in Slurm.
automatically select a suitable partition depending on your memory and --gres (gpu) requirements. If
The majority of the basic tasks can be executed on the conventional nodes like a Haswell. Slurm will
you do not specify the partition most likely you will be addressed to the Haswell partition (1328
automatically select a suitable partition depending on your memory and GPU requirements.
nodes in total).
### Parallel Jobs
### Parallel Jobs
...
@@ -32,9 +63,10 @@ nodes in total).
...
@@ -32,9 +63,10 @@ nodes in total).
if it is necessary. Slurm will automatically find suitable hardware. Normal compute nodes are
if it is necessary. Slurm will automatically find suitable hardware. Normal compute nodes are
perfect for this task.
perfect for this task.
**OpenMP jobs:**
An SMP-parallel job can only run
**within a node**
, so it is necessary to include the
**OpenMP jobs:**
SMP-parallel applications can only run
**within a node**
, so it is necessary to
options
`-N 1`
and
`-n 1`
. Using
`--cpus-per-task N`
Slurm will start one task and you will have N CPUs.
include the options
`-N 1`
and
`-n 1`
. Using
`--cpus-per-task N`
Slurm will start one task and you
The maximum number of processors for an SMP-parallel program is 896 on Taurus ([SMP]
**todo link**
island).
will have N CPUs. The maximum number of processors for an SMP-parallel program is 896 on Taurus
([SMP]
**todo link**
island).
**GPUs**
partitions are best suited for
**repetitive**
and
**highly-parallel**
computing tasks. If
**GPUs**
partitions are best suited for
**repetitive**
and
**highly-parallel**
computing tasks. If
you have a task with potential [data parallelism]
**todo link**
most likely that you need the GPUs.
you have a task with potential [data parallelism]
**todo link**
most likely that you need the GPUs.
...
@@ -42,22 +74,22 @@ Beyond video rendering, GPUs excel in tasks such as machine learning, financial
...
@@ -42,22 +74,22 @@ Beyond video rendering, GPUs excel in tasks such as machine learning, financial
modeling. Use the gpu2 and ml partition only if you need GPUs! Otherwise using the x86 partitions
modeling. Use the gpu2 and ml partition only if you need GPUs! Otherwise using the x86 partitions
(e.g Haswell) most likely would be more beneficial.
(e.g Haswell) most likely would be more beneficial.
**Interactive jobs:**
Slurm can forward your X11 credentials to the first (or even all)
node
for a job
**Interactive jobs:**
Slurm can forward your X11 credentials to the first
node
(or even all) for a job
with the
`--x11`
option. To use an interactive job you have to specify
`-X`
flag for the ssh login.
with the
`--x11`
option. To use an interactive job you have to specify
`-X`
flag for the ssh login.
## Interactive vs.
sb
atch
## Interactive vs.
B
atch
Mode
However, using srun directly on the
s
hell will lead to blocking and launch an interactive job.
Apart
However, using
`
srun
`
directly on the
S
hell will lead to blocking and launch an interactive job.
from short test runs, it is recommended to
launch your jobs into the background by using batch jobs.
Apart
from short test runs, it is recommended to
encapsulate your experiments and computational
For that, you can conveniently put the parameters directly into the job file which you can submit
tasks into batchjobs and submit them to the batch system. For that, you can conveniently put the
using
`sbatch [options] <job file>`
.
parameters directly into the job file which you can submit
using
`sbatch [options] <job file>`
.
## Processing of Data for Input and Output
## Processing of Data for Input and Output
Pre-processing and post-processing of the data is a crucial part for the majority of data-dependent
Pre-processing and post-processing of the data is a crucial part for the majority of data-dependent
projects. The quality of this work influence on the computations. However, pre- and post-processing
projects. The quality of this work influence on the computations. However, pre- and post-processing
in many cases can be done completely or partially on a local
pc
and then transferred to
the Taurus.
in many cases can be done completely or partially on a local
system
and then transferred to
ZIH
Please use
Taurus
for the computation-intensive tasks.
systems.
Please use
ZIH systems primarily
for the computation-intensive tasks.
Useful links: [Batch Systems]
**todo link**
, [Hardware Taurus]
**todo link**
, [HPC-DA]
**todo link**
,
<!--
Useful links: [Batch Systems]**todo link**, [Hardware Taurus]**todo link**, [HPC-DA]**todo link**,
-->
[Slurm]
**todo link**
<!--
[Slurm]**todo link**
-->
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment