-
Martin Schroschk authoredMartin Schroschk authored
search:
boost: 2.0
Batch System Slurm
ZIH uses the batch system Slurm for resource management and job scheduling. Compute nodes are not accessed directly, but addressed through Slurm. You specify the needed resources (cores, memory, GPU, time, ...) and Slurm will schedule your job for execution.
When logging in to ZIH systems, you are placed on a login node. There, you can manage your data life cycle, setup experiments, and edit and prepare jobs. The login nodes are not suited for computational work! From the login nodes, you can interact with the batch system, e.g., submit and monitor your jobs.
??? note "Batch System"
The batch system is the central organ of every HPC system users interact with its compute
resources. The batch system finds an adequate compute system (partition) for your compute jobs.
It organizes the queueing and messaging, if all resources are in use. If resources are available
for your job, the batch system allocates and connects to these resources, transfers runtime
environment, and starts the job.
A workflow could look like this:
```mermaid
sequenceDiagram
user ->>+ login node: run programm
login node ->> login node: kill after 5 min
login node ->>- user: Killed!
user ->> login node: salloc [...]
login node ->> Slurm: Request resources
Slurm ->> user: resources
user ->>+ allocated resources: srun [options] [command]
allocated resources ->> allocated resources: run command (on allocated nodes)
allocated resources ->>- user: program finished
user ->>+ allocated resources: srun [options] [further_command]
allocated resources ->> allocated resources: run further command
allocated resources ->>- user: program finished
user ->>+ allocated resources: srun [options] [further_command]
allocated resources ->> allocated resources: run further command
Slurm ->> allocated resources: Job limit reached/exceeded
allocated resources ->>- user: Job limit reached
```
??? note "Batch Job"
At HPC systems, computational work and resource requirements are encapsulated into so-called
jobs. In order to allow the batch system an efficient job placement it needs these
specifications:
* requirements: number of nodes and cores, memory per core, additional resources (GPU)
* maximum run-time
* HPC project for accounting
* who gets an email on which occasion
Moreover, the [runtime environment](../software/overview.md) as well as the executable and
certain command-line arguments have to be specified to run the computational work.
This page provides a brief overview on
- Slurm options to specify resource requirements,
- how to submit interactive and batch jobs,
- how to write job files,
- how to manage and control your jobs.
If you are are already familiar with Slurm, you might be more interested in our collection of job examples. There is also a ton of external resources regarding Slurm. We recommend these links for detailed information:
- slurm.schedmd.com provides the official documentation comprising manual pages, tutorials, examples, etc.
- Comparison with other batch systems
Job Submission
There are three basic Slurm commands for job submission and execution:
-
srun
: Run a parallel application (and, if necessary, allocate resources first). -
sbatch
: Submit a batch script to Slurm for later execution. -
salloc
: Obtain a Slurm job allocation (i.e., resources like CPUs, nodes and GPUs) for interactive use. Release the allocation when finished.
Executing a program with srun
directly on the shell will be blocking and launch an
interactive job. Apart from short test runs, it is recommended to submit your
jobs to Slurm for later execution by using batch jobs. For that, you can conveniently
put the parameters in a job file, which you can submit using sbatch [options] <job file>
.
After submission, your job gets a unique job ID, which is stored in the environment variable
SLURM_JOB_ID
at job runtime. The command sbatch
outputs the job ID to stderr. Furthermore, you
can find it via squeue --me
. The job ID allows you to
manage and control your jobs.
!!! warning "srun vs. mpirun"
On ZIH systems, `srun` is used to run your parallel application. The use of `mpirun` is provenly
broken on clusters `Power9` and `Alpha` for jobs requiring more than one node. Especially when
using code from github projects, double-check its configuration by looking for a line like
'submit command mpirun -n $ranks ./app' and replace it with 'srun ./app'.
Otherwise, this may lead to wrong resource distribution and thus job failure, or tremendous
slowdowns of your application.