slurm.md

The batch system is the central organ of every HPC system users interact with its compute
resources. The batch system finds an adequate compute system (partition) for your compute jobs.
It organizes the queueing and messaging, if all resources are in use. If resources are available
for your job, the batch system allocates and connects to these resources, transfers runtime
environment, and starts the job.

A workflow could look like this:

```mermaid
sequenceDiagram
    user ->>+ login node: run programm
    login node ->> login node: kill after 5 min
    login node ->>- user: Killed!
    user ->> login node: salloc [...]
    login node ->> Slurm: Request resources
    Slurm ->> user: resources
    user ->>+ allocated resources: srun [options] [command]
    allocated resources ->> allocated resources: run command (on allocated nodes)
    allocated resources ->>- user: program finished
    user ->>+ allocated resources: srun [options] [further_command]
    allocated resources ->> allocated resources: run further command
    allocated resources ->>- user: program finished
    user ->>+ allocated resources: srun [options] [further_command]
    allocated resources ->> allocated resources: run further command
    Slurm ->> allocated resources: Job limit reached/exceeded
    allocated resources ->>- user: Job limit reached
```
At HPC systems, computational work and resource requirements are encapsulated into so-called
jobs. In order to allow the batch system an efficient job placement it needs these
specifications:

* requirements: number of nodes and cores, memory per core, additional resources (GPU)
* maximum run-time
* HPC project for accounting
* who gets an email on which occasion

Moreover, the [runtime environment](../software/overview.md) as well as the executable and
certain command-line arguments have to be specified to run the computational work.
On ZIH systems, `srun` is used to run your parallel application. The use of `mpirun` is provenly
broken on partitions `ml` and `alpha` for jobs requiring more than one node. Especially when
using code from github projects, double-check its configuration by looking for a line like
'submit command  mpirun -n $ranks ./app' and replace it with 'srun ./app'.

Otherwise, this may lead to wrong resource distribution and thus job failure, or tremendous
slowdowns of your application.
| Slurm Option               | Description |
|:---------------------------|:------------|
| `-n, --ntasks=<N>`         | Total number of (MPI) tasks (default: 1) |
| `-N, --nodes=<N>`          | Number of compute nodes |
| `--ntasks-per-node=<N>`    | Number of tasks per allocated node to start (default: 1) |
| `-c, --cpus-per-task=<N>`  | Number of CPUs per task; needed for multithreaded (e.g. OpenMP) jobs; typically `N` should be equal to `OMP_NUM_THREADS` |
| `-p, --partition=<name>`   | Type of nodes where you want to execute your job (refer to [partitions](partitions_and_limits.md)) |
| `--mem-per-cpu=<size>`     | Memory need per allocated CPU in MB |
| `-t, --time=<HH:MM:SS>`    | Maximum runtime of the job |
| `--mail-user=<your email>` | Get updates about the status of the jobs |
| `--mail-type=ALL`          | For what type of events you want to get a mail; valid options: `ALL`, `BEGIN`, `END`, `FAIL`, `REQUEUE` |
| `-J, --job-name=<name>`    | Name of the job shown in the queue and in mails (cut after 24 chars) |
| `--no-requeue`             | Disable requeueing of the job in case of node failure (default: enabled) |
| `--exclusive`              | Exclusive usage of compute nodes; you will be charged for all CPUs/cores on the node |
| `-A, --account=<project>`  | Charge resources used by this job to the specified project |
| `-o, --output=<filename>`  | File to save all normal output (stdout) (default: `slurm-%j.out`) |
| `-e, --error=<filename>`   | File to save all error output (stderr)  (default: `slurm-%j.out`) |
| `-a, --array=<arg>`        | Submit an array job ([examples](slurm_examples.md#array-jobs)) |
| `-w <node1>,<node2>,...`   | Restrict job to run on specific nodes only |
| `-x <node1>,<node2>,...`   | Exclude specific nodes from job |
| `--test-only`              | Retrieve estimated start time of a job considering the job queue; does not actually submit the job nor run the application |
When redirecting stderr and stderr into a file using `--output=<filename>` and
`--stderr=<filename>`, make sure the target path is writeable on the
compute nodes, i.e., it may not point to a read-only mounted
[filesystem](../data_lifecycle/overview.md) like `/projects.`
Runtime and memory limits are enforced. Please refer to the section on [partitions and
limits](partitions_and_limits.md) for a detailed overview.
marie@login$ salloc --nodes=2
salloc: Pending job allocation 27410653
salloc: job 27410653 queued and waiting for resources
salloc: job 27410653 has been allocated resources
salloc: Granted job allocation 27410653
salloc: Waiting for resource configuration
salloc: Nodes taurusi[6603-6604] are ready for job
marie@login$ hostname
tauruslogin5.taurus.hrsk.tu-dresden.de
marie@login$ srun hostname
taurusi6604.taurus.hrsk.tu-dresden.de
taurusi6603.taurus.hrsk.tu-dresden.de
marie@login$ exit # ending the resource allocation
marie@login$ srun --pty --ntasks=1 --cpus-per-task=4 --time=1:00:00 --mem-per-cpu=1700 bash -l
srun: job 13598400 queued and waiting for resources
srun: job 13598400 has been allocated resources
marie@compute$ # Now, you can start interactive work with e.g. 4 cores
marie@login$ srun --pty bash -l
srun: job 27410688 queued and waiting for resources
srun: job 27410688 has been allocated resources
marie@compute$ srun --overlap hostname
taurusi6604.taurus.hrsk.tu-dresden.de
The [module commands](../software/modules.md) are made available by sourcing the files
`/etc/profile` and `~/.bashrc`. This is done automatically by passing the parameter `-l` to your
shell, as shown in the example above. If you missed adding `-l` at submitting the interactive
session, no worry, you can source this files also later on manually (`source /etc/profile`).
A dedicated partition `interactive` is reserved for short jobs (< 8h) with no more than one job
per user. An interactive partition is available for every regular partition, e.g.
`alpha-interactive` for `alpha`. Please check the availability of nodes there with
`sinfo |grep 'interactive\|AVAIL' |less`
marie@login$ srun --ntasks=1 --pty --x11=first xeyes
If you are getting the error:

```Bash
srun: error: x11: unable to connect node taurusiXXXX
```

that probably means you still have an old host key for the target node in your
`~.ssh/known_hosts` file (e.g. from pre-SCS5). This can be solved either by removing the entry
from your `known_hosts` or by simply deleting the `known_hosts` file altogether if you don't have
important other entries in it.
```console
marie@login$ sbatch [options] <job_file>
```
#!/bin/bash
# ^Batch script starts with shebang line

#SBATCH --ntasks=24                   # #SBATCH lines request resources and
#SBATCH --time=01:00:00               # specify Slurm options
#SBATCH --account=<KTR>               #
#SBATCH --job-name=fancyExp           # All #SBATCH lines have to follow uninterrupted
#SBATCH --output=simulation-%j.out    # after the shebang line
#SBATCH --error=simulation-%j.err     # Comments start with # and do not count as interruptions

module purge                          # Set up environment, e.g., clean/switch modules environment
module load <module1 module2>         # and load necessary modules

srun ./application [options]          # Execute parallel application with srun
```bash
#!/bin/bash

#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=64
#SBATCH --time=01:00:00
#SBATCH --account=<account>

module purge
module load <modules>

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./path/to/openmp_application
```

* Submisson: `marie@login$ sbatch batch_script.sh`
* Run with fewer CPUs: `marie@login$ sbatch --cpus-per-task=14 batch_script.sh`
```bash
#!/bin/bash

#SBATCH --ntasks=64
#SBATCH --time=01:00:00
#SBATCH --account=<account>

module purge
module load <modules>

srun ./path/to/mpi_application
```

* Submisson: `marie@login$ sbatch batch_script.sh`
* Run with fewer MPI tasks: `marie@login$ sbatch --ntasks=14 batch_script.sh`
marie@login$ salloc --ntasks=1 --cpus-per-task=4 --partition <partition> --mem=200G : \
                    --ntasks=8 --cpus-per-task=1 --gres=gpu:8 --mem=80G --partition <partition>
[...]
marie@login$ srun ./my_application <args for master tasks> : ./my_application <args for worker tasks>
#!/bin/bash

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --partition=<partition>
#SBATCH --mem=200G
#SBATCH hetjob # required to separate groups
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:8
#SBATCH --mem=80G
#SBATCH --partition=<partition>

srun ./my_application <args for master tasks> : ./my_application <args for worker tasks>

# or as an alternative
srun ./my_application <args for master tasks> &
srun --het-group=1 ./my_application <args for worker tasks> &
wait
Invoke `squeue --me` to list only your jobs.
| Reason             | Long Description  |
|:-------------------|:------------------|
| `Dependency`         | This job is waiting for a dependent job to complete. |
| `None`               | No reason is set for this job. |
| `PartitionDown`      | The partition required by this job is in a down state. |
| `PartitionNodeLimit` | The number of nodes required by this job is outside of its partitions current limits. Can also indicate that required nodes are down or drained. |
| `PartitionTimeLimit` | The jobs time limit exceeds its partitions current time limit. |
| `Priority`           | One or higher priority jobs exist for this partition. |
| `Resources`          | The job is waiting for resources to become available. |
| `NodeDown`           | A node required by the job is down. |
| `BadConstraints`     | The jobs constraints can not be satisfied. |
| `SystemFailure`      | Failure of the Slurm system, a filesystem, the network, etc. |
| `JobLaunchFailure`   | The job could not be launched. This may be due to a filesystem problem, invalid program name, etc. |
| `NonZeroExitCode`    | The job terminated with a non-zero exit code. |
| `TimeLimit`          | The job exhausted its time limit. |
| `InactiveLimit`      | The job reached the system inactive limit. |
marie@login$ whypending <jobid>
We highly encourage you to inspect your previous jobs in order to better
estimate the requirements, e.g., runtime, for future jobs.
With PIKA, it is e.g. easy to check whether a job is hanging, idling,
or making good use of the resources.
```console
# show all own jobs contained in the accounting database
marie@login$ sacct
    JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
[...]
```

We'd like to point your attention to the following options to gain insight in your jobs.

??? example "Show specific job"

    ```console
    marie@login$ sacct --jobs=<JOBID>
    ```

??? example "Show all fields for a specific job"

    ```console
    marie@login$ sacct --jobs=<JOBID> --format=All
    ```

??? example "Show specific fields"

    ```console
    marie@login$ sacct --jobs=<JOBID> --format=JobName,MaxRSS,MaxVMSize,CPUTime,ConsumedEnergy
    ```

The manual page (`man sacct`) and the [sacct online reference](https://slurm.schedmd.com/sacct.html)
provide a comprehensive documentation regarding available fields and formats.

!!! hint "Time span"

    By default, `sacct` only shows data of the last day. If you want to look further into the past
    without specifying an explicit job id, you need to provide a start date via the option
    `--starttime` (or short: `-S`). A certain end date is also possible via `--endtime` (or `-E`).

??? example "Show all jobs since the beginning of year 2021"

    ```console
    marie@login$ sacct --starttime 2021-01-01 [--endtime now]
    ```
marie@login$ scontrol show res=<reservation name>
# e.g. scontrol show res=hpcsupport_123
  A feature is checked only for scheduling. Running jobs are not affected by changing features.