From 82bbae6600dd8b5973ca1bdb9c75f9d1bfc4895b Mon Sep 17 00:00:00 2001 From: Ben Roberts <ben@schedmd.com> Date: Thu, 8 Apr 2021 15:42:35 -0500 Subject: [PATCH] Docs - Clarify different OOM behavior for cgroups vs polling Bug 11318 --- doc/man/man5/cgroup.conf.5 | 8 +++++++- doc/man/man5/slurm.conf.5 | 7 +++++-- 2 files changed, 12 insertions(+), 3 deletions(-) diff --git a/doc/man/man5/cgroup.conf.5 b/doc/man/man5/cgroup.conf.5 index 2bb2846e452..f9d30e367dc 100644 --- a/doc/man/man5/cgroup.conf.5 +++ b/doc/man/man5/cgroup.conf.5 @@ -116,7 +116,13 @@ which case the job's RAM limit will be set to its swap space limit if \fBConstrainSwapSpace\fR is set to "yes". Also see \fBAllowedSwapSpace\fR, \fBAllowedRAMSpace\fR and \fBConstrainSwapSpace\fR. -NOTE: When enabled, ConstrainRAMSpace can lead to a noticeable decline in + +\fBNOTE\fR: When using \fBConstrainRAMSpace\fR, if a process tries to consume +more memory than is available, the step that process is running in will be +killed. This differs from the behavior when using \fBOverMemoryKill\fR, +where just the offending process will be killed. + +\fBNOTE\fR: When enabled, ConstrainRAMSpace can lead to a noticeable decline in per-node job throughout. Sites with high-throughput requirements should carefully weigh the tradeoff between per-node throughput, versus potential problems that can arise from unconstrained memory usage on the node. See diff --git a/doc/man/man5/slurm.conf.5 b/doc/man/man5/slurm.conf.5 index 29dc090cacc..0c8a846f022 100644 --- a/doc/man/man5/slurm.conf.5 +++ b/doc/man/man5/slurm.conf.5 @@ -1207,8 +1207,11 @@ allocation may affect other processes and/or machine health. task/cgroup as a TaskPlugin and making use of ConstrainRAMSpace=yes in the cgroup.conf instead of using this JobAcctGather mechanism for memory enforcement. With OverMemoryKill, memory limit is applied against each process -individually and is not applied to the step as a whole as it is with -ConstrainRAMSpace=yes. Using JobAcctGather is polling based and there is a +individually and is not applied to the step as a whole. This means that when +jobs have a process that consumes too much memory, the process will be killed +but the step will continue to run. When using cgroups with +ConstrainRAMSpace=yes, a process that consumes too much memory will result in +the job step being killed. Using JobAcctGather is polling based and there is a delay before a job is killed, which could lead to system Out of Memory events. .RE -- GitLab