Docs - Clarify different OOM behavior for cgroups vs polling

Bug 11318

Docs - Clarify different OOM behavior for cgroups vs polling
Bug 11318
82bbae66 · Ben Roberts · Danny Auble · 68ee435c · 82bbae66 · 82bbae66
Commit 82bbae66 authored 3 years ago by Ben Roberts Committed by Danny Auble 3 years ago
--- a/doc/man/man5/cgroup.conf.5
+++ b/doc/man/man5/cgroup.conf.5
@@ -116,7 +116,13 @@ which case the job's RAM limit will be set to its swap space limit if
 \fBConstrainSwapSpace\fR is set to "yes".
 Also see \fBAllowedSwapSpace\fR, \fBAllowedRAMSpace\fR and
 \fBConstrainSwapSpace\fR.
-NOTE: When enabled, ConstrainRAMSpace can lead to a noticeable decline in
+
+\fBNOTE\fR: When using \fBConstrainRAMSpace\fR, if a process tries to consume
+more memory than is available, the step that process is running in will be
+killed. This differs from the behavior when using \fBOverMemoryKill\fR,
+where just the offending process will be killed.
+
+\fBNOTE\fR: When enabled, ConstrainRAMSpace can lead to a noticeable decline in
 per-node job throughout. Sites with high-throughput requirements should
 carefully weigh the tradeoff between per-node throughput, versus potential
 problems that can arise from unconstrained memory usage on the node. See

--- a/doc/man/man5/slurm.conf.5
+++ b/doc/man/man5/slurm.conf.5
@@ -1207,8 +1207,11 @@ allocation may affect other processes and/or machine health.
 task/cgroup as a TaskPlugin and making use of ConstrainRAMSpace=yes in the
 cgroup.conf instead of using this JobAcctGather mechanism for memory
 enforcement. With OverMemoryKill, memory limit is applied against each process
-individually and is not applied to the step as a whole as it is with
-ConstrainRAMSpace=yes. Using JobAcctGather is polling based and there is a
+individually and is not applied to the step as a whole. This means that when
+jobs have a process that consumes too much memory, the process will be killed
+but the step will continue to run. When using cgroups with
+ConstrainRAMSpace=yes, a process that consumes too much memory will result in
+the job step being killed. Using JobAcctGather is polling based and there is a
 delay before a job is killed, which could lead to system Out of Memory events.
 .RE