From 82bbae6600dd8b5973ca1bdb9c75f9d1bfc4895b Mon Sep 17 00:00:00 2001
From: Ben Roberts <ben@schedmd.com>
Date: Thu, 8 Apr 2021 15:42:35 -0500
Subject: [PATCH] Docs - Clarify different OOM behavior for cgroups vs polling

Bug 11318
---
 doc/man/man5/cgroup.conf.5 | 8 +++++++-
 doc/man/man5/slurm.conf.5  | 7 +++++--
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/doc/man/man5/cgroup.conf.5 b/doc/man/man5/cgroup.conf.5
index 2bb2846e452..f9d30e367dc 100644
--- a/doc/man/man5/cgroup.conf.5
+++ b/doc/man/man5/cgroup.conf.5
@@ -116,7 +116,13 @@ which case the job's RAM limit will be set to its swap space limit if
 \fBConstrainSwapSpace\fR is set to "yes".
 Also see \fBAllowedSwapSpace\fR, \fBAllowedRAMSpace\fR and
 \fBConstrainSwapSpace\fR.
-NOTE: When enabled, ConstrainRAMSpace can lead to a noticeable decline in
+
+\fBNOTE\fR: When using \fBConstrainRAMSpace\fR, if a process tries to consume
+more memory than is available, the step that process is running in will be
+killed. This differs from the behavior when using \fBOverMemoryKill\fR,
+where just the offending process will be killed.
+
+\fBNOTE\fR: When enabled, ConstrainRAMSpace can lead to a noticeable decline in
 per-node job throughout. Sites with high-throughput requirements should
 carefully weigh the tradeoff between per-node throughput, versus potential
 problems that can arise from unconstrained memory usage on the node. See
diff --git a/doc/man/man5/slurm.conf.5 b/doc/man/man5/slurm.conf.5
index 29dc090cacc..0c8a846f022 100644
--- a/doc/man/man5/slurm.conf.5
+++ b/doc/man/man5/slurm.conf.5
@@ -1207,8 +1207,11 @@ allocation may affect other processes and/or machine health.
 task/cgroup as a TaskPlugin and making use of ConstrainRAMSpace=yes in the
 cgroup.conf instead of using this JobAcctGather mechanism for memory
 enforcement. With OverMemoryKill, memory limit is applied against each process
-individually and is not applied to the step as a whole as it is with
-ConstrainRAMSpace=yes. Using JobAcctGather is polling based and there is a
+individually and is not applied to the step as a whole. This means that when
+jobs have a process that consumes too much memory, the process will be killed
+but the step will continue to run. When using cgroups with
+ConstrainRAMSpace=yes, a process that consumes too much memory will result in
+the job step being killed. Using JobAcctGather is polling based and there is a
 delay before a job is killed, which could lead to system Out of Memory events.
 .RE
 
-- 
GitLab