Skip to content
Snippets Groups Projects
user avatar
Alejandro Sanchez authored
Commit 818a09e8 introduced a new state JOB_OOM and a new state reason
FAIL_OOM (OutOfMemory). The problem was that it based the decision upon
the value of the different memory.[*].failcnt being > 0.

That lead to "false positives" situations when the usage hit the limit
but the Kernel was able to reclaim pages and the process managed to finish
successfully. When this happens there might not necessary be OOM_KILL
events happening.

This patch makes it so the JOB_OOM state is set based upon OOM_KILL events
detected instead of usage hitting the limit. The usage hit will still
be logged as an info() message, and further work will be needed in the
master branch to better discern both type of events, maybe changing
the API and getting rid of the current SIG_OOM and a potential new
SIG_OOM_KILL.

OOM_KILL event is detected using the eventfd notification mechanism
on the cgroup v1 control/event files:
https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

If we plan to support cgroup v2, we should monitor 'memory.events' file
modified events. That would mean that any of the available entries changed
its value upon notification.
Entries include: low, high, max, oom, oom_kill:
https://www.kernel.org/doc/Documentation/cgroup-v2.txt
https://patchwork.kernel.org/patch/9737381
but since this is a fairly recent change many sites might be running
kernels still not supporting this feature.

Bug 3820.
943c4a13
History
Name Last commit Last update