An error occurred while fetching folder content.
Alejandro Sanchez
authored
Commit 818a09e8 introduced a new state JOB_OOM and a new state reason FAIL_OOM (OutOfMemory). The problem was that it based the decision upon the value of the different memory.[*].failcnt being > 0. That lead to "false positives" situations when the usage hit the limit but the Kernel was able to reclaim pages and the process managed to finish successfully. When this happens there might not necessary be OOM_KILL events happening. This patch makes it so the JOB_OOM state is set based upon OOM_KILL events detected instead of usage hitting the limit. The usage hit will still be logged as an info() message, and further work will be needed in the master branch to better discern both type of events, maybe changing the API and getting rid of the current SIG_OOM and a potential new SIG_OOM_KILL. OOM_KILL event is detected using the eventfd notification mechanism on the cgroup v1 control/event files: https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt If we plan to support cgroup v2, we should monitor 'memory.events' file modified events. That would mean that any of the available entries changed its value upon notification. Entries include: low, high, max, oom, oom_kill: https://www.kernel.org/doc/Documentation/cgroup-v2.txt https://patchwork.kernel.org/patch/9737381 but since this is a fairly recent change many sites might be running kernels still not supporting this feature. Bug 3820.
Name | Last commit | Last update |
---|