From a93afcd17cfb6acab07db37aa5590f88ea4f82c6 Mon Sep 17 00:00:00 2001 From: "Mark A. Grondona" <mgrondona@llnl.gov> Date: Sat, 17 Mar 2012 09:23:16 -0700 Subject: [PATCH] task/cgroup: delete job step memcg instead of using force_empty The current task/cgroup memory code writes to force_empty at job step completion and then waits for the release agent to be triggered to remove the memcg. However, force_empty only causes clean cache pages to be dropped from the memcg and does not actually move charges to the parent [1]. This has two unfortunate side-effects. First, pages that can't be dropped by force_empty are in-use and could stay that way indefinitely (e.g. system library that is in-use until just after force_empty completes). Thus, the step memcg never becomes 'empty' and the release agent is not activated. Second, cached pages that can be freed are likely associated with the job itself, and those files and libraries will have to be paged in again for subsequent job steps. In contrast, calling rmdir(2) on a memcg with no active tasks causes *all* current charges to move to parent, which is really what we want in this case. This allows cached libraries and binaries to stay resident and be associated with the job, and also ensures that the step memcg is removed immediately as the job step ends. Thus, this patch replaces the write to force_empty with a call to xcgroup_delete() on the step memcg, which in turn removes the memcg with rmdir(2). The functionality of this patch depends on the previous fix that uses xcgroup_move_process() to move slurmstepd to the root memcg. Otherwise, there will be leftover slurmstepd threads in the job step memcg, and the rmdir will fail with EBUSY. [1] Sec 4.3: http://www.kernel.org/doc/Documentation/cgroups/memory.txt --- src/plugins/task/cgroup/task_cgroup_memory.c | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/src/plugins/task/cgroup/task_cgroup_memory.c b/src/plugins/task/cgroup/task_cgroup_memory.c index ad5ad3875e5..ef37a0685d6 100644 --- a/src/plugins/task/cgroup/task_cgroup_memory.c +++ b/src/plugins/task/cgroup/task_cgroup_memory.c @@ -175,19 +175,22 @@ extern int task_cgroup_memory_fini(slurm_cgroup_conf_t *slurm_cgroup_conf) return SLURM_SUCCESS; /* - * Move the slurmstepd back to the root memory cg and force empty + * Move the slurmstepd back to the root memory cg and remove[*] * the step cgroup to move its allocated pages to its parent. - * The release_agent will asynchroneously be called for the step - * cgroup. It will do the necessary cleanup. - * It should be good if this force_empty mech could be done directly - * by the memcg implementation at the end of the last task managed - * by a cgroup. It is too difficult and near impossible to handle - * that cleanup correctly with current memcg. + * + * [*] Calling rmdir(2) on an empty cgroup moves all resident charged + * pages to the parent (i.e. the job cgroup). (If force_empty were + * used instead, only clean pages would be flushed). This keeps + * resident pagecache pages associated with the job. It is expected + * that the job epilog will then optionally force_empty the + * job cgroup (to flush pagecache), and then rmdir(2) the cgroup + * or wait for release notification from kernel. */ if (xcgroup_create(&memory_ns,&memory_cg,"",0,0) == XCGROUP_SUCCESS) { xcgroup_move_process(&memory_cg, getpid()); xcgroup_destroy(&memory_cg); - xcgroup_set_param(&step_memory_cg,"memory.force_empty","1"); + if (xcgroup_delete(&step_memory_cg) != XCGROUP_SUCCESS) + error ("cgroup: rmdir step memcg failed: %m"); } xcgroup_destroy(&user_memory_cg); -- GitLab