Skip to content
Snippets Groups Projects
  • Mark A. Grondona's avatar
    a93afcd1
    task/cgroup: delete job step memcg instead of using force_empty · a93afcd1
    Mark A. Grondona authored
    The current task/cgroup memory code writes to force_empty at job step
    completion and then waits for the release agent to be triggered to
    remove the memcg. However, force_empty only causes clean cache pages
    to be dropped from the memcg and does not actually move charges to
    the parent [1].
    
    This has two unfortunate side-effects. First, pages that can't be
    dropped by force_empty are in-use and could stay that way indefinitely
    (e.g. system library that is in-use until just after force_empty
    completes). Thus, the step memcg never becomes 'empty' and the release
    agent is not activated. Second, cached pages that can be freed are
    likely associated with the job itself, and those files and libraries
    will have to be paged in again for subsequent job steps.
    
    In contrast, calling rmdir(2) on a memcg with no active tasks
    causes *all* current charges to move to parent, which is really what
    we want in this case. This allows cached libraries and binaries to
    stay resident and be associated with the job, and also ensures that
    the step memcg is removed immediately as the job step ends.
    
    Thus, this patch replaces the write to force_empty with a call
    to xcgroup_delete() on the step memcg, which in turn removes
    the memcg with rmdir(2).
    
    The functionality of this patch depends on the previous fix that
    uses xcgroup_move_process() to move slurmstepd to the root memcg.
    Otherwise, there will be leftover slurmstepd threads in the job
    step memcg, and the rmdir will fail with EBUSY.
    
     [1] Sec 4.3: http://www.kernel.org/doc/Documentation/cgroups/memory.txt
    a93afcd1
    History
    task/cgroup: delete job step memcg instead of using force_empty
    Mark A. Grondona authored
    The current task/cgroup memory code writes to force_empty at job step
    completion and then waits for the release agent to be triggered to
    remove the memcg. However, force_empty only causes clean cache pages
    to be dropped from the memcg and does not actually move charges to
    the parent [1].
    
    This has two unfortunate side-effects. First, pages that can't be
    dropped by force_empty are in-use and could stay that way indefinitely
    (e.g. system library that is in-use until just after force_empty
    completes). Thus, the step memcg never becomes 'empty' and the release
    agent is not activated. Second, cached pages that can be freed are
    likely associated with the job itself, and those files and libraries
    will have to be paged in again for subsequent job steps.
    
    In contrast, calling rmdir(2) on a memcg with no active tasks
    causes *all* current charges to move to parent, which is really what
    we want in this case. This allows cached libraries and binaries to
    stay resident and be associated with the job, and also ensures that
    the step memcg is removed immediately as the job step ends.
    
    Thus, this patch replaces the write to force_empty with a call
    to xcgroup_delete() on the step memcg, which in turn removes
    the memcg with rmdir(2).
    
    The functionality of this patch depends on the previous fix that
    uses xcgroup_move_process() to move slurmstepd to the root memcg.
    Otherwise, there will be leftover slurmstepd threads in the job
    step memcg, and the rmdir will fail with EBUSY.
    
     [1] Sec 4.3: http://www.kernel.org/doc/Documentation/cgroups/memory.txt