From fec5e03b900aebba21199a922cb75acf182a6abc Mon Sep 17 00:00:00 2001 From: Morris Jette <jette@schedmd.com> Date: Wed, 9 Mar 2016 14:23:25 -0800 Subject: [PATCH] cray job requeue bug Fix Cray NHC spawning on job requeue. Previous logic would leave nodes allocated to a requeued job as non-usable on job termination. Specifically, each job has a "cleaning/cleaned" flag. Once a job terminates, the cleaning flag is set, then after the job node health check completes, the value gets set to cleaned. If the job is requeued, on its second (or subsequent) termination, the select/cray plugin is called to launch the NHC. The plugin sees the "cleaned" flag already set, it then logs: error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen and returns, never launching the NHC. Since the termination of the job NHC triggers releasing job resources (CPUs, memory, and GRES), those resources are never released for use by other jobs. Bug 2384 --- NEWS | 2 ++ src/plugins/select/cray/select_cray.c | 8 ++++++-- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/NEWS b/NEWS index 1135bab7843..1e8aa6eeb5a 100644 --- a/NEWS +++ b/NEWS @@ -34,6 +34,8 @@ documents those changes that are of interest to users and administrators. -- Fix display for RoutePlugin parameter to display the correct value. -- Fix route/topology plugin to prevent segfault in sbcast when in use. -- Fix Cray slurmconfgen_smw.py script to use nid as nid, not nic. + -- Fix Cray NHC spawning on job requeue. Previous logic would leave nodes + allocated to a requeued job as non-usable on job termination. * Changes in Slurm 15.08.8 ========================== diff --git a/src/plugins/select/cray/select_cray.c b/src/plugins/select/cray/select_cray.c index 2f9643743d3..f45703e7c35 100644 --- a/src/plugins/select/cray/select_cray.c +++ b/src/plugins/select/cray/select_cray.c @@ -1824,12 +1824,16 @@ extern int select_p_job_begin(struct job_record *job_ptr) xassert(job_ptr->select_jobinfo->data); jobinfo = job_ptr->select_jobinfo->data; + jobinfo->cleaning = CLEANING_INIT; /* Reset needed if requeued */ slurm_mutex_lock(&blade_mutex); - if (!jobinfo->blade_map) + if (!jobinfo->blade_map) { jobinfo->blade_map = bit_alloc(blade_cnt); - + } else { /* Clear vestigial bitmap in case job requeued */ + bit_nclear(jobinfo->blade_map, 0, + bit_size(jobinfo->blade_map) - 1); + } _set_job_running(job_ptr); /* char *tmp3 = bitmap2node_name(blade_nodes_running_npc); */ -- GitLab