Skip to content
Snippets Groups Projects
Commit fec5e03b authored by Morris Jette's avatar Morris Jette
Browse files

cray job requeue bug

Fix Cray NHC spawning on job requeue. Previous logic would leave nodes
allocated to a requeued job as non-usable on job termination.

Specifically, each job has a "cleaning/cleaned" flag. Once a job
terminates, the cleaning flag is set, then after the job node health
check completes, the value gets set to cleaned. If the job is requeued,
on its second (or subsequent) termination, the select/cray plugin
is called to launch the NHC. The plugin sees the "cleaned" flag
already set, it then logs:
error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen
and returns, never launching the NHC. Since the termination of the
job NHC triggers releasing job resources (CPUs, memory, and GRES),
those resources are never released for use by other jobs.

Bug 2384
parent 88ccc111
No related branches found
No related tags found
No related merge requests found
...@@ -34,6 +34,8 @@ documents those changes that are of interest to users and administrators. ...@@ -34,6 +34,8 @@ documents those changes that are of interest to users and administrators.
-- Fix display for RoutePlugin parameter to display the correct value. -- Fix display for RoutePlugin parameter to display the correct value.
-- Fix route/topology plugin to prevent segfault in sbcast when in use. -- Fix route/topology plugin to prevent segfault in sbcast when in use.
-- Fix Cray slurmconfgen_smw.py script to use nid as nid, not nic. -- Fix Cray slurmconfgen_smw.py script to use nid as nid, not nic.
-- Fix Cray NHC spawning on job requeue. Previous logic would leave nodes
allocated to a requeued job as non-usable on job termination.
* Changes in Slurm 15.08.8 * Changes in Slurm 15.08.8
========================== ==========================
......
...@@ -1824,12 +1824,16 @@ extern int select_p_job_begin(struct job_record *job_ptr) ...@@ -1824,12 +1824,16 @@ extern int select_p_job_begin(struct job_record *job_ptr)
xassert(job_ptr->select_jobinfo->data); xassert(job_ptr->select_jobinfo->data);
jobinfo = job_ptr->select_jobinfo->data; jobinfo = job_ptr->select_jobinfo->data;
jobinfo->cleaning = CLEANING_INIT; /* Reset needed if requeued */
slurm_mutex_lock(&blade_mutex); slurm_mutex_lock(&blade_mutex);
if (!jobinfo->blade_map) if (!jobinfo->blade_map) {
jobinfo->blade_map = bit_alloc(blade_cnt); jobinfo->blade_map = bit_alloc(blade_cnt);
} else { /* Clear vestigial bitmap in case job requeued */
bit_nclear(jobinfo->blade_map, 0,
bit_size(jobinfo->blade_map) - 1);
}
_set_job_running(job_ptr); _set_job_running(job_ptr);
/* char *tmp3 = bitmap2node_name(blade_nodes_running_npc); */ /* char *tmp3 = bitmap2node_name(blade_nodes_running_npc); */
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment