Skip to content
Snippets Groups Projects
Commit d72b13f2 authored by Morris Jette's avatar Morris Jette
Browse files

Fix for backfill launch job with reboot

This bug was likely the root cause of bug 3366. If the backfill scheduler
  allocates resources for a batch job and a node reboot is required, the
  batch launch RPC would be sent to the agent. At that point, there is a
  race condition between the agent and the job_time_limit() function
  testing for boot completion. If the job_time_limit() function ran
  first, it would trigger a second launch RPC request getting sent to
  the agent.
bug 3366
parent f9804256
No related branches found
No related tags found
No related merge requests found
...@@ -1856,8 +1856,7 @@ static int _start_job(struct job_record *job_ptr, bitstr_t *resv_bitmap) ...@@ -1856,8 +1856,7 @@ static int _start_job(struct job_record *job_ptr, bitstr_t *resv_bitmap)
power_g_job_start(job_ptr); power_g_job_start(job_ptr);
if (job_ptr->batch_flag == 0) if (job_ptr->batch_flag == 0)
srun_allocate(job_ptr->job_id); srun_allocate(job_ptr->job_id);
else if ((job_ptr->details == NULL) || else if (!IS_JOB_CONFIGURING(job_ptr))
(job_ptr->details->prolog_running == 0))
launch_job(job_ptr); launch_job(job_ptr);
slurmctld_diag_stats.backfilled_jobs++; slurmctld_diag_stats.backfilled_jobs++;
slurmctld_diag_stats.last_backfilled_jobs++; slurmctld_diag_stats.last_backfilled_jobs++;
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment