From 5e9dca41b54c364ad3577b9759702443c66c5bf7 Mon Sep 17 00:00:00 2001 From: Don Lipari <lipari1@llnl.gov> Date: Mon, 7 May 2012 09:49:14 -0700 Subject: [PATCH] Job priority reset bug on slurmctld restart The commit 8b14f3880d0 on Jan 19, 2011 is causing problems with Moab cluster-scheduled machines. Under this case, Moab hands off every job submitted immediately to SLURM which gets a zero priority. Once Moab schedules the job, Moab raises the job's priority to 10,000,000 and the job runs. When you happen to restart the slurmctld under such conditions, the sync_job_priorities() function runs which attempts to raise job priorities into a higher range if they are getting too close to zero. The problem as I see it is that you include the "boost" for zero priority jobs. Hence the problem we are seeing is that once the slurmctld is restarted, a bunch of zero priority jobs are suddenly eligible. So there becomes a disconnect between the top priority job Moab is trying to start and the top priority job SLURM sees. I believe the fix is simple: diff job_mgr.c~ job_mgr.c 6328,6329c6328,6331 < while ((job_ptr = (struct job_record *) list_next(job_iterator))) < job_ptr->priority += prio_boost; --- while ((job_ptr = (struct job_record *) list_next(job_iterator))) { if (job_ptr->priority) job_ptr->priority += prio_boost; } Do you agree? Don --- src/slurmctld/job_mgr.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/src/slurmctld/job_mgr.c b/src/slurmctld/job_mgr.c index 09fbb4eeb28..f2cdc0eaccb 100644 --- a/src/slurmctld/job_mgr.c +++ b/src/slurmctld/job_mgr.c @@ -6325,8 +6325,10 @@ extern void sync_job_priorities(void) return; job_iterator = list_iterator_create(job_list); - while ((job_ptr = (struct job_record *) list_next(job_iterator))) - job_ptr->priority += prio_boost; + while ((job_ptr = (struct job_record *) list_next(job_iterator))) { + if (job_ptr->priority > 1) + job_ptr->priority += prio_boost; + } list_iterator_destroy(job_iterator); lowest_prio += prio_boost; } -- GitLab