Skip to content
Snippets Groups Projects
Commit 14763717 authored by Moe Jette's avatar Moe Jette
Browse files

Add call to ping_nodes based upon health_check_interval.

Also call ping_nodes even when SlurmdTimeout=0 so that
non-responsive nodes can be returned to service.
parent 718fbb2d
No related branches found
No related tags found
No related merge requests found
...@@ -1014,13 +1014,18 @@ static void *_slurmctld_background(void *no_data) ...@@ -1014,13 +1014,18 @@ static void *_slurmctld_background(void *no_data)
last_sched_time = last_checkpoint_time = last_group_time = now; last_sched_time = last_checkpoint_time = last_group_time = now;
last_purge_job_time = last_trigger = now; last_purge_job_time = last_trigger = now;
last_timelimit_time = last_assert_primary_time = now; last_timelimit_time = last_assert_primary_time = now;
if (slurmctld_conf.slurmd_timeout) { if (slurmctld_conf.slurmd_timeout ||
/* We ping nodes that haven't responded in SlurmdTimeout/2, slurmctld_conf.health_check_interval) {
/* We ping nodes that haven't responded in SlurmdTimeout/3,
* but need to do the test at a higher frequency or we might * but need to do the test at a higher frequency or we might
* DOWN nodes with times that fall in the gap. */ * DOWN nodes with times that fall in the gap. */
ping_interval = slurmctld_conf.slurmd_timeout / 3; ping_interval = MIN((slurmctld_conf.slurmd_timeout/3),
} else slurmctld_conf.health_check_interval);
ping_interval = 60 * 60 * 24 * 356; /* one year */ } else {
/* This will just ping non-responding nodes
* and restore them to service */
ping_interval = 100; /* 100 seconds */
}
last_ping_node_time = now + (time_t)MIN_CHECKIN_TIME - ping_interval; last_ping_node_time = now + (time_t)MIN_CHECKIN_TIME - ping_interval;
last_ping_srun_time = now; last_ping_srun_time = now;
last_node_acct = now; last_node_acct = now;
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment