Skip to content
Snippets Groups Projects
user avatar
Morris Jette authored
wait to retry or not.

I discovered this bug regression testing. Some similar situations will
result in srun continuously issuing step create requests and the
launch_common_create_job_step() function not sleeping between RPCs.
Basically launch_common_create_job_step() sleeps for some error codes
and srun retries the step create on some error codes. The problem is
that those error codes do not match in both places, resulting in
constant retries without sleeps. This situation is very likely with
job preemption combined with salloc, but other conditions can trigger
the same event. The following errno will all trigger this situation:
EAGAIN, ESLURM_DISABLED, ESLURM_POWER_NOT_AVAIL, ESLURM_POWER_RESERVED,
ESLURM_PROLOG_RUNNING, ESLURM_INTERCONNECT_BUSY.

Bug 4786
10af7fbe
History
Name Last commit Last update