An error occurred while fetching folder content.
Morris Jette
authored
wait to retry or not. I discovered this bug regression testing. Some similar situations will result in srun continuously issuing step create requests and the launch_common_create_job_step() function not sleeping between RPCs. Basically launch_common_create_job_step() sleeps for some error codes and srun retries the step create on some error codes. The problem is that those error codes do not match in both places, resulting in constant retries without sleeps. This situation is very likely with job preemption combined with salloc, but other conditions can trigger the same event. The following errno will all trigger this situation: EAGAIN, ESLURM_DISABLED, ESLURM_POWER_NOT_AVAIL, ESLURM_POWER_RESERVED, ESLURM_PROLOG_RUNNING, ESLURM_INTERCONNECT_BUSY. Bug 4786
Name | Last commit | Last update |
---|