An error occurred while fetching folder content.
Didier GAZEN
authored
In your node_mgr fix to keep rebooted nodes down (commit 9cd15dfe), you forgot to consider the case of nodes that are powered up but are responding after ResumeTimeout seconds (the maximum time permitted). Such nodes are marked DOWN (because they didn't respond within ResumeTimeout seconds) than should become silently available when ReturnToService=1 (as stated in the slurm.conf manual) With your modification when such nodes are finally responding, they are seen as rebooted nodes and remain in the DOWN state (with the new reason: Node unexpectedly rebooted) even when ReturnToService=1 ! My patch to obtain the correct behaviour:
Name | Last commit | Last update |
---|