Skip to content
Snippets Groups Projects
  • Didier GAZEN's avatar
    4e8545b6
    Fix for node reboot/down state · 4e8545b6
    Didier GAZEN authored
    In your node_mgr fix to keep rebooted nodes down (commit 9cd15dfe), you
    forgot to consider the case of nodes that are powered up but are responding after
    ResumeTimeout seconds (the maximum time permitted). Such nodes are
    marked DOWN (because they didn't respond within ResumeTimeout seconds) than
    should become silently available when ReturnToService=1 (as stated in the slurm.conf manual)
    
    With your modification when such nodes are finally responding, they are seen as
    rebooted nodes and remain in the DOWN state (with the new reason: Node
    unexpectedly rebooted) even when ReturnToService=1 !
    
    Correction of commit 3c2b46af
    4e8545b6
    History
    Fix for node reboot/down state
    Didier GAZEN authored
    In your node_mgr fix to keep rebooted nodes down (commit 9cd15dfe), you
    forgot to consider the case of nodes that are powered up but are responding after
    ResumeTimeout seconds (the maximum time permitted). Such nodes are
    marked DOWN (because they didn't respond within ResumeTimeout seconds) than
    should become silently available when ReturnToService=1 (as stated in the slurm.conf manual)
    
    With your modification when such nodes are finally responding, they are seen as
    rebooted nodes and remain in the DOWN state (with the new reason: Node
    unexpectedly rebooted) even when ReturnToService=1 !
    
    Correction of commit 3c2b46af