Improve fault-tolerance for batch jobs. If a node fails to respond to the
batch_job_launch RPC, then deallocate those resources and requeue the job. If a node registers and fails to show a batch job that should have a script running there (node zero of allocation), then consider the job complete.
Showing
- src/slurmctld/agent.c 9 additions, 1 deletionsrc/slurmctld/agent.c
- src/slurmctld/job_mgr.c 33 additions, 2 deletionssrc/slurmctld/job_mgr.c
- src/slurmctld/job_scheduler.c 8 additions, 10 deletionssrc/slurmctld/job_scheduler.c
- src/slurmctld/node_scheduler.c 2 additions, 2 deletionssrc/slurmctld/node_scheduler.c
Loading
Please register or sign in to comment