fix of idle nodes cannot be allocated
avoid add/remove node resource of job if the node is lost by resize I found another case that idle node can not be allocated. It can be reproduced as follows: 1. run a job with -k option: [root@mn0 ~]# srun -w cn[18-28] -k sleep 1000 srun: error: Node failure on cn28 srun: error: Node failure on cn28 srun: error: cn28: task 10: Killed ^Csrun: interrupt (one more within 1 sec to abort) srun: tasks 0-9: running srun: task 10: exited abnormally ^Csrun: sending Ctrl-C to job 106120.0 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 2. set a node down and then set it idle: [root@mn0 ~]# scontrol update nodename=cn28 state=down reason="hjcao test" [root@mn0 ~]# scontrol update nodename=cn28 state=idle 3. restart slurmctld [root@mn0 ~]# service slurm restart stopping slurmctld: [ OK ] slurmctld is stopped starting slurmctld: [ OK ] 4. cancel the job then, the node set down will be left unavailable: [root@mn0 ~]# sinfo -n cn[18-28] PARTITION AVAIL TIMELIMIT NODES STATE NODELIST work* up infinite 11 idle cn[18-28] [root@mn0 ~]# srun -w cn[18-28] hostname srun: job 106122 queued and waiting for resources [root@mn0 slurm]# grep cn28 slurmctld.log [2013-03-18T15:28:02+08:00] debug3: cons_res: _vns: node cn28 in exclusive use [2013-03-18T15:29:02+08:00] debug3: cons_res: _vns: node cn28 in exclusive use I made an attempt to fix this by the attached patch. Please review it.
Loading
Please register or sign in to comment