-
Morris Jette authored
There was a subtle bug in how tasks were bound to CPUs which could result in an "infinite loop" error. The problem was various socket/core/threasd calculations were based upon the resources allocated to a step rather than all resources on the node and rounding errors could occur. Consider for example a node with 2 sockets, 6 cores per socket and 2 threads per core. On the idle node, a job requesting 14 CPUs is submitted. That job would be allocted 4 cores on the first socket and 3 cores on the second socket. The old logic would get the number of sockets for the job at 2 and the number of cores at 7, then calculate the number of cores per socket at 7/2 or 3 (rounding down to an integer). The logic layouting out tasks would bind the first 3 cores on each socket to the job then not find any remaining cores, report the "infinite loop" error to the user, and run the job without one of the expected cores. The problem gets even worse when there are some allocated cores on a node. In a more extreme case, a job might be allocated 6 cores on one socket and 1 core on a second socket. In that case, 3 of that job's cores would be unused. bug 2502
Morris Jette authoredThere was a subtle bug in how tasks were bound to CPUs which could result in an "infinite loop" error. The problem was various socket/core/threasd calculations were based upon the resources allocated to a step rather than all resources on the node and rounding errors could occur. Consider for example a node with 2 sockets, 6 cores per socket and 2 threads per core. On the idle node, a job requesting 14 CPUs is submitted. That job would be allocted 4 cores on the first socket and 3 cores on the second socket. The old logic would get the number of sockets for the job at 2 and the number of cores at 7, then calculate the number of cores per socket at 7/2 or 3 (rounding down to an integer). The logic layouting out tasks would bind the first 3 cores on each socket to the job then not find any remaining cores, report the "infinite loop" error to the user, and run the job without one of the expected cores. The problem gets even worse when there are some allocated cores on a node. In a more extreme case, a job might be allocated 6 cores on one socket and 1 core on a second socket. In that case, 3 of that job's cores would be unused. bug 2502