-
Morris Jette authored
From Chris Holmes, HP: After several days of brainstorming and debugging, I have identified a bug in SLURM 2.5.0rc2, related to the 'tree' topology. It was so early in the execution of the whole SLURM machinery that it took me some time to figure it out (say, 100 or 200 jobs showing the issue, with more or less debugging levels increased and extra instrumentation, with sometimes an uncertain reliability)... For every “switch” a bitmap of nodes (seen down by the switch) is built as the topology is discovered through 'topology.conf'. There is code in read_config.c, executed when the SLURM control daemon starts, that reorders the nodes (according to their hostname by default), while the switches table (ie the bitmaps) has already being built. To reorder the nodes means that the bitmaps of the switches become wrong.
Morris Jette authoredFrom Chris Holmes, HP: After several days of brainstorming and debugging, I have identified a bug in SLURM 2.5.0rc2, related to the 'tree' topology. It was so early in the execution of the whole SLURM machinery that it took me some time to figure it out (say, 100 or 200 jobs showing the issue, with more or less debugging levels increased and extra instrumentation, with sometimes an uncertain reliability)... For every “switch” a bitmap of nodes (seen down by the switch) is built as the topology is discovered through 'topology.conf'. There is code in read_config.c, executed when the SLURM control daemon starts, that reorders the nodes (according to their hostname by default), while the switches table (ie the bitmaps) has already being built. To reorder the nodes means that the bitmaps of the switches become wrong.