An error occurred while fetching folder content.
Marshall Garey
authored
Job steps that run on cloud nodes and use the alias_list - in other words, SlurmctldParameters=cloud_dns is not in slurm.conf - all talk directly back to the slurmctld. To make that happen, we set the parent tank of each stepd to -1. However, we also set the rank of each stepd to 0. this meant that when each stepd sent a REQUEST_STEP_COMPLETE RPC to the slurmctld, they would tell slurmctld to clean up node 0 in the step allocation. So, multi-node step allocations weren't cleaning up after the steps completed and would cause subsequent job steps to hang. The step allocations would only clean up properly at the end of the job. Ensure that each stepd uses the correct rank so that job steps are properly cleaned up after each step completes. Bug 6467.
Name | Last commit | Last update |
---|