Files · e7d4d5931058c93816f0e21f16d4b671d990cfe7 · tud-zih-energy / Slurm

An error occurred while fetching folder content.

Marshall Garey authored 5 years ago

Job steps that run on cloud nodes and use the alias_list - in other
words, SlurmctldParameters=cloud_dns is not in slurm.conf - all talk
directly back to the slurmctld. To make that happen, we set the parent
tank of each stepd to -1. However, we also set the rank of each stepd to
0. this meant that when each stepd sent a REQUEST_STEP_COMPLETE RPC to
the slurmctld, they would tell slurmctld to clean up node 0 in the step
allocation. So, multi-node step allocations weren't cleaning up after
the steps completed and would cause subsequent job steps to hang. The
step allocations would only clean up properly at the end of the job.

Ensure that each stepd uses the correct rank so that job steps are
properly cleaned up after each step completes.

Bug 6467.

e7d4d593

History

e7d4d593 5 years ago

History

Name	Last commit	Last update