Fix for sstat on multi-node batch jobs (8589ff40) · Commits · tud-zih-energy / Slurm

Commit 8589ff40 authored 8 years ago by Jacek Budzowski Committed by Morris Jette 8 years ago

Fix for sstat on multi-node batch jobs

There is a problem with gathering batch step statistics for jobs which are allocated on more than one node.

Sstat asks wrong node for batch step stats. It requests info from last node from hostlist while it should ask first host from hostlist (i.e. BatchHost), because only on the first node the batch step actually executes.

For example, when you have a job allocated on nodes n000[1-2] with BatchHost=p0001. You should be able to check its statistics by running sstat [ with -vv switch for more verbose output] (e.g. sstat -j 1234.batch -vv). Then you can see lines:

sstat: debug: slurm_job_step_stat: getting pid information of job 1234.4294967294 on nodes n0002
sstat: debug: job step 1234.4294967294 has already completed

The problem lays in sstat source code. For batch step a hostlist variable is taken from the hostlist_pop function, which returns last host from given hostlist. This should be replaced with the hostlist_shift function, which returns first host from the given hostlist. Patch attached.

bug 2975

parent 2c7c5459

No related branches found

No related tags found

No related merge requests found

Hide whitespace changes

Inline Side-by-side

Showing with 1 addition and 1 deletion

Please register or to comment