Commits · 5527909796d00de71bde05f9f775ba57877e9c2b · tud-zih-energy / Slurm

May 01, 2021
- Merge branch 'slurm-20.02' into slurm-20.11 · 55279097
  Albert Gil authored 3 years ago
  
  55279097
- Testsuite - Refactor test12.2 · 04f8a581
  Albert Gil authored 3 years ago
  
  Use test_name to avoid test_id. Use get_config_parameter to simplify checking requirements. Use submit_job to simplify code avoiding unnecessary spawn/expect. Use wait_for and new auxiliary function to avoid sleeps. Use subtest/subpass to avoid exit_code. Use tolerance to simplify subtests. Bug 10315 Signed-off-by: Scott Jackson <scottmo@schedmd.com>
  04f8a581
- Testsuite - Improve test12.2 avoiding undesired loops · be4eb0e1
  Albert Gil authored 3 years ago
  
  Bug 10315 Signed-off-by: Scott Jackson <scottmo@schedmd.com>
  be4eb0e1
- Testsuite - Fix race condition in test12.2 · 2a9739a1
  Scott Jackson authored 3 years ago
  
  By creating the file to read as setup step instead that in one of the tasks. Bug 10315
  2a9739a1
- Merge branch 'slurm-20.02' into slurm-20.11 · 2329a113
  Albert Gil authored 3 years ago
  
  2329a113
- Testsuite - Add temporary debug in test21.39 · e388924c
  Scott Jackson authored 3 years ago
  
  Bug 10604
  e388924c
Apr 30, 2021

qstat - fix printf error message in output · 7011f1c1

John Thiltges authored 3 years ago

print_job_select() included an unnecessary argument in a printf() call,
causing a 'Redundant argument in printf' error to be shown.

Bug 11491.

7011f1c1

Apr 29, 2021

Docs - Add SLURM_OVERCOMMIT as Output Environment Variable · 8ef7b60a
Ben Roberts authored 3 years ago
```
For salloc, sbatch and srun

Bug 11461

Signed-off-by: Tim Wickberg <tim@schedmd.com>
```
8ef7b60a
Docs - Remove 'future' note about --thread-spec flag · 7394b801
Ben Roberts authored 3 years ago
```
Bug 11430

Signed-off-by: Tim Wickberg <tim@schedmd.com>
```
7394b801
Docs - Caution against Slurm commands in pro/epilog · 131e1cf6
Ben Roberts authored 3 years ago
```
Bug 11206

Signed-off-by: Tim Wickberg <tim@schedmd.com>
```
131e1cf6
Remove extra space from sdiag --usage output · bd678425
Scott Jackson authored 3 years ago

bd678425
Run autoregen on new automake version · 78cb168e
Danny Auble authored 3 years ago

78cb168e

Avoid deadlock (EDEADLK) in SlurmctldProlog/SlurmctldEpilog · 6b6b3879

Marcin Stolarek authored 3 years ago

This looks like a typo in initial commit b71efa62 - lock should be
released before return not locked again.

Bug 11480

6b6b3879

Fix issue where job steps wouldn't run if the first node was full · 6a2c99ed

Marshall Garey authored 3 years ago

In a multi-node job it was possible to be in a situation where there
were more CPUs available for steps to use but steps would not launch.
For example, if a node has 2 cores and 1 thread per core and this job is
submitted:

sbatch -N2 --ntasks-per-node=2 --mem=1000 job.bash

And job.bash contains the following:

for i in {1..4}
do
	srun --exact --mem=100 -N1 -c1 -n1 sleep 60 &
done
wait

In this case, two steps would run on the first node and one step would
run on the second node, but the fourth step would not run until the
first step completed, even though there is an available task and CPU on
the second node in the allocation. Why does this happen?

If the step requests CPUs <= number of nodes, then when _pick_step_nodes()
calls _pick_step_nodes_cpus:

            node_tmp = _pick_step_nodes_cpus(job_ptr, nodes_avail,
                             nodes_needed,
                             cpus_needed,
                             usable_cpu_cnt);

it will simply return the first N nodes from the nodes_avail bitmap,
where N is the number of nodes that the step requested.

In this example job, all the CPUs on the first node are allocated, but
the first node remains in the nodes_avail bitmap. Then
_pick_step_nodes_cpus() selects the first node  and adds it to the
nodes_picked bitmap. Right after that, _pick_step_nodes() gets the
number of CPUs from nodes in the nodes_picked bitmap, which is 0 CPUs.

The fix is to remove fully allocated nodes from nodes_avail bitmap. But
this also creates a problem where once all the nodes are fully allocated
and another valid step request comes, then an incorrect error message of
ESLURM_REQUESTED_NODE_CONFIG_UNAVAILABLE would happen, when the correct
error message is ESLURM_NODES_BUSY. So we increment job_blocked_nodes if
there are no available cpus on the node.

Bug 11357

6a2c99ed

Fix NULL pointer dereference defect reported by Coverity · 63e94c2c
Carlos Tripiana Montes authored 3 years ago
```
Continuation of 8475ae9d

CID 221511
Bug 11401
```
63e94c2c
Merge branch 'slurm-20.02' into slurm-20.11 · cab4adf0
Albert Gil authored 3 years ago

cab4adf0
Testsuite - Improve test9.8 relaxing a timeout to avoid false positives · 79da7119
Scott Jackson authored 3 years ago
```
Fix job submission count message

Bug 10439
```
79da7119
Testsuite - Improve logs on wait_for* avoiding redundant start time · 5c4a899e
Scott Jackson authored 3 years ago
```
Bug 10439
```
5c4a899e

Apr 28, 2021
- Docs - Update versions in quickstart admin guide · 7fd867b2
  Marcin Stolarek authored 4 years ago
  
  Bug 11059
  7fd867b2
- Docs - Recommend using task/affinity over TaskAffinity · 6f17473d
  Ben Roberts authored 3 years ago
  
  Add note in cgroup.conf to match note in slurm.conf Bug 11209 Signed-off-by: Tim Wickberg <tim@schedmd.com>
  6f17473d
- select/cons_tres - fix broken Dragonfly topology node selection. · 8475ae9d
  Carlos Tripiana Montes authored 3 years ago
  
  Jobs requesting resources that could fit in 1 leaf switch are incorrectly spread across switches. Fixing this code also makes "--switches" work again. select/cons_res already works according to documentation. Bug 11401
  8475ae9d
- Docs - Remove extra `-` in sacct.1 · 8853fa75
  Michael Hinton authored 4 years ago
  
  Bug 10944
  8853fa75
- Docs - Further style fixes to man pages · 0ccf828a
  Ben Roberts authored 4 years ago
  
  Fix extra newlines that were causing links to be created where they weren't needed. Removed extra .TP that prevented link from being added. Correctly closed bold tag that prevented link. Bug 10944
  0ccf828a
- Docs - Remove extra line breaks · 514e2689
  Ben Roberts authored 3 years ago
  
  Extra .LP tags and other white space created larger than normal gaps between certain paragraphs. Bug 10944
  514e2689
- Docs - fixup man page dates. · baedceb4
  Tim Wickberg authored 3 years ago
  
  baedceb4
- Docs - Fix indentation issues in man pages · 2af7e20f
  Ben Roberts authored 3 years ago
  
  Fix items with descriptions that started on the same lines. Address cases where some descriptions were indented more than the rest of the descriptions in the list. Bug 10944
  2af7e20f
- Docs - Alphabetize environment variables · e04723f9
  Ben Roberts authored 4 years ago
  
  Fixed for salloc, sbatch and srun Bug 10944
  e04723f9
- Docs - Change priority_multifactor3 to unix newline characters · 8e70877d
  Ben Roberts authored 4 years ago
  
  Bug 10944
  8e70877d
Apr 27, 2021
- Error out when we cannot fetch remote configs. · bc257a83
  Tim McMullan authored 3 years ago
  
  Send a 0 as the file length over the pipe to the parent to break out of the safe_read(), rather than hanging indefinitely since the child has already exited. Bug 11460
  bc257a83
- Start NEWS for v20.11.7. · 777f8434
  Tim Wickberg authored 3 years ago
  
  777f8434
- Update META for v20.11.6 release. · 23eba95f
  Tim Wickberg authored 3 years ago
  
  Update slurm.spec as well.
  23eba95f
Apr 26, 2021
- Docs - Clarify different OOM behavior for cgroups vs polling · 82bbae66
  Ben Roberts authored 3 years ago
  
  Bug 11318
  82bbae66
Apr 23, 2021
- Revert "Clarify how SLURM_SUBMIT_DIR is set in salloc/sbatch/srun man pages." · 68ee435c
  Brian Christiansen authored 3 years ago
  
  This reverts commit 69d5d94b. submit_dir != working dir In 7092, the customer was seeing the interactive srun set SLURM_SUBMIT_DIR to the cwd in the interactive srun after -D had been applied -- which is wrong behavior. This will be fixed in 21.08. Bug 11407
  68ee435c
- Docs - Note --ntasks-per-core/socket incompatible with select/linear · 0b3d1694
  Ben Roberts authored 3 years ago
  
  Bug 11447
  0b3d1694
- Docs - Improve style on dist_plane page · a83fdd37
  Marcin Stolarek authored 3 years ago
  
  Bug 11396
  a83fdd37
Apr 21, 2021

Docs - Remove note about intent for ctrl-z from srun · 178b0f8f
Ben Roberts authored 3 years ago
```
Bug 11363

Signed-off-by: Tim Wickberg <tim@schedmd.com>
```
178b0f8f
Commit message for last 8 commits. · 5e4fd8eb
Danny Auble authored 3 years ago
```
Bug 11093
```
5e4fd8eb

job_container/tmpfs: Fix bug in _create_ns · 21d9cc1c

Carlos Tripiana Montes authored 3 years ago

It can happen that child proc fails at some point, and returns some exit
code different from 0. But parent didn't check this properly, and
finished returning w/o error.

Now, if child fails, parent too. This spanws back the error and we know
this job will have a problem.

Bug 11093

21d9cc1c

job_container/tmpfs: add functionality to restore NSs state after restart · eeec6cd2

Carlos Tripiana Montes authored 3 years ago

container_p_restore get now the list of jobs running from the spool dir
with stepd_available.

Then, it iterates over basepath entries and, for those which seems to
have been a mount point (has .ns file), tries to mount it again.

If it succeeds (it must), and if for this mount point the job is dead,
it releases resources and tries to delete files. Remember the removal
can fail if a resource is leaked. These would be fixed if slurmd starts
after HW reboot (no kernel leaks).

Bug 11093

eeec6cd2

Make _rm_data able to ignore a file removal failure. · c530b15e

Carlos Tripiana Montes authored 3 years ago

Then, make _rm_data call in _create_ns keep old behavior, as
here it will only be removed if something fails in creation, and no
previous NS leak is possible. So force removal, or fail.

But for _delete_ns, it could be called at job's end after slurmd
got killed and restarted, thus having leaked the NS. Even though,
slurmd can recreate the NS and mount it in the same place, the .ns
file can't be removed at the end of the job because EBUSY.

Bug 11093

c530b15e