Commits · f88119ff934c4586a5ab77f9a6297cce63fed444 · tud-zih-energy / Slurm

Jul 19, 2016

Add routing queue info to Slurm FAQ web page · f88119ff
Morris Jette authored 8 years ago

f88119ff
Fix some typos in comments and logs · 5a45503c
Gennaro Oliva authored 8 years ago

5a45503c

Improve partition AllowGroups caching · 7e381982

Morris Jette authored 8 years ago

If the user is now allowed to use the partition,
    then do not check that user's group access again for 5 seconds.
bug 2913

7e381982

Improve partition AllowGroups caching · 98dc38b2

Morris Jette authored 8 years ago

Improve partition AllowGroups caching. Update the table of UIDs permitted to
    use a partition based upon it's AllowGroups configuration parameter as new
    valid UIDs are found rather than looking up that user's group information
    for every job they submit, which can involve considerable overhead for
    some systems.
bug 2913

98dc38b2

Minimize preempted jobs · b9f17b18

Morris Jette authored 8 years ago

Minimize preempted jobs for configurations with multiple jobs per node.
  Previous logic would preeempt every job on node allocated to pending
  job.
bug 2906

b9f17b18

gres-flags=enforce-binding fix · 5df8509f

Morris Jette authored 8 years ago

Fix for core selection with job --gres-flags=enforce-binding option.
    Previous logic would in some cases allocate a job zero cores, resulting in
    slurmctld abort.
bug 2808

5df8509f

Jul 18, 2016

Improve GRES log format · b5e54e11

Morris Jette authored 8 years ago

Add some indentation so that GRES topology-specific information
  logged is more readable.

b5e54e11

Select/cons_res memory corruption fix · c06db0de

Morris Jette authored 8 years ago

A job allocation selecting nodes and no cores/CPUs could write
  off the end of arrays and corrupt memory. Now to figure out how
  the logic reached this point in the first place.
bug 2808

c06db0de

Add SLUGM16 dinner info · 6dc074c8
Morris Jette authored 8 years ago

6dc074c8

Jul 16, 2016

Add SLURM_PENDING_STEP id so it won't be confused with SLURM_EXTERN_CONT. · 0c7bd6d0

Danny Auble authored 8 years ago

In commit b8190e5d many places that were mean to be pending step ids
were changed to be extern_step id.  The main problem was when we came up
with the idea of the extern step we reused -1 (INFINITE) for the id.  So
pending steps also appeared to be extern steps as well.  Hopefully this
fixes the situation.

Bug 2907

0c7bd6d0

Remove vestigial comment · 71800937
Morris Jette authored 8 years ago

71800937

Move startup of power save thread · fb8e3558

Morris Jette authored 8 years ago

Start power save thread only after the partition information is read
  in order to avoid trying to interpret the SuspendExcParts configuration
  information before the partition information is available, which would
  result in a slurmctld abort.

fb8e3558

Prevent slurmctld race condition · c7cae55b

Morris Jette authored 8 years ago

Do not try to access part_list variable (partition list pointer)
  if not yet initialized. Return NULL pointer rather than aborting
  with NULL pointer.

c7cae55b

Jul 15, 2016
- Fix spelling of hierarchy in comments · 4f3a0a02
  Tim Wickberg authored 8 years ago
  
  4f3a0a02
- Do not scheduled powered down nodes in FAILED state · 310de98d
  Jacek Budzowski authored 8 years ago
  
  bug 2900
  310de98d
- Remove unnecessary test for super user in regression test · 2a7d01a5
  Nicolas Joly authored 8 years ago
  
  2a7d01a5
- Cleanup generated files if test cannot run due to inappropriate conditions. · b9abe288
  Nicolas Joly authored 8 years ago
  
  b9abe288
- Fix user message in test1.32 to report correct signal USR2. · 7f98f056
  Nicolas Joly authored 8 years ago
  
  7f98f056
- Update LRZ site report in SLUG16 agenda · 48dc2bec
  Morris Jette authored 8 years ago
  
  48dc2bec
- Move commit 30f4f81c to be above code that could call · f2b1c35f
  Danny Auble authored 8 years ago
  
  delete_step_records which would delete the steps without the killing flag set.
  f2b1c35f
- More on the others dealing with the extern cleanup. · 7c831dd9
  Danny Auble authored 8 years ago
  
  What this does is treats the extern step like a normal step on exit. It doesn't appear the original code is needed anymore and this simplifies the code. The select_cray change is relevant since the add is needed only when killing the step as that is the only place _internal_step_complete isn't used.
  7c831dd9
- Continuation of commit 667f1105 . Remove unneeded job_ptr variable from · d74bcf74
  Danny Auble authored 8 years ago
  
  functions.
  d74bcf74
- Continuation of commit 667f1105 to simplify the code more · 28745901
  Danny Auble authored 8 years ago
  
  28745901
- Slightly better debug when dealing with stepd_completions. · af4d7aa4
  Danny Auble authored 8 years ago
  
  af4d7aa4
- Make scontrol show steps show the extern step correctly. · a4c2649d
  Danny Auble authored 8 years ago
  
  Before it was showing it as TBD since pending steps and the extern step have the same stepid.
  a4c2649d
- Various cleanup needed for extern step. Continuation of commit 2fc0c860 · c79063b0
  Danny Auble authored 8 years ago
  
  What this does is set the state earlier to match a normal set. Remove the unneeded _send_pending_exit_msgs. There is only one task and we have the message for it, so don't worry about that one. Most important, wait for the other slurmstepd's to send their message, otherwise they could be lost on the other end.
  c79063b0
Jul 14, 2016
- Move talks at SLUG16 · f27341f7
  Morris Jette authored 8 years ago
  
  f27341f7
- Fix gang scheduling and license release logic · 111e3b48
  Morris Jette authored 8 years ago
  
  Fix gang scheduling and license release logic if single node job killed on bad node. Notifying gang and releasing licences is normally done when the epilog completion happens, but if the node(s) assigned to a job are all down, that does not happen. This results in the licenses being reserved indefinitely and the gang scheduler being left with a bad (old) job pointer that can result in various failure modes bug 2867
  111e3b48
- Add SLUG16 agenda/hotel links · 5041174f
  Morris Jette authored 8 years ago
  
  5041174f
- SLUG16 agenda update · c3a7d302
  Morris Jette authored 8 years ago
  
  Add hotels. Other minor changes.
  c3a7d302
- Fix missing variable from commit 30f4f81c · da462dbf
  Danny Auble authored 8 years ago
  
  da462dbf
- CRAY - If trying to kill a step and you have NHC_NO_STEPS set run NHC · e956f297
  Danny Auble authored 8 years ago
  
  anyway to attempt to log the backtraces of the potential unkillable processes.
  e956f297
- Fix uninitialized variable which could cause a core dump from commit · 50f77062
  Danny Auble authored 8 years ago
  
  667f1105.
  50f77062
- Fix potential deadlock from commit b4dc9eea . · 30f4f81c
  Danny Auble authored 8 years ago
  
  30f4f81c
Jul 13, 2016

Continuation of last commit. · b4dc9eea

Danny Auble authored 8 years ago

We have decided to go back to the way 15.08 called NHC instead of calling
it first before sending a SIGKILL to the steps tasks. With this patch we
only start the NHC early when we have to resend the SIGKILL for unkillable
processes. This will hopefully get us the backtrace of the unkillable
processes which was the reason we did it this way in the first place :).

b4dc9eea

CRAY - Simplify when a NHC is called on a step that has unkillable · 603ae198
Danny Auble authored 8 years ago
```
processes.
```
603ae198
Update SLUG agenda for 2016 · a9c3ea71
Morris Jette authored 8 years ago

a9c3ea71

Jul 12, 2016
- Fix test1.29 / 17.15 for limits above 32-bits. · 0b8bbc00
  Nicolas Joly authored 8 years ago
  
  Bug 2892.
  0b8bbc00
- CRAY - Fix for reporting steps lingering after they are already finished. · cd06d0f9
  Danny Auble authored 8 years ago
  
  Bug 2874 We will most likely redo this logic (as it appears to be duplicated) in a following patch.
  cd06d0f9
- Fix for burst_buffer/cray batch submit error · 7cdcc25c
  Morris Jette authored 8 years ago
  
  Don't generate an error when a batch job is submitted that must wait for stage-in before starting.
  7cdcc25c