Commits · 60b58b7020f89035f447045668d9d9f3c99c8288 · tud-zih-energy / Slurm

Mar 17, 2016

Prevent uid update from corrupting assoc_hash table. · 60b58b70

Tim Wickberg authored 9 years ago

The uid is used as part of the hash function, must remove old reference
and recalculate if it may change, otherwise _delete_assoc_hash
will not find it again when the association is removed, causing
slurmctld to segfault.

Bug 2560.

60b58b70

Mar 11, 2016

Fix job array step function printout. · 03d29e24

Tim Wickberg authored 9 years ago

Return [0-100:2] formatting, rather than [0,2,4,6,8,...] when using
a step function.

Was inadvertantly broken in 14.11 with commit 5ffdca92.

Bug 2535.

03d29e24

Feb 24, 2016
- BGQ - Tighter locks around structures when nodes/cables change state. · c5925f41
  Danny Auble authored 9 years ago
  
  c5925f41
- BGQ - Remove redeclaration of job_read_lock. · fd3dedda
  Danny Auble authored 9 years ago
  
  fd3dedda
Jan 14, 2016

fix AuthInfo with alternate munge socket location · f3d54f99

Morris Jette authored 9 years ago

Fix for configuration of "AuthType=munge" and "AuthInfo=socket=..." with
    alternate munge socket path.
bug 2348

f3d54f99

Jan 07, 2016
- Fix dependency string printing to include task id numbers if set. · da4f5684
  Tim Wickberg authored 9 years ago
  
  Bug 2314.
  da4f5684
Jan 05, 2016
- Start NEWS for v14.11.12 · a91c3069
  Morris Jette authored 9 years ago
  
  a91c3069
- Update META for v14.11.11 tag · 7b916f46
  Morris Jette authored 9 years ago
  
  7b916f46
Dec 31, 2015
- Fix buffer overflow caused by jobid2str() with an inadequetly sized buffer · cb5046ca
  Tim Wickberg authored 9 years ago
  
  Later releases have switched over to snprintf to avoid this issue, but 14.11 did not get that patch. Bug 2295.
  cb5046ca
Dec 15, 2015

Fix potential memory corruption in _slurm_rpc_epilog_complete as well as · 4f857787

Danny Auble authored 9 years ago

_slurm_rpc_complete_job_allocation.

This is a rewrite of 438365ec which didn't catch the job_ptr wasn't
in a lock so the memory issue could still of existed.  This hopefully fixes
all the spots the job_ptr wasn't in the lock.

Fixes bug 2146

4f857787

Revert "Remove jobid2str() based on bug #2146." · dcadbb7c
Danny Auble authored 9 years ago
```
This reverts commit 438365ec.
```
dcadbb7c

Nov 25, 2015
- Update NEWS · cd630a25
  David Bigagli authored 9 years ago
  
  cd630a25
- Remove jobid2str() based on bug #2146. · 438365ec
  David Bigagli authored 9 years ago
  
  438365ec
Nov 16, 2015
- Log the request to terminate a job at info level if DebugFlags includes · 8dcbe1bb
  David Bigagli authored 9 years ago
  
  the Steps keyword.
  8dcbe1bb
Nov 13, 2015
- Update NEWS · 00615dbf
  David Bigagli authored 9 years ago
  
  00615dbf
- Fix the qstat wrapper when user is removed from the system · 0ef37477
  Andrew Wettstein authored 9 years ago
  
  but still has running jobs.
  0ef37477
Nov 04, 2015
- Fix systemd's slurmd service from killing slurmstepds on shutdown. · 508f866e
  Brian Christiansen authored 9 years ago
  
  Bug 2095
  508f866e
Oct 22, 2015
- Update NEWS for start of v14.11.11 work · a5f7dce8
  Morris Jette authored 9 years ago
  
  a5f7dce8
- Update META for v14.11.10 tag · 752a17e1
  Morris Jette authored 9 years ago
  
  752a17e1
Oct 19, 2015
- Prevent slurmstepd from core dumping. · 52b7dd04
  David Bigagli authored 9 years ago
  
  52b7dd04
Oct 09, 2015
- Add more information in the slurmd log should stepd fail to send reply · 4b68f260
  David Bigagli authored 9 years ago
  
  about job setup.
  4b68f260
Oct 07, 2015
- Fix sacct -j, (nothing but a comma) to not return all jobs. · d5979ef6
  Danny Auble authored 9 years ago
  
  d5979ef6
- Fix issue with sacct, printing 0_0 for array's that had finished in the · 75ea13a3
  Danny Auble authored 9 years ago
  
  database but the start record hadn't made it yet.
  75ea13a3
Oct 06, 2015
- Permit job_submit plugin to set a job's priority · 3b5f13fa
  Thomas Cadeau authored 9 years ago
  
  bug 2011
  3b5f13fa
- Fix sacct to not return all jobs if the -j option is given with a trailing · 2646e761
  Danny Auble authored 9 years ago
  
  ','.
  2646e761
- Propagate sbatch "--dist=plane=#" option to srun. · 6868906b
  Morris Jette authored 9 years ago
  
  bug 1999
  6868906b
Oct 05, 2015
- Include header for clean BGQ/Cray build · 3d601061
  jette authored 9 years ago
  
  3d601061
Oct 03, 2015

Don't requeue RPCs from slurmctld to DOWN nodes · f4ea9dec

Morris Jette authored 9 years ago

Don't requeue RPC going out from slurmctld to DOWN nodes (can generate
    repeating communication errors).
bug 2002

f4ea9dec

Oct 02, 2015

Don't mark powered down node as not responding · 8c03a8bc

Morris Jette authored 9 years ago

This will only happen if a PING RPC for the node is already queued
  when the decision is made to power it down, then fails to get
  a response for the ping (since the node is already down).
bug 1995

8c03a8bc

Sep 30, 2015

Reset job CPU count if CPUs/task ratio increased for mem limit · 836912bf

Morris Jette authored 9 years ago

If a job's CPUs/task ratio is increased due to configured MaxMemPerCPU,
then increase it's allocated CPU count in order to enforce CPU limits.
Previous logic would increase/set the cpus_per_task as needed if a
job's --mem-per-cpu was above the configured MaxMemPerCPU, but NOT
increase the min_cpus or max_cpus varilable. This resulted in allocating
the wrong CPU count.

836912bf

Enable srun -I to use pending step logic. · 0bf0e71f
Brian Christiansen authored 9 years ago
```
Continuation of 1252d1a1
Bug 1938
```
0bf0e71f

Don't start duplicate batch job · c1513956

Morris Jette authored 9 years ago

Requeue/hold batch job launch request if job already running. This is
  possible if node went to DOWN state, but jobs remained active.
In addition, if a prolog/epilog failed DRAIN the node rather than
  setting it down, which could kill jobs that could continue to
  run.
bug 1985

c1513956

Sep 29, 2015
- Improve job_completion logging · 4cebe297
  Morris Jette authored 9 years ago
  
  Previous logic would not report termiation siganl, only exit code, which could be meaningless.
  4cebe297
- Fix srun -I<timeout> from flooding the controller with step create requests. · 1252d1a1
  Brian Christiansen authored 9 years ago
  
  Bug 1938
  1252d1a1
- Fix updating job in db after extending job's timelimit past partition's timelimit. · 7a0836fc
  Brian Christiansen authored 9 years ago
  
  Bug 1984
  7a0836fc
Sep 28, 2015

Fix for node state when shrinking jobs · 6c9d4540

Morris Jette authored 9 years ago

When nodes have been allocated to a job and then released by the
  job while resizing, this patch prevents the nodes from continuing
  to appear allocated and unavailable to other jobs. Requires
  exclusive node allocation to trigger. This prevents the previously
  reported failure, but a proper fix will be quite complex and
  delayed to the next major release of Slurm (v 16.05).
bug 1851

6c9d4540

Sep 23, 2015
- For pending jobs have sacct print 0 for nnodes instead of the bogus 2. · 71287134
  Danny Auble authored 9 years ago
  
  The 2 came from the nodelist being "None assigned", which would be treated as 2 hosts when sent into hostlist.
  71287134
Sep 22, 2015

fix for group whith split entries in /etc/group · 803a0b4c

Brian Gilmer authored 9 years ago

If user belongs to a group which has split entries in /etc/group
    search for its username in all groups.
Ammendment to commit 93ead71a
bug 1738

803a0b4c

Correct job count limit logic for job arrays · add3d8cd

Danny Auble authored 9 years ago

Correct counting for job array limits, job count limit underflow possible
    when master cancellation of master job record.
bug 1952

add3d8cd

Sep 21, 2015

Addition to last commit to do the same thing for removing. · 16e9399d

Danny Auble authored 9 years ago

Also a very minor sanity check in job_mgr.c to make sure we at least have
a task count.  This shouldn't matter, but just to be as robust as possible.

16e9399d