Commits · 17f31e94647d53aa37ddee79e11ba772172426dd · tud-zih-energy / Slurm

Mar 12, 2012
- Merge remote-tracking branch 'origin/slurm-2.3' · 17f31e94
  Danny Auble authored 13 years ago
  
  Conflicts: src/plugins/select/bluegene/bg_dynamic_block.c
  17f31e94
- BLUEGENE - fix issue where if a small block was in error it could hold up · 1306cbe3
  Danny Auble authored 13 years ago
  
  the queue when trying to place a larger than midplane job.
  1306cbe3
- Updated accounting tests from Nathan Yee · a6d2596f
  Morris Jette authored 13 years ago
  
  a6d2596f
Mar 11, 2012
- BLUEGENE - better debug · 39af4f55
  Danny Auble authored 13 years ago
  
  39af4f55
Mar 09, 2012
- NEWS for last patch · f40d25ee
  Danny Auble authored 13 years ago
  
  f40d25ee
- Fix regression in 2.4.0.pre3 where number of submitted jobs limit wasn't · 86054d6d
  Danny Auble authored 13 years ago
  
  being honored for QOS.
  86054d6d
- Fix debug statement · dda0180f
  Danny Auble authored 13 years ago
  
  dda0180f
- BGQ - unlock correctly · c4356814
  Danny Auble authored 13 years ago
  
  c4356814
- switch to assert to avoid C++ issue · f0df81dc
  Danny Auble authored 13 years ago
  
  f0df81dc
Mar 08, 2012
- BGQ - if doing mesh we have to have the starting corner be the first · 07f72dbd
  Danny Auble authored 13 years ago
  
  midplane added. If we don't do that then we are hosed. So we just always add it first to avoid issues.
  07f72dbd
- BLUEGENE - only load state file when recovering state. · 403b61cb
  Danny Auble authored 13 years ago
  
  403b61cb
- BGQ - fix starting location to use the block allocators start which is · b411c9cc
  Danny Auble authored 13 years ago
  
  always right and much easier to come by.
  b411c9cc
- BLUEGENE - fixed potential deadlock issue when a nodeboard goes down and · 06074ffb
  Danny Auble authored 13 years ago
  
  people are polling the system at the exact same time.
  06074ffb
- add spacing · 0cd38685
  Danny Auble authored 13 years ago
  
  0cd38685
- BGQ - if a cnode goes into available state don't worry about killing jobs · 8397b363
  Danny Auble authored 13 years ago
  
  that just would be silly.
  8397b363
- BGQ - clear bit even if there is no cnode_err_cnt just to be sure. · 6f239c95
  Danny Auble authored 13 years ago
  
  6f239c95
- Add uptime to "slurmd -C" output · 1027de07
  Morris Jette authored 13 years ago
  
  1027de07
- Remove test logic dating back to early slurm development · cd6dff5f
  Morris Jette authored 13 years ago
  
  cd6dff5f
- Cosmetic changes · 02510edc
  Morris Jette authored 13 years ago
  
  02510edc
Mar 07, 2012
- Cosmetic mods · 481f5a96
  Morris Jette authored 13 years ago
  
  481f5a96
- Add sacct test using node names · 07e0dcf9
  Morris Jette authored 13 years ago
  
  07e0dcf9
- give example how to define frontend nodes · 6bf04fab
  Danny Auble authored 13 years ago
  
  6bf04fab
- FRONTEND - fix issue where if a compute node was in a down state and · caadbfcb
  Danny Auble authored 13 years ago
  
  an admin updates the node to idle/resume the compute nodes will go instantly to idle instead of idle* which means no response.
  caadbfcb
Mar 06, 2012
- revert 10a5de6a · df5935e6
  Danny Auble authored 13 years ago
  
  df5935e6
- BLUEGENE - make it so the epilog runs until slurmctld tells it the job is · 9c461154
  Danny Auble authored 13 years ago
  
  gone. Previously it had a timelimit which has proven to not be the right thing.
  9c461154
- BGQ - catch errors from the kill option of the runjob client. · 2a56fd6d
  Danny Auble authored 13 years ago
  
  2a56fd6d
- BGQ - update documentation on sub-block limitations. · 0ec5b62a
  Danny Auble authored 13 years ago
  
  0ec5b62a
- BLUEGENE - added more debugging when a job's magic is bad. · 95d6e227
  Danny Auble authored 13 years ago
  
  95d6e227
- BGQ - make new function to check if job is finished before purge · 10a5de6a
  Danny Auble authored 13 years ago
  
  10a5de6a
- BGQ - changed default from NO_VAL to 0 · 1adc20ee
  Danny Auble authored 13 years ago
  
  1adc20ee
- Commenet out reference to slurchemy tool · 0154de8b
  Morris Jette authored 13 years ago
  
  0154de8b
- Fix typo on name · 7eb8e3f3
  Morris Jette authored 13 years ago
  
  7eb8e3f3
Mar 02, 2012

Mods in priority/multifactor for prio=1 · b223af49

Morris Jette authored 13 years ago

In SLURM verstion 2.4, we now schedule jobs at priority=1 and no longer treat
it as a special case.

b223af49

Cosmetic mods to priority logic · 0810353e
Morris Jette authored 13 years ago

0810353e
Merge branch 'slurm-2.3' · ec372e00
Morris Jette authored 13 years ago

ec372e00
cray/srun wrapper, don't use aprun -q by default · ea9adc17
Morris Jette authored 13 years ago
```
In cray/srun wrapper, only include aprun "-q" option when srun "--quiet"
option is used.
```
ea9adc17
Change a slurmd msg from info() to debug() · 73f915bf
Morris Jette authored 13 years ago

73f915bf
Merge branch 'slurm-2.3' · c06064bc
Morris Jette authored 13 years ago

c06064bc

Fix for possible SEGV · ed56303c

Morris Jette authored 13 years ago

Here's what seems to have happened:

- A job was pending, waiting for resources.
- slurm.conf was changed to remove some nodes, and a scontrol reconfigure was done.
- As a result of the reconfigure, the pending job became non-runnable, due to "Requested node configuration is not available". The scheduler set the job state to JOB_FAILED and called delete_job_details.
- scontrol reconfigure was done again.
- read_slurm_conf called _restore_job_dependencies.
- _restore_job_dependencies called build_feature_list for each job in the job list
- When build_feature_list tried to reference the now deleted job details for the failed job, it got a segmentation fault.

The problem was reported by a customer on Slurm 2.2.7.  I have not been able to reproduce it on 2.4.0-pre3, although the relevant code looks the same. There may be a timing window. The attached patch attempts to fix the problem by adding a check to _restore_job_dependencies.  If the job state is JOB_FAILED, the job is skipped.

Regards,
Martin

This is an alternative solutionh to bug316980fix.patch

ed56303c

Mar 01, 2012
- Fix build so "make dist" works · 1a6e3fbf
  Morris Jette authored 13 years ago
  
  1a6e3fbf