Commits · a912fb3b88daf5e12c3eb3df410292ac93bd1c0d · tud-zih-energy / Slurm

Mar 11, 2016
- Merge branch 'slurm-14.11' into slurm-15.08 · a912fb3b
  Tim Wickberg authored 9 years ago
  
  a912fb3b
- Fix job array step function printout. · 03d29e24
  Tim Wickberg authored 9 years ago
  
  Return [0-100:2] formatting, rather than [0,2,4,6,8,...] when using a step function. Was inadvertantly broken in 14.11 with commit 5ffdca92. Bug 2535.
  03d29e24
Mar 10, 2016

Add NEWS for commit 3bb2e602 · a0be0dc5
Morris Jette authored 9 years ago

a0be0dc5

Cray Datawarp job requeue bug fix · 3bb2e602

Morris Jette authored 9 years ago

burst_buffer/cray plugin: Prevent a requeued job from being restarted while
    file stage-out is still in progress. Previous logic could restart the job
    and not perform a new stage-in.
bug 2584, comment #45

3bb2e602

Mar 09, 2016

cray job requeue bug · fec5e03b

Morris Jette authored 9 years ago

Fix Cray NHC spawning on job requeue. Previous logic would leave nodes
allocated to a requeued job as non-usable on job termination.

Specifically, each job has a "cleaning/cleaned" flag. Once a job
terminates, the cleaning flag is set, then after the job node health
check completes, the value gets set to cleaned. If the job is requeued,
on its second (or subsequent) termination, the select/cray plugin
is called to launch the NHC. The plugin sees the "cleaned" flag
already set, it then logs:
error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen
and returns, never launching the NHC. Since the termination of the
job NHC triggers releasing job resources (CPUs, memory, and GRES),
those resources are never released for use by other jobs.

Bug 2384

fec5e03b

Correctly parse nids in slurmconfgen_smw.py · 88ccc111

David Gloe authored 9 years ago

An error in slurmconfgen_smw.py caused it to parse the nic as the nid.
On some systems those values differ, causing the generated slurm.conf file to
be incorrect.

Bug 2532.

88ccc111

Mar 08, 2016
- Remove unneeded check introduced in 897c4b27 · ba7dfc75
  Tim Wickberg authored 9 years ago
  
  _set_collectors() already has a run_in_daemon("slurmd") that precludes this from being an issue.
  ba7dfc75
- Fix route/topology plugin to prevent segfault in sbcast. · 897c4b27
  Bill Brophy authored 9 years ago
  
  route_p_split_hostlist was not thread-safe, and would cause one of several segfaults depending on where in the initialization code each thread was. Bug 2495.
  897c4b27
- Fix displayed value for RoutePlugin. · 14c51e65
  Tim Wickberg authored 9 years ago
  
  Was incorrectly displaying "(null)" even when loaded successfully.
  14c51e65
- Handle function error for Coverity · e6b7b2c2
  Morris Jette authored 9 years ago
  
  e6b7b2c2
- Fix link to kernel cgroup documentation. · 4c3fd194
  Janne Blomqvist authored 9 years ago
  
  4c3fd194
Mar 07, 2016

add additional tuning notes for mysql/mariadb · 49dc5d8d

Tim Wickberg authored 9 years ago

In particular, it seems that MariaDB has changed the default for
innodb_lock_wait_timeout has been lowered which can cause issues
for the various rollup processes on systems with high job counts.

49dc5d8d

Mar 05, 2016
- Continuation to commit b294f81b to do the right thing for jobs. · 35f7a262
  Danny Auble authored 9 years ago
  
  35f7a262
- Fixed double read lock on getting job's gres/tres. · b23a57cf
  Danny Auble authored 9 years ago
  
  b23a57cf
Mar 04, 2016
- Continuation of commit 7f0bdc84 · 55a678dd
  Danny Auble authored 9 years ago
  
  Step GRES value changed from type "int" to "int64_t" to support larger values. Signed-off-by: Danny Auble <da@schedmd.com>
  55a678dd
- Fix issue where steps weren't always getting the gres/tres involved. · b294f81b
  Danny Auble authored 9 years ago
  
  b294f81b
Mar 03, 2016

Fix issue with sbcast not doing a correct fanout. · 72f13426
Danny Auble authored 9 years ago

72f13426
Fix getting reservations to database when database is down. · 5c43d754
Brian Christiansen authored 9 years ago
```
Bug 2507
```
5c43d754

Increase step GRES variable size · 7f0bdc84

Morris Jette authored 9 years ago

Step GRES value changed from type "int" to "int64_t" to support larger
values. Previous logic could fail in step allocation values over 32-bits.
Other GRES values are 64-bit.

7f0bdc84

Force close on exec on first 256 file descriptors when launching a · f502f1e5

Danny Auble authored 9 years ago

slurmstepd to close potential open ones.

It was pointed out the slurmd using acct_gather_energy/ipmi links to
freeipmi which could possibly open /dev/ipmi0 without the close on exec
flag set as root while launching a step leaving it open in the users app.

What this does is sets the flag on the first 256 to mitigate the concern.

Reported by Maksym Planeta.

Bug 2506

f502f1e5

Mar 02, 2016
- Backfill scheduler to validate correct job partition · efd9d35e
  Gary B Skouson authored 9 years ago
  
  Previous logic tested whatever the job's partition pointer indicated rather than the partition we are trying to run the job in. This bug was introduced in Slurm version 15.08.5, Nov 16, 2015, commit 94f0e948 bug 2499
  efd9d35e
- Move definition to only place used to avoid confusion, continuation of · f257976a
  Danny Auble authored 9 years ago
  
  patch 2d5066e7
  f257976a
- Remove a duplicate xmalloc · 2d5066e7
  Thomas Cadeau authored 9 years ago
  
  2d5066e7
Mar 01, 2016
- slight alteration to include 127.0.0.1 to commit 65820eca · 7a8e543f
  Danny Auble authored 9 years ago
  
  7a8e543f
- Explain some node address issues · 65820eca
  Morris Jette authored 9 years ago
  
  bug 2496
  65820eca
- Update NEWS as well. · a058ff4a
  Tim Wickberg authored 9 years ago
  
  a058ff4a
- Remove old presentations and design nodes (circa 2003). · fbf86521
  Tim Wickberg authored 9 years ago
  
  Distribute only the maintained doc/html and doc/man directories. Reports and notes are a historical artifact that can be found in git tags if necessary, but have no value for modern installations.
  fbf86521
- Simplify Makefile.am for doc/ and run autogen.sh · 0264cb75
  Tim Wickberg authored 9 years ago
  
  0264cb75
- run autogen.sh with automake 1.15 · 48f36224
  Tim Wickberg authored 9 years ago
  
  48f36224
- Defer suspend until launch completes · d2cd18d1
  Morris Jette authored 9 years ago
  
  This fixes a bug introduced in commit 52fe3de1 in the event the fork() call fails in slurmstepd.
  d2cd18d1
- Defer suspend until launch completes · 52fe3de1
  Morris Jette authored 9 years ago
  
  Insure that a job is completely launched before trying to suspend it. Previous logic would start suspend logic early in the life of the slurmstepd process, after it's listening socket was open but before the tasks were launched. This defers the suspend logic until after all prologs and setup completes and the tasks are launched. This is important in the case of gang scheduling, in which newly launched jobs can be immediately suspended. bug 2494
  52fe3de1
- Add "JobId=" to some log messages for better clarity · 1a7b4f62
  Morris Jette authored 9 years ago
  
  1a7b4f62
Feb 29, 2016
- Fix test21.21 to work when AccountingStorageEnforce=safe isn't set. · 9e2e2f15
  Danny Auble authored 9 years ago
  
  Bug 1976
  9e2e2f15
Feb 26, 2016
- Set correct reason when a QOS' MaxTresMins is violated. · 745568f2
  Danny Auble authored 9 years ago
  
  745568f2
- Add not to slurm.conf man page about SallocDefaultCommand and TaskPlugins. · b5b349b0
  Tim Wickberg authored 9 years ago
  
  Add note to slurm.conf man page about setting "--cpu_bind=no" as part of SallocDefaultCommand if a TaskPlugin is in use.
  b5b349b0
- Replace goto with break · e990c183
  Maksym Planeta authored 9 years ago
  
  e990c183
- fix limitation in test · ac6b1c34
  Bjørn-Helge Mevik authored 9 years ago
  
  Test 14.10 in the test suite (of slurm 15.08.8, at least) uses $sinfo -tidle -h -o%n to find idle nodes. This only works if NodeHostname == NodeName on the nodes. The following should work regardless of this: $scontrol show hostnames \$($sinfo -tidle -h -o%N)
  ac6b1c34
- Grammatical nit in srun(1). · 3c2676ec
  Tim Wickberg authored 9 years ago
  
  3c2676ec
Feb 25, 2016

Add missing definition for val_to_char() · 344c74fc

Tim Wickberg authored 9 years ago

Since the function is inlined the single definition let GCC build everything
properly, but debug builds (which disable inline) resulted in:
slurmstepd: [465.0]: symbol lookup error:
(trimmed path)/task_cgroup.so: undefined symbol: val_to_char
when running srun --cpu_bind=v.

task/affinity had this definition already, task/cgroup didn't.

344c74fc

Fix for unititialized memory · c0509864
Morris Jette authored 9 years ago
```
Reported by valgrind running test7.2, but shouldn't cause any real problem
```
c0509864