Commits · fec5e03b900aebba21199a922cb75acf182a6abc · tud-zih-energy / Slurm

Mar 09, 2016

Morris Jette authored 9 years ago

Fix Cray NHC spawning on job requeue. Previous logic would leave nodes
allocated to a requeued job as non-usable on job termination.

Specifically, each job has a "cleaning/cleaned" flag. Once a job
terminates, the cleaning flag is set, then after the job node health
check completes, the value gets set to cleaned. If the job is requeued,
on its second (or subsequent) termination, the select/cray plugin
is called to launch the NHC. The plugin sees the "cleaned" flag
already set, it then logs:
error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen
and returns, never launching the NHC. Since the termination of the
job NHC triggers releasing job resources (CPUs, memory, and GRES),
those resources are never released for use by other jobs.

Bug 2384

fec5e03b

Mar 08, 2016
- Remove unneeded check introduced in 897c4b27 · ba7dfc75
  Tim Wickberg authored 9 years ago
  
  _set_collectors() already has a run_in_daemon("slurmd") that precludes this from being an issue.
  ba7dfc75
- Fix route/topology plugin to prevent segfault in sbcast. · 897c4b27
  Bill Brophy authored 9 years ago
  
  route_p_split_hostlist was not thread-safe, and would cause one of several segfaults depending on where in the initialization code each thread was. Bug 2495.
  897c4b27
- Fix displayed value for RoutePlugin. · 14c51e65
  Tim Wickberg authored 9 years ago
  
  Was incorrectly displaying "(null)" even when loaded successfully.
  14c51e65
- Handle function error for Coverity · e6b7b2c2
  Morris Jette authored 9 years ago
  
  e6b7b2c2
Mar 05, 2016
- Continuation to commit b294f81b to do the right thing for jobs. · 35f7a262
  Danny Auble authored 9 years ago
  
  35f7a262
- Fixed double read lock on getting job's gres/tres. · b23a57cf
  Danny Auble authored 9 years ago
  
  b23a57cf
Mar 04, 2016
- Continuation of commit 7f0bdc84 · 55a678dd
  Danny Auble authored 9 years ago
  
  Step GRES value changed from type "int" to "int64_t" to support larger values. Signed-off-by: Danny Auble <da@schedmd.com>
  55a678dd
- Fix issue where steps weren't always getting the gres/tres involved. · b294f81b
  Danny Auble authored 9 years ago
  
  b294f81b
Mar 03, 2016

Fix issue with sbcast not doing a correct fanout. · 72f13426
Danny Auble authored 9 years ago

72f13426
Fix getting reservations to database when database is down. · 5c43d754
Brian Christiansen authored 9 years ago
```
Bug 2507
```
5c43d754

Increase step GRES variable size · 7f0bdc84

Morris Jette authored 9 years ago

Step GRES value changed from type "int" to "int64_t" to support larger
values. Previous logic could fail in step allocation values over 32-bits.
Other GRES values are 64-bit.

7f0bdc84

Force close on exec on first 256 file descriptors when launching a · f502f1e5

Danny Auble authored 9 years ago

slurmstepd to close potential open ones.

It was pointed out the slurmd using acct_gather_energy/ipmi links to
freeipmi which could possibly open /dev/ipmi0 without the close on exec
flag set as root while launching a step leaving it open in the users app.

What this does is sets the flag on the first 256 to mitigate the concern.

Reported by Maksym Planeta.

Bug 2506

f502f1e5

Mar 02, 2016
- Backfill scheduler to validate correct job partition · efd9d35e
  Gary B Skouson authored 9 years ago
  
  Previous logic tested whatever the job's partition pointer indicated rather than the partition we are trying to run the job in. This bug was introduced in Slurm version 15.08.5, Nov 16, 2015, commit 94f0e948 bug 2499
  efd9d35e
- Move definition to only place used to avoid confusion, continuation of · f257976a
  Danny Auble authored 9 years ago
  
  patch 2d5066e7
  f257976a
- Remove a duplicate xmalloc · 2d5066e7
  Thomas Cadeau authored 9 years ago
  
  2d5066e7
Mar 01, 2016

run autogen.sh with automake 1.15 · 48f36224
Tim Wickberg authored 9 years ago

48f36224

Defer suspend until launch completes · d2cd18d1

Morris Jette authored 9 years ago

This fixes a bug introduced in commit 52fe3de1
in the event the fork() call fails in slurmstepd.

d2cd18d1

Defer suspend until launch completes · 52fe3de1

Morris Jette authored 9 years ago

Insure that a job is completely launched before trying to suspend it.
Previous logic would start suspend logic early in the life of the
slurmstepd process, after it's listening socket was open but before
the tasks were launched. This defers the suspend logic until after
all prologs and setup completes and the tasks are launched. This is
important in the case of gang scheduling, in which newly launched
jobs can be immediately suspended.
bug 2494

52fe3de1

Add "JobId=" to some log messages for better clarity · 1a7b4f62
Morris Jette authored 9 years ago

1a7b4f62

Feb 26, 2016
- Set correct reason when a QOS' MaxTresMins is violated. · 745568f2
  Danny Auble authored 9 years ago
  
  745568f2
- Replace goto with break · e990c183
  Maksym Planeta authored 9 years ago
  
  e990c183
Feb 25, 2016

Add missing definition for val_to_char() · 344c74fc

Tim Wickberg authored 9 years ago

Since the function is inlined the single definition let GCC build everything
properly, but debug builds (which disable inline) resulted in:
slurmstepd: [465.0]: symbol lookup error:
(trimmed path)/task_cgroup.so: undefined symbol: val_to_char
when running srun --cpu_bind=v.

task/affinity had this definition already, task/cgroup didn't.

344c74fc

Fix for unititialized memory · c0509864
Morris Jette authored 9 years ago
```
Reported by valgrind running test7.2, but shouldn't cause any real problem
```
c0509864
Fix issue where SocketsPerBoard didn't translate to Sockets when CPUS= · fcae2193
Danny Auble authored 9 years ago
```
was also given.
```
fcae2193

Feb 24, 2016

Make it so scontrol update part qos= will take away a partition QOS from · 3a7470ae
Danny Auble authored 9 years ago
```
a partition.
```
3a7470ae

Make it possible to change CPUsPerTask with scontrol. · de28c13a

Danny Auble authored 9 years ago

This also reverts most of commit fa331e30 as well as commit bd9fa830
which would try to set the pn_min_cpus every time a job was updated.
If a job didn't request node counts then they were hosed.

This commit takes away the magic which was screwing things up.  Now the
person gets what they asked for without magic changing things.

Bug 2302
Bug 2742
Bug 2478

de28c13a

Fix issue where when updating a job the pn_min_cpus was updated · bd9fa830
Danny Auble authored 9 years ago
```
erroneously.
```
bd9fa830

Properly handle select_g_select_nodeinfo_get() error · 542ead89

Morris Jette authored 9 years ago

Failure has never been observed, but initialize the used variable
  before calling the function so we don't re-use old data if the
  function returns an error.

542ead89

Rename a variable, no change in logic · 31d67fb5

Morris Jette authored 9 years ago

Rename an improperly named variable in the logic scontrol uses to
  print node information ("total_used" was really "idle_cpus"), so
  the logic looks the same as that used in sinfo to determine node
  state.

31d67fb5

Improve some step allocation logs · a0e3e5de

Morris Jette authored 9 years ago

Include warning for Cray simulation as reminder for developers to
change code as needed.

a0e3e5de

BGQ - Tighter locks around structures when nodes/cables change state. · c5925f41
Danny Auble authored 9 years ago

c5925f41
BGQ - Remove redeclaration of job_read_lock. · fd3dedda
Danny Auble authored 9 years ago

fd3dedda

Feb 23, 2016

select/cray: Log NHC run times over 1 minute · eb58137b
Morris Jette authored 9 years ago

eb58137b

Fix issue with resizing jobs and limits not be kept track of correctly. · 92ac0dcd

Danny Auble authored 9 years ago

This whole process could probably be done better by keeping track of
old values and new values and only calling one function instead of a
pre and post function, but that can probably wait for future generations
of the code as it works now and is probably adequate for the time being.

Bug 2352

92ac0dcd

Feb 19, 2016

Replace 'inexistent' with 'non-existent'. No functional change. · f75a90f5
Tim Wickberg authored 9 years ago

f75a90f5

Spelling corrections. No functional changes. · c276bf49

Gennaro Oliva authored 9 years ago

Consistantly use American English for
existant -> existent
assocation -> association

Correct some typos, and one grammatical mistake.

c276bf49

BurstBuffer/cray pre-run race condtition fix · e8959ae9

Morris Jette authored 9 years ago

BurstBuffer/cray - Defer job cancellation or time limit while "pre-run"
    operation in progress to avoid inconsistent state due to multiple calls
    to job termination functions.
bug 2454

e8959ae9

Move fclose into conditional. · 691b97b0

Tim Wickberg authored 9 years ago

Otherwise call fclose(NULL) iff the ClusterName is not set and the
clustername file does not exist. Should not happen in production.

Coverity #67041.

691b97b0

backport of commit aa5eb7ef · babfbef9
Morris Jette authored 9 years ago

babfbef9