Commits · b0f36e54f6dd046a31dc463b86425115096df290 · tud-zih-energy / Slurm

Mar 10, 2016

capmc_resume: add support for SPLIT MCDRAM mode · b0f36e54
Morris Jette authored 9 years ago

b0f36e54
node_features/knl_cray: add support for SPLIT MCDRAM mode · 52f7256d
Morris Jette authored 9 years ago

52f7256d
Merge branch 'slurm-15.08' · 87df5a43
Morris Jette authored 9 years ago
```
Conflicts:
	NEWS
```
87df5a43

Morris Jette authored 9 years ago

Fix Cray NHC spawning on job requeue. Previous logic would leave nodes
allocated to a requeued job as non-usable on job termination.

Specifically, each job has a "cleaning/cleaned" flag. Once a job
terminates, the cleaning flag is set, then after the job node health
check completes, the value gets set to cleaned. If the job is requeued,
on its second (or subsequent) termination, the select/cray plugin
is called to launch the NHC. The plugin sees the "cleaned" flag
already set, it then logs:
error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen
and returns, never launching the NHC. Since the termination of the
job NHC triggers releasing job resources (CPUs, memory, and GRES),
those resources are never released for use by other jobs.

Bug 2384

536c8451

Correctly parse nids in slurmconfgen_smw.py · e050806e

David Gloe authored 9 years ago

An error in slurmconfgen_smw.py caused it to parse the nic as the nid.
On some systems those values differ, causing the generated slurm.conf file to
be incorrect.

Bug 2532.

e050806e

Remove unneeded check introduced in · 8072b2cb

Tim Wickberg authored 9 years ago

_set_collectors() already has a run_in_daemon("slurmd") that
precludes this from being an issue.

8072b2cb

Fix route/topology plugin to prevent segfault in sbcast. · 0dfc924c

Bill Brophy authored 9 years ago

route_p_split_hostlist was not thread-safe, and would cause
one of several segfaults depending on where in the initialization
code each thread was.

Bug 2495.

0dfc924c

Fix displayed value for RoutePlugin. · db8491f1
Tim Wickberg authored 9 years ago
```
Was incorrectly displaying "(null)" even when loaded successfully.
```
db8491f1
Add NEWS for commit 3bb2e602 · a0be0dc5
Morris Jette authored 9 years ago

a0be0dc5

Cray Datawarp job requeue bug fix · 3bb2e602

Morris Jette authored 9 years ago

burst_buffer/cray plugin: Prevent a requeued job from being restarted while
    file stage-out is still in progress. Previous logic could restart the job
    and not perform a new stage-in.
bug 2584, comment #45

3bb2e602

Merge pull request #149 from supermanue/patch-1 · c54cffe5
Morris Jette authored 9 years ago
```
possible bug in smap Makefile
```
c54cffe5

possible bug in smap Makefile · ddeddbfb

Manuel Rodríguez-Pascual authored 9 years ago

LIBS can have a previous value, as depicted in ./configure --help

"Some influential environment variables:
(...)
 LIBS        libraries to pass to the linker, e.g. -l<library>
"
Original assignation to LIBS overwrites this value. With this edition, the user defined flags and NCURSES ones are both employed by the linker.

ddeddbfb

Mar 09, 2016

cray job requeue bug · fec5e03b

Morris Jette authored 9 years ago

Fix Cray NHC spawning on job requeue. Previous logic would leave nodes
allocated to a requeued job as non-usable on job termination.

Specifically, each job has a "cleaning/cleaned" flag. Once a job
terminates, the cleaning flag is set, then after the job node health
check completes, the value gets set to cleaned. If the job is requeued,
on its second (or subsequent) termination, the select/cray plugin
is called to launch the NHC. The plugin sees the "cleaned" flag
already set, it then logs:
error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen
and returns, never launching the NHC. Since the termination of the
job NHC triggers releasing job resources (CPUs, memory, and GRES),
those resources are never released for use by other jobs.

Bug 2384

fec5e03b

Correctly parse nids in slurmconfgen_smw.py · 88ccc111

David Gloe authored 9 years ago

An error in slurmconfgen_smw.py caused it to parse the nic as the nid.
On some systems those values differ, causing the generated slurm.conf file to
be incorrect.

Bug 2532.

88ccc111

sbcast default buffer size set to 8MB · a06452f2
Morris Jette authored 9 years ago
```
This matches the documentation
```
a06452f2

Mar 08, 2016
- Remove unneeded check introduced in 897c4b27 · ba7dfc75
  Tim Wickberg authored 9 years ago
  
  _set_collectors() already has a run_in_daemon("slurmd") that precludes this from being an issue.
  ba7dfc75
- Fix route/topology plugin to prevent segfault in sbcast. · 897c4b27
  Bill Brophy authored 9 years ago
  
  route_p_split_hostlist was not thread-safe, and would cause one of several segfaults depending on where in the initialization code each thread was. Bug 2495.
  897c4b27
- Fix displayed value for RoutePlugin. · 14c51e65
  Tim Wickberg authored 9 years ago
  
  Was incorrectly displaying "(null)" even when loaded successfully.
  14c51e65
- node_feature/knl_cray MCDRAM percentages · 73d5519b
  Morris Jette authored 9 years ago
  
  Capture MCDRAM percentages for various configurations from capmc. This assumes the percentages for various configurations will be identical for all nodes within a cluster.
  73d5519b
- Expand Cray KNL documentation · 99a98580
  Morris Jette authored 9 years ago
  
  99a98580
- Merge branch 'slurm-15.08' · 19ad2345
  Morris Jette authored 9 years ago
  
  19ad2345
- Handle function error for Coverity · e6b7b2c2
  Morris Jette authored 9 years ago
  
  e6b7b2c2
- Fix link to kernel cgroup documentation. · 4c3fd194
  Janne Blomqvist authored 9 years ago
  
  4c3fd194
Mar 07, 2016

Add test for --depend=aftercorr · 62f659a5
Morris Jette authored 9 years ago

62f659a5
Fix gres parsing of --gres=gpu:1k · f698c235
Brian Christiansen authored 9 years ago
```
clang found a deferencing null issue which lead to finding the parsing error.
```
f698c235

Added per job array task dependencies · c8dd9790

Dominik Bartkiewicz authored 9 years ago

Added new job dependency type of "aftercorr" which will start a task of a
    job array after the corresponding task of another job array completes.
bug 2460

c8dd9790

add additional tuning notes for mysql/mariadb · 49dc5d8d

Tim Wickberg authored 9 years ago

In particular, it seems that MariaDB has changed the default for
innodb_lock_wait_timeout has been lowered which can cause issues
for the various rollup processes on systems with high job counts.

49dc5d8d

Mar 05, 2016
- Fix some node reboot timing · e7cd9c24
  Morris Jette authored 9 years ago
  
  Fix some timing issues with respect to rebooting a node, especailly KNL node needing reboot to change configuration.
  e7cd9c24
- Make it so jobs/steps track ':' named gres/tres, before hand gres/gpu:tesla · 0cd69296
  Danny Auble authored 9 years ago
  
  would only track gres/gpu, now it will track both gres/gpu and gres/gpu:tesla as separate gres if configured like AccountingStorageTRES=gres/gpu,gres/gpu:tesla
  0cd69296
- Merge remote-tracking branch 'origin/slurm-15.08' · 7c9cc617
  Danny Auble authored 9 years ago
  
  7c9cc617
- Continuation to commit b294f81b to do the right thing for jobs. · 35f7a262
  Danny Auble authored 9 years ago
  
  35f7a262
- Merge remote-tracking branch 'origin/slurm-15.08' · 0a8e2d43
  Danny Auble authored 9 years ago
  
  0a8e2d43
- Fixed double read lock on getting job's gres/tres. · b23a57cf
  Danny Auble authored 9 years ago
  
  b23a57cf
- Move common code into a single function. This also allows requests like · 7153bbfd
  Danny Auble authored 9 years ago
  
  --gres=gpu:tesla before you needed to give a count --gres=gpu:tesla:1 now both should work.
  7153bbfd
Mar 04, 2016
- Merge remote-tracking branch 'origin/slurm-15.08' · 90935701
  Danny Auble authored 9 years ago
  
  90935701
- Continuation of commit 7f0bdc84 · 55a678dd
  Danny Auble authored 9 years ago
  
  Step GRES value changed from type "int" to "int64_t" to support larger values. Signed-off-by: Danny Auble <da@schedmd.com>
  55a678dd
- Merge remote-tracking branch 'origin/slurm-15.08' · cdb6ab5d
  Danny Auble authored 9 years ago
  
  cdb6ab5d
- Fix issue where steps weren't always getting the gres/tres involved. · b294f81b
  Danny Auble authored 9 years ago
  
  b294f81b
- parsing of scheduling parameters · 9beeb3a6
  Morris Jette authored 9 years ago
  
  These changes apply to both the main scheduling logic and backfill scheduler. If some SchedulerParameters value was configured, the slurmctld started, then completely removed, and slurmctld reconfigured the value would not be reset to it's default value but the originally configured value would persist until slurmctld restarted.
  9beeb3a6
- Fix NEWS entry. · d2b913a2
  Brian Christiansen authored 9 years ago
  
  Continuation of 31225a82
  d2b913a2