Commits · 3309d164ca4d0de47b4d09f75160ba043f49defa · tud-zih-energy / Slurm

Jun 05, 2013

srun - Don't check for executable if --test-only flag is used. · 3309d164
Danny Auble authored 11 years ago

3309d164

priority/multifactor2 - Prevent possible divide by zero. · fc3997f9

Janne Blomqvist authored 11 years ago

Andy Wettstein (University of Chicago) reported privately to me that slurmctld
2.5.4 crashed after he enabled the priority/multifactor2 plugin due to a
division by zero error.

I was able to reproduce the crash by creating an account hierarchy where all
the accounts and users had zero shares.
See bug 315

fc3997f9

Jun 04, 2013
- Start NEWS for v2.5.8 · 6102d42c
  Morris Jette authored 11 years ago
  
  6102d42c
- Update META for v2.5.7 tag · cd30b339
  Morris Jette authored 11 years ago
  
  cd30b339
- launch/poe - Fix for hostlist file support with repeated host names. · 58c21140
  jette authored 11 years ago
  
  Without this change, it appears that POE ignores the -procs argument resulting in a job step request with multiple host names, but only one ntask required
  58c21140
Jun 03, 2013

Fix for job step allocation with required hostlist and exclusive option · 523b1992
jette authored 11 years ago
```
Previously if the required node has no available CPUs left, then other
nodes in the job allocation would be used
```
523b1992

restore max_nodes of desc to NO_VAL when checkpointing job · f82e0fb8

Hongjia Cao authored 11 years ago

We're having some trouble getting our slurm jobs to successfully
restart after a checkpoint.  For this test, I'm using sbatch and a
simple, single-threaded executable.  Slurm is 2.5.4, blcr is 0.8.5.
I'm submitting the job using sbatch:

$ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh

I am able to create the checkpoint and vacate the node:

$ scontrol checkpoint create 137
.... time passes ....
$ scontrol vacate 137

At that point, I see the checkpoint file from blcr in the current
directory and the checkpoint file from Slurm
in /var/spool/slurm-llnl/checkpoint.  However, when I attempt to
restart the job:

$ scontrol checkpoint restart 137
scontrol_checkpoint error: Node count specification invalid

In slurmctld's log (at level 7) I see:

[2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=*****
[2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002
[2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0
[2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid

f82e0fb8

May 30, 2013
- Select/cons_res - Fix bug resulting in held job · b574168b
  Morris Jette authored 11 years ago
  
  Uninitialized variables resulted in error of "cons_res: sync loop not progressing, holding job #"
  b574168b
May 29, 2013

Fix job step allocation with --exclusive and --hostlist option · 85cab0cb

jette authored 11 years ago

The most notable problem case is on a cray where a job step
specifically requests one or more node that are not the first
nodes in the job allocation

85cab0cb

May 23, 2013
- sched/backfill - Modify logic to reduce overhead under heavy load. · 941a5ac9
  Morris Jette authored 11 years ago
  
  The problem we have observed is the backfill scheduler temporarily gives up its locks (one second), but then reclaims them before the backlog of work completes, basically keeping the backfill scheduler running for a really long time when under a heavy load. bug 297
  941a5ac9
- switch/nrt - Correct network_id use logic. · 7faae23f
  Morris Jette authored 11 years ago
  
  7faae23f
- switch/nrt, enable support for --network=sn_all or sn_single options · 4e4013a1
  Morris Jette authored 11 years ago
  
  4e4013a1
- Move scheduling start timer (for sdiag) to remove lock time · 87df6f1a
  Morris Jette authored 11 years ago
  
  87df6f1a
- sdiag documentation update · 958db08c
  Morris Jette authored 11 years ago
  
  Fix minor bug in sdiag backfill scheduling time reported on Bluegene systems Improve explanation of backfill scheduling cycle time calculation.
  958db08c
- Node reboot logic correction · fcc63508
  Morris Jette authored 11 years ago
  
  Defers (rather than forgets) reboot request with job running on the node within a reservation.
  fcc63508
- CRAY - fix a missing transient print · 164ea1e9
  Danny Auble authored 11 years ago
  
  164ea1e9
- CRAY - Support CLE 4.2.0 · b7b4b7d5
  Danny Auble authored 11 years ago
  
  b7b4b7d5
May 22, 2013
- BGQ - remove unused variable. · b89ac514
  Danny Auble authored 11 years ago
  
  b89ac514
- BGQ - When --geo is requested do not impose the default conn_types. · 8f1d9c6b
  Danny Auble authored 11 years ago
  
  8f1d9c6b
- Expand explanation of MPICH2 build and use · c5f9abb5
  Morris Jette authored 11 years ago
  
  c5f9abb5
- switch/nrt - Validate dynamic window allocation size. · 922251e5
  jette authored 11 years ago
  
  922251e5
- switch/nrt: report window state information in more compact format · a7c45e54
  jette authored 11 years ago
  
  a7c45e54
May 21, 2013
- Clarify documentation for node reboot logic · 1b9e9e64
  Morris Jette authored 11 years ago
  
  1b9e9e64
May 18, 2013
- BGQ - Fix issue with preemption on sub-block jobs where a job would kill · 3a849f26
  Danny Auble authored 11 years ago
  
  all preemptable jobs on the midplane instead of just the ones it needed to.
  3a849f26
- BLUEGENE - remove duplicate definition · f6eaa251
  Danny Auble authored 11 years ago
  
  f6eaa251
May 16, 2013
- Prevent clearing reason field for pending jobs. · 1f8e47ba
  Morris Jette authored 11 years ago
  
  This bug was introduced in commit f1cf6d2d fix for bug 290
  1f8e47ba
- POE - pack missing variable to allow fanout (more than 32 nodes) · f45b7e9a
  Danny Auble authored 11 years ago
  
  f45b7e9a
May 14, 2013
- Change comparison to avoid redundant variable reset · bf9e7e44
  Morris Jette authored 11 years ago
  
  bf9e7e44
- Priority/multifactor - Avoid underflow in half-life calculation. · 5d70ccce
  Morris Jette authored 11 years ago
  
  5d70ccce
May 13, 2013
- Add comments to clarify logic · b94720b4
  Morris Jette authored 11 years ago
  
  b94720b4
- Drain node on prolog or epilog failure, rather than downing the node · e43239ae
  Morris Jette authored 11 years ago
  
  Downing the node will kill all jobs allocated to the node, very bad on something like a BlueGene system
  e43239ae
May 11, 2013
- Update the manpage to better describe the io redirection on BGQ system. · d510ead6
  David Bigagli authored 11 years ago
  
  d510ead6
May 10, 2013

correctly set alloc state of node in select/linear · 0ef764b5

Hongjia Cao authored 11 years ago

fix of the following problem:
if a node is excised from a job and a reconfiguration(e.g., update
partition) is done when the job is still running, the node will be left
in state idle but not available any more until the next
reconfiguration/restart of slurmctld after the job finished.

0ef764b5

May 08, 2013
- Update NEWS file for Bug#284. · bae01305
  David Bigagli authored 11 years ago
  
  bae01305
- Bug#284 Fix invalid memory read. · 486e0233
  David Bigagli authored 11 years ago
  
  486e0233
- sview - Fix race condition where new information could of slipped past · 68f0f5db
  Danny Auble authored 11 years ago
  
  the node tab and we didn't notice.
  68f0f5db
May 07, 2013
- Fix the array index from i to node_inx to avoid reading the array boundary. · ff2ee1b1
  David Bigagli authored 11 years ago
  
  ff2ee1b1
May 04, 2013
- Note the restricted use of the --gid option by the salloc command · b7d5e0ea
  Morris Jette authored 11 years ago
  
  Response to bug 274
  b7d5e0ea
May 03, 2013

Make test more robust · 2592eb5e

jette authored 11 years ago

Make test work if current working directory not in the search path
Check for appropriate task rank on POE based systems
Disable the entire test on POE systems

2592eb5e

May 02, 2013

POE - Fix logic binding tasks to CPUs. · 48e164e0

jette authored 11 years ago

Without this change pmdv12 was bound to one CPU and could not use
all of the resources allocated to the job step for the tasks that
it launches

48e164e0