Commits · 262374a8e5e19bedd0e32c5e60b51acd34f5065e · tud-zih-energy / Slurm

Sep 23, 2013
- Reorder get config logic to avoid deadlock. · 262374a8
  Morris Jette authored 11 years ago
  
  bug 428
  262374a8
Aug 13, 2013

select/cons_res - Add test for zero node allocation · e180d341

jette authored 11 years ago

I don't see how this could happen, but it might explain something
reported by Harvard University. In any case, this could prevent
an infinite loop if the task distribution funciton is passed a
job allocation with zero nodes.

e180d341

select/cons_res - Avoid extraneous "oversubscribe" error messages · 302d8b3f

jette authored 11 years ago

This problem was reported by Harvard University and could be
reproduced with a command line of "srun -N1 --tasks-per-node=2 -O id".
With other job types, the error message could be logged many times
for each job. This change logs the error once per job and only if
the job request does not include the -O/--overcommit option.

302d8b3f

Jul 05, 2013
- switch/nrt - Don't allocate network resources unless job step has 2+ nodes · 7ea11af2
  jette authored 11 years ago
  
  7ea11af2
Jun 28, 2013
- Select/cons_res - Correct total CPU count allocated to a job · 9a17ba1c
  Morris Jette authored 11 years ago
  
  Effects jobs with --exclusive and --cpus-per-task options bug 355
  9a17ba1c
Jun 25, 2013

Updated the automake min version in autogen.sh to be correct. · 390e7558

David Gloe authored 11 years ago

The SLURM Makefile.am scripts use pkglibexecdir. One source indicates
that this was not added until automake 1.10.2
(https://github.com/rerun/rerun/issues/167).

So we just made that to be the minimum.

390e7558

Jun 21, 2013
- Added datadir which appeared to be needed on some systems · d90b128e
  Danny Auble authored 11 years ago
  
  d90b128e
- ran autogen.sh · fce61e1c
  Danny Auble authored 11 years ago
  
  fce61e1c
- Remove --program-prefix from spec file since it appears to be added by · fdba3fb5
  Danny Auble authored 11 years ago
  
  default and appeared to break other things.
  fdba3fb5
- Get html/man files to install in correct places with rpms. · 73df7da7
  Danny Auble authored 11 years ago
  
  73df7da7
Jun 19, 2013
- add missing .so to the spec · 342438ee
  Danny Auble authored 11 years ago
  
  342438ee
Jun 12, 2013
- Fix bug that would leak memory and over-write the AllowGroups field · 7d47017b
  Morris Jette authored 11 years ago
  
  if on "scontrol reconfig" when AllowNodes is manually changed using scontrol since last slurmctld restart.
  7d47017b
Jun 11, 2013
- Documentation fixes for formatting and spelling problems · a52be691
  Gennaro Oliva authored 11 years ago
  
  a52be691
- PE - Update documentation about how to set up the libpermapi.so to work · a7f93a02
  Danny Auble authored 11 years ago
  
  correctly with poe when attempting to run > 32 nodes.
  a7f93a02
Jun 10, 2013

Avoid gres step allocation errors when a job shrinks in size · 9c216b9d

Morris Jette authored 11 years ago

due to either down nodes or explicit resizing.
Generated slurmctld errors of this type:
[2013-06-04T12:43:46+06:00] error: gres/gpu: step_test 68662.4294967294 gres_bit_alloc is NULL
This is a movement of the logic introduced in commit
https://github.com/SchedMD/slurm/commit/6fff97bb77d2d88aa808c47fd7880246a0c1d090
to eliminate a memory leak.

9c216b9d

Jun 06, 2013
- Fix for slurmctld segfault on NULL front-end reason field. · df858ff1
  Mark Nelson authored 11 years ago
  
  df858ff1
Jun 05, 2013
- energy - only zero out the previous value if new consumption is reported. · 8d99170f
  Danny Auble authored 11 years ago
  
  8d99170f
- energy - On a single node only use the last task for gathering energy. · 09601d60
  Danny Auble authored 11 years ago
  
  Since we don't currently track energy usage per task (only per step). Otherwise we get double the energy.
  09601d60
- srun - Don't check for executable if --test-only flag is used. · 3309d164
  Danny Auble authored 11 years ago
  
  3309d164
- priority/multifactor2 - Prevent possible divide by zero. · fc3997f9
  Janne Blomqvist authored 11 years ago
  
  Andy Wettstein (University of Chicago) reported privately to me that slurmctld 2.5.4 crashed after he enabled the priority/multifactor2 plugin due to a division by zero error. I was able to reproduce the crash by creating an account hierarchy where all the accounts and users had zero shares. See bug 315
  fc3997f9
Jun 04, 2013
- Start NEWS for v2.5.8 · 6102d42c
  Morris Jette authored 11 years ago
  
  6102d42c
- Update META for v2.5.7 tag · cd30b339
  Morris Jette authored 11 years ago
  
  cd30b339
- launch/poe - Fix for hostlist file support with repeated host names. · 58c21140
  jette authored 11 years ago
  
  Without this change, it appears that POE ignores the -procs argument resulting in a job step request with multiple host names, but only one ntask required
  58c21140
Jun 03, 2013

Fix for job step allocation with required hostlist and exclusive option · 523b1992
jette authored 11 years ago
```
Previously if the required node has no available CPUs left, then other
nodes in the job allocation would be used
```
523b1992

restore max_nodes of desc to NO_VAL when checkpointing job · f82e0fb8

Hongjia Cao authored 11 years ago

We're having some trouble getting our slurm jobs to successfully
restart after a checkpoint.  For this test, I'm using sbatch and a
simple, single-threaded executable.  Slurm is 2.5.4, blcr is 0.8.5.
I'm submitting the job using sbatch:

$ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh

I am able to create the checkpoint and vacate the node:

$ scontrol checkpoint create 137
.... time passes ....
$ scontrol vacate 137

At that point, I see the checkpoint file from blcr in the current
directory and the checkpoint file from Slurm
in /var/spool/slurm-llnl/checkpoint.  However, when I attempt to
restart the job:

$ scontrol checkpoint restart 137
scontrol_checkpoint error: Node count specification invalid

In slurmctld's log (at level 7) I see:

[2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=*****
[2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002
[2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0
[2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid

f82e0fb8

May 30, 2013
- Select/cons_res - Fix bug resulting in held job · b574168b
  Morris Jette authored 11 years ago
  
  Uninitialized variables resulted in error of "cons_res: sync loop not progressing, holding job #"
  b574168b
May 29, 2013

Fix job step allocation with --exclusive and --hostlist option · 85cab0cb

jette authored 11 years ago

The most notable problem case is on a cray where a job step
specifically requests one or more node that are not the first
nodes in the job allocation

85cab0cb

May 23, 2013
- sched/backfill - Modify logic to reduce overhead under heavy load. · 941a5ac9
  Morris Jette authored 11 years ago
  
  The problem we have observed is the backfill scheduler temporarily gives up its locks (one second), but then reclaims them before the backlog of work completes, basically keeping the backfill scheduler running for a really long time when under a heavy load. bug 297
  941a5ac9
- switch/nrt - Correct network_id use logic. · 7faae23f
  Morris Jette authored 11 years ago
  
  7faae23f
- switch/nrt, enable support for --network=sn_all or sn_single options · 4e4013a1
  Morris Jette authored 11 years ago
  
  4e4013a1
- Move scheduling start timer (for sdiag) to remove lock time · 87df6f1a
  Morris Jette authored 11 years ago
  
  87df6f1a
- sdiag documentation update · 958db08c
  Morris Jette authored 11 years ago
  
  Fix minor bug in sdiag backfill scheduling time reported on Bluegene systems Improve explanation of backfill scheduling cycle time calculation.
  958db08c
- Node reboot logic correction · fcc63508
  Morris Jette authored 11 years ago
  
  Defers (rather than forgets) reboot request with job running on the node within a reservation.
  fcc63508
- CRAY - fix a missing transient print · 164ea1e9
  Danny Auble authored 11 years ago
  
  164ea1e9
- CRAY - Support CLE 4.2.0 · b7b4b7d5
  Danny Auble authored 11 years ago
  
  b7b4b7d5
May 22, 2013
- BGQ - remove unused variable. · b89ac514
  Danny Auble authored 11 years ago
  
  b89ac514
- BGQ - When --geo is requested do not impose the default conn_types. · 8f1d9c6b
  Danny Auble authored 11 years ago
  
  8f1d9c6b
- Expand explanation of MPICH2 build and use · c5f9abb5
  Morris Jette authored 11 years ago
  
  c5f9abb5
- switch/nrt - Validate dynamic window allocation size. · 922251e5
  jette authored 11 years ago
  
  922251e5
- switch/nrt: report window state information in more compact format · a7c45e54
  jette authored 11 years ago
  
  a7c45e54