Commits · b87be8bb2784bdb5b9363fbae087c6a9cc972852 · tud-zih-energy / Slurm

Nov 28, 2012

Accounting - Fixed issue where if nodenames have changed on a system and · b87be8bb

Danny Auble authored 12 years ago

you query against that with -N and -E you will get all jobs during that
time instead of only the ones running on -N.

Signed-off-by: Danny Auble <da@schedmd.com>

b87be8bb

Nov 27, 2012
- BGQ - handle pending actions on a block better when trying to deallocate it. · e4431036
  Danny Auble authored 12 years ago
  
  e4431036
- BLUEGENE - With Dynamic layout mode - Fix issue where if a larger block · 0dad50ff
  Danny Auble authored 12 years ago
  
  was already in error and isn't deallocating and underlying hardware goes bad one could get overlapping blocks in error making the code assert when a new job request comes in.
  0dad50ff
- BGQ - Add 64 tasks per node as a valid option for srun when used with · 4f085be3
  Danny Auble authored 12 years ago
  
  overcommit.
  4f085be3
Nov 20, 2012
- Accounting - Fix issue where QOS usage was being zeroed out on a · 8b0b5ae7
  Danny Auble authored 12 years ago
  
  slurmctld restart.
  8b0b5ae7
- Reset node MAINT state flag when a reservation's nodes or flags change · cc97d84b
  Morris Jette authored 12 years ago
  
  cc97d84b
Nov 19, 2012
- BGQ - Fix job step timeout actually happen when done from within an · 0500e007
  Danny Auble authored 12 years ago
  
  allocation.
  0500e007
- Modify use of OOM (out of memory protection) for Linux 2.6.36 kernel or later · 8ae5e73e
  Morris Jette authored 12 years ago
  
  NOTE: If you were setting the environment variable SLURMSTEPD_OOM_ADJ=-17, it should be set to -1000 for Linux 2.6.36 kernel or later.
  8ae5e73e
- NEWS for e40883f1 · f42117c4
  Danny Auble authored 12 years ago
  
  f42117c4
Nov 07, 2012
- BGQ - validate correct ntasks_per_node · 7eb1a451
  Danny Auble authored 12 years ago
  
  7eb1a451
- BGQ - Fix issue when running srun outside of an allocation and only · 9e25da94
  Danny Auble authored 12 years ago
  
  specifying the number of tasks and not the number of nodes.
  9e25da94
Nov 05, 2012

Cray - Improve signal handling for spawned tasks on job cancel · 3ff9f17e

Morris Jette authored 12 years ago

On job kill requeust, send SIGCONT, SIGTERM, wait KillWait and send
SIGKILL. Previously just sent SIGKILL to tasks.

3ff9f17e

Nov 02, 2012
- Remove duplicate NEWS item · c3fde3ce
  Morris Jette authored 12 years ago
  
  c3fde3ce
- Update NEWS for start of v2.4.5 work · 832ca7df
  Morris Jette authored 12 years ago
  
  832ca7df
Oct 25, 2012
- Correction to slurmdbd communications failure handling logic · 26871b8d
  Morris Jette authored 12 years ago
  
  Incorrect error codes returned in some cases, especially if the slurmdbd is down
  26871b8d
- Cray - Defer salloc until after PrologSlurmctld completes. · a5645a19
  Morris Jette authored 12 years ago
  
  a5645a19
Oct 24, 2012
- smap - spread node information across multiple lines for larger systems. · 2c8bd966
  Morris Jette authored 12 years ago
  
  Previously for linux systems all information was placed on a single line.
  2c8bd966
Oct 23, 2012
- GQ - Cleaner handling of cnode failures when reported through the runjob · f6a33bad
  Danny Auble authored 12 years ago
  
  interface instead of through the normal method.
  f6a33bad
Oct 22, 2012
- BGQ - Fix for printing realtime server debug correctly. · 9054e4e0
  Danny Auble authored 12 years ago
  
  9054e4e0
Oct 18, 2012
- BGQ - Make it so if a nodeboard goes in error any block using that midplane · ea39371a
  Danny Auble authored 12 years ago
  
  for passthrough gets removed on a dynamic system.
  ea39371a
- BGQ - Add logic to make it so blocks can't use a midplane with a nodeboard · 4b1f6608
  Danny Auble authored 12 years ago
  
  in error for passthrough.
  4b1f6608
- Fixed InactiveLimit math to work correctly · 13a8882a
  Danny Auble authored 12 years ago
  
  13a8882a
- BGQ - Fixed InactiveLimit to work correctly to avoid scenarios where a · 65fef1ff
  Danny Auble authored 12 years ago
  
  user's pending allocation was started with srun and then for some reason the slurmctld was brought down and while it was down the srun was removed.
  65fef1ff
- BGQ - Add functionality to make it so we track the actions on a block. · baf267e0
  Danny Auble authored 12 years ago
  
  This is needed for when a free request is added to a block but there are jobs finishing up so we don't start new jobs on the block since they will fail on start.
  baf267e0
Oct 17, 2012
- BlueGene - don't change pending job's node count when changing partition. · 2b59f495
  Morris Jette authored 12 years ago
  
  Previously the node count would change from c-node count to midplane count (but still be interpreted as a c-node count).
  2b59f495
Oct 16, 2012
- Fix for older < glibc 2.4 systems to use euidaccess instead of eaccess. · d9e28215
  Danny Auble authored 12 years ago
  
  d9e28215
Oct 02, 2012

Correct -mem-per-cpu logic for multiple threads per core · 6a103f2e

Morris Jette authored 12 years ago

See bugzilla bug 132

When using select/cons_res and CR_Core_Memory, hyperthreaded nodes may be
overcommitted on memory when CPU counts are scaled. I've tested 2.4.2 and HEAD
(2.5.0-pre3).

Conditions:
-----------
* SelectType=select/cons_res
* SelectTypeParameters=CR_Core_Memory
* Using threads
  - Ex. "NodeName=linux0 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2
RealMemory=400"

Description:
------------
In the cons_res plugin, _verify_node_state() in job_test.c checks if a node has
sufficient memory for a job. However, the per-CPU memory limits appear to be
scaled by the number of threads. This new value may exceed the available memory
on the node. And, once a node is overcommitted on memory, future memory checks
in _verify_node_state() will always succeed.

Scenario to reproduce:
----------------------
With the example node linux0, we run a single-core job with 250MB/core
    srun --mem-per-cpu=250 sleep 60

cons_res checks that it will fit: ((real - alloc) >= job mem)
    ((400 - 0) >= 250) and the job starts

Then, the memory requirement is doubled:
    "slurmctld: error: cons_res: node linux0 memory is overallocated (500) for
job X"
    "slurmd: scaling CPU count by factor of 2"

This job should not have started

While the first job is still running, we submit a second, identical job
    srun --mem-per-cpu=250 sleep 60

cons_res checks that it will fit:
    ((400 - 500) >= 250), the unsigned int wraps, the test passes, and the job
starts

This second job also should not have started

6a103f2e

Modify strigger so that a filter option of "--user=0" is supported · 7166976e
Morris Jette authored 12 years ago

7166976e

Sep 27, 2012
- BGQ - Logic added to make sure a job has finished on a block before it is · 0badb119
  Danny Auble authored 12 years ago
  
  purged from the system if its front-end node goes down.
  0badb119
- BGQ - If a job goes away while still trying to free it up in the · 064ee393
  Danny Auble authored 12 years ago
  
  database, and the job is running on a small block make sure we free up the correct node count.
  064ee393
- Fix for srun --test-only to work correctly with timelimits · 36e819e5
  Bill Brophy authored 12 years ago
  
  36e819e5
Sep 24, 2012
- Execute slurm_spank_job_epilog when there is no system Epilog configured. · c57ab123
  Morris Jette authored 12 years ago
  
  This addresses bug 130
  c57ab123
Sep 21, 2012
- BGQ - Fix issue when a cnode going to an error (not SoftwareError) state · 4b1aed73
  Danny Auble authored 12 years ago
  
  with a job running or trying to run on it.
  4b1aed73
Sep 20, 2012
- BGQ - Fix if large block goes into error and the next highest priority jobs · 7e53c48f
  Danny Auble authored 12 years ago
  
  are planning on using the block. Previously it would fail those jobs erroneously.
  7e53c48f
Sep 19, 2012
- BGQ - minor fix to make build work in emulated mode. · c9f14f80
  Danny Auble authored 12 years ago
  
  c9f14f80
Sep 18, 2012
- Updates for v2.4.3 tag · 1410b960
  Morris Jette authored 12 years ago
  
  1410b960
- Added all available limits to the output of sacctmgr list qos · 2c500639
  Danny Auble authored 12 years ago
  
  2c500639
Sep 17, 2012
- Fix sacct to work with QOS' that have previously been deleted. · 48bf06d8
  Danny Auble authored 12 years ago
  
  48bf06d8
- CRAY - Update documentation to describe installation from rpm instead · f7321e1a
  Danny Auble authored 12 years ago
  
  or previous piecemeal method.
  f7321e1a
Sep 15, 2012
- CRAY - Fix for sacct -N option to work correctly · a6ffef22
  Danny Auble authored 12 years ago
  
  Adapted from a patch from Stephen Trofinoff <trofinoff@cscs.ch>
  a6ffef22