Commits · e65638f0f67fbb2a990545f485a9626a7a744d81 · tud-zih-energy / Slurm

Oct 03, 2012
- Add Slurm User Group Meeting 2012 info to publication web page · e65638f0
  Morris Jette authored 12 years ago
  
  e65638f0
- Change comment for better clarity · fee9ae4a
  Morris Jette authored 12 years ago
  
  fee9ae4a
Oct 02, 2012

Correct -mem-per-cpu logic for multiple threads per core · 6a103f2e

Morris Jette authored 12 years ago

See bugzilla bug 132

When using select/cons_res and CR_Core_Memory, hyperthreaded nodes may be
overcommitted on memory when CPU counts are scaled. I've tested 2.4.2 and HEAD
(2.5.0-pre3).

Conditions:
-----------
* SelectType=select/cons_res
* SelectTypeParameters=CR_Core_Memory
* Using threads
  - Ex. "NodeName=linux0 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2
RealMemory=400"

Description:
------------
In the cons_res plugin, _verify_node_state() in job_test.c checks if a node has
sufficient memory for a job. However, the per-CPU memory limits appear to be
scaled by the number of threads. This new value may exceed the available memory
on the node. And, once a node is overcommitted on memory, future memory checks
in _verify_node_state() will always succeed.

Scenario to reproduce:
----------------------
With the example node linux0, we run a single-core job with 250MB/core
    srun --mem-per-cpu=250 sleep 60

cons_res checks that it will fit: ((real - alloc) >= job mem)
    ((400 - 0) >= 250) and the job starts

Then, the memory requirement is doubled:
    "slurmctld: error: cons_res: node linux0 memory is overallocated (500) for
job X"
    "slurmd: scaling CPU count by factor of 2"

This job should not have started

While the first job is still running, we submit a second, identical job
    srun --mem-per-cpu=250 sleep 60

cons_res checks that it will fit:
    ((400 - 500) >= 250), the unsigned int wraps, the test passes, and the job
starts

This second job also should not have started

6a103f2e

Modify strigger so that a filter option of "--user=0" is supported · 7166976e
Morris Jette authored 12 years ago

7166976e
one more fix · fb0269f3
Danny Auble authored 12 years ago

fb0269f3
BGQ - make regression tests work correctly on real systems · 4585b4c0
Danny Auble authored 12 years ago

4585b4c0

Oct 01, 2012
- BGQ - Make it so bluegene test only runs on an emulated system. · 283c860b
  Danny Auble authored 12 years ago
  
  283c860b
Sep 28, 2012
- BGQ - Fixes to tests on a real BGQ system · 4488ae30
  Don Lipari authored 12 years ago
  
  4488ae30
Sep 27, 2012
- BGQ - Logic added to make sure a job has finished on a block before it is · 0badb119
  Danny Auble authored 12 years ago
  
  purged from the system if its front-end node goes down.
  0badb119
- remove extra magic clear · dd3704ed
  Danny Auble authored 12 years ago
  
  dd3704ed
- BGQ - If a job goes away while still trying to free it up in the · 064ee393
  Danny Auble authored 12 years ago
  
  database, and the job is running on a small block make sure we free up the correct node count.
  064ee393
- update various documentation · 7d2aa36e
  Danny Auble authored 12 years ago
  
  7d2aa36e
- Fix for srun --test-only to work correctly with timelimits · 36e819e5
  Bill Brophy authored 12 years ago
  
  36e819e5
Sep 25, 2012
- Fixed typo (backgroud -> background) · 527d7eb9
  Danny Auble authored 12 years ago
  
  527d7eb9
Sep 24, 2012
- Execute slurm_spank_job_epilog when there is no system Epilog configured. · c57ab123
  Morris Jette authored 12 years ago
  
  This addresses bug 130
  c57ab123
Sep 21, 2012
- BGQ - Fix issue when a cnode going to an error (not SoftwareError) state · 4b1aed73
  Danny Auble authored 12 years ago
  
  with a job running or trying to run on it.
  4b1aed73
Sep 20, 2012
- BGQ - Fix if large block goes into error and the next highest priority jobs · 7e53c48f
  Danny Auble authored 12 years ago
  
  are planning on using the block. Previously it would fail those jobs erroneously.
  7e53c48f
Sep 19, 2012
- BGQ - minor fix to make build work in emulated mode. · c9f14f80
  Danny Auble authored 12 years ago
  
  c9f14f80
Sep 18, 2012
- Updates for v2.4.3 tag · 1410b960
  Morris Jette authored 12 years ago
  
  1410b960
- Cosmetic changes, no changes in logic · 79ddd5dc
  Morris Jette authored 12 years ago
  
  79ddd5dc
- fix for default partition having a '*' after it · fb9bd16f
  Danny Auble authored 12 years ago
  
  fb9bd16f
- up memory constraint (I think something happened with glibc or the kernel) · a772fc04
  Danny Auble authored 12 years ago
  
  no big deal, but now we get just a bit more memory allocated (4152 over instead of the 4100 we were previously looking for)
  a772fc04
- Added all available limits to the output of sacctmgr list qos · 2c500639
  Danny Auble authored 12 years ago
  
  2c500639
- Correct path in Cray web page · e6238a4f
  Morris Jette authored 12 years ago
  
  e6238a4f
- Minor updates to SLURM/Cray guide, mostly typos · 773b7121
  Morris Jette authored 12 years ago
  
  773b7121
Sep 17, 2012
- Fix sacct to work with QOS' that have previously been deleted. · 48bf06d8
  Danny Auble authored 12 years ago
  
  48bf06d8
- Remove unneeded variables. · 6ddd55bf
  Danny Auble authored 12 years ago
  
  6ddd55bf
- DBD - if list is filling up remove step_complete messages as well since · a2fc224d
  Danny Auble authored 12 years ago
  
  they are a no-opt when they get to the DBD and we would favor having job completion messages instead of step completion messages if possible.
  a2fc224d
- CRAY - Update documentation to describe installation from rpm instead · f7321e1a
  Danny Auble authored 12 years ago
  
  or previous piecemeal method.
  f7321e1a
Sep 15, 2012
- CRAY - Fix for sacct -N option to work correctly · a6ffef22
  Danny Auble authored 12 years ago
  
  Adapted from a patch from Stephen Trofinoff <trofinoff@cscs.ch>
  a6ffef22
- simply duplicate code · 85eb6e69
  Danny Auble authored 12 years ago
  
  85eb6e69
- Correct cluster_dims function used · 7d0113d7
  Danny Auble authored 12 years ago
  
  7d0113d7
- CRAY: add the option --enable-salloc-background to the slurm.spec file · ea32460f
  Danny Auble authored 12 years ago
  
  ea32460f
Sep 14, 2012
- Modify cray module file with correct logic to load perl directory · 43c2e41a
  Morris Jette authored 12 years ago
  
  43c2e41a
Sep 13, 2012
- put gethostname_short in file since it isn't exported in libslurm · 25ae1898
  Danny Auble authored 12 years ago
  
  25ae1898
- Minor cosmetic changes to last patch, enforcemnt of ConstrainSwapSpace=no · 6bdd773e
  Morris Jette authored 12 years ago
  
  6bdd773e
- task/cgroup/memory - ensure that ConstrainSwapSpace=no is correctly handled · 5baf98ba
  Matthieu Hautreux authored 12 years ago
  
  Only set memory.memsw.limit_in_bytes to the computed amount of allowed RAM+Swap when ConstrainSwapSpace=yes in cgroup.conf. When ConstrainSwapSpace=yes and ConstrainRAMSpace=no, automatically set AllowedRAMSpace to 100% in order to compute the memsw.limit based on the allocated memory plus the allowed swap percent. Then use that limit for both mem and mem+sw limits.
  5baf98ba
- backport cray 2.5 docs to 2.4 · 2a2ce582
  Danny Auble authored 12 years ago
  
  2a2ce582
- include file from last checkin · 51c4f7c8
  Danny Auble authored 12 years ago
  
  51c4f7c8
- CRAY - add module file that is based on configure to the slurm rpm · 9f48fb3f
  Danny Auble authored 12 years ago
  
  9f48fb3f