Commits · 3d9033f39070585549cd1efeefd4c87a36fc10c4 · tud-zih-energy / Slurm

Oct 20, 2011
- Re-calculate rather than preserve a job's new prio if original prio <=1 · 3d9033f3
  Morris Jette authored 13 years ago
  
  3d9033f3
- BLUEGENE - Fixed issues with running on a sub-midplane system. · 6c335d6f
  Danny Auble authored 13 years ago
  
  6c335d6f
Oct 19, 2011
- Disable some SelectTypeParameters for select/linear that aren't compatible. · 78caf6c3
  Danny Auble authored 13 years ago
  
  78caf6c3
Oct 18, 2011
- Disable some SelectTypeParameters for select/linear · 7596ac02
  Morris Jette authored 13 years ago
  
  7596ac02
- Removed unused label · cb108d0f
  Morris Jette authored 13 years ago
  
  cb108d0f
- task/cgroup: move slurm_cgroup_conf definition back to task_cgroup.c · 23692e6f
  Matthieu Hautreux authored 13 years ago
  
  23692e6f
Oct 13, 2011

task/cgroup: correct a regression in cpuset management · 70c26991

The addition of the default slurm cg with the cpuset subsystem was
incomplete preventing from having a working solution. The contents
of cpuset.cpus and cpuset.mems were not replicated from the parent
resulting in "No space left on device" errors when trying to add
tasks to the step cg.

70c26991

cgroup: ensure that plugins 's cg subsystems use a default slurm root cg · 5df2ad71

Matthieu Hautreux authored 13 years ago

In order to distinguish between slurm related cg and system related cg,
ensure that all slurm related cgroup directories are created under a
single directory. This directory is slurm or slurm_nodename in case of
multiple-slurmd usage.

5df2ad71

Oct 12, 2011

task/cgroup: Expand debug message during memcg creation · abfdfcbe

Mark A. Grondona authored 13 years ago

Add the amount of memory allocated by slurm to the job or step
to the debug message in memcg_initialize(). Also, change the
message from debug to info, so that a user can see the information
by using --slurmd-debug=1.

abfdfcbe

task/cgroup: Add debug message after memory cgroup initialization · 25d51e90

Mark A. Grondona authored 13 years ago

For debugging purposes, add a debug level message with some values
of interest just after task_cgroup_memory has initialized.

25d51e90

cgroups: Add new config parameter MinRAMSpace · 6ce0e77b

Mark A. Grondona authored 13 years ago

Add a new configuration parameter MinRAMSpace which sets a lower bound on
memory.limit_in_bytes and memory.memsw.limit_in_bytes . This is required in
case an administrator or user sets an absurdly low value for memory limit,
potentially causing the slurmstepd to be terminated by the OOM killer.

MinRAMSpace is set in MB of RAM and is 30 by default. (An arbitrarily
chosen value)

6ce0e77b

cgroups: Allow percent values in cgroup.conf to be floating point · fa38c431

Mark A. Grondona authored 13 years ago

The use of whole percent values for cgroup.conf parameters such
as AllowedRAMSpace, MaxRAMPercent, AllowedSwapSpace and MaxSwapPercent
may be too coarse grained on systems with large amounts of memory.
(e.g. 1% of 64G is over 650MB).

This patch allows these percentage values to be arbitrary floating
point numbers to allow finer grained tuning of these limits and
parameters.

fa38c431

task/cgroup: Don't create memory cgroups with limit of 0 bytes · e1bb1689

Mark A. Grondona authored 13 years ago

Treat a 0 byte memory limit from SLURM as unlimited and instead use
MaxRAMPercent and MaxSwapPercent as RAM and Swap limits for the job/job
step. This avoids creating a memory cgroup with limit_in_bytes = 0,
which would end up causing the cgroup to OOM before slurmstepd could
even be started.

This also allows systems in which SLURM isn't explicitly allocating
memory to use the task/cgroup plugin with ConstrainRAMSpace=yes.

e1bb1689

task/cgroup: Apply MaxRamPercent and MaxSwapPercent to memory cgroups · db99233d

Mark A. Grondona authored 13 years ago

Calculate the upper bound RAM in bytes and Swap in bytes that may
be used by any one cgroup and apply this limit in the task/cgroup
code.

db99233d

task/cgroup: Refactor task_cgroup_memory_create · 941262a3

Mark A. Grondona authored 13 years ago

There was some duplicated code in task_cgroup_memory_create. In order
to facilitate extending this code in the future, refactor it into
a common function memcg_initialize().

941262a3

cgroups: Allow cgroup mount point to be configurable · c9ea11b5

Mark A. Grondona authored 13 years ago

cgroups code currently assumes cgroup subsystems will be mounted
under /cgroup, which is not the ideal location for many situations.
Add a new cgroup.conf parameter to redefine the mount point to an
arbitrary location. (for example, some systems may already have
cgroupfs mounted under /dev/cgroup or /sys/fs/cgroup)

c9ea11b5

Oct 11, 2011

proctrack/cgroup: no longer rely on release agent to clean step cg · ef8cc0a7

Matthieu Hautreux authored 13 years ago

With release_agent notified at the step cgroup level, the step cgroup
can be removed while slurmstepd as not yet finished its internals
epilog mechanisms. Inhibiting release agent at the step level and
ensuring its proper removal helps to guarantee that the node will only
be eligible for job execution when the resources will be completely
available (no longer used by the job or the epilogs).

ef8cc0a7

Oct 05, 2011
- removed other unneeded variables. · 4f015589
  Danny Auble authored 13 years ago
  
  4f015589
- BLUEGENE - If removing blocks from system that once existed cleanup of old · 51edcafb
  Danny Auble authored 13 years ago
  
  block happens correctly now.
  51edcafb
Oct 03, 2011
- BGQ - fix to set up corner correctly for sub block jobs. · b836839f
  Danny Auble authored 13 years ago
  
  b836839f
Sep 30, 2011

Fix bugs in sched/backfill, time limits and QOS · 4df8a986

Morris Jette authored 13 years ago

Fix bugs in sched/backfill with respect to QOS reservation support and job
time limits. Patch from Alejandro Lucero Palau (Barcelona Supercomputer Center).

4df8a986

Sep 29, 2011
- BLUEGENE - Fix if running in Static/Overlap mode and full system block · 6b7d41b5
  Danny Auble authored 13 years ago
  
  is in an error state, won't deny jobs.
  6b7d41b5
- BLUEGENE - Fix minor potential memory leak when setting block error reason. · 7c25f668
  Danny Auble authored 13 years ago
  
  7c25f668
- BLUEGENE - handle reason of blocks in error more correctly between · 01d49db4
  Danny Auble authored 13 years ago
  
  restarts of the slurmctld.
  01d49db4
- BLUEGENE - Update correctly the state in the reason of a block if an · 3a507bc2
  Danny Auble authored 13 years ago
  
  admin sets the state to error.
  3a507bc2
Sep 26, 2011

Cosmetic mods for GCC v4.6 · 413b1c2c

Morris Jette authored 13 years ago

Many cosmetic modifications to eliminate warning message from GCC version
4.6 compiler, mostly due to unused variables.

413b1c2c

Sep 17, 2011
- BLUEGENE - Fix for if changing the defined blocks in the bluegene.conf and · 50cafcf7
  Danny Auble authored 13 years ago
  
  jobs happen to be running on blocks not in the new config.
  50cafcf7
Sep 16, 2011

Problem using salloc/mpirun with task affinity socket binding · 98b203d4

Morris Jette authored 13 years ago

salloc/mpirun does not play well together with task affinity socket binding.  The following example illustrates the problem.

[sulu] (slurm) mnp> salloc -p bones-only -N1-1 -n3 --cpu_bind=socket mpirun cat /proc/self/status | grep Cpus_allowed_list
salloc: Granted job allocation 387
--------------------------------------------------------------------------
An invalid physical processor id was returned ...

The problem is that with mpirun jobs Slurm launches only a single task, regardless of the value of -n. This confuses the socket binding logic in task affinity.  The result is that task affinity binds the task to only a single cpu, instead of all the allocated cpus on the socket.  When mpi attempts to bind to any of the other allocated cpus on the socket, it gets the "invalid physical processor id" error. Note that the problem may occur even if socket binding is not explicitly requested by the user.  If task/affinity is configured and the allocated CPUs are a whole number of sockets, Slurm will use "implicit auto binding" to sockets, triggering the problem.
Patch from Martin Perry (Bull).

98b203d4

Sep 12, 2011
- BLUEGENE - fix to handle HTC mode when changing the size of a job. · 0e9a2d06
  Danny Auble authored 13 years ago
  
  0e9a2d06
- BGP - If a block is defined in the bluegene.conf file to be a HTC small · cce97899
  Danny Auble authored 13 years ago
  
  type set it that way right off the start.
  cce97899
- BLUEGENE - fix issue with BGL/P systems that don't have multi-dimensional · 073e71b1
  Danny Auble authored 13 years ago
  
  conn_types.
  073e71b1
- BLUEGENE - fixed typos that created bugs. · 04e0d256
  Danny Auble authored 13 years ago
  
  04e0d256
Sep 10, 2011
- BLUEGENE - better debug · 8da93435
  Danny Auble authored 13 years ago
  
  8da93435
- BLUEGENE - Make it possible for an admin to define multiple dimension · e8bdc8fd
  Danny Auble authored 13 years ago
  
  conn_types in a block definition.
  e8bdc8fd
- BLUEGENE - Fix deadlock issue if toggling between Dynamic and Static block · d45af3e1
  Danny Auble authored 13 years ago
  
  allocation with jobs running on blocks that don't exist in the static setup.
  d45af3e1
Sep 09, 2011

Improve performance of preemption logic · b5a8a742

Morris Jette authored 13 years ago

This modifcation improves the performance of SLURM's preemption logic
be reducing the execution time of the scheduling logic and doing a better
job of minimizing the number of job's preempted to initiate a new job.
Based largely upon work by Phil Eckert, LLNL.

b5a8a742

Sep 08, 2011
- BLUEGENE - Fix for creating full system static block on a BGQ system. · 8793ca21
  Danny Auble authored 13 years ago
  
  8793ca21
Sep 06, 2011
- Cosmetic fix for printing out debug info in the priority plugin. · 7423c5d4
  Danny Auble authored 13 years ago
  
  7423c5d4
- BLUEGENE - added FIXME comment for future reference to the deny_wrap code. · ac465b76
  Danny Auble authored 13 years ago
  
  ac465b76
- BLUEGENE - slight change in documentation of new deny_wrap to · 35ef6e55
  Danny Auble authored 13 years ago
  
  ba_geo_test_all()
  35ef6e55