- Oct 20, 2011
-
-
Morris Jette authored
-
Danny Auble authored
-
- Oct 19, 2011
-
-
Danny Auble authored
-
- Oct 18, 2011
-
-
Morris Jette authored
-
Morris Jette authored
-
Matthieu Hautreux authored
-
- Oct 13, 2011
-
-
Matthieu Hautreux authored
The addition of the default slurm cg with the cpuset subsystem was incomplete preventing from having a working solution. The contents of cpuset.cpus and cpuset.mems were not replicated from the parent resulting in "No space left on device" errors when trying to add tasks to the step cg.
-
Matthieu Hautreux authored
In order to distinguish between slurm related cg and system related cg, ensure that all slurm related cgroup directories are created under a single directory. This directory is slurm or slurm_nodename in case of multiple-slurmd usage.
-
- Oct 12, 2011
-
-
Mark A. Grondona authored
Add the amount of memory allocated by slurm to the job or step to the debug message in memcg_initialize(). Also, change the message from debug to info, so that a user can see the information by using --slurmd-debug=1.
-
Mark A. Grondona authored
For debugging purposes, add a debug level message with some values of interest just after task_cgroup_memory has initialized.
-
Mark A. Grondona authored
Add a new configuration parameter MinRAMSpace which sets a lower bound on memory.limit_in_bytes and memory.memsw.limit_in_bytes . This is required in case an administrator or user sets an absurdly low value for memory limit, potentially causing the slurmstepd to be terminated by the OOM killer. MinRAMSpace is set in MB of RAM and is 30 by default. (An arbitrarily chosen value)
-
Mark A. Grondona authored
The use of whole percent values for cgroup.conf parameters such as AllowedRAMSpace, MaxRAMPercent, AllowedSwapSpace and MaxSwapPercent may be too coarse grained on systems with large amounts of memory. (e.g. 1% of 64G is over 650MB). This patch allows these percentage values to be arbitrary floating point numbers to allow finer grained tuning of these limits and parameters.
-
Mark A. Grondona authored
Treat a 0 byte memory limit from SLURM as unlimited and instead use MaxRAMPercent and MaxSwapPercent as RAM and Swap limits for the job/job step. This avoids creating a memory cgroup with limit_in_bytes = 0, which would end up causing the cgroup to OOM before slurmstepd could even be started. This also allows systems in which SLURM isn't explicitly allocating memory to use the task/cgroup plugin with ConstrainRAMSpace=yes.
-
Mark A. Grondona authored
Calculate the upper bound RAM in bytes and Swap in bytes that may be used by any one cgroup and apply this limit in the task/cgroup code.
-
Mark A. Grondona authored
There was some duplicated code in task_cgroup_memory_create. In order to facilitate extending this code in the future, refactor it into a common function memcg_initialize().
-
Mark A. Grondona authored
cgroups code currently assumes cgroup subsystems will be mounted under /cgroup, which is not the ideal location for many situations. Add a new cgroup.conf parameter to redefine the mount point to an arbitrary location. (for example, some systems may already have cgroupfs mounted under /dev/cgroup or /sys/fs/cgroup)
-
- Oct 11, 2011
-
-
Matthieu Hautreux authored
With release_agent notified at the step cgroup level, the step cgroup can be removed while slurmstepd as not yet finished its internals epilog mechanisms. Inhibiting release agent at the step level and ensuring its proper removal helps to guarantee that the node will only be eligible for job execution when the resources will be completely available (no longer used by the job or the epilogs).
-
- Oct 05, 2011
-
-
Danny Auble authored
-
Danny Auble authored
block happens correctly now.
-
- Oct 03, 2011
-
-
Danny Auble authored
-
- Sep 30, 2011
-
-
Morris Jette authored
Fix bugs in sched/backfill with respect to QOS reservation support and job time limits. Patch from Alejandro Lucero Palau (Barcelona Supercomputer Center).
-
- Sep 29, 2011
-
-
Danny Auble authored
is in an error state, won't deny jobs.
-
Danny Auble authored
-
Danny Auble authored
restarts of the slurmctld.
-
Danny Auble authored
admin sets the state to error.
-
- Sep 26, 2011
-
-
Morris Jette authored
Many cosmetic modifications to eliminate warning message from GCC version 4.6 compiler, mostly due to unused variables.
-
- Sep 17, 2011
-
-
Danny Auble authored
jobs happen to be running on blocks not in the new config.
-
- Sep 16, 2011
-
-
Morris Jette authored
salloc/mpirun does not play well together with task affinity socket binding. The following example illustrates the problem. [sulu] (slurm) mnp> salloc -p bones-only -N1-1 -n3 --cpu_bind=socket mpirun cat /proc/self/status | grep Cpus_allowed_list salloc: Granted job allocation 387 -------------------------------------------------------------------------- An invalid physical processor id was returned ... The problem is that with mpirun jobs Slurm launches only a single task, regardless of the value of -n. This confuses the socket binding logic in task affinity. The result is that task affinity binds the task to only a single cpu, instead of all the allocated cpus on the socket. When mpi attempts to bind to any of the other allocated cpus on the socket, it gets the "invalid physical processor id" error. Note that the problem may occur even if socket binding is not explicitly requested by the user. If task/affinity is configured and the allocated CPUs are a whole number of sockets, Slurm will use "implicit auto binding" to sockets, triggering the problem. Patch from Martin Perry (Bull).
-
- Sep 12, 2011
-
-
Danny Auble authored
-
Danny Auble authored
type set it that way right off the start.
-
Danny Auble authored
conn_types.
-
Danny Auble authored
-
- Sep 10, 2011
-
-
Danny Auble authored
-
Danny Auble authored
conn_types in a block definition.
-
Danny Auble authored
allocation with jobs running on blocks that don't exist in the static setup.
-
- Sep 09, 2011
-
-
Morris Jette authored
This modifcation improves the performance of SLURM's preemption logic be reducing the execution time of the scheduling logic and doing a better job of minimizing the number of job's preempted to initiate a new job. Based largely upon work by Phil Eckert, LLNL.
-
- Sep 08, 2011
-
-
Danny Auble authored
-
- Sep 06, 2011
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
ba_geo_test_all()
-