- Nov 28, 2012
-
-
Danny Auble authored
you query against that with -N and -E you will get all jobs during that time instead of only the ones running on -N. Signed-off-by:
Danny Auble <da@schedmd.com>
-
- Nov 27, 2012
-
-
Danny Auble authored
-
Danny Auble authored
was already in error and isn't deallocating and underlying hardware goes bad one could get overlapping blocks in error making the code assert when a new job request comes in.
-
Danny Auble authored
overcommit.
-
- Nov 20, 2012
-
-
Danny Auble authored
slurmctld restart.
-
Morris Jette authored
-
- Nov 19, 2012
-
-
Danny Auble authored
allocation.
-
Morris Jette authored
NOTE: If you were setting the environment variable SLURMSTEPD_OOM_ADJ=-17, it should be set to -1000 for Linux 2.6.36 kernel or later.
-
Danny Auble authored
-
- Nov 07, 2012
-
-
Danny Auble authored
-
Danny Auble authored
specifying the number of tasks and not the number of nodes.
-
- Nov 05, 2012
-
-
Morris Jette authored
On job kill requeust, send SIGCONT, SIGTERM, wait KillWait and send SIGKILL. Previously just sent SIGKILL to tasks.
-
- Nov 02, 2012
-
-
Morris Jette authored
-
Morris Jette authored
-
- Oct 25, 2012
-
-
Morris Jette authored
Incorrect error codes returned in some cases, especially if the slurmdbd is down
-
Morris Jette authored
-
- Oct 24, 2012
-
-
Morris Jette authored
Previously for linux systems all information was placed on a single line.
-
- Oct 23, 2012
-
-
Danny Auble authored
interface instead of through the normal method.
-
- Oct 22, 2012
-
-
Danny Auble authored
-
- Oct 18, 2012
-
-
Danny Auble authored
for passthrough gets removed on a dynamic system.
-
Danny Auble authored
in error for passthrough.
-
Danny Auble authored
-
Danny Auble authored
user's pending allocation was started with srun and then for some reason the slurmctld was brought down and while it was down the srun was removed.
-
Danny Auble authored
This is needed for when a free request is added to a block but there are jobs finishing up so we don't start new jobs on the block since they will fail on start.
-
- Oct 17, 2012
-
-
Morris Jette authored
Previously the node count would change from c-node count to midplane count (but still be interpreted as a c-node count).
-
- Oct 16, 2012
-
-
Danny Auble authored
-
- Oct 02, 2012
-
-
Morris Jette authored
See bugzilla bug 132 When using select/cons_res and CR_Core_Memory, hyperthreaded nodes may be overcommitted on memory when CPU counts are scaled. I've tested 2.4.2 and HEAD (2.5.0-pre3). Conditions: ----------- * SelectType=select/cons_res * SelectTypeParameters=CR_Core_Memory * Using threads - Ex. "NodeName=linux0 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=400" Description: ------------ In the cons_res plugin, _verify_node_state() in job_test.c checks if a node has sufficient memory for a job. However, the per-CPU memory limits appear to be scaled by the number of threads. This new value may exceed the available memory on the node. And, once a node is overcommitted on memory, future memory checks in _verify_node_state() will always succeed. Scenario to reproduce: ---------------------- With the example node linux0, we run a single-core job with 250MB/core srun --mem-per-cpu=250 sleep 60 cons_res checks that it will fit: ((real - alloc) >= job mem) ((400 - 0) >= 250) and the job starts Then, the memory requirement is doubled: "slurmctld: error: cons_res: node linux0 memory is overallocated (500) for job X" "slurmd: scaling CPU count by factor of 2" This job should not have started While the first job is still running, we submit a second, identical job srun --mem-per-cpu=250 sleep 60 cons_res checks that it will fit: ((400 - 500) >= 250), the unsigned int wraps, the test passes, and the job starts This second job also should not have started
-
Morris Jette authored
-
- Sep 27, 2012
-
-
Danny Auble authored
purged from the system if its front-end node goes down.
-
Danny Auble authored
database, and the job is running on a small block make sure we free up the correct node count.
-
Bill Brophy authored
-
- Sep 24, 2012
-
-
Morris Jette authored
This addresses bug 130
-
- Sep 21, 2012
-
-
Danny Auble authored
with a job running or trying to run on it.
-
- Sep 20, 2012
-
-
Danny Auble authored
are planning on using the block. Previously it would fail those jobs erroneously.
-
- Sep 19, 2012
-
-
Danny Auble authored
-
- Sep 18, 2012
-
-
Morris Jette authored
-
Danny Auble authored
-
- Sep 17, 2012
-
-
Danny Auble authored
-
Danny Auble authored
or previous piecemeal method.
-
- Sep 15, 2012
-
-
Danny Auble authored
Adapted from a patch from Stephen Trofinoff <trofinoff@cscs.ch>
-