- Jun 05, 2013
-
-
Danny Auble authored
-
Janne Blomqvist authored
Andy Wettstein (University of Chicago) reported privately to me that slurmctld 2.5.4 crashed after he enabled the priority/multifactor2 plugin due to a division by zero error. I was able to reproduce the crash by creating an account hierarchy where all the accounts and users had zero shares. See bug 315
-
- Jun 04, 2013
-
-
Morris Jette authored
-
Morris Jette authored
-
jette authored
Without this change, it appears that POE ignores the -procs argument resulting in a job step request with multiple host names, but only one ntask required
-
- Jun 03, 2013
-
-
jette authored
Previously if the required node has no available CPUs left, then other nodes in the job allocation would be used
-
Hongjia Cao authored
We're having some trouble getting our slurm jobs to successfully restart after a checkpoint. For this test, I'm using sbatch and a simple, single-threaded executable. Slurm is 2.5.4, blcr is 0.8.5. I'm submitting the job using sbatch: $ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh I am able to create the checkpoint and vacate the node: $ scontrol checkpoint create 137 .... time passes .... $ scontrol vacate 137 At that point, I see the checkpoint file from blcr in the current directory and the checkpoint file from Slurm in /var/spool/slurm-llnl/checkpoint. However, when I attempt to restart the job: $ scontrol checkpoint restart 137 scontrol_checkpoint error: Node count specification invalid In slurmctld's log (at level 7) I see: [2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=***** [2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002 [2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0 [2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid
-
- May 30, 2013
-
-
Morris Jette authored
Uninitialized variables resulted in error of "cons_res: sync loop not progressing, holding job #"
-
- May 29, 2013
-
-
jette authored
The most notable problem case is on a cray where a job step specifically requests one or more node that are not the first nodes in the job allocation
-
- May 23, 2013
-
-
Morris Jette authored
The problem we have observed is the backfill scheduler temporarily gives up its locks (one second), but then reclaims them before the backlog of work completes, basically keeping the backfill scheduler running for a really long time when under a heavy load. bug 297
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
Fix minor bug in sdiag backfill scheduling time reported on Bluegene systems Improve explanation of backfill scheduling cycle time calculation.
-
Morris Jette authored
Defers (rather than forgets) reboot request with job running on the node within a reservation.
-
Danny Auble authored
-
Danny Auble authored
-
- May 22, 2013
-
-
Danny Auble authored
-
Danny Auble authored
-
Morris Jette authored
-
jette authored
-
jette authored
-
- May 21, 2013
-
-
Morris Jette authored
-
- May 18, 2013
-
-
Danny Auble authored
all preemptable jobs on the midplane instead of just the ones it needed to.
-
Danny Auble authored
-
- May 16, 2013
-
-
Morris Jette authored
This bug was introduced in commit f1cf6d2d fix for bug 290
-
Danny Auble authored
-
- May 14, 2013
-
-
Morris Jette authored
-
Morris Jette authored
-
- May 13, 2013
-
-
Morris Jette authored
-
Morris Jette authored
Downing the node will kill all jobs allocated to the node, very bad on something like a BlueGene system
-
- May 11, 2013
-
-
David Bigagli authored
-
- May 10, 2013
-
-
Hongjia Cao authored
fix of the following problem: if a node is excised from a job and a reconfiguration(e.g., update partition) is done when the job is still running, the node will be left in state idle but not available any more until the next reconfiguration/restart of slurmctld after the job finished.
-
- May 08, 2013
-
-
David Bigagli authored
-
David Bigagli authored
-
Danny Auble authored
the node tab and we didn't notice.
-
- May 07, 2013
-
-
David Bigagli authored
-
- May 04, 2013
-
-
Morris Jette authored
Response to bug 274
-
- May 03, 2013
-
-
jette authored
Make test work if current working directory not in the search path Check for appropriate task rank on POE based systems Disable the entire test on POE systems
-
- May 02, 2013
-
-
jette authored
Without this change pmdv12 was bound to one CPU and could not use all of the resources allocated to the job step for the tasks that it launches
-