- Feb 05, 2013
-
-
Danny Auble authored
-
Danny Auble authored
to compile correctly on newer compilers Signed-off-by:
Danny Auble <da@schedmd.com>
-
Don Lipari authored
-
jette authored
-
jette authored
-
Morris Jette authored
If job involved in dependency completes and is purged the logic used to test for circular dependencies can use the invalid pointer and generate an invalid memory reference before the pointer is cleared from the dependency list data structure.
-
Morris Jette authored
-
- Feb 04, 2013
-
-
Morris Jette authored
-
jette authored
Without this change, allocations less than a whole node can result in an incorrect task binding for power7 processors
-
Morris Jette authored
-
jette authored
-
Morris Jette authored
-
- Feb 01, 2013
-
-
Morris Jette authored
-
jette authored
Without this change, the testing only happens with task/affinity
-
jette authored
-
Morris Jette authored
This bug was introduced in 2.5.2 for the case where a GPU count was configured, but without device files. Unfortunately it left the CUDA_VISIBLE_DEVICES environment variable unset if GPUs were configured, but the job did not request any of them.
-
Nathan Yee authored
-
Morris Jette authored
-
Morris Jette authored
-
- Jan 31, 2013
-
-
Morris Jette authored
-
Morris Jette authored
-
Nathan Yee authored
-
Morris Jette authored
This eliminates the need for libnrt.so on the head node.
-
- Jan 30, 2013
-
-
Morris Jette authored
-
David Bigagli authored
-
Morris Jette authored
-
- Jan 29, 2013
-
-
Danny Auble authored
block, destroy it correctly.
-
Danny Auble authored
at least 3.5.0. This avoids a stack overflow when running jobs on more than 120k nodes.
-
Danny Auble authored
-
Danny Auble authored
-
Morris Jette authored
The new callbacks are not fleshed out, but eliminates a build error
-
David Bigagli authored
-
David Bigagli authored
-
Danny Auble authored
function.
-
Morris Jette authored
-
- Jan 28, 2013
-
-
David Bigagli authored
-
- Jan 26, 2013
-
-
Danny Auble authored
-
- Jan 23, 2013
-
-
jette authored
I run into a problem with slurm-2.5.1 that IDLE nodes can not be allocated to jobs. This can be reproduced as follows: First, submit a job with --no-kill option (I have SLURM_EXCLUSIVE set to allocate nodes exclusively by default). Then set one of the nodes allocated to the job(cn2) to state DOWN: srun: error: Node failure on cn2 srun: error: Node failure on cn2 srun: error: cn2: task 0: Killed ^Csrun: interrupt (one more within 1 sec to abort) srun: task 1: running srun: task 0: exited abnormally ^Csrun: sending Ctrl-C to job 22605.0 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. srun: Force Terminated job step 22605.0 Then change state of the node to IDLE again. But it can not be allocated to jobs: srun: job 22606 queued and waiting for resources JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 22606 work hostname root PD 0:00 1 (Resources) 22604 work sbatch root R 3:06 1 cn1 NodeName=cn2 Arch=x86_64 CoresPerSocket=8 CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=0.05 Features=abc Gres=(null) NodeAddr=cn2 NodeHostName=cn2 OS=Linux RealMemory=30000 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2012-12-24T15:22:34 SlurmdStartTime=2013-01-14T11:06:32 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 I traced and located the problem in select/cons_res. The call sequence is: slurmctld/node_mgr.c: update_node() => slurmctld/job_mgr.c: kill_running_job_by_node_name() => excise_node_from_job() => plugins/select/cons_res/select_cons_res.c: select_p_job_resized() => _rm_job_from_one_node() => _build_row_bitmaps() => common/job_resources: remove_job_from_cores() If there are other jobs running in the partition, the partition row bitmap will not be set correctly. In the example above, before _build_row_bitmaps(), output of _dump_part() is: [2013-01-19T13:24:56+08:00] part:work rows:1 pri:1 [2013-01-19T13:24:56+08:00] row0: num_jobs 2: bitmap: 16,32-63 after setting the node down, output of _dump_part() is [2013-01-19T13:24:56+08:00] part:work rows:1 pri:1 [2013-01-19T13:24:56+08:00] row0: num_jobs 2: bitmap: 16,32-47 Cores of cn2 are not marked as available. Instead, cores of other nodes are released. When another job requires the node cn2, the following log message appears: [2013-01-19T13:25:03+08:00] debug3: cons_res: _vns: node cn2 busy I do not understand the design of select/cons_res well and I do not know how to fix this. But it seems that _build_row_bitmaps() should not be called, since the job is not removed totally, but only one of the nodes released.
-
Morris Jette authored
-
- Jan 22, 2013
-
-
Danny Auble authored
-