- Jan 29, 2013
-
-
Morris Jette authored
-
David Bigagli authored
-
David Bigagli authored
-
Morris Jette authored
-
Morris Jette authored
-
Danny Auble authored
function.
-
jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- Jan 28, 2013
-
-
Danny Auble authored
-
David Bigagli authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- Jan 26, 2013
-
-
Danny Auble authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- Jan 25, 2013
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
No change in logic yet, just a variable rename
-
Morris Jette authored
-
Morris Jette authored
-
- Jan 24, 2013
-
-
Morris Jette authored
Add squeue options to print array_job_id and array_task_id Change the environment variables SLURM_ARRAY_JOBID to SLURM_ARRAY_JOB_ID and SLURM_ARRAY_ID to SLURM_ARRAY_TASK_ID Substantial updates to web page
-
Morris Jette authored
-
Morris Jette authored
Put "switches" in alphabetic order Remove "\n" from switches output, that adds extra space in display
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- Jan 23, 2013
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
jette authored
I run into a problem with slurm-2.5.1 that IDLE nodes can not be allocated to jobs. This can be reproduced as follows: First, submit a job with --no-kill option (I have SLURM_EXCLUSIVE set to allocate nodes exclusively by default). Then set one of the nodes allocated to the job(cn2) to state DOWN: srun: error: Node failure on cn2 srun: error: Node failure on cn2 srun: error: cn2: task 0: Killed ^Csrun: interrupt (one more within 1 sec to abort) srun: task 1: running srun: task 0: exited abnormally ^Csrun: sending Ctrl-C to job 22605.0 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. srun: Force Terminated job step 22605.0 Then change state of the node to IDLE again. But it can not be allocated to jobs: srun: job 22606 queued and waiting for resources JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 22606 work hostname root PD 0:00 1 (Resources) 22604 work sbatch root R 3:06 1 cn1 NodeName=cn2 Arch=x86_64 CoresPerSocket=8 CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=0.05 Features=abc Gres=(null) NodeAddr=cn2 NodeHostName=cn2 OS=Linux RealMemory=30000 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2012-12-24T15:22:34 SlurmdStartTime=2013-01-14T11:06:32 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 I traced and located the problem in select/cons_res. The call sequence is: slurmctld/node_mgr.c: update_node() => slurmctld/job_mgr.c: kill_running_job_by_node_name() => excise_node_from_job() => plugins/select/cons_res/select_cons_res.c: select_p_job_resized() => _rm_job_from_one_node() => _build_row_bitmaps() => common/job_resources: remove_job_from_cores() If there are other jobs running in the partition, the partition row bitmap will not be set correctly. In the example above, before _build_row_bitmaps(), output of _dump_part() is: [2013-01-19T13:24:56+08:00] part:work rows:1 pri:1 [2013-01-19T13:24:56+08:00] row0: num_jobs 2: bitmap: 16,32-63 after setting the node down, output of _dump_part() is [2013-01-19T13:24:56+08:00] part:work rows:1 pri:1 [2013-01-19T13:24:56+08:00] row0: num_jobs 2: bitmap: 16,32-47 Cores of cn2 are not marked as available. Instead, cores of other nodes are released. When another job requires the node cn2, the following log message appears: [2013-01-19T13:25:03+08:00] debug3: cons_res: _vns: node cn2 busy I do not understand the design of select/cons_res well and I do not know how to fix this. But it seems that _build_row_bitmaps() should not be called, since the job is not removed totally, but only one of the nodes released.
-
Morris Jette authored
-