Commits · d0d01878cc2f16bc7d70500dcca01fe350740ef5 · tud-zih-energy / Slurm

Jan 29, 2013
- Merge branch 'slurm-2.5' · d0d01878
  Morris Jette authored 12 years ago
  
  d0d01878
- Avoid invalid memory when GRES has no available plugin · 8a63416d
  David Bigagli authored 12 years ago
  
  8a63416d
- Fix for write off end of allocated memory · e60849a5
  David Bigagli authored 12 years ago
  
  e60849a5
- No change in logic. Chnage formatting to match linux kernel standard · c10d7f6e
  Morris Jette authored 12 years ago
  
  c10d7f6e
- Make squeue job array compressed format be the default · 04f57f70
  Morris Jette authored 12 years ago
  
  04f57f70
- Change variable info to be io_info to avoid confusion with the info · 23bf863f
  Danny Auble authored 12 years ago
  
  function.
  23bf863f
- Increase memory limit in a test · 072814db
  jette authored 12 years ago
  
  072814db
- Merge branch 'slurm-2.5' · 41a969a6
  Morris Jette authored 12 years ago
  
  41a969a6
- Fix typo in log message · 3164de16
  Morris Jette authored 12 years ago
  
  3164de16
Jan 28, 2013
- reset errno to 0 if able to coomunicate with a socket · 0ca4d83e
  Danny Auble authored 12 years ago
  
  0ca4d83e
- Fix typo in squeue man page · 282b964c
  David Bigagli authored 12 years ago
  
  282b964c
- No change in logic. Change formatting to match standard · 78ef038c
  Morris Jette authored 12 years ago
  
  78ef038c
- add job array support to sstat, sprio and sattach · 70532fe3
  Morris Jette authored 12 years ago
  
  70532fe3
- Add job array support for scontrol hold/holdu/release commands · 98e6b2dc
  Morris Jette authored 12 years ago
  
  98e6b2dc
- Add function to translate job ID string to number, supports job arrays · 98c07f7e
  Morris Jette authored 12 years ago
  
  98c07f7e
Jan 26, 2013
- reset errno to 0 if able to coomunicate with a socket · eedbceda
  Danny Auble authored 12 years ago
  
  eedbceda
- Fix for squeue step filtering logic with job arrays · f8ed1750
  Morris Jette authored 12 years ago
  
  f8ed1750
- Correction to squeue job/step filtering logic for job arrays · 15327452
  Morris Jette authored 12 years ago
  
  15327452
- Add job array support to scontrol show job and show step · 619c0398
  Morris Jette authored 12 years ago
  
  619c0398
Jan 25, 2013
- Add squeue filtering for jobs or steps with job array ID values · 43bb4daa
  Morris Jette authored 12 years ago
  
  43bb4daa
- Add support for job array expressions in scancel · bd8732a6
  Morris Jette authored 12 years ago
  
  bd8732a6
- Modify scancel to support job arrays using a single RPC · 6b427224
  Morris Jette authored 12 years ago
  
  6b427224
- Add slurmctld support to signal entire job array with one RPC · d22491f6
  Morris Jette authored 12 years ago
  
  d22491f6
- Change "batch_flag" to "flags" in kill_job data structure and functions · 7a513f19
  Morris Jette authored 12 years ago
  
  No change in logic yet, just a variable rename
  7a513f19
- Fixes for squeue job ID filtering with job arrays · ee9a5abe
  Morris Jette authored 12 years ago
  
  ee9a5abe
- Optimize squeue output for job arrays · 57e4b8e1
  Morris Jette authored 12 years ago
  
  57e4b8e1
Jan 24, 2013
- More job array support · f4464c58
  Morris Jette authored 12 years ago
  
  Add squeue options to print array_job_id and array_task_id Change the environment variables SLURM_ARRAY_JOBID to SLURM_ARRAY_JOB_ID and SLURM_ARRAY_ID to SLURM_ARRAY_TASK_ID Substantial updates to web page
  f4464c58
- Add sview support for job arrays · 56eb4257
  Morris Jette authored 12 years ago
  
  56eb4257
- Minor improvement to sview full job display · f2f1f46c
  Morris Jette authored 12 years ago
  
  Put "switches" in alphabetic order Remove "\n" from switches output, that adds extra space in display
  f2f1f46c
- Update job array web page · e49a3e33
  Morris Jette authored 12 years ago
  
  e49a3e33
- Add testamonial from Colin McMurtrie to web page · 241e2660
  Morris Jette authored 12 years ago
  
  241e2660
- Add scancel support for job arrays · b20837a7
  Morris Jette authored 12 years ago
  
  b20837a7
Jan 23, 2013

Add support for job array IDs within stdin/out/err file names · b8b16876
Morris Jette authored 12 years ago

b8b16876
Minor format changes to team web page · 297ff720
Morris Jette authored 12 years ago

297ff720
Add initial version of job array web page · 59d22ac5
Morris Jette authored 12 years ago

59d22ac5
Added SLURM_ARRAY_ID to environment of job array · ded211de
Morris Jette authored 12 years ago

ded211de
Added SLURM_SUBMIT_HOST to salloc, sbatch and srun job environment. · 4a55e5b1
Morris Jette authored 12 years ago

4a55e5b1
Merge branch 'slurm-2.5' · d9bd45db
Morris Jette authored 12 years ago

d9bd45db

In select/cons_res, correct logic when job removed from only some nodes. · eb3c1046

jette authored 12 years ago

I run into a problem with slurm-2.5.1 that IDLE nodes can not be
allocated to jobs. This can be reproduced as follows:

First, submit a job with --no-kill option (I have SLURM_EXCLUSIVE set to
allocate nodes exclusively by default). Then set one of the nodes
allocated to the job(cn2) to state DOWN:

srun: error: Node failure on cn2
srun: error: Node failure on cn2
srun: error: cn2: task 0: Killed
^Csrun: interrupt (one more within 1 sec to abort)
srun: task 1: running
srun: task 0: exited abnormally
^Csrun: sending Ctrl-C to job 22605.0
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: Force Terminated job step 22605.0

Then change state of the node to IDLE again. But it can not be allocated
to jobs:

srun: job 22606 queued and waiting for resources

  JOBID PARTITION     NAME     USER  ST       TIME  NODES
NODELIST(REASON)
  22606      work hostname     root  PD       0:00      1 (Resources)
  22604      work   sbatch     root   R       3:06      1 cn1

NodeName=cn2 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=0.05 Features=abc
   Gres=(null)
   NodeAddr=cn2 NodeHostName=cn2
   OS=Linux RealMemory=30000 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=2012-12-24T15:22:34 SlurmdStartTime=2013-01-14T11:06:32
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0

I traced and located the problem in select/cons_res. The call sequence
is:

slurmctld/node_mgr.c: update_node() =>
slurmctld/job_mgr.c: kill_running_job_by_node_name() =>
excise_node_from_job() =>
plugins/select/cons_res/select_cons_res.c: select_p_job_resized() =>
_rm_job_from_one_node() => _build_row_bitmaps() =>
common/job_resources: remove_job_from_cores()

If there are other jobs running in the partition, the partition row
bitmap will not be set correctly. In the example above, before
_build_row_bitmaps(), output of _dump_part() is:

[2013-01-19T13:24:56+08:00] part:work rows:1 pri:1
[2013-01-19T13:24:56+08:00]   row0: num_jobs 2: bitmap: 16,32-63

after setting the node down, output of _dump_part() is

[2013-01-19T13:24:56+08:00] part:work rows:1 pri:1
[2013-01-19T13:24:56+08:00]   row0: num_jobs 2: bitmap: 16,32-47

Cores of cn2 are not marked as available. Instead, cores of other nodes
are released. When another job requires the node cn2, the following log
message appears:

[2013-01-19T13:25:03+08:00] debug3: cons_res: _vns: node cn2 busy

I do not understand the design of select/cons_res well and I do not know
how to fix this. But it seems that _build_row_bitmaps() should not be
called, since the job is not removed totally, but only one of the nodes
released.

eb3c1046

Merge branch 'slurm-2.5' · 5157a780
Morris Jette authored 12 years ago

5157a780