Commits · bde80059dbe03db7f5f02f148e9356416b8e5cff · tud-zih-energy / Slurm

Mar 20, 2013
- Merge branch 'slurm-2.5' · bde80059
  Morris Jette authored 12 years ago
  
  bde80059
- [PATCH] fix of job requiring contiguous nodes can not run · e416e35f
  Hongjia Cao authored 12 years ago
  
  e416e35f
- SlurmDBD - fix to allow user root along with the slurm user to register a · 485cb062
  Danny Auble authored 12 years ago
  
  cluster.
  485cb062
- Merge branch 'slurm-2.5' · 2da3b228
  jette authored 12 years ago
  
  2da3b228
- Decrease time limit in a test in case of small partition time limit · e0020ed1
  jette authored 12 years ago
  
  e0020ed1
- Add more logging information to a test · b912fad0
  jette authored 12 years ago
  
  b912fad0
- initialize timer string to avoid garbage in log messages · 73996996
  Morris Jette authored 12 years ago
  
  73996996
Mar 19, 2013

Merge branch 'slurm-2.5' · 4322b420
Morris Jette authored 12 years ago
```
Conflicts:
	src/plugins/sched/backfill/backfill.c
```
4322b420
Log when a job's time limit is changes by backfill scheduling · 03ad76cf
Don Lipari authored 12 years ago

03ad76cf
Select/cons_res - Tighter packing of job allocations on sockets. · 7fcdc7e5
Morris Jette authored 12 years ago

7fcdc7e5
change select() to poll() in waiting for a socket to be readable · 3175cf91
Hongjia Cao authored 12 years ago
```
select()/FD_ISSET() does not work for file descriptor larger than 1023.
```
3175cf91
Note nature of latest change · 8e038b5c
Morris Jette authored 12 years ago

8e038b5c

fix of idle nodes cannot be allocated · 4ea9850a

Hongjia Cao authored 12 years ago

avoid add/remove node resource of job if the node is lost by resize

 I found another case that idle node can not be allocated. It can be
reproduced as follows:

1. run a job with -k option:

    [root@mn0 ~]# srun -w cn[18-28] -k sleep 1000
    srun: error: Node failure on cn28
    srun: error: Node failure on cn28
    srun: error: cn28: task 10: Killed
    ^Csrun: interrupt (one more within 1 sec to abort)
    srun: tasks 0-9: running
    srun: task 10: exited abnormally
    ^Csrun: sending Ctrl-C to job 106120.0
    srun: Job step aborted: Waiting up to 2 seconds for job step to
finish.

2. set a node down and then set it idle:

    [root@mn0 ~]# scontrol update nodename=cn28 state=down reason="hjcao
test"
    [root@mn0 ~]# scontrol update nodename=cn28 state=idle

3. restart slurmctld

    [root@mn0 ~]# service slurm restart
    stopping slurmctld:                                        [  OK  ]
    slurmctld is stopped
    starting slurmctld:                                        [  OK  ]

4. cancel the job

then, the node set down will be left unavailable:

    [root@mn0 ~]# sinfo -n cn[18-28]
    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    work*        up   infinite     11   idle cn[18-28]

    [root@mn0 ~]# srun -w cn[18-28] hostname
    srun: job 106122 queued and waiting for resources

    [root@mn0 slurm]# grep cn28 slurmctld.log
    [2013-03-18T15:28:02+08:00] debug3: cons_res: _vns: node cn28 in
exclusive use
    [2013-03-18T15:29:02+08:00] debug3: cons_res: _vns: node cn28 in
exclusive use

I made an attempt to fix this by the attached patch. Please review it.

4ea9850a

Merge branch 'slurm-2.5' · 6dd90805
Morris Jette authored 12 years ago

6dd90805

Correction in logic issuing call to account for change in job time limit · 9f5a7a0e

Morris Jette authored 12 years ago

I don't believe save_time_limit was redundant.  At least in this case:

if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){
    if (orig_time_limit == NO_VAL)
        orig_time_limit = comp_time_limit;
    job_ptr->time_limit = orig_time_limit;
[...]

So later, when updating the db,

    if (save_time_limit != job_ptr->time_limit)
        jobacct_storage_g_job_start(acct_db_conn,
                        job_ptr);
will cause the db to be updated, while,

        if (orig_time_limit != job_ptr->time_limit)
        jobacct_storage_g_job_start(acct_db_conn,
                        job_ptr);

will not because job_ptr->time_limit now equals orig_time_limit.

9f5a7a0e

Merge branch 'slurm-2.5' · 3f24195a

Morris Jette authored 12 years ago

Conflicts:
	src/db_api/cluster_report_functions.c
	src/plugins/sched/backfill/backfill.c

3f24195a

Do not report error when job step terminates while sstat is running · 4cb6137c
Morris Jette authored 12 years ago

4cb6137c

Record updated job time limit if modified by backfill · 46348f91

Don Lipari authored 12 years ago

Without this change, if the job's time limit is modified down
toward --time-min by the backfill scheduler, update the job's
time limit in the database.

46348f91

Mar 18, 2013
- Improvements for fault-tolerance work · 465fc898
  Morris Jette authored 12 years ago
  
  465fc898
Mar 14, 2013
- sreport - Fix by adding planned down time to utilization reports. · dced5e7f
  Danny Auble authored 12 years ago
  
  dced5e7f
- Add some diagrams to the IBM PE documentation · 2868384d
  Morris Jette authored 12 years ago
  
  2868384d
- Merge remote-tracking branch 'origin/slurm-2.5' · 363dfb95
  Danny Auble authored 12 years ago
  
  363dfb95
- Accounting - more checks for strings with a possible `'` in it. · ff021de1
  Danny Auble authored 12 years ago
  
  ff021de1
- CRAY - Fix SLURM_TASKS_PER_NODE to be set correctly. · 5c370edb
  Danny Auble authored 12 years ago
  
  5c370edb
- Remove temporary testing logic for timer logging · feb46f2b
  Morris Jette authored 12 years ago
  
  feb46f2b
- Change default log time to ISO 8601 (remove time zone) · 28beff27
  Morris Jette authored 12 years ago
  
  Add milliseconds to default log message header (both RFC 5424 and ISO 8601 time formats). Disable milliseconds logging using the configure parameter "--disable-log-time-msec". Default time format changes to ISO 8601 (without time zone information). Specify "--enable-rfc5424time" to restore the time zone information.
  28beff27
Mar 13, 2013
- Add "--enable-rfc5424time-secs" configure parameter · 00c46ce5
  Morris Jette authored 12 years ago
  
  Add milliseconds to default log message header with the (default) RFC5424 time format. Disable milliseconds logging using the configure parameter "--enable-rfc5424time-secs". Sample time stamp format is as follows: "2013-03-13T14:28:17.767-07:00".
  00c46ce5
- Add msec resolution to RFC5424 time stamps in the logs · a36b680f
  David Bigagli authored 12 years ago
  
  a36b680f
- Merge branch 'slurm-2.5' · 169cfd68
  Morris Jette authored 12 years ago
  
  Conflicts: doc/man/man1/sbatch.1
  169cfd68
- Improve error checking for step allocation with min and max node count · 7223d0d2
  Morris Jette authored 12 years ago
  
  7223d0d2
- Correction to error returned by step request error for too many CPUs · 36df0bbf
  Morris Jette authored 12 years ago
  
  If step requests more CPUs than possible in specified node count of job allocation then return ESLURM_TOO_MANY_REQUESTED_CPUS rather than ESLURM_NODES_BUSY and retrying.
  36df0bbf
- Add comments to describe function arguments · 4a37e469
  Morris Jette authored 12 years ago
  
  4a37e469
- Minor documentation update for select plugin · a71f95b2
  Danny Auble authored 12 years ago
  
  a71f95b2
- Minor changes to documentation · 0da977dc
  Danny Auble authored 12 years ago
  
  0da977dc
Mar 12, 2013

Merge branch 'slurm-2.5' · 13cd951c
Morris Jette authored 12 years ago
```
Conflicts:
	src/plugins/select/cons_res/select_cons_res.c
```
13cd951c
Minor format changes from previous commit · f5a89755
Morris Jette authored 12 years ago

f5a89755

Fix scheduling if node in more than one partition · fcef06b4

Magnus Jonsson authored 12 years ago

I found a bug in cons_res/select_p_select_nodeinfo_set_all.

If a node is part of two (or more) partitions the code will only count the number of cores/cpus in the partition that has the most running jobs on that node.

Patch attached to fix the problem.

I also added an new function to bitstring to count the number of bits in an range (bit_set_count_range) and made a minor improvement of (bit_set_count) while reviewing the range version.

Best regards,
Magnus

fcef06b4

Mar 11, 2013
- Merge branch 'slurm-2.5' · 533dea59
  Morris Jette authored 12 years ago
  
  533dea59
- Add support for SALLOC_RESERVATION and SLURM_RESERVATION for salloc and srun · 845a7925
  Morris Jette authored 12 years ago
  
  This permits default reservation names to be more easily managed
  845a7925
- Add support for SBATCH_RESERVATION environment variable · 03983666
  Andy Wettstein authored 12 years ago
  
  03983666