Commits · 4b86225224dbb9a1af5baaa9b572486c9293d7e3 · tud-zih-energy / Slurm

Mar 29, 2013
- Add sanity check for NULL cluster names trying to register. · 4b862252
  Danny Auble authored 11 years ago
  
  4b862252
- Add Slurm User Group 2013 information (incomplete) · 480a744f
  Morris Jette authored 11 years ago
  
  480a744f
Mar 28, 2013
- Remove vestigial files from Slurm User Group Meeting 2010 · f3cb2fdc
  Morris Jette authored 11 years ago
  
  f3cb2fdc
Mar 27, 2013
- Added support for FreeBSD. · 5338879e
  Jason Bacon authored 11 years ago
  
  5338879e
- Purge vestigial job scripts · ea3c9f0b
  Morris Jette authored 11 years ago
  
  WIthout this patch, when the slurmd cold starts or slurmstepd terminates abnormally, the job script file can be left around. bug 243
  ea3c9f0b
- Reject job at submit time if the node count is invalid · f1cf6d2d
  Morris Jette authored 11 years ago
  
  Previously such a job submitted to a DOWN partition would be queued. bug 187
  f1cf6d2d
- Fix format problem in salloc man page · 71ab5afb
  Morris Jette authored 11 years ago
  
  71ab5afb
- Add missing environment variables to the man pages for srun/salloc/sbatch · bbe34ded
  Nathan Yee authored 11 years ago
  
  bbe34ded
Mar 26, 2013
- Merge pull request #44 from grondo/slurm-2.5-for-schedmd · 338b99fa
  Danny Auble authored 11 years ago
  
  Fix spank_option_getopt in local context
  338b99fa
- Fix spank_option_getopt in local context · 23e3939b
  Mark A. Grondona authored 11 years ago
  
  In local context (srun, sbatch, salloc), spank_option_getopt() would always return that an option was found due to a missing check for spopt->found in spank_option_getopt. This patch fixes the issue.
  23e3939b
- Fix to man page to fix deprecated reference. · 22b16cef
  Danny Auble authored 11 years ago
  
  22b16cef
- Fixes for Allowing user root to add jobs/steps to the dbd. · 146bd61a
  Danny Auble authored 11 years ago
  
  146bd61a
- Update reference to slurm documentation from www.schedmd.com/slurmdocs to · 782b3fdb
  Danny Auble authored 11 years ago
  
  slurm.schedmd.com
  782b3fdb
- Accounting - Minor fix to avoid reuse of variable erroneously. · 9403500e
  Danny Auble authored 12 years ago
  
  9403500e
- Accounting - When rolling up data from past usage ignore "idle" time from · 2ed8a4d6
  Danny Auble authored 12 years ago
  
  a reservation when it has the "Ignore_Jobs" flag set. Since jobs could run outside of the reservation in it's nodes without this you could have double time.
  2ed8a4d6
Mar 25, 2013
- Modify logging logic to more easily disable · c1c50a42
  Morris Jette authored 12 years ago
  
  c1c50a42
- launch/aprun - correction to tasks/per/node computation · 6486499a
  Morris Jette authored 12 years ago
  
  6486499a
- Cray - Disable enforcement of MaxTasksPerNode · aacdb424
  Morris Jette authored 12 years ago
  
  This is not applicable with launch/aprun
  aacdb424
- Note nature of last two patches from Hongjia Cao · a63e616e
  Morris Jette authored 12 years ago
  
  a63e616e
- fix of not enough cpus allocated to a step · e0b68e0c
  Hongjia Cao authored 12 years ago
  
  e0b68e0c
- fix of not running exclusive step with required nodes · 00494f54
  Hongjia Cao authored 12 years ago
  
  00494f54
Mar 24, 2013
- Note Bright support situation · cc668cb7
  jette authored 12 years ago
  
  cc668cb7
Mar 23, 2013
- Vestigial code cleanup in squeue · 39d20fa5
  Lipari, Don authored 12 years ago
  
  39d20fa5
Mar 22, 2013

Andy Wettstein authored 12 years ago

On Redhat 6 based distros the lua library name is liblua-5.1.so.
Installing the lua-devel package will create the liblua.so symlink, but
if that isn't installed then the lua job submit plugin will fail to
load.
I'm attaching a patch that adds liblua-5.1.so to the search path.

9112d154

Select/cray - Modify build to enable direct use of libslurm library. · 7d4f145a

Morris Jette authored 12 years ago

These changes are required so that select/cray can load select/linear,
  which is a bit more complex than the other select plugin structures.
Export plugin_context_create and plugin_context_destroy symbols from
  libslurm.so.
Correct typo in exported hostlist_sort symbol name
Define some functions in select/cray to avoid undefined symbols if
  the plugin is loaded via libslurm rather than from a slurm command
  (which has all of the required symbols)

7d4f145a

Mar 21, 2013
- Update support info · 33647a85
  Morris Jette authored 12 years ago
  
  33647a85
Mar 20, 2013
- [PATCH] fix of job requiring contiguous nodes can not run · e416e35f
  Hongjia Cao authored 12 years ago
  
  e416e35f
- SlurmDBD - fix to allow user root along with the slurm user to register a · 485cb062
  Danny Auble authored 12 years ago
  
  cluster.
  485cb062
- Decrease time limit in a test in case of small partition time limit · e0020ed1
  jette authored 12 years ago
  
  e0020ed1
- Add more logging information to a test · b912fad0
  jette authored 12 years ago
  
  b912fad0
- initialize timer string to avoid garbage in log messages · 73996996
  Morris Jette authored 12 years ago
  
  73996996
Mar 19, 2013

Log when a job's time limit is changes by backfill scheduling · 03ad76cf
Don Lipari authored 12 years ago

03ad76cf
Select/cons_res - Tighter packing of job allocations on sockets. · 7fcdc7e5
Morris Jette authored 12 years ago

7fcdc7e5
change select() to poll() in waiting for a socket to be readable · 3175cf91
Hongjia Cao authored 12 years ago
```
select()/FD_ISSET() does not work for file descriptor larger than 1023.
```
3175cf91
Note nature of latest change · 8e038b5c
Morris Jette authored 12 years ago

8e038b5c

fix of idle nodes cannot be allocated · 4ea9850a

Hongjia Cao authored 12 years ago

avoid add/remove node resource of job if the node is lost by resize

 I found another case that idle node can not be allocated. It can be
reproduced as follows:

1. run a job with -k option:

    [root@mn0 ~]# srun -w cn[18-28] -k sleep 1000
    srun: error: Node failure on cn28
    srun: error: Node failure on cn28
    srun: error: cn28: task 10: Killed
    ^Csrun: interrupt (one more within 1 sec to abort)
    srun: tasks 0-9: running
    srun: task 10: exited abnormally
    ^Csrun: sending Ctrl-C to job 106120.0
    srun: Job step aborted: Waiting up to 2 seconds for job step to
finish.

2. set a node down and then set it idle:

    [root@mn0 ~]# scontrol update nodename=cn28 state=down reason="hjcao
test"
    [root@mn0 ~]# scontrol update nodename=cn28 state=idle

3. restart slurmctld

    [root@mn0 ~]# service slurm restart
    stopping slurmctld:                                        [  OK  ]
    slurmctld is stopped
    starting slurmctld:                                        [  OK  ]

4. cancel the job

then, the node set down will be left unavailable:

    [root@mn0 ~]# sinfo -n cn[18-28]
    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    work*        up   infinite     11   idle cn[18-28]

    [root@mn0 ~]# srun -w cn[18-28] hostname
    srun: job 106122 queued and waiting for resources

    [root@mn0 slurm]# grep cn28 slurmctld.log
    [2013-03-18T15:28:02+08:00] debug3: cons_res: _vns: node cn28 in
exclusive use
    [2013-03-18T15:29:02+08:00] debug3: cons_res: _vns: node cn28 in
exclusive use

I made an attempt to fix this by the attached patch. Please review it.

4ea9850a

Correction in logic issuing call to account for change in job time limit · 9f5a7a0e

Morris Jette authored 12 years ago

I don't believe save_time_limit was redundant.  At least in this case:

if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){
    if (orig_time_limit == NO_VAL)
        orig_time_limit = comp_time_limit;
    job_ptr->time_limit = orig_time_limit;
[...]

So later, when updating the db,

    if (save_time_limit != job_ptr->time_limit)
        jobacct_storage_g_job_start(acct_db_conn,
                        job_ptr);
will cause the db to be updated, while,

        if (orig_time_limit != job_ptr->time_limit)
        jobacct_storage_g_job_start(acct_db_conn,
                        job_ptr);

will not because job_ptr->time_limit now equals orig_time_limit.

9f5a7a0e

Do not report error when job step terminates while sstat is running · 4cb6137c
Morris Jette authored 12 years ago

4cb6137c

Record updated job time limit if modified by backfill · 46348f91

Don Lipari authored 12 years ago

Without this change, if the job's time limit is modified down
toward --time-min by the backfill scheduler, update the job's
time limit in the database.

46348f91

Mar 14, 2013
- sreport - Fix by adding planned down time to utilization reports. · dced5e7f
  Danny Auble authored 12 years ago
  
  dced5e7f