Skip to content
Snippets Groups Projects
  1. Mar 29, 2013
  2. Mar 28, 2013
  3. Mar 27, 2013
  4. Mar 26, 2013
  5. Mar 25, 2013
  6. Mar 24, 2013
  7. Mar 23, 2013
  8. Mar 22, 2013
    • Andy Wettstein's avatar
      Add path for liblua · 9112d154
      Andy Wettstein authored
      On Redhat 6 based distros the lua library name is liblua-5.1.so.
      Installing the lua-devel package will create the liblua.so symlink, but
      if that isn't installed then the lua job submit plugin will fail to
      load.
      I'm attaching a patch that adds liblua-5.1.so to the search path.
      9112d154
    • Morris Jette's avatar
      Select/cray - Modify build to enable direct use of libslurm library. · 7d4f145a
      Morris Jette authored
      These changes are required so that select/cray can load select/linear,
        which is a bit more complex than the other select plugin structures.
      Export plugin_context_create and plugin_context_destroy symbols from
        libslurm.so.
      Correct typo in exported hostlist_sort symbol name
      Define some functions in select/cray to avoid undefined symbols if
        the plugin is loaded via libslurm rather than from a slurm command
        (which has all of the required symbols)
      7d4f145a
  9. Mar 21, 2013
  10. Mar 20, 2013
  11. Mar 19, 2013
    • Don Lipari's avatar
    • Morris Jette's avatar
    • Hongjia Cao's avatar
      change select() to poll() in waiting for a socket to be readable · 3175cf91
      Hongjia Cao authored
      select()/FD_ISSET() does not work for file descriptor larger than 1023.
      3175cf91
    • Morris Jette's avatar
      Note nature of latest change · 8e038b5c
      Morris Jette authored
      8e038b5c
    • Hongjia Cao's avatar
      fix of idle nodes cannot be allocated · 4ea9850a
      Hongjia Cao authored
      avoid add/remove node resource of job if the node is lost by resize
      
       I found another case that idle node can not be allocated. It can be
      reproduced as follows:
      
      1. run a job with -k option:
      
          [root@mn0 ~]# srun -w cn[18-28] -k sleep 1000
          srun: error: Node failure on cn28
          srun: error: Node failure on cn28
          srun: error: cn28: task 10: Killed
          ^Csrun: interrupt (one more within 1 sec to abort)
          srun: tasks 0-9: running
          srun: task 10: exited abnormally
          ^Csrun: sending Ctrl-C to job 106120.0
          srun: Job step aborted: Waiting up to 2 seconds for job step to
      finish.
      
      2. set a node down and then set it idle:
      
          [root@mn0 ~]# scontrol update nodename=cn28 state=down reason="hjcao
      test"
          [root@mn0 ~]# scontrol update nodename=cn28 state=idle
      
      3. restart slurmctld
      
          [root@mn0 ~]# service slurm restart
          stopping slurmctld:                                        [  OK  ]
          slurmctld is stopped
          starting slurmctld:                                        [  OK  ]
      
      4. cancel the job
      
      then, the node set down will be left unavailable:
      
          [root@mn0 ~]# sinfo -n cn[18-28]
          PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
          work*        up   infinite     11   idle cn[18-28]
      
          [root@mn0 ~]# srun -w cn[18-28] hostname
          srun: job 106122 queued and waiting for resources
      
          [root@mn0 slurm]# grep cn28 slurmctld.log
          [2013-03-18T15:28:02+08:00] debug3: cons_res: _vns: node cn28 in
      exclusive use
          [2013-03-18T15:29:02+08:00] debug3: cons_res: _vns: node cn28 in
      exclusive use
      
      I made an attempt to fix this by the attached patch. Please review it.
      4ea9850a
    • Morris Jette's avatar
      Correction in logic issuing call to account for change in job time limit · 9f5a7a0e
      Morris Jette authored
      I don't believe save_time_limit was redundant.  At least in this case:
      
      if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){
          if (orig_time_limit == NO_VAL)
              orig_time_limit = comp_time_limit;
          job_ptr->time_limit = orig_time_limit;
      [...]
      
      So later, when updating the db,
      
          if (save_time_limit != job_ptr->time_limit)
              jobacct_storage_g_job_start(acct_db_conn,
                              job_ptr);
      will cause the db to be updated, while,
      
              if (orig_time_limit != job_ptr->time_limit)
              jobacct_storage_g_job_start(acct_db_conn,
                              job_ptr);
      
      will not because job_ptr->time_limit now equals orig_time_limit.
      9f5a7a0e
    • Morris Jette's avatar
    • Don Lipari's avatar
      Record updated job time limit if modified by backfill · 46348f91
      Don Lipari authored
      Without this change, if the job's time limit is modified down
      toward --time-min by the backfill scheduler, update the job's
      time limit in the database.
      46348f91
  12. Mar 14, 2013
Loading