Skip to content
Snippets Groups Projects
  1. Apr 11, 2016
    • Morris Jette's avatar
      backfill - minor performance enahcements · 395b5505
      Morris Jette authored
      The gprof tool is showing most time is being consumed by the bit_test()
        function as called from the select plugin, which in turn was called
        by the backfill scheduler. These changes replace the for loop end-points.
        Previous logic tested for all possible nodes. The new logic identifes
        the first and last bit set in the node bitmap and uses those end-points
        instead. Node the logic to find the first and last bits set starts off
        with a word-based search (testing for a 64-bit zero value rather than
        testing each individual bit). The net result is a small performance
        improvement.
      bug 2588
      395b5505
    • Tim Wickberg's avatar
      Fix three typos. · e6e87c92
      Tim Wickberg authored
      e6e87c92
    • Morris Jette's avatar
      burst_buffer/cray fix for pre_run fail · 8f667db4
      Morris Jette authored
      burst_buffer/cray - Decrement job's prolog_running counter if pre_run fails.
      bug 2621
      8f667db4
    • Morris Jette's avatar
      Reset job's prolog_running counter · f3f41e10
      Morris Jette authored
      If a job is no longer in configuring state, then clear the prolog_running
        counter on slurmctld restart or reconfigure.
      bug 2621
      f3f41e10
  2. Apr 09, 2016
    • Morris Jette's avatar
      Fix for commit e62a9270 · 06776b12
      Morris Jette authored
      For case where job can't start and there are no running jobs
      to remove in order to establish estimated start time.
      06776b12
    • Morris Jette's avatar
      backfill scheduling enhancement · e62a9270
      Morris Jette authored
      When determining when a pending job will be able to start, rather
        than testing after removing each running job and trying to schedule
        the pending jobs, remove multiple jobs that all end about the
        same time before testing. This reduces the number of calls to
        the job placement logic, which is time consuming.
      e62a9270
  3. Apr 08, 2016
  4. Apr 07, 2016
  5. Apr 06, 2016
  6. Apr 05, 2016
    • Morris Jette's avatar
      Fix backfill scheduler race condition · d8b18ff8
      Morris Jette authored
      Fix backfill scheduler race condition that could cause invalid pointer in
          select/cons_res plugin. Bug introduced in 15.08.9, commit:
          efd9d35e
      
      The scenario is as follows
      1. Backfill scheduler is running, then releases locks
      2. Main scheduling loop starts a job "A"
      3. Backfill scheduler resumes, finds job "A" in its queue and
         resets it's partition pointer.
      4. Job "A" completes and tries to remove resource allocation record
         from select/cons_res data structure, but fails to find it because
         it is looking in the table for the wrong partition.
      5. Job "A" record gets purged from slurmctld
      6. Select/cons_res plugin attempts to operate on resource allocation
         data structure, finds pointer into the now purged data structure
         of job "A" and aborts or gets SEGV
      Bug 2603
      d8b18ff8
    • Danny Auble's avatar
    • Danny Auble's avatar
      Remove debug from commit 921c59e4 · 24566dd7
      Danny Auble authored
      24566dd7
  7. Apr 04, 2016
  8. Apr 02, 2016
  9. Apr 01, 2016
  10. Mar 31, 2016
  11. Mar 30, 2016
  12. Mar 28, 2016
    • Morris Jette's avatar
      task/cgroup - Fix task binding to CPUs bug · ddf6d9a4
      Morris Jette authored
      There was a subtle bug in how tasks were bound to CPUs which could result
      in an "infinite loop" error. The problem was various socket/core/threasd
      calculations were based upon the resources allocated to a step rather than
      all resources on the node and rounding errors could occur. Consider for
      example a node with 2 sockets, 6 cores per socket and 2 threads per core.
      On the idle node, a job requesting 14 CPUs is submitted. That job would
      be allocted 4 cores on the first socket and 3 cores on the second socket.
      The old logic would get the number of sockets for the job at 2 and the
      number of cores at 7, then calculate the number of cores per socket at
      7/2 or 3 (rounding down to an integer). The logic layouting out tasks
      would bind the first 3 cores on each socket to the job then not find any
      remaining cores, report the "infinite loop" error to the user, and run
      the job without one of the expected cores. The problem gets even worse
      when there are some allocated cores on a node. In a more extreme case,
      a job might be allocated 6 cores on one socket and 1 core on a second
      socket. In that case, 3 of that job's cores would be unused.
      bug 2502
      ddf6d9a4
    • Morris Jette's avatar
      Fix for srun signal handling threading problem · c8d36dba
      Morris Jette authored
      This is a revision to commit 1ed38f26
      The root problem is that a pthread is passed an argument which is
      a pointer to a variable on the stack. If that variable is over-written,
      the signal number recieved will be garbage, and that bad signal
      number will be interpretted by srun to possible abort the request.
      c8d36dba
  13. Mar 26, 2016
    • Morris Jette's avatar
      Revert commit efa83a02 · c1dde86c
      Morris Jette authored
      The previous commit obviously fixed a problem, but introduced a different
      set of problems. This will be pursued later, perhaps in version 16.05.
      c1dde86c
  14. Mar 25, 2016
    • Morris Jette's avatar
      Revert commit 6c14b969 · f5920b77
      Morris Jette authored
      With some configurations and systems, errors of the following sort were
      occuring:
      task/cgroup: task[1] infinite loop broken while trying to provision compute elements using block
      task/cgroup: task[1] unable to set taskset '0x0'
      f5920b77
    • Morris Jette's avatar
      burst_buffer/cray - pre-run fail fix · 5a48207e
      Morris Jette authored
      burst_buffer/cray - If the pre-run operation fails then don't issue
          duplicate job cancel/requeue unless the job is still in run state. Prevents
          jobs hung in COMPLETING state.
      bug 2587
      5a48207e
  15. Mar 24, 2016
    • Morris Jette's avatar
      Select/cray - Log NHC run time on "scontrol reconfig" · 58627d02
      Morris Jette authored
      Running "scontrol reconfig" releases resources for jobs waiting for
        the completion of Node Health Check so that other jobs can run.
        Cray says to always wait for NHC to complete, but in extreme
        cases that can be 2 hours, during which the entire resource
        allocation for a job may be unusable. Per advice from NERSC,
        the logic to release resources is unchanged, but logging is
        added here.
      58627d02
Loading