Skip to content
Snippets Groups Projects
  1. Jul 26, 2017
  2. Jul 19, 2017
  3. Jul 13, 2017
  4. Feb 09, 2017
  5. Jan 19, 2017
  6. Jan 18, 2017
  7. Jan 11, 2017
    • Morris Jette's avatar
      Fix srun/sattach race condtion · 38089f2b
      Morris Jette authored
      The old logic would result in test16.4 failing some of the time.
        The failure was caused by the sattach command attaching to a
        job step before the original srun command received a
        RESPONSE_LAUNCH_TASKS message. That messsage  would then be sent
        to the salloc command. Since srun never got the message, it
        would hang. This change does not mark the job step as RUNNING
        until after the original srun gets sent the RESPONSE_LAUNCH_TASKS
        message and sattach requests are blocked until that time.
      38089f2b
    • Morris Jette's avatar
      Improve an error message · a82a70dc
      Morris Jette authored
      Identify the function where an error is generated.
      a82a70dc
  8. Dec 06, 2016
  9. Sep 28, 2016
    • Morris Jette's avatar
      Add task launch flag field · 3251406c
      Morris Jette authored
      Add "flag" field to launch_tasks_request_msg. Remove the following fields
          (moved into flags): multi_prog, task_flags, user_managed_io, pty,
          buffered_stdio, and labelio. More flags to be added later.
      3251406c
  10. Sep 09, 2016
    • Morris Jette's avatar
      Cap job termination message staggering at 5 seconds · 3e2251cb
      Morris Jette authored
      Previous cap was 2 sec (default TCP timeout) times the node count
        and divided by 1000. A 9000 node job would have the messages
        spread out over 18 seconds. This change caps the spread at
        5 seconds and assumes the normal TCP logic can handle the rest
      bug 3044
      3e2251cb
  11. Sep 08, 2016
  12. Aug 16, 2016
    • Morris Jette's avatar
      slurmstepd to load all plugins at startup · 962f0cce
      Morris Jette authored
      slurmstepd modified to pre-load all relevant plugins at startup to avoid
          the possibility of modified plugins later resulting in inconsistent API
          or data structures and a failure of slurmstepd.
      bug 2334
      962f0cce
  13. Jul 27, 2016
  14. Jul 22, 2016
  15. Jul 15, 2016
    • Danny Auble's avatar
      Various cleanup needed for extern step. Continuation of commit 2fc0c860 · c79063b0
      Danny Auble authored
      What this does is set the state earlier to match a normal set.
      
      Remove the unneeded _send_pending_exit_msgs.  There is only one task and
      we have the message for it, so don't worry about that one.
      
      Most important, wait for the other slurmstepd's to send their message,
      otherwise they could be lost on the other end.
      c79063b0
  16. Jul 08, 2016
  17. Jul 07, 2016
  18. Jul 02, 2016
  19. Jul 01, 2016
  20. May 25, 2016
  21. May 24, 2016
  22. May 11, 2016
  23. May 10, 2016
  24. May 06, 2016
    • John Thiltges's avatar
      Fix for slurmstepd setfault · db0fe22e
      John Thiltges authored
      With slurm-15.08.10, we're seeing occasional segfaults in slurmstepd. The logs point to the following line: slurm-15.08.10/src/slurmd/slurmstepd/mgr.c:2612
      
      On that line, _get_primary_group() is accessing the results of getpwnam_r():
          *gid = pwd0->pw_gid;
      
      If getpwnam_r() cannot find a matching password record, it will set the result (pwd0) to NULL, but still return 0. When the pointer is accessed, it will cause a segfault.
      
      Checking the result variable (pwd0) to determine success should fix the issue.
      db0fe22e
  25. May 05, 2016
  26. Apr 02, 2016
  27. Mar 29, 2016
  28. Mar 28, 2016
  29. Mar 03, 2016
  30. Mar 01, 2016
    • Morris Jette's avatar
      Defer suspend until launch completes · d2cd18d1
      Morris Jette authored
      This fixes a bug introduced in commit 52fe3de1
      in the event the fork() call fails in slurmstepd.
      d2cd18d1
    • Morris Jette's avatar
      Defer suspend until launch completes · 52fe3de1
      Morris Jette authored
      Insure that a job is completely launched before trying to suspend it.
      Previous logic would start suspend logic early in the life of the
      slurmstepd process, after it's listening socket was open but before
      the tasks were launched. This defers the suspend logic until after
      all prologs and setup completes and the tasks are launched. This is
      important in the case of gang scheduling, in which newly launched
      jobs can be immediately suspended.
      bug 2494
      52fe3de1
Loading