Skip to content
Snippets Groups Projects
  1. Jun 19, 2013
  2. Jun 18, 2013
  3. Jun 05, 2013
  4. May 24, 2013
  5. May 23, 2013
  6. May 21, 2013
  7. May 15, 2013
  8. May 10, 2013
  9. Apr 24, 2013
  10. Jan 29, 2013
  11. Nov 26, 2012
  12. Nov 21, 2012
    • Matthieu Hautreux's avatar
      slurmstepd : correct a bug in the IO thread termination monitoring · f297242e
      Matthieu Hautreux authored
      A dedicated thread (_kill_thr) is launched by slurmstepd at the end of a
      step in order to destroy the IO thread if it does not manage to correctly
      terminate by itself after 300 seconds.
      
      Two bugs are corrected in this logic by this patch.
      
      First, the performed sleep(300) is not protected against interruptions
      and this delay can be reduced to a few seconds in case of signals received
      by slurmstepd, thus, reducing the delay and forcing the IO thread to
      terminate before the expiration of the grace time. The logic is modified
      to ensure that the delay is respected using a loop around the sleep().
      
      Second, to terminate the IO thread, a SIGKILL is delivered to the IO thread
      using pthread_kill. However, sending SIGKILL using pthread_kill is a
      process-wide operation (see man pthread_kill), thus all the slurmstepd
      threads are killed and slurmstepd is terminated. This logic is modified
      by using pthread_cancel() instead of pthread_kill() thus letting the
      pthread_join() of _wait_for_io() having a chance to act as expected.
      
      Without this patch, when _kill_thr is interrupted, slurmstepd is
      terminated, letting the step in a incomplete state, as the node may not
      have been able to send the REQUEST_STEP_COMPLETE to the controler.
      Thus, consecutive steps can no longer be executed and stay permanently in
      the "Job step creation temporarily disabled, retrying" state.
      f297242e
  13. Nov 07, 2012
    • Janne Blomqvist's avatar
      Modify default log timestamp pto conform to RFC 5424 format · 4b941731
      Janne Blomqvist authored
      the attached patch changes the default timestamp format in logfiles to conform to RFC 5424 (the current version of the syslog RFC). It is identical to the current default "ISO 8601" timestamp used by slurm, with the exception that the timezone offset is appended. This has the benefits of
      
      1) It's unambiguous.
      
      2) Avoids potential confusion for admins running cluster(s) in different timezones.
      
      3) Might help debug issues related to DST transitions. (More on that later..)
      
      (To be pedantic, a RFC 5424 timestamp is still a valid ISO 8601 timestamp, but the converse is not necessarily true. So there is RFC 3339 which is a "profile" of ISO 8601, that is a subset, recommended for internet protocols. The RFC 5424 timestamp, in turn, is a subset of the RFC 3339 timestamps.)
      
      The previous behavior of can be used by running configure with the
      
      --disable-rfc5424time
      
      flag.
      4b941731
  14. Oct 22, 2012
  15. Oct 16, 2012
  16. Oct 15, 2012
  17. Jul 16, 2012
  18. Jul 13, 2012
  19. May 29, 2012
  20. May 11, 2012
  21. May 07, 2012
  22. May 05, 2012
  23. Apr 27, 2012
  24. Mar 26, 2012
  25. Mar 22, 2012
  26. Feb 04, 2012
    • Morris Jette's avatar
      Add call to MPI plugin · 231c927c
      Morris Jette authored
      Add call to mpi_hook_slurmstepd_prefork() from slurmstep
      immediately prior to fork/exec of user tasks.
      Patch from Hongjia Cao, NUDT.
      231c927c
  27. Jan 19, 2012
  28. Jan 18, 2012
  29. Jan 17, 2012
  30. Jan 13, 2012
    • Mark A. Grondona's avatar
      slurmstepd: unblock all signals before invoking user job · 06047590
      Mark A. Grondona authored
      It was found that slurmstepd was intermittently leaving SIGPIPE
      blocked when launching user tasks. This may have something to do
      with the fact that the xsignal_unblock() call in _fork_all_tasks()
      is referencing an extern array (nominally this should have unblocked
      SIGPIPE), but I didn't spend the time to fully track this issue
      down. Instead, I figured there is probably no reason we would _not_
      want to unblock *all* signals, so this patch does that.
      
      Before this change, the following program fails every once in awhile:
      
       #include <stdio.h>
       #include <signal.h>
      
       int main (int ac, char **av)
       {
      	int i, rc = 0;
      	struct sigaction act;
      	for (i = 1; i < SIGRTMAX; i++) {
      		sigaction (i, NULL, &act);
      		if (act.sa_handler == SIG_DFL)
      			continue;
      		fprintf (stderr, "Signal %d appears to be ignored!\n", i);
      		rc = 1;
      	}
      	return (rc);
       }
      
      with:
      
       srun -N1 -n1 ./test
       Signal 13 appears to be ignored!
      
      after the change, the program succeeds.
      06047590
Loading