Skip to content
Snippets Groups Projects
  1. Jan 12, 2011
  2. Jan 11, 2011
  3. Jan 10, 2011
    • Moe Jette's avatar
      31776ea0
    • Moe Jette's avatar
      -- Add scancel --reservation option to cancell all jobs associated with a · 2aba39da
      Moe Jette authored
          specific reservation.
      2aba39da
    • Moe Jette's avatar
    • Moe Jette's avatar
      fec68ac8
    • Moe Jette's avatar
      Patch from Gerrit: 01_salloc-Bug-Fix-nested-terminal-foreground-process.diff · 69e1d108
      Moe Jette authored
      salloc: notify terminal foreground process
      
      This fixes another bug observed in salloc child process cleanup. I found
      that some shells, e.g. zsh, do not forward all signals to their children.
      
      The patch fixes the problem that
       * command_pid is still active but does not equal tpgid,
       * tpgid is not the same as salloc's process group,
       * tpgid is very unlikely to come from another process, since we block
         the suspend/TSTP signal,
       * signalling command_pid does not automatically imply that the active
         terminal foreground process is also signalled,
       * hence send a HUP to signify "death of controlling process".
      
      This setup fixed the problem on zsh. I then went and tested a more complex setup:
      
      Before:
      -------
      palu2:0 ~>ps  f -o pid,pgid,tpgid,ppid,stat,tty,cmd
        PID  PGID TPGID  PPID STAT TT       CMD
      21117 21117 21597 21116 Ss   pts/9    -bash
      21260 21260 21597 21117 Sl   pts/9     \_ ./slurm_build/git/src/salloc/salloc -v --time=00:01:00 -N17 zsh
      21266 21266 21597 21260 S    pts/9         \_ zsh
      21323 21323 21597 21266 S    pts/9             \_ /bin/bash
      21397 21397 21597 21323 S    pts/9                 \_ -bin/tcsh
      21526 21526 21597 21397 S    pts/9                     \_ /bin/sh
      21597 21597 21597 21526 S+   pts/9                         \_ aprun -N1 -n17 sleep 12345
      21601 21597 21597 21597 S+   pts/9                             \_ aprun -N1 -n17 sleep 12345
      
      After the timeout:
      ------------------
      palu2:0 ~>ps  f -o pid,pgid,tpgid,ppid,stat,tty,cmd
        PID  PGID TPGID  PPID STAT TT       CMD
      21323 21323 21117     1 S    pts/9    /bin/bash
      21397 21397 21117 21323 S    pts/9     \_ -bin/tcsh
      21526 21526 21117 21397 S    pts/9         \_ /bin/sh
      
      ==> The 'dangerous' aprun terminal foreground process group 21597 has been removed, while the child
          subprocess groups 21323, 21397, and 21526 now exist as orph01_salloc-Bug-Fix-nested-terminal-foreground-process.diff
      aned groups, to be cleaned up by init.
      69e1d108
  4. Jan 07, 2011
  5. Jan 06, 2011
  6. Jan 03, 2011
  7. Dec 29, 2010
  8. Dec 28, 2010
Loading