- Jan 12, 2011
-
-
Joseph P. Donaghy authored
-
Moe Jette authored
until the job is released. Patch from Rod Schultz, Bull.
-
- Jan 11, 2011
-
-
Moe Jette authored
attempts from 60 seconds to 29 seconds. This should eliminate a possible synchronization problem with gang scheduling that could result in job step creation requests only occuring when a job is suspended.
-
Danny Auble authored
BLUEGENE - better checking small blocks in dynamic mode whether a full midplane job could run or not.
-
Moe Jette authored
running to be more clear and only print when --verbose option is used.
-
Moe Jette authored
-
Moe Jette authored
-
Danny Auble authored
-
Moe Jette authored
it's partition is configured "Shared=EXCLUSIVE" (which is redundant).
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
after a job completed.
-
Moe Jette authored
-
- Jan 10, 2011
-
-
Moe Jette authored
size get queued).
-
Moe Jette authored
specific reservation.
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
salloc: notify terminal foreground process This fixes another bug observed in salloc child process cleanup. I found that some shells, e.g. zsh, do not forward all signals to their children. The patch fixes the problem that * command_pid is still active but does not equal tpgid, * tpgid is not the same as salloc's process group, * tpgid is very unlikely to come from another process, since we block the suspend/TSTP signal, * signalling command_pid does not automatically imply that the active terminal foreground process is also signalled, * hence send a HUP to signify "death of controlling process". This setup fixed the problem on zsh. I then went and tested a more complex setup: Before: ------- palu2:0 ~>ps f -o pid,pgid,tpgid,ppid,stat,tty,cmd PID PGID TPGID PPID STAT TT CMD 21117 21117 21597 21116 Ss pts/9 -bash 21260 21260 21597 21117 Sl pts/9 \_ ./slurm_build/git/src/salloc/salloc -v --time=00:01:00 -N17 zsh 21266 21266 21597 21260 S pts/9 \_ zsh 21323 21323 21597 21266 S pts/9 \_ /bin/bash 21397 21397 21597 21323 S pts/9 \_ -bin/tcsh 21526 21526 21597 21397 S pts/9 \_ /bin/sh 21597 21597 21597 21526 S+ pts/9 \_ aprun -N1 -n17 sleep 12345 21601 21597 21597 21597 S+ pts/9 \_ aprun -N1 -n17 sleep 12345 After the timeout: ------------------ palu2:0 ~>ps f -o pid,pgid,tpgid,ppid,stat,tty,cmd PID PGID TPGID PPID STAT TT CMD 21323 21323 21117 1 S pts/9 /bin/bash 21397 21397 21117 21323 S pts/9 \_ -bin/tcsh 21526 21526 21117 21397 S pts/9 \_ /bin/sh ==> The 'dangerous' aprun terminal foreground process group 21597 has been removed, while the child subprocess groups 21323, 21397, and 21526 now exist as orph01_salloc-Bug-Fix-nested-terminal-foreground-process.diff aned groups, to be cleaned up by init.
-
- Jan 07, 2011
- Jan 06, 2011
- Jan 03, 2011
-
-
Moe Jette authored
-
Danny Auble authored
-
- Dec 29, 2010
-
-
Moe Jette authored
-
Moe Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
- Dec 28, 2010
-
-
Moe Jette authored
-
Don Lipari authored
-
Moe Jette authored
-
Don Lipari authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-