Skip to content
Snippets Groups Projects
  • Morris Jette's avatar
    5f89223f
    Change Cray mpi_fini failure logic · 5f89223f
    Morris Jette authored
    Treat Cray MPI job calling exit() without mpi_fini() as fatal error for
    that specific task and let srun handle all timeout logic.
    Previous logic would cancel the entire job step and srun options
    for wait time and kill on exit were ignored. The new logic provides
    users with the following type of response:
    
    $ srun -n3 -K0 -N3 --wait=60 ./tmp
    Task:0 Cycle:1
    Task:2 Cycle:1
    Task:1 Cycle:1
    Task:0 Cycle:2
    Task:2 Cycle:2
    slurmstepd: step 14927.0 task 1 exited without calling mpi_fini()
    srun: error: tux2: task 1: Killed
    Task:0 Cycle:3
    Task:2 Cycle:3
    Task:0 Cycle:4
    ...
    
    bug 1171
    5f89223f
    History
    Change Cray mpi_fini failure logic
    Morris Jette authored
    Treat Cray MPI job calling exit() without mpi_fini() as fatal error for
    that specific task and let srun handle all timeout logic.
    Previous logic would cancel the entire job step and srun options
    for wait time and kill on exit were ignored. The new logic provides
    users with the following type of response:
    
    $ srun -n3 -K0 -N3 --wait=60 ./tmp
    Task:0 Cycle:1
    Task:2 Cycle:1
    Task:1 Cycle:1
    Task:0 Cycle:2
    Task:2 Cycle:2
    slurmstepd: step 14927.0 task 1 exited without calling mpi_fini()
    srun: error: tux2: task 1: Killed
    Task:0 Cycle:3
    Task:2 Cycle:3
    Task:0 Cycle:4
    ...
    
    bug 1171