Skip to content
Snippets Groups Projects
Commit 58c12f7e authored by David Bigagli's avatar David Bigagli
Browse files

Temporary fix. If slurmstepd dies do not hang srun but abort the

entire parallel job.
parent 5759a1d1
No related branches found
No related tags found
No related merge requests found
...@@ -1788,6 +1788,16 @@ step_launch_notify_io_failure(step_launch_state_t *sls, int node_id) ...@@ -1788,6 +1788,16 @@ step_launch_notify_io_failure(step_launch_state_t *sls, int node_id)
node_id); node_id);
sls->abort = true; sls->abort = true;
pthread_cond_broadcast(&sls->cond); pthread_cond_broadcast(&sls->cond);
} else {
/* FIXME
* If stepd dies or we see I/O error with stepd.
* Do not abort the whole job but collect all
* taks on the node just like if they exited.
*/
error("%s: aborting, io error with slurmstepd on node %d",
__func__, node_id);
sls->abort = true;
pthread_cond_broadcast(&sls->cond);
} }
pthread_mutex_unlock(&sls->lock); pthread_mutex_unlock(&sls->lock);
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment