From 2929a2895457f4f60eab29e32ab5d5b0e035d6df Mon Sep 17 00:00:00 2001 From: Morris Jette <jette@schedmd.com> Date: Tue, 20 Oct 2015 08:27:01 -0700 Subject: [PATCH] Warn of risks in job suspend/resume bug 2031 --- doc/html/gang_scheduling.shtml | 10 +++++++++- doc/man/man1/scontrol.1 | 12 +++++++++++- 2 files changed, 20 insertions(+), 2 deletions(-) diff --git a/doc/html/gang_scheduling.shtml b/doc/html/gang_scheduling.shtml index e3856ced55e..fb6ef72e5a6 100644 --- a/doc/html/gang_scheduling.shtml +++ b/doc/html/gang_scheduling.shtml @@ -500,6 +500,14 @@ Note that <I>CR_Core_Memory</I> supports CPU binding, while <I>CR_CPU_Memory</I> does not. </P> -<p style="text-align:center;">Last modified 24 February 2014</p> +<P>Note that manually suspending a job (i.e. "scontrol suspend ...") releases +its CPUs for allocation to other jobs. +Resuming a previously suspended job may result in multiple jobs being +allocated the same CPUs, which could trigger gang scheduling of jobs. +Use of the scancel command to send SIGSTOP and SIGCONT signals would stop a +job without releasing its CPUs for allocaiton to other jobs and would be a +preferable mechanism in many cases.</P> + +<p style="text-align:center;">Last modified 20 October 2015</p> <!--#include virtual="footer.txt"--> diff --git a/doc/man/man1/scontrol.1 b/doc/man/man1/scontrol.1 index abe90ba14b4..3060f74e7af 100644 --- a/doc/man/man1/scontrol.1 +++ b/doc/man/man1/scontrol.1 @@ -295,7 +295,7 @@ The job_list argument is a comma separated list of job IDs. A held job can be released using scontrol to reset its priority (e.g. "scontrol release <job_id>"). The command accepts the following option: .RS -.TP 12 +.TP \fIState=SpecialExit\fP The "SpecialExit" keyword specifies that the job has to be put in a special state \fBJOB_SPECIAL_EXIT\fP. @@ -303,11 +303,21 @@ The "scontrol show job" command will display the JobState as \fBSPECIAL_EXIT\fP, while the "squeue" command as \fBSE\fP. .RE +.TP \fBresume\fP \fIjob_list\fP Resume a previously suspended job. The job_list argument is a comma separated list of job IDs. Also see \fBsuspend\fR. +\fBNOTE:\fR A suspended job releases its CPUs for allocation to other jobs. +Resuming a previously suspended job may result in multiple jobs being +allocated the same CPUs, which could trigger gang scheduling with some +configurations or severe degradation in performance with other configurations. +Use of the scancel command to send SIGSTOP and SIGCONT signals would stop a +job without releasing its CPUs for allocaiton to other jobs and would be a +preferable mechanism in many cases. +Use with caution. + .TP \fBschedloglevel\fP \fILEVEL\fP Enable or disable scheduler logging. -- GitLab