\." $Id$ .\" .TH SRUN "1" "December 2002" "srun 0.1" "slurm components" .SH "NAME" srun \- run parallel jobs .SH SYNOPSIS .B srun [\fIOPTIONS\fR...] \fIexecutable \fR[\fIargs\fR...] .br .B srun \-\-allocate [\fIOPTIONS\fR...] [job_script] .br .B srun \-\-attach=jobid .SH DESCRIPTION Allocate resources and optionally initiate parallel jobs on clusters managed by SLURM. .TP parallel run options .TP \fB\-n\fR, \fB\-\-nprocs\fR=\fInprocs\fR Specify the number of processes to run. Request that .B srun allocate \fInprocs\fR processes. Specification of the number of processes per node may be achieved with the .B -c and .B -N options. If unspecified, the default is one process. .TP \fB\-c\fR, \fB\-\-cpus\-per\-task\fR=\fIncpus\fR Request that \fIncpus\fR be allocated \fBper process\fR. This may be useful if the job will be multithreaded and requires more than one cpu per task for optimal performance. The default is one cpu per process. .TP \fB\-N\fR, \fB\-\-nodes\fR=\fInnodes\fR Request that \fInnodes\fR nodes be allocated to this job. The default is to allocate one cpu per process, such that nodes with one cpu will run one process, nodes with 2 cpus will be allocated 2 processes, etc. The distribution of processes across nodes may be controlled using this option along with the .B -n and .B -c options. .TP \fB\-p\fR, \fB\-\-partition\fR=\fIpartition\fR Request resources from partition "\fIpartition\fR." Partitions are created by the slurm administrator. .TP \fB\-t\fR, \fB\-\-time\fR=\fIminutes\fR Establish a time limit to terminate the job after the specified number of minutes. .TP \fB\-\-cddir\fR=\fIpath\fR have the remote processes do a chdir to \fIpath\fR before beginning execution. The default is to chdir to the current working directory of the \fBsrun\fR process. .TP \fB\-I\fR, \fB\-\-immediate\fR exit if resources are not immediately available. By default, \fB\-\-immediate\fR is off, and .B srun will block until resources become available. .TP \fB\-k\fR, \fB\-\-kill-off\fR Do not automatically terminate a job of one of the nodes it has been allocated fails. The job will assume all responsibilities for fault-tolerance. The default action is to termniate the job upon node failure. .TP \fB\-s\fR, \fB\-\-share\fR The job can share nodes with other running jobs. This may result in faster job initiation and higher system utilization, but lower application performance. .TP \fB\-O\fR, \fB\-\-overcommit\fR overcommit resources. Normally, .B srun will not allocate more than one process per cpu. By specifying \fB\-\-overcommit\fR you are explicitly allowing more than one process per cpu. However no more than \fBMAX_TASKS_PER_NODE\fR tasks are permitted to execute per node. .TP \fB\-T\fR, \fB\-\-threads\fR=\fInthreads\fR Request that .B srun use \fInthreads\fR to initiate and control the parallel job. The default value is the smaller of 10 or the number of nodes allocated. .TP \fB\-l\fR, \fB\-\-label\fR prepend task number to lines of stdout/err. Normally, stdout and stderr from remote tasks is line-buffered directly to the stdout and stderr of .B srun . The \fB\-\-label\fR option will prepend lines of output with the remote task id. .TP \fB\-m\fR, \fB\-\-distribution\fR=(\fIblock\fR|\fIcyclic\fR) Specify an alternate distribution method for remote processes. .RS .TP .B block The block method of distribution will allocate processes in-order to the cpus on a node. This is the default behavior if the number of tasks exceeds the number of nodes requested. .TP .B cyclic The cyclic method distributes processes in a round-robin fashion across the allocated nodes. That is, process 1 will be allocated to the first node, process 2 to the second, and so on. This is the default behavior if the number of tasks is no larger than the number of nodes requested. .RE .TP \fB\-J\fR, \fB\-\-job\-name\fR=\fIjobname\fR Specify a name for the job. The specified name will appear along with the job id number when querying running jobs on the system. The default is the supplied \fBexecutable\fR program's name. .TP \fB\-o\fR, \fB\-\-output\fR=\fImode\fR Specify the mode for stdout redirection. By default, .B srun collects stdout from all tasks and line buffers this output to the attached terminal. With \fB\-\-output\fR stdout may be redirected to a file, to one file per task, or to /dev/null. See section \fBIO Redirection\fR below for the various forms of \fImode\fR. .TP \fB\-i\fR, \fB\-\-input\fR=\fImode\fR Specify how stdin is to redirected. By default, .B srun redirects stdin from the terminal all tasks. See \fBIO Redirection\fR below for more options. .TP \fB\-e\fR, \fB\-\-error\fR=\fImode\fR Specify how stderr is to be redirected. By default, .B srun redirects stderr to the same file as stdout, if one is specified. The \fB\-\-error\fR option is provided to allow stdout and stderr to be redirected to different locations. See \fBIO Redirection\fR below for more options. .TP \fB\-b\fR, \fB\-\-batch\fR Submit in "batch mode." \fBsrun\fR will make a copy of the \fIexecutable\fR file (a script) and submit the request for execution when resouces are available. \fBsrun\fR will terminate after the request has been submitted. The \fIexecutable\fR file will run on the first node allocated to the job and must contain \fBsrun\fR commands to initiate parallel tasks. stdin will be redirected from /dev/null, stdout and stderr will be redirected to a file (default is \fIjobname\fR.out or \fIjobid\fR.out in current working directory, see \fB\-o\fR for other IO options). \fIexecutable\fR must be specified using either a fully qualified pathname or its pathname will be relative to the current working directory. The search path will not be used to locate the file. \fIexecutable\fR will be interpretted by the user's default shell unless the file begins with "#!" followed by the fully qualified pathname of a valid shell. .TP \fB\-v\fR, \fB\-\-verbose\fR verbose operation. Multiple \fB-v\fR's will further increase the verbosity of .B srun. .TP \fB\-d\fR, \fB\-\-slurmd-debug\fR=\fIlevel\fR Specify a debug level for slurmd(8). \fIlevel\fR may be an integer value between 0 [quiet, only errors are displayed] and 6 [insanely verbose operation]. The slurmd debug information is copied to the stderr of the job. .TP \fB\-W\fR, \fB\-\-wait\fR=\fIseconds\fR Specify how long to wait after the first task terminates before terminating all remaining tasks. The default value is unlimited. This can be useful to insure that a a job is terminated in a timely fashion in the event that one or more tasks terminate prematurely. .PP Allocate options: .TP \fB\-A\fR, \fB\-\-allocate\fR allocate resources and spawn a shell. When \fB\-\-allocate\fR is specified to .B srun , no remote tasks are started. Instead a subshell is started that has access to the allocated resources. Multiple jobs can then be run on the same cpus from within this subshell. See \fBAllocate Mode\fR below. .PP Attach to running job: .TP \fB\-a\fR, \fB\-\-attach\fR=\fIid\fR This option will attach .B srun to a running job with job id = \fIid\fR. Provided that the calling user has access to that running job, stdout and stderr will be redirected to the current session and signals received by .B srun will be forwarded to the remote processes. .TP \fB\-j\fR, \fB\-\-join\fR Join with running job. This will duplicate stdout/stderr to the calling \fBsrun\fR. stdin and signals will not be propagated to the job. \fB\-\-join\fR is only allowed with \fB\-\-attach\fR. .TP \fB\-s\fR, \fB\-\-steal\fR Steal the connection to the running job. This will close any open sessions with the specified job and allow stdin and signals to be propagated. \fB\-\-steal\fR is only allowed with \fB\-\-attach\fR. .PP Constraint Options. The following options all put constraints on the nodes that may be considered for the job: .TP \fB\-\-mincpus\fR=\fIn\fR Specify minimum number of cpus per node .TP \fB\-\-mem\fR=\fIMB\fR Specify a minimum amount of real memory .TP \fB\-\-vmem\fR=\fIMB\fR Specify a minimum amount of virtual memory .TP \fB\-\-tmp\fR=\fIMB\fR Specify a minimum amount of temporary disk space .TP \fB\-C\fR, \fB\-\-constraint\fR=\fIlist\fR specify a list of constraints. The \fIlist\fR of constraints is a comma separated list of features that have been assigned to the nodes by the slurm administrator. If no nodes have the requested feature, then the job will be rejected by the slurm job manager. .TP \fB\-\-contiguous\fR demand a contiguous range of nodes. The default is on. Specify --contiguous=no if a contiguous range of nodes is not a constraint. .TP \fB\-w\fR, \fB\-\-nodelist\fR=\fIhost1,host2,...\fR or \fIfilename\fR request a specific list of hosts. The job will contain \fIat least\fR these hosts. The list may be specified as a comma-separated list of hosts, a range of hosts (host[1-5,7,...] for example), or a filename. The host list will be assumed to be a filename if it contains a "/" character. .PP Help options .TP -?, \fB\-\-help\fR Show this help message .TP \fB\-\-usage\fR Display brief usage message .PP Other options .TP \fB\-V\fR, \fB\-\-version\fR output version information and exit .PP Unless the \fB\-a\fR (\fB\-\-attach\fR) or \fB-A\fR (\fB\-\-allocate\fR) options are specified (see \fBAllocate mode\fR and \fBAttaching to jobs\fR below), .B srun will submit the job request to the slurm job controller, then initiate all processes on the remote nodes. If the request cannot be met immediately, .B srun will block until the resources are free to run the job. If the \fB\-I\fR (\fB\-\-immediate\fR) option is specified .B srun will terminate if resources are not immediately available. .PP When initiating remote processes .B srun will propagate the current working directory, unless \fB\-\-cddir\fR=\fIpath\fR is specified, in which case \fIpath\fR will become the working directory for the remote processes. .PP The \fB-n\fB, \fB-c\fR, and \fB-N\fR options control how CPUs and nodes will be allocated to the job. When specifying only the number of processes to run with \fB-n\fR, a default of one CPU per process is allocated. By specifying the number of CPUs required per task (\fB-c\fR), more than one CPU may be allocated per process. If the number of nodes is specified with \fB-N\fR, .B srun will attempt to allocate \fIat least\fR the number of nodes specified. .PP Combinations of the above three options may be used to change how processes are distributed across nodes and cpus. For instance, by specifying both the number of processes and number of nodes on which to run, the number of processes per node is implied. However, if the number of CPUs per process is more important then number of processes (\fB-n\fR) and the number of CPUs per process (\fB-c\fR) should be specified. .PP .B srun will refuse to allocate more than one process per CPU unless \fB\-\-overcommit\fR (\fB\-O\fR) is also specified. .PP .B srun will attempt to meet the above specifications "at a minimum." That is, if 16 nodes are requested for 32 processes, and some nodes do not have 2 CPUs, the allocation of nodes will be increased in order to meet the demand for CPUs. In other words, a \fIminimum\fR of 16 nodes are being requested. However, if 16 nodes are requested for 15 processes, .B srun will consider this an error, as 15 processes cannot run across 16 nodes. .PP .B "IO Redirection" .PP By default stdout and stderr will be redirected from all tasks to the stdout and stderr of .B srun , and stdin will be redirected from the standard input of .B srun to all remote tasks. This behavior may be changed with the \fB\-\-output\fR, \fB\-\-error\fR, and \fB\-\-input\fR (\fB\-o\fR, \fB\-e\fR, \fB\-i\fR) options. Valid format specifications for these options are .TP 10 \fBall\fR stdout stderr is redirected from all tasks to srun. stdin is broadcast to all remote tasks. (This is the default behavior) .TP \fBnone\fR stdout and stderr is not recieved from any task. stdin is not sent to any task (stdin is closed). .TP \fItaskid\fR stdout and/or stderr are redirected from only the task with relative id equal to \fItaskid\fR, where 0 <= \fItaskid\fR <= \fIntasks\fR, where \fIntasks\fR is the total number of tasks in the current job step. stdin is redirected from the stdin of .B srun to this same task. .TP filename .B srun will redirect stdout and/or stderr to the named file from all tasks. stdin will be redirected from the named file and broadcast to all tasks in the job. .TP format string .B srun allows for a format string to be used to generate the named IO file described above. The following list of format specifiers may be used in the format string to generate a filename that will be unique to a given jobid, stepid, node, or task. In each case, the appropiate number of files are opened and associated with the corresponding tasks. .RS 10 .TP %J jobid.stepid of the running job. (e.g. "128.0") .TP %j jobid of the running job. .TP %s stepid of the running job. .TP %N short hostname. This will create a separate IO file per node. .TP %n Node identifier relative to current job (e.g. "0" is the first node of the running job) This will create a separate IO file per node. .TP %t task identifier (rank) relative to current job. This will create a separate IO file per task. .PP A number placed between the percent character and format specifier may be used to zero-pad the result in the IO filename. This number is ignored if the format specifier corresponds to non-numeric data (%N for example). Some examples of how the format string may be used for a 4 task job step with a Job ID of 128 and step id of 0 are included below: .TP 15 job%J.out job128.0.out .TP job%4j.out job0128.out .TP job%j-%2t.out job128-00.out, job128-01.out, ... .PP .RS -10 .PP .B "Allocate Mode" .PP When the allocate option is specified (\fB\-A\fR, \fB\-\-allocate\fR) \fBsrun\fR will not initiate any remote processes after acquiring resources. Instead, \fBsrun\fR will spawn a subshell which has access to the acquired resources. Subsequent instances of \fBsrun\fR from within this subshell will then run on these resources. .PP If the name of a script is specified on the commandline with \fB\-\-allocate\fR, the spawned shell will run the specified script. Resources allocated in this way will only be freed when the subshell terminates. .PP .B "Attaching to a running job" .PP Use of the \fB-a\fR \fIjobid\fR (or \fB\-\-attach\fR) option allows \fBsrun\fR to reattach to a running job, receiving stdout and stderr from the job and forwarding signals to the job, just as if the current session of \fBsrun\fR had started the job. (stdin, however, cannot be forwarded to the job). .PP There are two ways to reattach to a running job. The default method is to steal any current connections to the job. In this case, the \fBsrun\fR process currently managing the job will be terminated, and control will be relegated to the caller. To allow the current \fBsrun\fR to continue managing the running job, the \fB\-j\fB (\fB\-\-join\fR) option may be specified. When joining with the running job, stdout and stderr are duplicated to the new \fBsrun\fR session, but signals are not forwarded to the remote job. .PP Node and CPU selection options do not make sense when specifying \fB\-\-attach\fR, and it is an error to use \fB-n\fR, \fB-c\fR, or \fB-N\fR in attach mode. .PP .SH "ENVIRONMENT VARIABLES" .PP Some .B srun options may be set via environment variables. These environment variables, along with their corresponding options, are listed below. (Note: commandline options will always override these settings) .TP 20 SLURM_CPUS_PER_TASK \fB\-c, \-\-ncpus\-per\-task\fR=\fIn\fR .TP SLURM_DEBUG \fB\-v, \-\-verbose\fR .TP SLURMD_DEBUG \fB\-d, \-\-slurmd-debug\fR .TP SLURM_DISTRIBUTION \fB\-m, \-\-distribution\fR=(\fIblock|cyclic\fR) .TP SLURM_NNODES \fB\-N, \-\-nodes\fR=\fIn\fR .TP SLURM_NPROCS \fB\-n, \-\-nprocs\fR=\fIn\fR .TP SLURM_OVERCOMMIT \fB\-o, \-\-overcommit\fR .TP SLURM_PARTITION \fB\-p, --partition\fR=\fIpartition\fR .TP SLURM_STDERRMODE \fB\-e, \-\-error\fR=\fImode\fR .TP SLURM_STDINMODE \fB\-i, \-\-input\fR=\fImode\fR .TP SLURM_STDOUTMODE \fB\-o, \-\-output\fR=\fImode\fR .TP SLURM_WAIT \fB\-W, \-\-wait\fR=\fIseconds\fR .PP Additionally, .B srun will set some environment variables in the environment of the executing tasks on the remote compute nodes. These environment variables are: .TP 20 SLURM_JOBID job id of the executing job. .TP SLURM_RANK the MPI rank of the current process .TP SLURM_NPROCS total number of processes in the current job .TP SLURM_NODELIST list of nodes that the slurm job is executing on. .SH "SIGNALS AND ESCAPE SEQUENCES" Signals sent to the \fBsrun\fR command are automatically forwarded to the tasks it is controlling with a few exceptions. The escape sequence \fB<control-c>\fR will report the state of all tasks associated with the \fBsrun\fR command. If \fB<control-c>\fR is entered twice within one second, then the associated SIGINT signal will be sent to all tasks. If a third \fB<control-c>\fB is recieved, the job will be forcefully terminated without waiting for remote tasks to exit. The escape sequence \fB<control-z>\fR is presently ignored. Our intent is for this put the \fBsrun\fR command into a mode where various special actions may be invoked. .SH "MPI SUPPORT" On computers with a Quadrics interconnect, \fBsrun\fR directly supports the Quadrics version of MPI without modification. Applications build using the Quadrics MPI library will communicate directly over the Quadrics interconnect without any special \fBsrun\fR options. Users may also use MPICH on any computer where that is available. The \fBmpirun\fR command may need to be provided with information on its command line identifying the resources to be used. The installer of the MPICH software may configure it to perform these steps automatically. At worst, you must specify two parameters: .TP \fB\-np SLURM_NPROCS\fR number of processors to run on .TP \fB\-machinefile <machinefile>\fR list of computers on which to execute. This list can be constructed executing the command \fBsrun /bin/hostname\fR and writing its standard output to the desired file. Execute \fBmpirun \-\-help\fR for more options. .SH "EXAMPLES" This simple example demonstrates the execution of the command \fBhostname\fR in eight tasks. At least eight processors will be allocated to the job (the same as the task count) on however many nodes are required to satify the request. The output of each task will be preceeded with its task number. .br > srun \-n8 \-l hostname This example demonstrates how one might submit a script for later execution (batch mode). The script will be initiated when resources are available and no higher priority job is pending for the same partition. The script will execute on 4 nodes with one task per node implicit. Note that the script executes on one node. For the script to utilize all allocated nodes, it must execute the \fBsrun\fR command or an MPI program. .br > cat my_script .br #!/bin/csh .br date .br srun \-l hostname .br > srun \-N4 \-b my_script This example demonstrates how one executes a simple MPICH job in the event that MPICH has not been configurated to automatically set the required parameters (this is the worst cases scenario). We use \fBsrun\fR to build a list of machines (nodes) to be used by \fBmpirun\fR in its required format. A sample command line and the script to be executed follow. .br > cat my_script #!/bin/csh .br srun /bin/hostname >nodes .br mpirun \-np $SLURM_NPROCS \-machinefile nodes /bin/hostname .br rm node_list .br > srun \-N2 \-n4 my_script If MPICH is configured to directly use SLURM, the execute line is the much simpler: .br > mpirun \-np 4 /bin/hostname .SH "BUGS" If the number of processors per node allocated to a job is not evenly divisible by the value of \fBcpus\-per\-node\fR, tasks may be initiated on nodes lacking a sufficient number of processors for the desired parallelism. For example, if \fBcpus\-per\-node\fR is three, \fBnprocs\fR is four and the job is allocated three nodes each with four processors. The requisite 12 processors have been allocated, but there is no way for the job to initiate four tasks with each of them having exclusive access to three processors on the same node. The \fBnodes\fR and \fBmincpus\fR options may be helpful in preventing this problem. .SH "SEE ALSO" \fBscancel\fR(1), \fBsqueue\fR(1)