From d07c0e87ec0adce956bf3c92bbbd19bd84c0472e Mon Sep 17 00:00:00 2001 From: Moe Jette <jette1@llnl.gov> Date: Wed, 30 Apr 2003 17:05:54 +0000 Subject: [PATCH] Remove details of allocate and batch job initiation designs due to space constraints and make a note to that effect. --- doc/clusterworld/report.tex | 117 +----------------------------------- 1 file changed, 3 insertions(+), 114 deletions(-) diff --git a/doc/clusterworld/report.tex b/doc/clusterworld/report.tex index 19e5e24aa5e..d1845a07e41 100644 --- a/doc/clusterworld/report.tex +++ b/doc/clusterworld/report.tex @@ -1227,8 +1227,9 @@ list of allocated nodes, job step credential, etc. if the request is granted, \srun\ then initializes a listen port for stdio connections and connects to the {\tt slurmd}s on the allocated nodes requesting that the remote processes be initiated. The {\tt slurmd}s begin execution of the tasks and -connect back to \srun\ for stdout and stderr. This process and other -initiation modes are described in more detail below. +connect back to \srun\ for stdout and stderr. This process is described +in more detail below. Details of the batch and allocate modes of operation +are not presented due to space constraints. \subsection{Interactive Job Initiation} @@ -1290,118 +1291,6 @@ the allocated nodes, it issues a request for the epilog to be run on each of the {\tt slurmd}s in the allocation. As {\tt slurmd}s report that the epilog ran successfully, the nodes are returned to the partition. -\subsection{Queued (Batch) Job Initiation} - -\begin{figure}[tb] -\centerline{\epsfig{file=../figures/queued-job-init.eps,scale=0.5} } -\caption{\small Queued job initiation. - \slurmctld\ initiates the user's job as a batch script on one node. - Batch script contains an \srun\ call that initiates parallel tasks - after instantiating job step with controller. The shaded region is - a compressed representation and is shown in more detail in the - interactive diagram (Figure~\ref{init-interactive})} -\label{init-batch} -\end{figure} - -Figure~\ref{init-batch} shows the initiation of a queued job in -SLURM. The user invokes \srun\ in batch mode by supplying the {\tt --batch} -option to \srun . Once user options are processed, \srun\ sends a batch -job request to \slurmctld\ that identifies the stdin, stdout and stderr file -names for the job, current working directory, environment, requested -number of nodes, etc. -The \slurmctld\ queues the request in its priority-ordered queue. - -Once the resources are available and the job has a high enough priority, \linebreak -\slurmctld\ allocates the resources to the job and contacts the first node -of the allocation requesting that the user job be started. In this case, -the job may either be another invocation of \srun\ or a job script -including invocations of \srun . The \slurmd\ on -the remote node responds to the run request, initiating the job manager, -session manager, and user script. An \srun\ executed from within the script -detects that it has access to an allocation and initiates a job step on -some or all of the nodes within the job. - -Once the job step is complete, the \srun\ in the job script notifies -the \slurmctld\, and terminates. The job script continues executing and -may initiate further job steps. Once the job script completes, the task -thread running the job script collects the exit status and sends a task -exit message to the \slurmctld . The \slurmctld\ notes that the job -is complete and requests that the job epilog be run on all nodes that -were allocated. As the {\tt slurmd}s respond with successful completion -of the epilog, the nodes are returned to the partition. - -\subsection{Allocate Mode Initiation} - -\begin{figure}[tb] -\centerline{\epsfig{file=../figures/allocate-init.eps,scale=0.5} } -\caption{\small Job initiation in allocate mode. Resources are allocated and - \srun\ spawns a shell with access to the resources. When user runs - an \srun\ from within the shell, the a job step is initiated under - the allocation} -\label{init-allocate} -\end{figure} - -In allocate mode, the user wishes to allocate a job and interactively run -job steps under that allocation. The process of initiation in this mode -is shown in Figure~\ref{init-allocate}. The invoked \srun\ sends -an allocate request to \slurmctld , which, if resources are available, -responds with a list of nodes allocated, job id, etc. The \srun\ process -spawns a shell on the user's terminal with access to the allocation, -then waits for the shell to exit (at which time the job is considered -complete). - -An \srun\ initiated within the allocate sub-shell recognizes that -it is running under an allocation and therefore already within a -job. Provided with no other arguments, \srun\ started in this manner -initiates a job step on all nodes within the current job. - -% Maybe later: -% -% However, the user may select a subset of these nodes implicitly by using -% the \srun\ {\tt --nodes} option, or explicitly by specifying a relative -% nodelist ( {\tt --nodelist=[0-5]} ). - -An \srun\ executed from the sub-shell reads the environment and user -options, then notifies the controller that it is starting a job step under -the current job. The \slurmctld\ registers the job step and responds -with a job step credential. \srun\ then initiates the job step using the same -general method as for interactive job initiation. - -When the user exits the allocate sub-shell, the original \srun\ receives -exit status, notifies \slurmctld\ that the job is complete, and exits. -The controller runs the epilog on each of the allocated nodes, returning -nodes to the partition as they successfully complete the epilog. - -% -% Information in this section seems like it should be some place else -% (Some of it is incorrect as well) -% -mark -% -%\section{Infrastructure} -% -%The state of \slurmctld\ is written periodically to disk for fault -%tolerance. SLURM daemons are initiated via {\tt inittab} using the {\tt -%respawn} option to insure their continuous execution. If the control -%machine itself becomes inoperative, its functions can easily be moved in -%an automated fashion to another node. In fact, the computers designated - -%as both primary and backup control machine can easily be relocated as -%needed without loss of the workload by changing the configuration file -%and restarting all SLURM daemons. -% -%The {\tt syslog} tools are used for logging purposes and take advantage - -%of the severity level parameter. -% -%Direct use of the Elan interconnect is provided a version of MPI developed -%and supported by Quadrics. SLURM supports this version of MPI with no -%modifications. -% -%SLURM supports the TotalView debugger\cite{Etnus2002}. This requires -%\srun\ to not only maintain a list of nodes used by each job step, but -%also a list of process ids on each node corresponding the application's -%tasks. - \section{Results} \begin{figure}[htb] -- GitLab