From f0993a89ca80987d0ddf9c8bcb156f259e8bdfbf Mon Sep 17 00:00:00 2001 From: Moe Jette <jette1@llnl.gov> Date: Wed, 19 Mar 2003 19:57:20 +0000 Subject: [PATCH] Minor clean-up and updates to section 1 only. --- doc/pubdesign/llnl.tex | 2 +- doc/pubdesign/report.tex | 71 +++++++++++++++++++--------------------- 2 files changed, 34 insertions(+), 39 deletions(-) diff --git a/doc/pubdesign/llnl.tex b/doc/pubdesign/llnl.tex index 30513f0e947..efc0f64c69d 100644 --- a/doc/pubdesign/llnl.tex +++ b/doc/pubdesign/llnl.tex @@ -118,7 +118,7 @@ \put(300,545) { - \makebox(0,0)[l]{\large \shortstack[l]{ UCRL-MA-147996 \\ REV 2}} + \makebox(0,0)[l]{\large \shortstack[l]{ UCRL-MA-147996 \\ REV 3}} } \put(60, 460) diff --git a/doc/pubdesign/report.tex b/doc/pubdesign/report.tex index cef66f1448f..a0b4b8ea8e7 100644 --- a/doc/pubdesign/report.tex +++ b/doc/pubdesign/report.tex @@ -94,7 +94,7 @@ but does assume that the entire cluster is within a single administrative domain with a common user base across the entire cluster. -\item {\em System administrator friendly}: SLURM is configured a +\item {\em System administrator friendly}: SLURM is configured with a simple configuration file and minimizes distributed state. Its configuration may be changed at any time without impacting running jobs. Heterogeneous nodes within a cluster may be easily managed. @@ -116,7 +116,7 @@ conflicting requests for resources by managing a queue of pending work. Users interact with SLURM through four command line utilities: \srun\ for submitting a job for execution and optionally controlling it interactively, -\scancel\ for early termination of a pending or running job, +\scancel\ for terminating a pending or running job, \squeue\ for monitoring job queues, and \sinfo\ for monitoring partition and overall system state. System administrators perform privileged operations through an additional @@ -127,10 +127,6 @@ and directs operations. Compute nodes simply run a \slurmd\ daemon (similar to a remote shell daemon) to export control to SLURM. -External schedulers and meta-batch systems can submit jobs to SLURM, -order its queues, and monitor SLURM state through an application -programming interface (API). - \subsection{What SLURM is Not} SLURM is not a comprehensive cluster administration or monitoring package. @@ -151,8 +147,8 @@ external entity. Its default scheduler implements First-In First-Out (FIFO). An external entity can establish a job's initial priority through a plugin. -An external scheduler may also submit, signal, hold, reorder and -terminate jobs via the API. +An external scheduler may also submit, signal, and terminate jobs +as well as reorder the queue of pending jobs via the API. \subsection{Architecture} @@ -211,14 +207,13 @@ are explained in more detail below. \slurmd\ is a multi-threaded daemon running on each compute node and can be compared to a remote shell daemon: -it waits for work, executes the work, returns status, -then waits for more work. -Since it initiates jobs for other users, it must run as user {tt root}. +it reads the common SLURM configuration file, waits for work, +executes the work, returns status,then waits for more work. +Since it initiates jobs for other users, it must run as user {\em root}. It also asynchronously exchanges node and job status with {\tt slurmctld}. The only job information it has at any given time pertains to its currently executing jobs. -\slurmd\ reads the common SLURM configuration file, {\tt /etc/slurm.conf}, -and has five major components: +\slurmd\ has five major components: \begin{itemize} \item {\em Machine and Job Status Services}: Respond to controller @@ -249,7 +244,7 @@ termination requests to any set of locally managed processes. \subsubsection{Slurmctld} -Most SLURM state information exists in the controller, {\tt slurmctld}. +Most SLURM state information exists in {\tt slurmctld}, also known as the controller. \slurmctld\ is multi-threaded with independent read and write locks for the various data structures to enhance scalability. When \slurmctld\ starts, it reads the SLURM configuration file. @@ -260,7 +255,7 @@ disk periodically with incremental changes written to disk immediately for fault tolerance. \slurmctld\ runs in either master or standby mode, depending on the state of its fail-over twin, if any. -\\slurmctld\ need not execute as user {\tt root}. +\slurmctld\ need not execute as user {\tt root}. In fact, it is recommended that a unique user entry be created for executing \slurmctld\ and that user must be identified in the SLURM configuration file as {\tt SlurmUser}. @@ -303,10 +298,9 @@ clean-up and performs another scheduling cycle as described above. The command line utilities are the user interface to SLURM functionality. They offer users access to remote execution and job control. They also -permit administrators to dynamically change the system configuration. The -utilities read the global configuration file -to determine the host(s) for \slurmctld\ requests, and the ports for -both for \slurmctld\ and \slurmd\ requests. +permit administrators to dynamically change the system configuration. +These commands all use SLURM APIs which are directly available for +more sophisticated applications. \begin{itemize} \item {\tt scancel}: Cancel a running or a pending job or job step, @@ -319,6 +313,7 @@ such as draining a node or partition in preparation for maintenance. Many \scontrol\ functions can only be executed by privileged users. \item {\tt sinfo}: Display a summary of partition and node information. +A assortment of filtering and output format options are available. \item {\tt squeue}: Display the queue of running and waiting jobs and/or job steps. A wide assortment of filtering, sorting, and output @@ -376,9 +371,10 @@ permit use of other communications layers. At LLNL we are using an Ethernet for SLURM communications and the Quadrics Elan switch exclusively for user applications. The SLURM configuration file permits the identification of each -node's name to be used for communications as well as its hostname. -In the case of a control machine known as {\em mcri} to be communicated -with using the name {\em emcri}, this is represented in the +node's hostname as well as its name to be used for communications. +In the case of a control machine known as {\em mcri} to be +communicated with using the name {\em emcri} (say to indicate +an ethernet communications path), this is represented in the configuration file as {\em ControlMachine=mcri ControlAddr=emcri}. The name used for communication is the same as the hostname unless otherwise specified. @@ -413,8 +409,7 @@ required by others, set-uid programs may be used to grant specific permissions to specific users. We presently support three authentication mechanisms via plugins: -{\tt authd}\cite{Authd2002}, {\tt munged} and {\tt none} -(ie. trust message contents). +{\tt authd}\cite{Authd2002}, {\tt munged} and {\tt none}. A plugin can easily be developed for Kerberos or authentication mechanisms as desired. The \munged\ implementation is described below. @@ -439,19 +434,19 @@ In SLURM's case, the user supplied information includes node identification information to prevent a credential from being used on nodes it is not destined for. -When resources are allocated to a user by the controller, a ``job -step credential'' is generated by combining the user id, job id, +When resources are allocated to a user by the controller, a +{\em job step credential} is generated by combining the user id, job id, step id, the list of resources allocated (nodes), and the credential -lifetime. This ``job step credential'' is encrypted with +lifetime. This job step credential is encrypted with a \slurmctld\ private key. This credential is returned to the requesting agent ({\tt srun}) along with the allocation response, and must be forwarded to the remote {\tt slurmd}'s upon job step initiation. \slurmd\ decrypts this credential with the \slurmctld 's public key to verify that the user may access -resources on the local node. \slurmd\ also uses this ``job step credential'' +resources on the local node. \slurmd\ also uses this job step credential to authenticate standard input, output, and error communication streams. -Access to partitions may be restricted via a ``RootOnly'' flag. +Access to partitions may be restricted via a {\em RootOnly} flag. If this flag is set, job submit or allocation requests to this partition are only accepted if the effective user ID originating the request is a privileged user. @@ -461,12 +456,12 @@ with exclusive access to partitions. Individual users will not be permitted to directly submit jobs to such a partition, which would prevent the external scheduler from effectively managing it. Access to partitions may also be restricted to users who are -members of specific Unix groups using a ``AllowGroups'' specification. +members of specific Unix groups using a {\em AllowGroups} specification. \subsection{Example: Executing a Batch Job} In this example a user wishes to run a job in batch mode, in which \srun\ returns -immediately and the job executes ``in the background'' when resources +immediately and the job executes in the background when resources are available. The job is a two-node run of script containing {\em mping}, a simple MPI application. The user submits the job: @@ -529,12 +524,12 @@ prolog program (if one is configured) as user {\tt root}, and executes the job script (or command) as the submitting user. The \srun\ within the job script detects that it is running with allocated resources from the presence of the {\tt SLURM\_JOBID} environment variable. \srun\ connects to -\slurmctld\ to request a ``job step'' to run on all nodes of the current +\slurmctld\ to request a job step to run on all nodes of the current job. \slurmctld\ validates the request and replies with a job credential and switch resources. \srun\ then contacts \slurmd 's running on both {\em dev6} and {\em dev7}, passing the job credential, environment, current working directory, command path and arguments, and interconnect -information. The {\tt slurmd}'s verify the valid job credential, connect +information. The {\tt slurmd}'s verify the valid job step credential, connect stdout and stderr back to \srun , establish the environment, and execute the command as the submitting user. @@ -605,7 +600,7 @@ stdout of {\tt srun}: 1 pinged 0: 1048576 bytes 4682.97 uSec 223.91 MB/s \end{verbatim} -When the job terminates, \srun\ receives an EOF on each stream and +When the job terminates, \srun\ receives an EOF (End Of File) on each stream and closes it, then receives the task exit status from each {\tt slurmd}. The \srun\ process notifies \slurmctld\ that the job is complete and terminates. The controller contacts all \slurmd 's allocated to the @@ -941,10 +936,10 @@ allocated to the job of the termination request. The \slurmd\ job termination procedure, including job signaling, is described in the slurmd section. -One may think of a ``job'' as described above as an allocation of resource +One may think of a {\em job} as described above as an allocation of resource and a user script rather than a collection of parallel tasks. For that, the scripts execute \srun\ commands to initiate the parallel tasks -or ``job steps''. The job may include multiple job steps, executing +or {\em job steps}. The job may include multiple job steps, executing sequentially and or concurrently either on separate or overlapping nodes. Job steps have associated with them specific nodes (some or all of those associated with the job), tasks, and a task distribution (cyclic or @@ -1186,7 +1181,7 @@ forward signals from the user's terminal and so on. {\em join} can be considered a variant of {\em attach} in which the job's stdout is captured, but stdin and signals can't be sent to it. -An interactive job may also be forced into the ``background'' with a +An interactive job may also be forced into the background with a special control sequence typed at the user's terminal. This sequence causes another \srun\ to attach to the running job while the interactive \srun\ terminates. Output from the running job is subsequently @@ -1308,7 +1303,7 @@ working directory, environment, requested number of nodes, etc. The Once the resources are available and the job has a high enough priority, \slurmctld\ allocates the resources to the job and contacts the first node -of the allocation requesting that the user ``job'' be started. In this case, +of the allocation requesting that the user job be started. In this case, the job may either be another invocation of \srun\ or a {\em job script} which may have multiple invocations of \srun\ within it. The \slurmd\ on the remote node responds to the run request, initiating the job thread, task thread, -- GitLab