From 327c5068fb114c17a7764deebf72594b991965fc Mon Sep 17 00:00:00 2001 From: Moe Jette <jette1@llnl.gov> Date: Wed, 20 Oct 2004 21:54:51 +0000 Subject: [PATCH] Major revision to draft paper. --- doc/bgl.report/Makefile | 4 +- doc/bgl.report/report.tex | 959 +++++-------------------------------- doc/bgl.report/smap.output | 19 + 3 files changed, 136 insertions(+), 846 deletions(-) create mode 100644 doc/bgl.report/smap.output diff --git a/doc/bgl.report/Makefile b/doc/bgl.report/Makefile index 3014f645804..7fe7a35ec07 100644 --- a/doc/bgl.report/Makefile +++ b/doc/bgl.report/Makefile @@ -10,7 +10,7 @@ REPORT = report -TEX = ../common/llnlCoverPage.tex $(REPORT).tex +TEX = ../common/llnlCoverPage.tex $(REPORT).tex FIGDIR = ../figures FIGS = $(FIGDIR)/arch.eps \ @@ -39,7 +39,7 @@ BIB = ../common/project.bib all: $(REPORT).ps -$(REPORT).dvi: $(TEX) $(FIGS) $(PLOTS) $(BIB) +$(REPORT).dvi: $(TEX) $(FIGS) $(PLOTS) $(BIB) smap.output rm -f *.log *.aux *.blg *.bbl (TEXINPUTS=.:../common::; export TEXINPUTS; \ BIBINPUTS=.:../common::; export BIBINPUTS; \ diff --git a/doc/bgl.report/report.tex b/doc/bgl.report/report.tex index 2deb4deacfb..e34d1c2597f 100644 --- a/doc/bgl.report/report.tex +++ b/doc/bgl.report/report.tex @@ -67,7 +67,7 @@ % couple of macros for the title page and document \def\ctit{SLURM Resource Management for Blue Gene/L} \def\ucrl{UCRL-JC-TBD} -\def\auth{Morris Jette \\ Dan Phung \\ Danny Auble} +\def\auth{Morris Jette \\ Danny Auble \\ Dan Phung} \def\pubdate{October 18, 2004} \def\journal{Conference TBD} @@ -94,7 +94,7 @@ \noindent\normalsize The Blue Gene/L (BGL) system is a highly scalable computer developed by IBM for Lawrence Livermore National Laboratory (LLNL). -The current system has over 130,000 processors interconnected by a +The current system has over 131,000 processors interconnected by a three-dimensional toroidal network with complex rules for managing the network and allocating resources to jobs. We selected SLURM (Simple Linux Utility for Resource Management ) to @@ -123,7 +123,7 @@ The Blue Gene/L system offers a unique cell-based design in which the capacity can be expanded without introducing bottlenecks. The Blue Gene/L system delivered to LLNL consists of 131,072 processors and 16TB of memory. -The peak computational rate will be 360 TeraFLOPs. +The peak computational rate will exceed 360 TeraFLOPs. Simple Linux Utility for Resource Management (SLURM)\footnote{A tip of the hat to Matt Groening and creators of {\em Futurama}, @@ -137,18 +137,20 @@ proven both highly reliable and highly scalalble. \section{Architecture of Blue Gene/L} -The basic building-blocks of Blue Gene/L are c-nodes consisting -of two processors based upon the PowerPC 550GX, 256MB of memory -and support for five separate networks on a single chip. -These are subsequently grouped into base partitions, each consisting -of 512 c-nodes in an eight by eight by eight grid with the same +The basic building-blocks of Blue Gene/L are c-nodes. +Each c-node consists +of two processors based upon the PowerPC 550GX, 512 MB of memory +and support for five separate networks on a single chip. +One of the processors may be used for computations and the +second used exclusively for communications. +Alternately, both processors may be used for computations. +These c-nodes are subsequently grouped into base partitions, each consisting +of 512 c-nodes in an eight by eight by eight array with the same network support. The Blue Gene/L system delivered to LLNL consists of 128 base -partitions organized in an eight by four by four grid. +partitions organized in an eight by four by four array. The minimal resource allocation unit for applications is one base partition so that at most 128 simultaneous jobs may execute. -However, we anticipate a substantially smaller number of simultaneous -jobs, each using at least a several base partitions. The c-nodes execute a custom micro-kernel. System calls that can not directly be processed by the c-node @@ -160,14 +162,28 @@ Three distinct communications networks are supported: a three-dimensional torus with direct nearest-neighbor connections; a global tree network for broadcast and reduction operations; and a barrier network for synchronization. +The torus network connects each node to +its nearest neighbors in the X, Y and Z directions for a +total of six of these connections for each node. -Mesh vs. Torus -Overhead of changes (e.g. reboot nodes) +Service node, Front-end-node. + +Mesh vs. Torus. +Wiring rules. + +Overhead of starting a new job (e.g. reboot nodes). Etc. +Be careful not to use non-public information (don't use +information directly from the "IBM Confidential" documents). \section{Architecture of SLURM} +Only a brief description of SLURM architecture and implemenation is provided +here. +A more thorough treatment of the SLURM design and implementation is +available \cite{SLURM2002}. + Several SLURM features make it well suited to serve as a resource manager for Blue Gene/L. @@ -213,20 +229,6 @@ work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates conflicting requests for resources by managing a queue of pending work. -Users interact with SLURM through five command line utilities: \srun\ -for submitting a job for execution and optionally controlling it -interactively, \scancel\ for terminating a pending or running job, -\squeue\ for monitoring job queues, and \sinfo\ for monitoring partition -and overall system state. The \smap\ command combines some of -the capabilities of both \sinfo\ and \squeue\, but displays the -information in a graphical fashion that reflects the system topography. -System administrators perform privileged operations through an -additional command line utility, {\tt scontrol}. - -The central controller daemon, \slurmctld\, maintains the global -state and directs operations. Compute nodes simply run a \slurmd\ daemon -(similar to a remote shell daemon) to export control to SLURM. - \begin{figure}[tb] \centerline{\epsfig{file=../figures/arch.eps,scale=0.35}} \caption{\small SLURM architecture} @@ -236,8 +238,32 @@ state and directs operations. Compute nodes simply run a \slurmd\ daemon As shown in Figure~\ref{arch}, SLURM consists of a \slurmd\ daemon running on each compute node, a central \slurmctld\ daemon running on a management node (with optional fail-over twin), and five command -line utilities: {\tt srun}, {\tt scancel}, {\tt sinfo}, {\tt squeue}, -and {\tt scontrol}, which can run anywhere in the cluster. +line utilities: \srun\, \scancel, \sinfo\, \squeue\, and \scontrol\, +which can run anywhere in the cluster. + +The central controller daemon, \slurmctld\, maintains the global +state and directs operations. +\slurmctld\ monitors the state of nodes (through {\tt slurmd}), +groups nodes into partitions with various contraints, +manages a queue of pending work, and +allocates resouces to pending jobs and job steps. +\slurmctld\ does not directly execute any user jobs, but +provides overall management of jobs and resources. + +Compute nodes simply run a \slurmd\ daemon (similar to a remote +shell daemon) to export control to SLURM. +Each \slurmd\ monitors machine status, +performs remote job execution, manages the job's I/O, and otherwise +manages the jobs and job steps for its execution host. + +Users interact with SLURM through four command line utilities: +\srun\ for submitting a job for execution and optionally controlling +it interactively, +\scancel\ for signalling or terminating a pending or running job, +\squeue\ for monitoring job queues, and +\sinfo\ for monitoring partition and overall system state. +System administrators perform privileged operations through an +additional command line utility, {\tt scontrol}. The entities managed by these SLURM daemons include {\em nodes}, the compute resource in SLURM, {\em partitions}, which group nodes into @@ -245,11 +271,10 @@ logical disjoint sets, {\em jobs}, or allocations of resources assigned to a user for a specified amount of time, and {\em job steps}, which are sets of (possibly parallel) tasks within a job. Each job in the priority-ordered queue is allocated nodes within a single partition. -Once an allocation request fails, no lower priority jobs for that -partition will be considered for a resource allocation. Once a job is -assigned a set of nodes, the user is able to initiate parallel work in -the form of job steps in any configuration within the allocation. For -instance, a single job step may be started that utilizes all nodes +Once a job is assigned a set of nodes, the user is able to initiate +parallel work in the form of job steps in any configuration within +the job's allocation. +For instance, a single job step may be started that utilizes all nodes allocated to the job, or several job steps may independently use a portion of the allocation. @@ -267,833 +292,79 @@ in Partition 2 has only one job step using half of the original job allocation. That job might initiate additional job steps to utilize the remaining nodes of its allocation. -\begin{figure}[tb] -\centerline{\epsfig{file=../figures/slurm-arch.eps,scale=0.5}} -\caption{SLURM architecture - subsystems} -\label{archdetail} -\end{figure} - -Figure~\ref{archdetail} shows the subsystems that are implemented -within the \slurmd\ and \slurmctld\ daemons. These subsystems are -explained in more detail below. - -\subsection{Slurmd} - -\slurmd\ is a multi-threaded daemon running on each compute node and -can be compared to a remote shell daemon: it reads the common SLURM -configuration file and saved state information, -notifies the controller that it is active, waits -for work, executes the work, returns status, then waits for more work. -Because it initiates jobs for other users, it must run as user {\em root}. -It also asynchronously exchanges node and job status with {\tt slurmctld}. -The only job information it has at any given time pertains to its -currently executing jobs. \slurmd\ has five major components: - -\begin{itemize} -\item {\tt Machine and Job Status Services}: Respond to controller -requests for machine and job state information and send asynchronous -reports of some state changes (e.g., \slurmd\ startup) to the controller. - -\item {\tt Remote Execution}: Start, manage, and clean up after a set -of processes (typically belonging to a parallel job) as dictated by -the \slurmctld\ daemon or an \srun\ or \scancel\ command. Starting a -process may include executing a prolog program, setting process limits, -setting real and effective uid, establishing environment variables, -setting working directory, allocating interconnect resources, setting core -file paths, initializing stdio, and managing process groups. Terminating -a process may include terminating all members of a process group and -executing an epilog program. - -\item {\tt Stream Copy Service}: Allow handling of stderr, stdout, and -stdin of remote tasks. Job input may be redirected -from a single file or multiple files (one per task), an -\srun\ process, or /dev/null. Job output may be saved into local files or -returned to the \srun\ command. Regardless of the location of stdout/err, -all job output is locally buffered to avoid blocking local tasks. - -\item {\tt Job Control}: Allow asynchronous interaction with the Remote -Execution environment by propagating signals or explicit job termination -requests to any set of locally managed processes. - -\end{itemize} - -\subsection{Slurmctld} - -Most SLURM state information exists in {\tt slurmctld}, also known as -the controller. \slurmctld\ is multi-threaded with independent read -and write locks for the various data structures to enhance scalability. -When \slurmctld\ starts, it reads the SLURM configuration file and -any previously saved state information. Full controller state -information is written to disk periodically, with incremental changes -written to disk immediately for fault tolerance. \slurmctld\ runs in -either master or standby mode, depending on the state of its fail-over -twin, if any. \slurmctld\ need not execute as user {\em root}. In fact, -it is recommended that a unique user entry be created for executing -\slurmctld\ and that user must be identified in the SLURM configuration -file as {\tt SlurmUser}. \slurmctld\ has three major components: - - -\begin{itemize} -\item {\tt Node Manager}: Monitors the state of each node in the cluster. -It polls {\tt slurmd}s for status periodically and receives state -change notifications from \slurmd\ daemons asynchronously. It ensures -that nodes have the prescribed configuration before being considered -available for use. - -\item {\tt Partition Manager}: Groups nodes into non-overlapping sets -called partitions. Each partition can have associated with it -various job limits and access controls. The Partition Manager also -allocates nodes to jobs based on node and partition states and -configurations. Requests to initiate jobs come from the Job Manager. -\scontrol\ may be used to administratively alter node and partition -configurations. - -\item {\tt Job Manager}: Accepts user job requests and places pending jobs -in a priority-ordered queue. The Job Manager is awakened on a periodic -basis and whenever there is a change in state that might permit a job to -begin running, such as job completion, job submission, partition {\em up} -transition, node {\em up} transition, etc. The Job Manager then makes -a pass through the priority-ordered job queue. The highest priority -jobs for each partition are allocated resources as possible. As soon as -an allocation failure occurs for any partition, no lower-priority jobs -for that partition are considered for initiation. After completing the -scheduling cycle, the Job Manager's scheduling thread sleeps. Once a -job has been allocated resources, the Job Manager transfers necessary -state information to those nodes, permitting it to commence execution. -When the Job Manager detects that all nodes associated with a job -have completed their work, it initiates cleanup and performs another -scheduling cycle as described above. - -\end{itemize} - -\subsection{Command Line Utilities} - -The command line utilities offer users access to remote execution and -job control. They also permit administrators to dynamically change -the system configuration. These commands use SLURM APIs that are -directly available for more sophisticated applications. - -\begin{itemize} -\item {\tt scancel}: Cancel a running or a pending job or job step, -subject to authentication and authorization. This command can also be -used to send an arbitrary signal to all processes on all nodes associated -with a job or job step. - -\item {\tt scontrol}: Perform privileged administrative commands -such as bringing down a node or partition in preparation for maintenance. -Many \scontrol\ functions can only be executed by privileged users. - -\item {\tt sinfo}: Display a summary of partition and node information. -An assortment of filtering and output format options are available. - -\item {\tt squeue}: Display the queue of running and waiting jobs and/or -job steps. A wide assortment of filtering, sorting, and output format -options are available. - -\item {\tt srun}: Allocate resources, submit jobs to the SLURM queue, -and initiate parallel tasks (job steps). Every set of executing parallel -tasks has an associated \srun\ that initiated it and, if the \srun\ -persists, manages it. Jobs may be submitted for later execution -(e.g., batch), in which case \srun\ terminates after job submission. -Jobs may also be submitted for interactive execution, where \srun\ keeps -running to shepherd the running job. In this case, \srun\ negotiates -connections with remote {\tt slurmd}s for job initiation and to get -stdout and stderr, forward stdin,\footnote{\srun\ command line options -select the stdin handling method, such as broadcast to all tasks, or -send only to task 0.} and respond to signals from the user. \srun\ -may also be instructed to allocate a set of resources and spawn a shell -with access to those resources. - -\end{itemize} - -\subsection{Plugins} - In order to simplify the use of different infrastructures, SLURM uses a general purpose plugin mechanism. A SLURM plugin is a dynamically linked code object that is loaded explicitly at run time by the SLURM libraries. A plugin provides a customized implementation of a well-defined API connected to tasks such as authentication, -interconnect fabric, and task scheduling. A common set of functions is defined +or job scheduling. A common set of functions is defined for use by all of the different infrastructures of a particular variety. For example, the authentication plugin must define functions such as {\tt slurm\_auth\_create} to create a credential, {\tt slurm\_auth\_verify} -to verify a credential to approve or deny authentication, -{\tt slurm\_auth\_get\_uid} to get the uid associated with a specific -credential, etc. It also must define the data structure used, a plugin -type, a plugin version number, etc. When a SLURM daemon is initiated, it +to verify a credential to approve or deny authentication, etc. +When a SLURM command or daemon is initiated, it reads the configuration file to determine which of the available plugins -should be used. For example {\em AuthType=auth/authd} says to use the -plugin for authd based authentication and {\em PluginDir=/usr/local/lib} +should be used. For example {\em AuthType=auth/munge} says to use the +plugin for munge based authentication and {\em PluginDir=/usr/local/lib} identifies the directory in which to find the plugin. -\subsection{Communications Layer} - -SLURM presently uses Berkeley sockets for communications. However, -we anticipate using the plugin mechanism to permit use of other -communications layers. At LLNL we are using an ethernet network -for SLURM communications and the Quadrics Elan switch exclusively -for user applications. The SLURM configuration file permits the -identification of each node's hostname as well as its name to be used -for communications. In the case of a control machine known as {\em mcri} -to be communicated with using the name {\em emcri} (say to indicate an -ethernet communications path), this is represented in the configuration -file as {\em ControlMachine=mcri ControlAddr=emcri}. The name used for -communication is the same as the hostname unless otherwise specified. - -Internal SLURM functions pack and unpack data structures in machine -independent format. We considered the use of XML style messages, but we -felt this would adversely impact performance (albeit slightly). If XML -support is desired, it is straightforward to perform a translation and -use the SLURM APIs. - - -\subsection{Security} - -SLURM has a simple security model: any user of the cluster may submit -parallel jobs to execute and cancel his own jobs. Any user may view -SLURM configuration and state information. Only privileged users -may modify the SLURM configuration, cancel any job, or perform other -restricted activities. Privileged users in SLURM include the users -{\em root} and {\em SlurmUser} (as defined in the SLURM configuration file). -If permission to modify SLURM configuration is required by others, set-uid -programs may be used to grant specific permissions to specific users. - -\subsubsection{Communication Authentication.} - -Historically, inter-node authentication has been accomplished via the use -of reserved ports and set-uid programs. In this scheme, daemons check the -source port of a request to ensure that it is less than a certain value -and thus only accessible by {\em root}. The communications over that -connection are then implicitly trusted. Because reserved ports are a -limited resource and set-uid programs are a possible security concern, -we have employed a credential-based authentication scheme that -does not depend on reserved ports. In this design, a SLURM authentication -credential is attached to every message and authoritatively verifies the -uid and gid of the message originator. Once recipients of SLURM messages -verify the validity of the authentication credential, they can use the uid -and gid from the credential as the verified identity of the sender. - -The actual implementation of the SLURM authentication credential is -relegated to an ``auth'' plugin. We presently have implemented three -functional authentication plugins: authd\cite{Authd2002}, -Munge, and none. The ``none'' authentication type employs a null -credential and is only suitable for testing and networks where security -is not a concern. Both the authd and Munge implementations employ -cryptography to generate a credential for the requesting user that -may then be authoritatively verified on any remote nodes. However, -authd assumes a secure network and Munge does not. Other authentication -implementations, such as a credential based on Kerberos, should be easy -to develop using the auth plugin API. - -\subsubsection{Job Authentication.} - -When resources are allocated to a user by the controller, a ``job step -credential'' is generated by combining the uid, job id, step id, the list -of resources allocated (nodes), and the credential lifetime and signing -the result with the \slurmctld\ private key. This credential grants the -user access to allocated resources and removes the burden from \slurmd\ -to contact the controller to verify requests to run processes. \slurmd\ -verifies the signature on the credential against the controller's public -key and runs the user's request if the credential is valid. Part of the -credential signature is also used to validate stdout, stdin, -and stderr connections from \slurmd\ to \srun . - -\subsubsection{Authorization.} - -Access to partitions may be restricted via a {\em RootOnly} flag. -If this flag is set, job submit or allocation requests to this partition -are only accepted if the effective uid originating the request is a -privileged user. A privileged user may submit a job as any other user. -This may be used, for example, to provide specific external schedulers -with exclusive access to partitions. Individual users will not be -permitted to directly submit jobs to such a partition, which would -prevent the external scheduler from effectively managing it. Access to -partitions may also be restricted to users who are members of specific -Unix groups using a {\em AllowGroups} specification. - -\section{Slurmctld Design} - -\slurmctld\ is modular and multi-threaded with independent read and -write locks for the various data structures to enhance scalability. -The controller includes the following subsystems: Node Manager, Partition -Manager, and Job Manager. Each of these subsystems is described in -detail below. - -\subsection{Node Management} - -The Node Manager monitors the state of nodes. Node information monitored -includes: - -\begin{itemize} -\item Count of processors on the node -\item Size of real memory on the node -\item Size of temporary disk storage -\item State of node (RUN, IDLE, DRAINED, etc.) -\item Weight (preference in being allocated work) -\item Feature (arbitrary description) -\item IP address -\end{itemize} - -The SLURM administrator can specify a list of system node names using -a numeric range in the SLURM configuration file or in the SLURM tools -(e.g., ``{\em NodeName=linux[001-512] CPUs=4 RealMemory=1024 TmpDisk=4096 \linebreak -Weight=4 Feature=Linux}''). These values for CPUs, RealMemory, and -TmpDisk are considered to be the minimal node configuration values -acceptable for the node to enter into service. The \slurmd\ registers -whatever resources actually exist on the node, and this is recorded -by the Node Manager. Actual node resources are checked on \slurmd\ -initialization and periodically thereafter. If a node registers with -less resources than configured, it is placed in DOWN state and -the event logged. Otherwise, the actual resources reported are recorded -and possibly used as a basis for scheduling (e.g., if the node has more -RealMemory than recorded in the configuration file, the actual node -configuration may be used for determining suitability for any application; -alternately, the data in the configuration file may be used for possibly -improved scheduling performance). Note the node name syntax with numeric -range permits even very large heterogeneous clusters to be described in -only a few lines. In fact, a smaller number of unique configurations -can provide SLURM with greater efficiency in scheduling work. - -{\em Weight} is used to order available nodes in assigning work to -them. In a heterogeneous cluster, more capable nodes (e.g., larger memory -or faster processors) should be assigned a larger weight. The units -are arbitrary and should reflect the relative value of each resource. -Pending jobs are assigned the least capable nodes (i.e., lowest weight) -that satisfy their requirements. This tends to leave the more capable -nodes available for those jobs requiring those capabilities. - -{\em Feature} is an arbitrary string describing the node, such as a -particular software package, file system, or processor speed. While the -feature does not have a numeric value, one might include a numeric value -within the feature name (e.g., ``1200MHz'' or ``16GB\_Swap''). If the -nodes on the cluster have disjoint features (e.g., different ``shared'' -file systems), one should identify these as features (e.g., ``FS1'', -``FS2'', etc.). Programs may then specify that all nodes allocated to -it should have the same feature, but that any of the specified features -are acceptable (e.g., ``$Feature=FS1|FS2|FS3$'' means the job should be -allocated nodes that all have the feature ``FS1'' or they all have feature -``FS2,'' etc.). - -Node records are kept in an array with hash table lookup. If nodes are -given names containing sequence numbers (e.g., ``lx01'', ``lx02'', etc.), -the hash table permits specific node records to be located very quickly; -therefore, this is our recommended naming convention for larger clusters. - -An API is available to view any of this information and to update some -node information (e.g., state). APIs designed to return SLURM state -information permit the specification of a time stamp. If the requested -data has not changed since the time stamp specified by the application, -the application's current information need not be updated. The API -returns a brief ``No Change'' response rather than returning relatively -verbose state information. Changes in node configurations (e.g., node -count, memory, etc.) or the nodes actually in the cluster should be -reflected in the SLURM configuration files. SLURM configuration may be -updated without disrupting any jobs. - -\subsection{Partition Management} - -The Partition Manager identifies groups of nodes to be used for execution -of user jobs. One might consider this the actual resource scheduling -component. Data associated with a partition includes: - -\begin{itemize} -\item Name -\item RootOnly flag to indicate that only users {\em root} or -{\tt SlurmUser} may allocate resources in this partition (for any user) -\item List of associated nodes -\item State of partition (UP or DOWN) -\item Maximum time limit for any job -\item Minimum and maximum nodes allocated to any single job -\item List of groups permitted to use the partition (defaults to ALL) -\item Shared access (YES, NO, or FORCE) -\item Default partition (if no partition is specified in a job request) -\end{itemize} - -It is possible to alter most of this data in real-time in order to affect -the scheduling of pending jobs (currently executing jobs would not be -affected). This information is confined to the controller machine(s) -for better scalability. It is used by the Job Manager (and possibly an -external scheduler), which either exist only on the control machine or -communicate only with the control machine. - -The nodes in a partition may be designated for exclusive or non-exclusive -use by a job. A {\tt shared} value of YES indicates that jobs may -share nodes on request. A {\tt shared} value of NO indicates that -jobs are always given exclusive use of allocated nodes. A {\tt shared} -value of FORCE indicates that jobs are never ensured exclusive -access to nodes, but SLURM may initiate multiple jobs on the nodes -for improved system utilization and responsiveness. In this case, -job requests for exclusive node access are not honored. Non-exclusive -access may negatively impact the performance of parallel jobs or cause -them to fail upon exhausting shared resources (e.g., memory or disk -space). However, shared resources may improve overall system utilization -and responsiveness. The proper support of shared resources, including -enforcement of limits on these resources, entails a substantial amount of -effort, which we are not presently planning to expend. However, we have -designed SLURM so as to not preclude the addition of such a capability -at a later time if so desired. Future enhancements could include -constraining jobs to a specific CPU count or memory size within a -node, which could be used to effectively space-share individual nodes. -The Partition Manager will allocate nodes to pending jobs on request -from the Job Manager. - -Submitted jobs can specify desired partition, time limit, node count -(minimum and maximum), CPU count (minimum) task count, the need for -contiguous node assignment, and an explicit list of nodes to be included -and/or excluded in its allocation. Nodes are selected so as to satisfy -all job requirements. For example, a job requesting four CPUs and four -nodes will actually be allocated eight CPUs and four nodes in the case -of all nodes having two CPUs each. The request may also indicate node -configuration constraints such as minimum real memory or CPUs per node, -required features, shared access, etc. Overall there are 13 different -parameters that may identify resource requirements for a job. - -Nodes are selected for possible assignment to a job based on the -job's configuration requirements (e.g., partition specification, minimum -memory, temporary disk space, features, node list, etc.). The selection -is refined by determining which nodes are up and available for use. -Groups of nodes are then considered in order of weight, with the nodes -having the lowest {\em Weight} preferred. Finally, the physical location -of the nodes is considered. - -Bit maps are used to indicate which nodes are up, idle, associated -with each partition, and associated with each unique configuration. -This technique permits scheduling decisions to normally be made by -performing a small number of tests followed by fast bit map manipulations. -If so configured, a job's resource requirements would be compared with -the (relatively small number of) node configuration records, each of -which has an associated bit map. Usable node configuration bitmaps would -be ANDed with the selected partitions bit map ANDed with the UP node -bit map and possibly ANDed with the IDLE node bit map (this last test -depends on the desire to share resources). This method can eliminate -tens of thousands of individual node configuration comparisons that -would otherwise be required in large heterogeneous clusters. - -The actual selection of nodes for allocation to a job is currently tuned -for the Quadrics interconnect. This hardware supports hardware message -broadcast only if the nodes are contiguous. If a job is not allocated -contiguous nodes, a slower software based multi-cast mechanism is used. -Jobs will be allocated continuous nodes to the extent possible (in -fact, contiguous node allocation may be specified as a requirement on -job submission). If contiguous nodes cannot be allocated to a job, it -will be allocated resources from the minimum number of sets of contiguous -nodes possible. If multiple sets of contiguous nodes can be allocated -to a job, the one that most closely fits the job's requirements will -be used. This technique will leave the largest continuous sets of nodes -intact for jobs requiring them. - -The Partition Manager builds a list of nodes to satisfy a job's request. -It also caches the IP addresses of each node and provides this information -to \srun\ at job initiation time for improved performance. - -The failure of any node to respond to the Partition Manager only affects -jobs associated with that node. In fact, a job may indicate it should -continue executing even if allocated nodes cease responding. In this -case, the job needs to provide for its own fault tolerance. All other -jobs and nodes in the cluster will continue to operate after a node -failure. No additional work is allocated to the failed node, and it will -be pinged periodically to determine when it resumes responding. The node -may then be returned to service (depending on the {\tt ReturnToService} -parameter in the SLURM configuration). - -\subsection{Configuration} - -A single configuration file applies to all SLURM daemons and commands. -Most of this information is used only by the controller. Only the -host and port information is referenced by most commands. - -\subsection{Job Manager} - -There are a multitude of parameters associated with each job, including: -\begin{itemize} -\item Job name -\item Uid -\item Job id -\item Working directory -\item Partition -\item Priority -\item Node constraints (processors, memory, features, etc.) -\end{itemize} - -Job records have an associated hash table for rapidly locating -specific records. They also have bit maps of requested and/or -allocated nodes (as described above). - -The core functions supported by the Job Manager include: -\begin{itemize} -\item Request resource (job may be queued) -\item Reset priority of a job -\item Status job (including node list, memory and CPU use data) -\item Signal job (send arbitrary signal to all processes associated - with a job) -\item Terminate job (remove all processes) -\item Change node count of running job (could fail if insufficient -resources are available) -%\item Preempt/resume job (future) -%\item Checkpoint/restart job (future) - -\end{itemize} - -Jobs are placed in a priority-ordered queue and allocated nodes as -selected by the Partition Manager. SLURM implements a very simple default -scheduling algorithm, namely FIFO. An attempt is made to schedule pending -jobs on a periodic basis and whenever any change in job, partition, -or node state might permit the scheduling of a job. - -We are aware that this scheduling algorithm does not satisfy the needs -of many customers, and we provide the means for establishing other -scheduling algorithms. Before a newly arrived job is placed into the -queue, an external scheduler plugin assigns its initial priority. -A plugin function is also called at the start of each scheduling -cycle to modify job or system state as desired. SLURM APIs permit an -external entity to alter the priorities of jobs at any time and re-order -the queue as desired. The Maui Scheduler \cite{Jackson2001,Maui2002} -is one example of an external scheduler suitable for use with SLURM. - -LLNL uses DPCS \cite{DPCS2002} as SLURM's external scheduler. -DPCS is a meta-scheduler with flexible scheduling algorithms that -suit our needs well. -It also provides the scalability required for this application. -DPCS maintains pending job state internally and only transfers the -jobs to SLURM (or another underlying resources manager) only when -they are to begin execution. -By not transferring jobs to a particular resources manager earlier, -jobs are assured of being initiated on the first resource satisfying -their requirements, whether a Linux cluster with SLURM or an IBM SP -with LoadLeveler (assuming a highly flexible application). -This mode of operation may also be suitable for computational grid -schedulers. - -In a future release, the Job Manager will collect resource consumption -information (CPU time used, CPU time allocated, and real memory used) -associated with a job from the \slurmd\ daemons. Presently, only the -wall-clock run time of a job is monitored. When a job approaches its -time limit (as defined by wall-clock execution time) or an imminent -system shutdown has been scheduled, the job is terminated. The actual -termination process is to notify \slurmd\ daemons on nodes allocated -to the job of the termination request. The \slurmd\ job termination -procedure, including job signaling, is described in Section~\ref{slurmd}. - -One may think of a job as described above as an allocation of resources -rather than a collection of parallel tasks. The job script executes -\srun\ commands to initiate the parallel tasks or ``job steps. '' The -job may include multiple job steps, executing sequentially and/or -concurrently either on separate or overlapping nodes. Job steps have -associated with them specific nodes (some or all of those associated with -the job), tasks, and a task distribution (cyclic or block) over the nodes. - -The management of job steps is considered a component of the Job Manager. -Supported job step functions include: -\begin{itemize} -\item Register job step -\item Get job step information -\item Run job step request -\item Signal job step -\end{itemize} - -Job step information includes a list of nodes (entire set or subset of -those allocated to the job) and a credential used to bind communications -between the tasks across the interconnect. The \slurmctld\ constructs -this credential and sends it to the \srun\ initiating the job step. - -\subsection{Fault Tolerance} -SLURM supports system level fault tolerance through the use of a secondary -or ``backup'' controller. The backup controller, if one is configured, -periodically pings the primary controller. Should the primary controller -cease responding, the backup loads state information from the last state -save and assumes control. When the primary controller is returned to -service, it tells the backup controller to save state and terminate. -The primary then loads state and assumes control. - -SLURM utilities and API users read the configuration file and initially try -to contact the primary controller. Should that attempt fail, an attempt -is made to contact the backup controller before returning an error. - -SLURM attempts to minimize the amount of time a node is unavailable -for work. Nodes assigned to jobs are returned to the partition as -soon as they successfully clean up user processes and run the system -epilog. In this manner, -those nodes that fail to successfully run the system epilog, or those -with unkillable user processes, are held out of the partition while -the remaining nodes are quickly returned to service. - -SLURM considers neither the crash of a compute node nor termination -of \srun\ as a critical event for a job. Users may specify on a per-job -basis whether the crash of a compute node should result in the premature -termination of their job. Similarly, if the host on which \srun\ is -running crashes, the job continues execution and no output is lost. - - -\section{Slurmd Design}\label{slurmd} - -The \slurmd\ daemon is a multi-threaded daemon for managing user jobs -and monitoring system state. Upon initiation it reads the configuration -file, recovers any saved state, captures system state, -attempts an initial connection to the SLURM -controller, and awaits requests. It services requests for system state, -accounting information, job initiation, job state, job termination, -and job attachment. On the local node it offers an API to translate -local process ids into SLURM job id. - -The most common action of \slurmd\ is to report system state on request. -Upon \slurmd\ startup and periodically thereafter, it gathers the -processor count, real memory size, and temporary disk space for the -node. Should those values change, the controller is notified. In a -future release of SLURM, \slurmd\ will also capture CPU and real-memory and -virtual-memory consumption from the process table entries for uploading -to {\tt slurmctld}. - -%FUTURE: Another thread is -%created to capture CPU, real-memory and virtual-memory consumption from -%the process table entries. Differences in resource utilization values -%from one process table snapshot to the next are accumulated. \slurmd\ -%insures these accumulated values are not decremented if resource -%consumption for a user happens to decrease from snapshot to snapshot, -%which would simply reflect the termination of one or more processes. -%Both the real and virtual memory high-water marks are recorded and -%the integral of memory consumption (e.g. megabyte-hours). Resource -%consumption is grouped by uid and SLURM job id (if any). Data -%is collected for system users ({\em root}, {\em ftp}, {\em ntp}, -%etc.) as well as customer accounts. -%The intent is to capture all resource use including -%kernel, idle and down time. Upon request, the accumulated values are -%uploaded to \slurmctld\ and cleared. - -\slurmd\ accepts requests from \srun\ and \slurmctld\ to initiate -and terminate user jobs. The initiate job request contains such -information as real uid, effective uid, environment variables, working -directory, task numbers, job step credential, interconnect specifications and -authorization, core paths, SLURM job id, and the command line to execute. -System-specific programs can be executed on each allocated node prior -to the initiation of a user job and after the termination of a user -job (e.g., {\em Prolog} and {\em Epilog} in the configuration file). -These programs are executed as user {\em root} and can be used to -establish an appropriate environment for the user (e.g., permit logins, -disable logins, terminate orphan processes, etc.). \slurmd\ executes -the prolog program, resets its session id, and then initiates the job -as requested. It records to disk the SLURM job id, session id, process -id associated with each task, and user associated with the job. In the -event of \slurmd\ failure, this information is recovered from disk in -order to identify active jobs. - -When \slurmd\ receives a job termination request from the SLURM -controller, it sends SIGTERM to all running tasks in the job, -waits for {\em KillWait} seconds (as specified in the configuration -file), then sends SIGKILL. If the processes do not terminate \slurmd\ -notifies \slurmctld , which logs the event and sets the node's state -to DRAINED. After all processes have terminated, \slurmd\ executes the -configured epilog program, if any. - -\section{Command Line Utilities} - -\subsection{scancel} - -\scancel\ terminates queued jobs or signals running jobs or job steps. -The default signal is SIGKILL, which indicates a request to terminate -the specified job or job step. \scancel\ identifies the job(s) to -be signaled through user specification of the SLURM job id, job step id, -user name, partition name, and/or job state. If a job id is supplied, -all job steps associated with the job are affected as well as the job -and its resource allocation. If a job step id is supplied, only that -job step is affected. \scancel\ can only be executed by the job's owner -or a privileged user. - -\subsection{scontrol} - -\scontrol\ is a tool meant for SLURM administration by user {\em root}. -It provides the following capabilities: -\begin{itemize} -\item {\tt Shutdown}: Cause \slurmctld\ and \slurmd\ to save state -and terminate. -\item {\tt Reconfigure}: Cause \slurmctld\ and \slurmd\ to reread the -configuration file. -\item {\tt Ping}: Display the status of primary and backup \slurmctld\ daemons. -\item {\tt Show Configuration Parameters}: Display the values of general SLURM -configuration parameters such as locations of files and values of timers. -\item {\tt Show Job State}: Display the state information of a particular job -or all jobs in the system. -\item {\tt Show Job Step State}: Display the state information of a particular -job step or all job steps in the system. -\item {\tt Show Node State}: Display the state and configuration information -of a particular node, a set of nodes (using numeric ranges syntax to -identify their names), or all nodes. -\item {\tt Show Partition State}: Display the state and configuration -information of a particular partition or all partitions. -\item {\tt Update Job State}: Update the state information of a particular job -in the system. Note that not all state information can be changed in this -fashion (e.g., the nodes allocated to a job). -\item {\tt Update Node State}: Update the state of a particular node. Note -that not all state information can be changed in this fashion (e.g., the -amount of memory configured on a node). In some cases, you may need -to modify the SLURM configuration file and cause it to be reread -using the ``Reconfigure'' command described above. -\item {\tt Update Partition State}: Update the state of a partition -node. Note that not all state information can be changed in this fashion -(e.g., the default partition). In some cases, you may need to modify -the SLURM configuration file and cause it to be reread using the -``Reconfigure'' command described above. -\end{itemize} - -\subsection{squeue} - -\squeue\ reports the state of SLURM jobs. It can filter these -jobs input specification of job state (RUN, PENDING, etc.), job id, -user name, job name, etc. If no specification is supplied, the state of -all pending and running jobs is reported. -\squeue\ also has a variety of sorting and output options. - -\subsection{sinfo} - -\sinfo\ reports the state of SLURM partitions and nodes. By default, -it reports a summary of partition state with node counts and a summary -of the configuration of those nodes. A variety of sorting and -output formatting options exist. - -\subsection{srun} - -\srun\ is the user interface to accessing resources managed by SLURM. -Users may utilize \srun\ to allocate resources, submit batch jobs, -run jobs interactively, attach to currently running jobs, or launch a -set of parallel tasks (job step) for a running job. \srun\ supports a -full range of options to specify job constraints and characteristics, -for example minimum real memory, temporary disk space, and CPUs per node, -as well as time limits, stdin/stdout/stderr handling, signal handling, -and working directory for job. - -The \srun\ utility can run in four different modes: {\em interactive}, -in which the \srun\ process remains resident in the user's session, -manages stdout/stderr/stdin, and forwards signals to the remote tasks; -{\em batch}, in which \srun\ submits a job script to the SLURM queue for -later execution; {\em allocate}, in which \srun\ requests resources from -the SLURM controller and spawns a shell with access to those resources; -{\em attach}, in which \srun\ attaches to a currently -running job and displays stdout/stderr in real time from the remote -tasks. - -\section{Job Initiation Design} - -There are three modes in which jobs may be run by users under SLURM. The -first and most simple mode is {\em interactive} mode, in which stdout and -stderr are displayed on the user's terminal in real time, and stdin and -signals may be forwarded from the terminal transparently to the remote -tasks. The second mode is {\em batch} or {\em queued} mode, in which the job is -queued until the request for resources can be satisfied, at which time the -job is run by SLURM as the submitting user. In the third mode, {\em allocate} -mode, a job is allocated to the requesting user, under which the user may -manually run job steps via a script or in a sub-shell spawned by \srun . - -\begin{figure}[tb] -\centerline{\epsfig{file=../figures/connections.eps,scale=0.35}} -\caption{\small Job initiation connections overview. 1. \srun\ connects to - \slurmctld\ requesting resources. 2. \slurmctld\ issues a response, - with list of nodes and job step credential. 3. \srun\ opens a listen - port for job IO connections, then sends a run job step - request to \slurmd . 4. \slurmd initiates job step and connects - back to \srun\ for stdout/err } -\label{connections} -\end{figure} - -Figure~\ref{connections} shows a high-level depiction of the connections -that occur between SLURM components during a general interactive -job startup. \srun\ requests a resource allocation and job step -initiation from the {\tt slurmctld}, which responds with the job id, -list of allocated nodes, job step credential, etc. if the request is granted, -\srun\ then initializes a listen port for stdio connections and connects -to the {\tt slurmd}s on the allocated nodes requesting that the remote -processes be initiated. The {\tt slurmd}s begin execution of the tasks and -connect back to \srun\ for stdout and stderr. This process is described -in more detail below. Details of the batch and allocate modes of operation -are not presented due to space constraints. - -\subsection{Interactive Job Initiation} - -\begin{figure}[tb] -\centerline{\epsfig{file=../figures/interactive-job-init.eps,scale=0.45} } -\caption{\small Interactive job initiation. \srun\ simultaneously allocates - nodes and a job step from \slurmctld\ then sends a run request to all - {\tt slurmd}s in job. Dashed arrows indicate a periodic request that - may or may not occur during the lifetime of the job} -\label{init-interactive} -\end{figure} - -Interactive job initiation is shown in -Figure~\ref{init-interactive}. The process begins with a user invoking -\srun\ in interactive mode. In Figure~\ref{init-interactive}, the user -has requested an interactive run of the executable ``{\tt cmd}'' in the -default partition. - -After processing command line options, \srun\ sends a message to -\slurmctld\ requesting a resource allocation and a job step initiation. -This message simultaneously requests an allocation (or job) and a job -step. \srun\ waits for a reply from {\tt slurmctld}, which may not come -instantly if the user has requested that \srun\ block until resources are -available. When resources are available for the user's job, \slurmctld\ -replies with a job step credential, list of nodes that were allocated, -cpus per node, and so on. \srun\ then sends a message each \slurmd\ on -the allocated nodes requesting that a job step be initiated. -The \slurmd\ daemons verify that the job is valid using the forwarded job -step credential and then respond to \srun . - -Each \slurmd\ invokes a job manager process to handle the request, which -in turn invokes a session manager process that initializes the session for -the job step. An IO thread is created in the job manager that connects -all tasks' IO back to a port opened by \srun\ for stdout and stderr. -Once stdout and stderr have successfully been connected, the task thread -takes the necessary steps to initiate the user's executable on the node, -initializing environment, current working directory, and interconnect -resources if needed. - -Each \slurmd\ forks a copy of itself that is responsible for the job -step on this node. This local job manager process then creates an -IO thread that initializes stdout, stdin, and stderr streams for each -local task and connects these streams to the remote \srun . Meanwhile, -the job manager forks a session manager process that initializes -the session becomes the requesting user and invokes the user's processes. - -As user processes exit, their exit codes are collected, aggregated when -possible, and sent back to \srun\ in the form of a task exit message. -Once all tasks have exited, the session manager exits, and the job -manager process waits for the IO thread to complete, then exits. -The \srun\ process either waits for all tasks to exit, or attempts to -clean up the remaining processes some time after the first task exits -(based on user option). Regardless, once all tasks are finished, -\srun\ sends a message to the \slurmctld\ releasing the allocated nodes, -then exits with an appropriate exit status. - -When the \slurmctld\ receives notification that \srun\ no longer needs -the allocated nodes, it issues a request for the epilog to be run on -each of the {\tt slurmd}s in the allocation. As {\tt slurmd}s report that the -epilog ran successfully, the nodes are returned to the partition. +\section {Blue Gene/L Specific Resource Management Issues} + +SLURM was only required to address a one-dimensional topology. +It was obvious that the resource selection logic would require a major +redesign. Plugin... + +The topology requirements also necessitated the addition of several +\srun\ options: {\em --geometry} to specify the dimension required by +the job, + {\em --no-rotate} to indicate of the geometry specification could rotate +in three-dimensions, +{\em --node-use} to specify if the second process on a c-node should +be used to execute the user application or be used for communications. + +The \slurmd\ daemon was designed to execute on the individual SLURM +nodes to monitor the status of that computer, launch job steps, etc. +BGL prohibited the execute of SLURM daemons within the base partitions. +In addition the base partition was a ... +\slurmd\ needed to execute on front-end-node.... +Disable job step. + +Base partitions are virtual nodes to SLURM. + +In order to provide users with a clear view of the BGL topology, a new +tools was developed. +\smap\ presents the same type of information as the \sinfo\ and \squeue\ +commands, but graphically displays the location of SLURM nodes +(BGL base partitions) assigned to partitions or partitions as shown in +Table ~\ref{smap_out}. + +\begin{table}[t] +\begin{center} + +\begin{tabular}[c]{c} +\\ +\fbox{ + \begin{minipage}[c]{1.0\linewidth} + {\scriptsize \verbatiminput{smap.output} } + \end{minipage} +} +\\ +\end{tabular} +\caption{\label{smap_out} Output of \srun\ command} +\end{center} +\end{table} + +\section{Blue Gene/L Network Wiring Issues} + +TBD \section{Results} -\begin{figure}[htb] -\centerline{\epsfig{file=../figures/times.eps,scale=0.7}} -\caption{\small Time to execute /bin/hostname with various node counts} -\label{timing} -\end{figure} - -We were able to perform some SLURM tests on a 1000-node cluster -in November 2002. Some development was still underway at that time -and tuning had not been performed. The results for executing the -program {\em /bin/hostname} on two tasks per node and various node -counts are shown in Figure~\ref{timing}. We found SLURM performance -to be comparable to the -Quadrics Resource Management System (RMS) \cite{Quadrics2002} for all -job sizes and about 80 times faster than IBM LoadLeveler\cite{LL2002} -at tested job sizes. +TBD \section{Future Plans} -\section{Acknowledgments} - -SLURM is jointly developed by LLNL and Linux NetworX. -Contributors to SLURM development include: -\begin{itemize} -\item Jay Windley of Linux NetworX for his development of the plugin -mechanism and work on the security components -\item Mark Seager and Greg Tomaschke for their support of this project -\end{itemize} +TBD \raggedright % make the bibliography diff --git a/doc/bgl.report/smap.output b/doc/bgl.report/smap.output new file mode 100644 index 00000000000..4b4ebd0fc60 --- /dev/null +++ b/doc/bgl.report/smap.output @@ -0,0 +1,19 @@ + a a a a b b d d ID JOBID PARTITION USER NAME ST TIME NODES NODELIST + a a a a b b d d a 12345 batch joseph tst1 R 43:12 64 bgl[000x333] + a a a a b b c c b 12346 debug chris sim3 R 12:34 16 bgl[420x533] +a a a a b b c c c 12350 debug danny job3 R 0:12 8 bgl[622x733] + d 12356 debug dan colu R 18:05 16 bgl[600x731] + a a a a b b d d e 12378 debug joseph asx4 R 0:34 4 bgl[612x713] + a a a a b b d d + a a a a b b c c +a a a a b b c c + + a a a a . . d d + a a a a . . d d + a a a a . . e e Y +a a a a . . e e | + | + a a a a . . d d 0----X + a a a a . . d d / + a a a a . . . . / +a a a a . . . # Z \ No newline at end of file -- GitLab