From 045700f0b98543ddebd56a3da77623c59e96e674 Mon Sep 17 00:00:00 2001 From: Moe Jette <jette1@llnl.gov> Date: Thu, 21 Oct 2004 00:17:02 +0000 Subject: [PATCH] Major update. Framework of papaer largely in place. --- doc/bgl.report/report.tex | 114 ++++++++++++++++++++++++++++++++------ 1 file changed, 98 insertions(+), 16 deletions(-) diff --git a/doc/bgl.report/report.tex b/doc/bgl.report/report.tex index e34d1c2597f..3f6377510e6 100644 --- a/doc/bgl.report/report.tex +++ b/doc/bgl.report/report.tex @@ -310,26 +310,89 @@ identifies the directory in which to find the plugin. \section {Blue Gene/L Specific Resource Management Issues} -SLURM was only required to address a one-dimensional topology. -It was obvious that the resource selection logic would require a major -redesign. Plugin... - -The topology requirements also necessitated the addition of several +Several issues needed to be addressed for SLURM to support BGL: +pseudo-nodes representing the base partitions, topology, +\slurmd\ executing only on the front-end-node, +BGL wiring issues and use of the BGL-specific APIs. +BGL wiring issues are extensive and addressed in a separate section. + +Since a BGL base partition is the minimum allocation unit for a job, +it was natural to consider each one as an independent SLURM node. +This meant SLURM would manage a very reasonable 128 nodes +rather than tens of thousands of individual c-nodes. +The \slurmd\ daemon was designed to execute on each SLURM +nodes to monitor the status of that node, launch job steps, etc. +Unfortunately BGL prohibited the execute of SLURM daemons within +the base partitions on any of the c-nodes. +SLIURM forced to execute one \slurmd\ for the entire BGL system +on a front-end node. +In addition, the typical Unix mechanism used to interact with a +compute host do not function with BGL base partitions. +This issue was addressed by adding a SLURM parameter to +indicate when it is running with a front-end node, in which case +there is assumed to be a single \slurmd\ for the entire system. + +SLURM was originally designed to address a one-dimensional topology +and this impacted a variety of areas from naming convensions to +node selection. +SLURM provides resource management on several Linux clusters +exceeding 1000 nodes and it is impractical to display or otherwise +work with hundreds of individual node names. +SLURM addresses this by using regular expressions to indicate +ranges of node names. +For example, "linux[0-1023]" was used to represent 1024 nodes +with names having a prefix of "linux" and a numeric suffic ranging +from "0" to "1023". +The most reasonable way to name the BGL nodes seemed to be +using a three digit suffix, but rather than indicate a monotonically +increasing number, each digit would represent the base partition's +location in the X, Y and Z dimensions (the value of X ranges +from 0 to 7, Y from 0 to 3, and Z from 0 to 3 on the LLNL system). +For example, "bgl012" would represent the base partition at +the position X=0, Y=1 and Z=2. +Since BGL resources naturally tend to be rectangular prisms in +shape, we modified the regular expression to indicate the two +extreme base partition locations. +The name prefix is always "bgl". +Within the brackets one lists the base partition with the smallest +X, Y and Z coordinates followed by a "x" followed by the base +partition with the highest X, Y and Z coordinates. +For example, "bgl[200x311]" represents the following eight base +partitions: bgl200, bgl201, bgl210, bgl211, bgl300, bgl301, bgl310 +and bgl311. +Note that this method does can not accomodate blocks of base +partitions that wrap over the torus boundaries particularly well, +although a regular expression of this sort is supported: +"bgl[000x-011,700x711]". + +The node selection functionality is another topology aware +SLURM component. +Rather than embedding BGL-specific logic into a multitude of +locations, all of this logic was put into a single plugin. +The pre-existing node selection logic was put into a plugin +supporting typical Linux clusters with node names based +upon a one-dimensional array. +The BGL-specific plugin not only selects nodes for pending jobs +based upon BGL topography, but issues the BGL-specific APIs +to monitor the system health (draining nodes with any failure +mode) and perform initialization and termination sequences for the job. + +BGL's topology requirement necessitated the addition of several \srun\ options: {\em --geometry} to specify the dimension required by the job, {\em --no-rotate} to indicate of the geometry specification could rotate in three-dimensions, +{\em --comm-type} to indicate the communctions type being mesh or torus, {\em --node-use} to specify if the second process on a c-node should -be used to execute the user application or be used for communications. - -The \slurmd\ daemon was designed to execute on the individual SLURM -nodes to monitor the status of that computer, launch job steps, etc. -BGL prohibited the execute of SLURM daemons within the base partitions. -In addition the base partition was a ... -\slurmd\ needed to execute on front-end-node.... -Disable job step. - -Base partitions are virtual nodes to SLURM. +be used to execute the user application or be used for communications. +While \srun\ accepts these new options on all computer systems, +the node selection plugin logic is used to manage this data in an +opaque data type. +Since these new data types are unused on non-BGL systems, the +functions to manage them perform no work. +Other computers with other topology requiremens will be able to +take advantage of this plugin infrastructure as well with minimal +effort. In order to provide users with a clear view of the BGL topology, a new tools was developed. @@ -354,17 +417,36 @@ Table ~\ref{smap_out}. \end{center} \end{table} +Rather than modifying SLURM to initiate and manage the parallel +tasks for BGL jobs, we decided utilize existing software from IBM. +This eliminated a multitude of software integration issues. +SLURM will manage resources, select resources for the job, +set an environment variable BGL\_PARTITION\_ID, and spawn +a script. +The job will initiate its parallel tasks through the use of {\em mpirun}. +{\em mpirun} uses BGL-specific APIs to launch and manage the +tasks. +An additional benefit of this architecture is that the single \slurmd\ +for the entire system is relieved of job step management, which +could involve a significant amount of overhead for a computer +of BGL's size. +We disabled SLURM's job step support for normal users to +mitigate the possible impact of users inadvertently attempting +to initiate job steps through SLURM. + \section{Blue Gene/L Network Wiring Issues} TBD +Static partitioning + \section{Results} TBD \section{Future Plans} -TBD +Dynamic partitioning \raggedright % make the bibliography -- GitLab