diff --git a/doc/html/admin.guide.html b/doc/html/admin.guide.html deleted file mode 100644 index a98cac21d0ce7bbbb2c3f424738879a693fe3bdf..0000000000000000000000000000000000000000 --- a/doc/html/admin.guide.html +++ /dev/null @@ -1,696 +0,0 @@ -<html> -<head> -<title>SLURM Administrator's Guide</title> -</head> -<body> - -<h1>SLURM Administrator's Guide</h1> - -<h2>Overview</h2> -Simple Linux Utility for Resource Management (SLURM) is an open source, -fault-tolerant, and highly scalable cluster management and job -scheduling system for Linux clusters having -thousands of nodes. Components include machine status, partition -management, job management, scheduling and stream copy modules. - -<h2>Build Information</h2> -TBD -Include PKI build instructions. - -<h2>Configuration</h2> -There a single SLURM configuration file containing: -overall SLURM options, node configurations, and partition configuration. -This file is located at "/etc/slurm.conf" by default. -The file location can be modified at system build time using the -DEFAULT_SLURM_CONF parameter. -The overall SLURM configuration options specify the control and backup -control machines. -The locations of daemons, state information storage, and other details -are specified at build time. -See the <a href="#Build">Build Parameters</a> section for details. -The node configuration tell SLURM what nodes it is to manage as well as -their expected configuration. -The partition configuration permits you to define sets (or partitions) -of nodes and establish distinct job limits or access control for them. -Configuration information may be read or updated using SLURM APIs. -This configuration file or a copy of it must be accessible on every computer under -SLURM management. -<p> -The following parameters may be specified: -<dl> -<dt>BackupController -<dd>The name of the machine where SLURM control functions are to be -executed in the event that ControlMachine fails. This node -may also be used as a compute server if so desired. It will come into service -as a controller only upon the failure of ControlMachine and will revert -to a "standby" mode when the ControlMachine becomes available once again. -This should be a node name without the full domain name (e.g. "lx0002"). -While not essential, it is highly recommended that you specify a backup controller. - -<dt>ControlMachine -<dd>The name of the machine where SLURM control functions are executed. -This should be a node name without the full domain name (e.g. "lx0001"). -This value must be specified. - -<dt>Epilog -<dd>Fully qualified pathname of a program to execute as user root on every -node when a user's job completes (e.g. "/usr/local/slurm/epilog"). This may -be used to purge files, disable user login, etc. By default there is no epilog. - -<dt>FastSchedule -<dd>If set to 1, then consider the configuration of each node to be that -specified in the configuration file. If set to 0, then base scheduling -decisions upon the actual configuration of each node. If the number of -node configuration entries in the configuration file is signficantly -lower than the number of nodes, setting FastSchedule to 1 will permit -much faster scheduling decisions to be made. The default value is 1. - -<dt>FirstJobId -<dd>The job id to be used for the first submitted to SLURM without a -specific requested value. Job id values generated will incremented by 1 -for each subsequent job. This may be used to provide a meta-scheduler -with a job id space which is disjoint from the interactive jobs. -The use of node names containing a numeric suffix will provide faster -operation for larger clusters. The default value is 10. - -<dt>HashBase -<dd>If the node names include a sequence number, this value defines the -base to be used in building a hash table based upon node name. Value of 8 -and 10 are recognized for octal and decimal sequence numbers respectively. -The value of zero is also recognized for node names lacking a sequence number. -The default value is 10. - -<dt>HeartbeatInterval -<dd>The interval, in seconds, at which the SLURM controller tests the -status of other daemons. The default value is 30 seconds. - -<dt>InactiveLimit -<dd>The interval, in seconds, a job is permitted to be inactive (with -no active job steps) before it is terminated. This prevents forgotten -jobs to be purged in a timely fashion without waiting for their time -limit to be reached. The default value is unlimited (zero). - -<dt>JobCredentialPrivateKey -<dd>Fully qualified pathname of a file containing a private key used for -authentication by Slurm daemons. - -<dt>JobCredentialPublicCertificate -<dd>Fully qualified pathname of a file containing a public key used for -authentication by Slurm daemons. - -<dt>KillWait -<dd>The interval, in seconds, given to a job's processes between the -SIGTERM and SIGKILL signals upon reaching its time limit. -If the job fails to terminate gracefully -in the interval specified, it will be forcably terminated. The default -value is 30 seconds. - -<dt>Prioritize -<dd>Fully qualified pathname of a program to execute in order to establish -the initial priority of a newly submitted job. By default there is no -prioritization program and each job gets a priority lower than that of -any existing jobs. - -<dt>Prolog -<dd>Fully qualified pathname of a program to execute as user root on every -node when a user's job begins execution (e.g. "/usr/local/slurm/prolog"). -This may be used to purge files, enable user login, etc. By default there -is no prolog. - -<dt>ReturnToService -<dd>If set to 1, then a DOWN node will become available for use -upon registration. The default value is 0, which -means that a node will remain DOWN until a system administrator explicitly -makes it available for use. - -<dt>SlurmctldPort -<dd>The port number that the SLURM controller, <i>slurmctld</i>, listens -to for work. The default value is SLURMCTLD_PORT as established at system -build time. - -<dt>SlurmctldTimeout -<dd>The interval, in seconds, that the backup controller waits for the -primary controller to respond before assuming control. The default value -is 300 seconds. - -<dt>SlurmdPort -<dd>The port number that the SLURM compute node daemon, <i>slurmd</i>, listens -to for work. The default value is SLURMD_PORT as established at system -build time. - -<dt>SlurmdTimeout -<dd>The interval, in seconds, that the SLURM controller waits for <i>slurmd</i> -to respond before configuring that node's state to DOWN. The default value -is 300 seconds. - -<dt>StateSaveLocation -<dd>Fully qualified pathname of a directory into which the slurm controller, -<i>slurmctld</i>, saves its state (e.g. "/usr/local/slurm/checkpoint"). SLURM -state will saved here to recover from system failures. The default value is "/tmp". -If any slurm daemons terminate abnormally, their core files will also be written -into this directory. - -<dt>TmpFS -<dd>Fully qualified pathname of the file system available to user jobs for -temporary storage. This parameter is used in establishing a node's <i>TmpDisk</i> -space. The default value is "/tmp". - -</dl> -Any text after "#" until the end of the line in the configuration file -will be considered a comment. -If you need to use "#" in a value within the configuration file, proceed -it with backslash "\"). -The configuration file should contain a keyword followed by an -equal sign, followed by the value. -Keyword value pairs should be separated from each other by white space. -The field descriptor keywords are case sensitive. -The size of each line in the file is limited to 1024 characters. -A sample SLURM configuration file (without node or partition information) -follows. -<pre> -ControlMachine=lx0001 BackupController=lx0002 -Epilog=/usr/local/slurm/epilog Prolog=/usr/local/slurm/prolog -FastSchedule=1 -FirstJobId=65536 -HashBase=10 -HeartbeatInterval=60 -InactiveLimit=120 -KillWait=30 -Prioritize=/usr/local/maui/priority -SlurmctldPort=7002 SlurmdPort=7003 -SlurmctldTimeout=300 SlurmdTimeout=300 -StateSaveLocation=/tmp/slurm.state -TmpFS=/tmp -</pre> -<p> -The node configuration permits you to identify the nodes (or machines) -to be managed by Slurm. You may also identify the characteristics of the -nodes in the configuration file. Slurm operates in a heterogeneous environment -and users are able to specify resource requirements for each job. -Slurm is optimized for scheduling systems in which the number of -unique configurations is small. It is recommended that the system -node configuration be provided in a minimal number of entries. -In many systems, this may be accomplished in only a few lines. -The node configuration specifies the following information: -<dl> - -<dt>NodeName -<dd>Name of a node as returned by hostname (e.g. "lx0012"). -<a name="NodeExp">A simple regular expression may optionally -be used to specify ranges -of nodes to avoid building a configuration file with large numbers -of entries. The regular expression can contain one -pair of square brackets with a sequence of comma separated -numbers and/or ranges of numbers separated by a "-" -(e.g. "linux[0-64,128]", or "lx[15,18,32-33]").</a> -If the NodeName is "DEFAULT", the values specified -with that record will apply to subsequent node specifications -unless explicitly set to other values in that node record or -replaced with a different set of default values. -For architectures in which the node order is significant, -nodes will be considered consecutive in the order defined. -For example, if the configuration for NodeName=charlie immediately -follows the configuration for NodeName=baker they will be -considered adjacent in the computer. - -<dt>Feature -<dd>A comma delimited list of arbitrary strings indicative of some -characteristic associated with the node. -There is no value associated with a feature at this time, a node -either has a feature or it does not. -If desired a feature may contain a numeric component indicating, -for example, processor speed. -By default a node has no features. -<dt>RealMemory -<dd>Size of real memory on the node in MegaBytes (e.g. "2048"). -The default value is 1. - -<dt>Procs -<dd>Number of processors on the node (e.g. "2"). -The default value is 1. - -<dt>State -<dd>State of the node with respect to the initiation of user jobs. -Acceptable values are "DOWN", "UNKNOWN", "IDLE", and "DRAINING". -The <a href="#NodeStates">node states</a> are fully described below. -The default value is "UNKNOWN". - -<dt>TmpDisk -<dd>Total size of temporary disk storage in TmpFS in MegaBytes -(e.g. "16384"). TmpFS (for "Temporary File System") -identifies the location which jobs should use for temporary storage. -Note this does not indicate the amount of free -space available to the user on the node, only the total file -system size. The system administration should insure this file -system is purged as needed so that user jobs have access to -most of this space. -The Prolog and/or Epilog programs (specified in the configuration file) -might be used to insure the file system is kept clean. -The default value is 1. - -<dt>Weight -<dd>The priority of the node for scheduling purposes. -All things being equal, jobs will be allocated the nodes with -the lowest weight which satisfies their requirements. -For example, a heterogeneous collection of nodes might -be placed into a single partition for greater system -utilization, responsiveness and capability. It would be -preferable to allocate smaller memory nodes rather than larger -memory nodes if either will satisfy a job's requirements. -The units of weight are arbitrary, but larger weights -should be assigned to nodes with more processors, memory, -disk space, higher processor speed, etc. -Weight is an integer value with a default value of 1. - -</dl> -<p> -Only the NodeName must be supplied in the configuration file; all other -items are optional. -It is advisable to establish baseline node configurations in -the configuration file, especially if the cluster is heterogeneous. -Nodes which register to the system with less than the configured resources -(e.g. too little memory), will be placed in the "DOWN" state to -avoid scheduling jobs on them. -Establishing baseline configurations will also speed SLURM's -scheduling process by permitting it to compare job requirements -against these (relatively few) configuration parameters and -possibly avoid having to perform job requirements -against every individual node's configuration. -The resources checked at node registration time are: Procs, -RealMemory and TmpDisk. -While baseline values for each of these can be established -in the configuration file, the actual values upon node -registration are recorded and these actual values may be -used for scheduling purposes (depending upon the value of -<i>FastSchedule</i> in the configuration file. -Default values can be specified with a record in which -"NodeName" is "DEFAULT". -The default entry values will apply only to lines following it in the -configuration file and the default values can be reset multiple times -in the configuration file with multiple entries where "NodeName=DEFAULT". -The "NodeName=" specification must be placed on every line -describing the configuration of nodes. -In fact, it is generally possible and desirable to define the -configurations of all nodes in only a few lines. -This convention permits significant optimization in the scheduling -of larger clusters. -The field descriptors above are case sensitive. -In order to support the concept of jobs requiring consecutive nodes -on some architectures, -node specifications should be place in this file in consecutive order. -The size of each line in the file is limited to 1024 characters. -<p> -<a name="NodeStates">The node states have the following meanings:</a> -<dl> -<dt>BUSY -<dd>The node has been allocated work (one or more user jobs). - -<dt>DOWN -<dd>The node is unavailable for use. It has been explicitly configured -DOWN or failed to respond to system state inquiries or has -explicitly removed itself from service due to a failure. This state -typically indicates some problem requiring administrator intervention. - -<dt>DRAINED -<dd>The node is idle, but not available for use. The state of a node -will automatically change from DRAINING to DRAINED when user job(s) executing -on that node terminate. Since this state is entered by explicit -administrator request, additional SLURM administrator intervention is typically -not required. - -<dt>DRAINING -<dd>The node has been made unavailable for new work by explicit administrator -intervention. It is processing some work at present and will enter state -"DRAINED" when that work has been completed. This might be used to -prepare some nodes for maintenance work. - -<dt>IDLE -<dd>The node is idle and available for use. - -<dt>UNKNOWN -<dd>Default initial node state upon startup of SLURM. -An attempt will be made to contact the node and acquire current state information. - -</dl> -<p> -SLURM uses a hash table in order to locate table entries rapidly. -Each table entry can be directly accessed without any searching -if the name contains a sequence number suffix. The value of -<i>HashBase</i> in the configuration file specifies the hashing algorithm. -Possible contains values are "10" and "8" for names containing -decimal and octal sequence numbers respectively -or "0" which processes mixed alpha-numeric without sequence numbers. -The default value of <i>HashBase</i> is "10". -If you use a naming convention lacking a sequence number, it may be -desirable to review the hashing function <i>hash_index</i> in the -node_mgr.c module. This is especially important in clusters having -large numbers of nodes. The sequence numbers can start at any -desired number, but should contain consecutive numbers. The -sequence number portion may contain leading zeros for a consistent -name length, if so desired. Note that correct operation -will be provided with any nodes names, but performance will suffer -without this optimization. -A sample SLURM configuration file (node information only) follows. -<pre> -# -# Node Configurations -# -NodeName=DEFAULT TmpDisk=16384 State=IDLE -NodeName=lx[0001-0002] State=DRAINED -NodeName=lx[0003-8000] Procs=16 RealMemory=2048 Weight=16 -NodeName=lx[8001-9999] Procs=32 RealMemory=4096 Weight=40 Feature=1200MHz,VizTools -</pre> -<p> -The partition configuration permits you to establish different job -limits or access controls for various groups (or partitions) of nodes. -Nodes may be in only one partition. The partition configuration -file contains the following information: -<dl> -<dt>AllowGroups -<dd>Comma separated list of group IDs which may use the partition. -If at least one group associated with the user submitting the -job is in AllowGroups, he will be permitted to use this partition. -The default value is "ALL". - -<dt>Default -<dd>If this keyword is set, jobs submitted without a partition -specification will utilize this partition. -Possible values are "YES" and "NO". -The default value is "NO". - -<dt>RootOnly -<dd>Specifies if only user ID zero (or user <i>root</i> may -initiate jobs in this partition. -Possible values are "YES" and "NO". -The default value is "NO". - -<dt>MaxNodes -<dd>Maximum count of nodes which may be allocated to any single job, -The default value is "UNLIMITED", which is represented internally as -1. - -<dt>MaxTime -<dd>Maximum wall-time limit for any job in minutes. The default -value is "UNLIMITED", which is represented internally as -1. - -<dt>Nodes -<dd>Comma separated list of nodes which are associated with this -partition. Node names may be specified using the <a href="#NodeExp"> -real expression syntax</a> described above. A blank list of nodes -(i.e. "Nodes= ") can be used if one wants a partition to exist, -but have no resources (possibly on a temporary basis). - -<dt>PartitionName -<dd>Name by which the partition may be referenced (e.g. "Interactive"). -This name can be specified by users when submitting jobs. - -<dt>Shared -<dd>Ability of the partition to execute more than one job at a -time on each node. Shared nodes will offer unpredictable performance -for application programs, but can provide higher system utilization -and responsiveness than otherwise possible. -Possible values are "FORCE", "YES", and "NO". -The default value is "NO". - -<dt>State -<dd>State of partition or availability for use. Possible values -are "UP" or "DOWN". The default value is "UP". - -</dl> -<p> -Only the PartitionName must be supplied in the configuration file. -Other parameters will assume default values if not specified. -The default values can be specified with a record in which -"PartitionName" is "DEFAULT" if non-standard default values are desired. -The default entry values will apply only to lines following it in the -configuration file and the default values can be reset multiple times -in the configuration file with multiple entries where "PartitionName=DEFAULT". -The configuration of one partition should be specified per line. -The field descriptors above are case sensitive. -The size of each line in the file is limited to 1024 characters. -A sample SLURM configuration file (partition information only) follows. -<p> -A single job may be allocated nodes from only one partition and -satisfy the configuration specifications for that partitions. -The job may specify a particular PartitionName, if so desired, -or use the system's default partition. -<pre> -# -# Partition Configurations -# -PartitionName=DEFAULT MaxTime=30 MaxNodes=2 -PartitionName=login Nodes=lx[0001-0002] State=DOWN -PartitionName=debug Nodes=lx[0003-0030] State=UP Default=YES -PartitionName=class Nodes=lx[0031-0040] AllowGroups=students -PartitionName=batch Nodes=lx[0041-9999] MaxTime=UNLIMITED MaxNodes=4096 RootOnly=YES -</pre> -<p> -APIs and an administrative tool can be used to alter the SLRUM -configuration in real time. -When the SLURM controller restarts, it's state will be restored -to that at the time it terminated unless the SLURM configuration -file is newer, it which case the configuration will be rebuilt -from that file. -State information not incorporated in the configuration file, -such as job state, will be preserved. -A <a href="#SampleConfig">SLURM configuration file</a> is included -at the end of this document. - -<h3>Job Configuration</h3> -The job configuration format specified below is used by the -scontrol administration tool to modify job state information: -<dl> -<dt>Contiguous -<dd>Determine if the nodes allocated to the job must be contiguous. -Acceptable values are "YES" and "NO" with the default being "NO". - -<dt>Features -<dd>Required features of nodes to be allocated to this job. -Features may be combined using "|" for OR, "&" for AND, -and square brackets. -For example, "Features=1000MHz|1200MHz&CoolTool". -The feature list is processes left to right except for -the grouping by brackets. -Square brackets are used to identify alternate features, -but ones that must apply to every node allocated to the job. -For example, some clusters are configured with more than -one parallel file system. These parallel file systems -may be accessible only to a subset of the nodes in a cluster. -The application may not care which parallel file system -is used, but all nodes allocated to it must be in the -subset of nodes assessing a single parallel file system. -This might be specified with a specification of -"Features=[PFS1|PFS2|PFS3|PFS4]". - -<dt>JobName -<dd>Name to be associated with the job - -<dt>JobId -<dd>Identification for the job, a sequence number. - -<dt>MinMemory -<dd>Minimum number of megabytes of real memory per node. - -<dt>MinProcs -<dd>Minimum number of processors per node. - -<dt>MinTmpDisk -<dd>Minimum number of megabytes of temporary disk storage per node. - -<dt>ReqNodes -<dd>The total number of nodes required to execute this job. - -<dt>ReqNodeList -<dd>A comma separated list of nodes to be allocated to the job. -The nodes may be specified using regular expressions (e.g. -"lx[0010-0020,0033-0040]" or "baker,charlie,delta"). - -<dt>ReqProcs -<dd>The total number of processors required to execute this job. - -<dt>Partition -<dd>Name of the partition in which this job should execute. - -<dt>Priority -<dd>Integer priority of the pending job. The value may -be specified by user root initiated jobs, otherwise SLURM will -select a value. Generally, higher priority jobs will be initiated -before lower priority jobs. - -<dt>Shared -<dd>Job can share nodes with other jobs. Possible values are YES and NO. - -<dt>State -<dd>State of the job. Possible values are "PENDING", "STARTING", -"RUNNING", and "ENDING". - -<dt>TimeLimit -<dd>Maximum wall-time limit for the job in minutes. An "UNLIMITED" -value is represented internally as -1. - -</dl> - -<a name="Build"><h2>Build Parameters</h2></a> -The following configuration parameters are established at SLURM build time. -State and configuration information may be read or updated using SLURM APIs. - -<dl> -<dt>SLURMCTLD_PATH -<dd>The fully qualified pathname of the file containing the SLURM daemon -to execute on the ControlMachine, <i>slurmctld</i>. The default value is "/usr/local/slurm/bin/slurmctld". -This file must be accessible to the ControlMachine and BackupController. - -<dt>SLURMD_PATH -<dd>The fully qualified pathname of the file containing the SLURM daemon -to execute on every compute server node. The default value is "/usr/local/slurm/bin/slurmd". -This file must be accessible to every SLURM compute server. - -<dt>SLURM_CONF -<dd>The fully qualified pathname of the file containing the SLURM -configuration file. The default value is "/etc/SLURM.conf". - -<dt>SLURMCTLD_PORT -<dd>The port number that the SLURM controller, <i>slurmctld</i>, listens -to for work. - -<dt>SLURMD_PORT -<dd>The port number that the SLURM compute node daemon, <i>slurmd</i>, listens -to for work. -</dl> - -<h2>scontrol Administration Tool</h2> -The tool you will primarily use in the administration of SLURM is scontrol. -It provides the means of viewing and updating node and partition -configurations. It can also be used to update some job state -information. You can execute scontrol with a single keyword on -the execute line or it will query you for input and process those -keywords on an interactive basis. The scontrol keywords are shown below. -A <a href="#SampleAdmin">sample scontrol session</a> with examples is appended. - -<p> -Usage: scontrol [-q | -v] [<keyword>]<br> --q is equivalent to the "quiet" keyword<br> --v is equivalent to the "verbose" keyword<br> -</pre> -<dl> -<dt>abort -<dd>Cause <i>slurmctld</i> terminate immediately and generate a core file. - -<dt>exit -<dd>Terminate scontrol. - -<dt>help -<dd>Display this list of scontrol commands and options. - -<dt>quiet -<dd>Print no messages other than error messages. - -<dt>quit -<dd>Terminate scontrol. - -<dt>reconfigure -<dd>The SLURM control daemon re-reads its configuration files. - -<dt>show <entity> [<ID>] -<dd>Show the configuration for a given entity. Entity must -be "config", "job", "node", "partition" or "step" for SLURM -configuration parameters, job, node, partition, and job step -information respectively. -By default, state information for all records is reported. -If you only wish to see the state of one entity record, -specify either its ID number (assumed if entirely numeric) -or its name. <a href="#NodeExp">Regular expressions</a> may -be used to identify node names. - -<dt>shutdown -<dd>Cause <i>slurmctld</i> to save state and terminate. - -<dt>update <options> -<dd>Update the configuration information. -Options are of the same format as the configuration file -and the output of the <i>scontrol show</i> command. -Not all configuration information can be modified using -this mechanism, such as the configuration of a node -after it has registered (only a node's state can be modified). -One can always modify the SLURM configuration file and -use the reconfigure command to rebuild all controller -information if required. -This command can only be issued by user <b>root</b>. - -<dt>verbose -<dd>Enable detailed logging of scontrol execution state information. - -<dt>version -<dd>Display the scontrol tool version number. -</dl> - -<h2>Miscellaneous</h2> -There is no necessity for synchronized clocks on the nodes. -Events occur either in real-time based upon message traffic -or based upon changes in the time on a node. However, synchronized -clocks will permit easier analysis of SLURM logs from multiple -nodes. -<p> -SLURM uses the syslog function to record events. It uses a -range of importance levels for these messages. Be certain -that your system's syslog functionality is operational. - -<a name="SampleConfig"><h2>Sample Configuration File</h2></a> -<pre> -# -# Sample /etc/slurm.conf -# Author: John Doe -# Date: 11/06/2001 -# -ControlMachine=lx0001 BackupController=lx0002 -Epilog="" Prolog="" -FastSchedule=1 -FirstJobId=65536 -HashBase=10 -HeartbeatInterval=60 -InactiveLimit=120 -KillWait=30 -Prioritize=/usr/local/maui/priority -SlurmctldPort=7002 SlurmdPort=7003 -SlurmctldTimeout=300 SlurmdTimeout=300 -StateSaveLocation=/tmp/slurm.state -TmpFS=/tmp -# -# Node Configurations -# -NodeName=DEFAULT TmpDisk=16384 State=IDLE -NodeName=lx[0001-0002] State=DRAINED -NodeName=lx[0003-8000] Procs=16 RealMemory=2048 Weight=16 -NodeName=lx[8001-9999] Procs=32 RealMemory=4096 Weight=40 Feature=1200MHz -# -# Partition Configurations -# -PartitionName=DEFAULT MaxTime=30 MaxNodes=2 -PartitionName=login Nodes=lx[0001-0002] State=DOWN -PartitionName=debug Nodes=lx[0003-0030] State=UP Default=YES -PartitionName=class Nodes=lx[0031-0040] AllowGroups=students -PartitionName=batch Nodes=lx[0041-9999] MaxTime=UNLIMITED MaxNodes=4096 RootOnly=YES -</pre> - -<a name="SampleAdmin"><h2>Sample scontrol Execution</h2></a> -<pre> -Remove node lx0030 from service, removing jobs as needed: - # scontrol - scontrol: update NodeName=lx0030 State=DRAINING - scontrol: show job - ID=1234 Name=Simulation MaxTime=100 Nodes=lx[0029-0030] State=RUNNING User=smith - ID=1235 Name=MyBigTest MaxTime=100 Nodes=lx0020,lx0023 State=RUNNING User=smith - scontrol: update job ID=1234 State=ENDING - scontrol: show job 1234 - Job 1234 not found - scontrol: show node lx0030 - Name=lx0030 Partition=class State=DRAINED Procs=16 RealMemory=2048 TmpDisk=16384 - scontrol: quit -</pre> - -<hr> -URL = http://www-lc.llnl.gov/dctg-lc/slurm/admin.guide.html -<p>Last Modified October 22, 2002</p> -<address>Maintained by <a href="mailto:slurm-dev@lists.llnl.gov"> -slurm-dev@lists.llnl.gov</a></address> -</body> -</html> diff --git a/doc/html/quickstart.html b/doc/html/quickstart.html index 4d20bc7e639200f5aa8f4832e4520e3150da8c18..def26ebacc7fa0664e5b91a5eaa457f32171f517 100644 --- a/doc/html/quickstart.html +++ b/doc/html/quickstart.html @@ -100,6 +100,8 @@ it is restored to service. The controller saves its state to disk whenever there is a change. This state can be recovered by the controller at startup time. <b>slurmctld</b> would typically execute as a special user specifically for this purpose (not user root). +State changes are saved so that jobs and other state can be +preserved when slurmctld moves or is restarted. <p> The <b>slurmd</b> daemon executes on every compute node. It resembles a remote shell daemon to export control to SLURM. @@ -199,12 +201,22 @@ The remaining information provides basic SLURM administration information. Individuals only interested in making use of SLURM need not read read further. -<h3>Authentication</h3> +<h3>Infrastructure</h3> All communications between SLURM components are authenticated. The authentication infrastructure used is specified in the SLURM configuration file and options include: <a href="http://www.theether.org/authd/">authd</a>, munged and none. +<p> +SLURM uses the syslog function to record events. It uses a +range of importance levels for these messages. Be certain +that your system's syslog functionality is operational. +<p> +There is no necessity for synchronized clocks on the nodes. +Events occur either in real-time based upon message traffic. +However, synchronized clocks will permit easier analysis of +SLURM logs from multiple nodes. + <h3>Configuration</h3>