From b345c8d24c6f63dc6c4b2dedbc2bdeaa485764cd Mon Sep 17 00:00:00 2001 From: Don Lipari <lipari1@llnl.gov> Date: Wed, 15 Apr 2009 17:50:20 +0000 Subject: [PATCH] Fixed typos in faq.shtml --- doc/html/faq.shtml | 52 +++++++++++++++++++++++----------------------- 1 file changed, 26 insertions(+), 26 deletions(-) diff --git a/doc/html/faq.shtml b/doc/html/faq.shtml index f8867fb8e71..5cd9ee95672 100644 --- a/doc/html/faq.shtml +++ b/doc/html/faq.shtml @@ -99,12 +99,12 @@ execute on DOWN nodes?</li> <li><a href="#batch_lost">What is the meaning of the error "Batch JobId=# missing from master node, killing it"?</a></li> <li><a href="#accept_again">What does the messsage -"srun: error: Unable to accept connection: Resources termporarily unavailable" +"srun: error: Unable to accept connection: Resources temporarily unavailable" indicate?</a></li> <li><a href="#task_prolog">How could I automatically print a job's SLURM job ID to its standard output?</li> <li><a href="#moab_start">I run SLURM with the Moab or Maui scheduler. -How can I start a job under SLURM wihtout the scheduler?</li> +How can I start a job under SLURM without the scheduler?</li> <li><a href="#orphan_procs">Why are user processes and <i>srun</i> running even though the job is supposed to be completed?</li> <li><a href="#slurmd_oom">How can I prevent the <i>slurmd</i> and @@ -129,7 +129,7 @@ for an extended period of time. This may be indicative of processes hung waiting for a core file to complete I/O or operating system failure. If this state persists, the system administrator should check for processes -associated with the job that can not be terminated then use the +associated with the job that cannot be terminated then use the <span class="commandline">scontrol</span> command to change the node's state to DOWN (e.g. "scontrol update NodeName=<i>name</i> State=DOWN Reason=hung_completing"), reboot the node, then reset the node's state to IDLE @@ -184,7 +184,7 @@ until no previously submitted job is pending. If the scheduler type is <b>backfi then jobs will generally be executed in the order of submission for a given partition with one exception: later submitted jobs will be initiated early if doing so does not delay the expected execution time of an earlier submitted job. In order for -backfill scheduling to be effective, users jobs should specify reasonable time +backfill scheduling to be effective, users' jobs should specify reasonable time limits. If jobs do not specify time limits, then all jobs will receive the same time limit (that associated with the partition), and the ability to backfill schedule jobs will be limited. The backfill scheduler does not alter job specifications @@ -226,7 +226,7 @@ more information.</p> SLURM has a job purging mechanism to remove inactive jobs (resource allocations) before reaching its time limit, which could be infinite. This inactivity time limit is configurable by the system administrator. -You can check it's value with the command</p> +You can check its value with the command</p> <blockquote> <p><span class="commandline">scontrol show config | grep InactiveLimit</span></p> </blockquote> @@ -251,7 +251,7 @@ the command. For example:</p> </blockquote> <p>srun processes "-N2" as an option to itself. "hostname" is the command to execute and "-pdebug" is treated as an option to the -hostname command. Which will change the name of the computer +hostname command. This will change the name of the computer on which SLURM executes the command - Very bad, <b>Don't run this command as user root!</b></p> @@ -305,7 +305,7 @@ that the processes associated with the switch have been terminated to avoid the possibility of re-using switch resources for other jobs (even on different nodes). SLURM considers jobs COMPLETED when all nodes allocated to the -job are either DOWN or confirm termination of all it's processes. +job are either DOWN or confirm termination of all its processes. This enables SLURM to purge job information in a timely fashion even when there are many failing nodes. Unfortunately the job step information may persist longer.</p> @@ -490,7 +490,7 @@ indicate?</b></a><br> The srun command normally terminates when the standard output and error I/O from the spawned tasks end. This does not necessarily happen at the same time that a job step is terminated. For example, -a file system problem could render a spawned tasks non-killable +a file system problem could render a spawned task non-killable at the same time that I/O to srun is pending. Alternately a network problem could prevent the I/O from being transmitted to srun. In any event, the srun command is notified when a job step is @@ -526,8 +526,8 @@ If the user's resource limit is not propagated, the limit in effect for the <i>slurmd</i> daemon will be used for the spawned job. A simple way to control this is to insure that user <i>root</i> has a sufficiently large resource limit and insuring that <i>slurmd</i> takes -full advantage of this limit. For example, you can set user's root's -locked memory limit limit to be unlimited on the compute nodes (see +full advantage of this limit. For example, you can set user root's +locked memory limit ulimit to be unlimited on the compute nodes (see <i>"man limits.conf"</i>) and insuring that <i>slurmd</i> takes full advantage of this limit (e.g. by adding something like <i>"ulimit -l unlimited"</i> to the <i>/etc/init.d/slurm</i> @@ -632,11 +632,11 @@ accommodate all jobs allocated to a node, either running or suspended. <p><a name="fast_schedule"><b>2. How can I configure SLURM to use the resources actually found on a node rather than what is defined in <i>slurm.conf</i>?</b></a><br> -SLURM can either base it's scheduling decisions upon the node +SLURM can either base its scheduling decisions upon the node configuration defined in <i>slurm.conf</i> or what each node actually returns as available resources. This is controlled using the configuration parameter <i>FastSchedule</i>. -Set it's value to zero in order to use the resources actually +Set its value to zero in order to use the resources actually found on each node, but with a higher overhead for scheduling. A value of one is the default and results in the node configuration defined in <i>slurm.conf</i> being used. See "man slurm.conf" @@ -667,7 +667,7 @@ See the slurm.conf and srun man pages for more information.</p> <p><a name="multi_job"><b>5. How can I control the execution of multiple jobs per node?</b></a><br> -There are two mechanism to control this. +There are two mechanisms to control this. If you want to allocate individual processors on a node to jobs, configure <i>SelectType=select/cons_res</i>. See <a href="cons_res.html">Consumable Resources in SLURM</a> @@ -691,7 +691,7 @@ for more information (e.g. "slurmctld -Dvvvvv"). <p><a name="sigpipe"><b>7. Why are user tasks intermittently dying at launch with SIGPIPE error messages?</b></a><br> -If you are using ldap or some other remote name service for +If you are using LDAP or some other remote name service for username and groups lookup, chances are that the underlying libc library functions are triggering the SIGPIPE. You can likely work around this problem by setting <i>CacheGroups=1</i> in your slurm.conf @@ -765,7 +765,7 @@ to relocate them. In order to do so, follow this procedure:</p> <li>Stop all SLURM daemons</li> <li>Modify the <i>ControlMachine</i>, <i>ControlAddr</i>, <i>BackupController</i>, and/or <i>BackupAddr</i> in the <i>slurm.conf</i> file</li> -<li>Distribute the updated <i>slurm.conf</i> file file to all nodes</li> +<li>Distribute the updated <i>slurm.conf</i> file to all nodes</li> <li>Restart all SLURM daemons</li> </ol> <p>There should be no loss of any running or pending jobs. Insure that @@ -803,7 +803,7 @@ cluster?</b></a><br> Yes, this can be useful for testing purposes. It has also been used to partition "fat" nodes into multiple SLURM nodes. There are two ways to do this. -The best method for most conditins is to run one <i>slurmd</i> +The best method for most conditions is to run one <i>slurmd</i> daemon per emulated node in the cluster as follows. <ol> <li>When executing the <i>configure</i> program, use the option @@ -822,7 +822,7 @@ slurm.conf. </li> of the node that it is supposed to serve on the execute line.</li> </ol> <p>It is strongly recommended that SLURM version 1.2 or higher be used -for this due to it's improved support for multiple slurmd daemons. +for this due to its improved support for multiple slurmd daemons. See the <a href="programmer_guide.html#multiple_slurmd_support">Programmers Guide</a> for more details about configuring multiple slurmd support.</p> @@ -983,7 +983,7 @@ This error indicates that a job credential generated by the slurmctld daemon corresponds to a job that the slurmd daemon has already revoked. The slurmctld daemon selects job ID values based upon the configured value of <b>FirstJobId</b> (the default value is 1) and each job gets -an value one large than the previous job. +a value one larger than the previous job. On job termination, the slurmctld daemon notifies the slurmd on each allocated node that all processes associated with that job should be terminated. @@ -1031,7 +1031,7 @@ of the program. <p><a name="rpm"><b>27. Why isn't the auth_none.so (or other file) in a SLURM RPM?</b></a><br> -The auth_none plugin is in a separete RPM and not built by default. +The auth_none plugin is in a separate RPM and not built by default. Using the auth_none plugin means that SLURM communications are not authenticated, so you probably do not want to run in this mode of operation except for testing purposes. If you want to build the auth_none RPM then @@ -1080,7 +1080,7 @@ sinfo -t drain -h -o "scontrol update nodename='%N' state=drain reason='%E'" execute on DOWN nodes?</a></b><br> Hierarchical communications are used for sending this message. If there are DOWN nodes in the communications hierarchy, messages will need to -be re-routed. This limits SLURM's ability to tightly synchroize the +be re-routed. This limits SLURM's ability to tightly synchronize the execution of the <i>HealthCheckProgram</i> across the cluster, which could adversely impact performance of parallel applications. The use of CRON or node startup scripts may be better suited to insure @@ -1113,14 +1113,14 @@ is executing. If a batch program is expected to be running on some node (i.e. node zero of the job's allocation) and is not found, the message above will be logged and the job cancelled. This typically is associated with exhausting memory on the node or some other critical -failure that can not be recovered from. The equivalent message in +failure that cannot be recovered from. The equivalent message in earlier releases of slurm is "Master node lost JobId=#, killing it". <p><a name="accept_again"><b>33. What does the messsage -"srun: error: Unable to accept connection: Resources termporarily unavailable" +"srun: error: Unable to accept connection: Resources temporarily unavailable" indicate?</b></a><br> -This has been reported on some larger clusters running Suse Linux when +This has been reported on some larger clusters running SUSE Linux when a user's resource limits are reached. You may need to increase limits for locked memory and stack size to resolve this problem. @@ -1128,7 +1128,7 @@ for locked memory and stack size to resolve this problem. SLURM job ID to its standard output?</b></a></br> The configured <i>TaskProlog</i> is the only thing that can write to the job's standard output or set extra environment variables for a job -or job step. To write to the job's standard output, preceed the message +or job step. To write to the job's standard output, precede the message with "print ". To export environment variables, output a line of this form "export name=value". The example below will print a job's SLURM job ID and allocated hosts for a batch job only. @@ -1150,7 +1150,7 @@ fi </pre> <p><a name="moab_start"><b>35. I run SLURM with the Moab or Maui scheduler. -How can I start a job under SLURM wihtout the scheduler?</b></a></br> +How can I start a job under SLURM without the scheduler?</b></a></br> When SLURM is configured to use the Moab or Maui scheduler, all submitted jobs have their priority initialized to zero, which SLURM treats as a held job. The job only begins when Moab or Maui decide where and when to start @@ -1163,7 +1163,7 @@ $ scontrol update jobid=1234 priority=1000000 </pre> <p>Note that changes in the configured value of <i>SchedulerType</i> only take effect when the <i>slurmctld</i> daemon is restarted (reconfiguring -SLURM will not change this parameter. You will also manuallly need to +SLURM will not change this parameter. You will also manually need to modify the priority of every pending job. When changing to Moab or Maui scheduling, set every job priority to zero. When changing from Moab or Maui scheduling, set every job priority to a -- GitLab