Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
Slurm
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
tud-zih-energy
Slurm
Commits
37e8283f
"README.rst" did not exist on "e891bfd213ba17b717a3c4cc3d80f8798045efe2"
Commit
37e8283f
authored
12 years ago
by
Danny Auble
Browse files
Options
Downloads
Patches
Plain Diff
BLUEGENE - update to documentation to explain backfill is fixed
parent
9d947fbf
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc/html/bluegene.shtml
+4
-38
4 additions, 38 deletions
doc/html/bluegene.shtml
with
4 additions
and
38 deletions
doc/html/bluegene.shtml
+
4
−
38
View file @
37e8283f
...
@@ -433,44 +433,10 @@ etc.). Sample prolog and epilog scripts follow. </p>
...
@@ -433,44 +433,10 @@ etc.). Sample prolog and epilog scripts follow. </p>
interfere with each other, scheduling is somewhat different on a BlueGene
interfere with each other, scheduling is somewhat different on a BlueGene
system than typical clusters.</p>
system than typical clusters.</p>
<p><b>IMPORTANT: Choose your <i>SchedType in your slurm.conf</i>
<p>Starting in 2.4.3 SchedType=sched/backfill works in all modes and
wisely.</b> The below only really applies to dynamic
for all job sizes. Before this release there were issues backfilling
partitioning. <b>If you use static or overlap partitioning always use
jobs smaller than a midplane. It is encourged to upgrade to at least
SchedType=sched/builtin.</b></p>
2.4.3 for better backfill behavior.</p>
<p>The way the backfill works is on a node basis or in bluegene's case
a midplane level. So the problem discribed below happens on
machines using select/cons_res as well.</p>
<p>Lets use a bluegene 1 midplane system for simplicities
sake. (imagine a 1 node 512 core system using cons_res and you will
have the same picture)</p>
<p>If you have a job running on 256 of the midplane and the next job
looking to run is 512 the backfill takes the node off from running
from the list saying this is claimed for the next run. The next job
is only 16 nodes and could easily run before the 256 finishes so it
should be backfilled but because the 512 has already claimed the
resources it does not run. So it could delay the start of the 16
node job until it becomes the highest priority job even if it would
really run before hand.</p>
<p>Using Builtin fixes this scenario and goes through the list in
priority running any job it can without respect to backfill. This
causes a new problem though. If there is a large job there is no
way for the queue to automatically drain and run it as in
backfill. So large jobs could starve and will hang out until the
system is free of other jobs.</p>
<p>So there are pluses and minuses for each method. On a large
bluegene install it is probably a good idea to use backfill for most of the
time and only switch to builtin when you want to stress things
(backfill is much heavier of a protocol). Smaller systems should probably
run as builtin all the time especially if there is only 1
midplane.</p>
<p>The backfill plugin can be changed to be more resource conscious
which would resolve all these issues, but this has not happened
yet. But that is enough about SchedType, onward.</p>
<p>SLURM does support different partitions with an assortment of
<p>SLURM does support different partitions with an assortment of
different scheduling parameters.
different scheduling parameters.
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment