Commits · 1e0608cdf6c6697abbb2bcaf0664cceacdccf365 · tud-zih-energy / Slurm

Feb 23, 2021

Requeue or kill job on a prolog failure when PrologFlags is not set. · 1e0608cd

Marshall Garey authored 4 years ago

Whe PrologFlags is not set, the job prolog runs just before the first
step of the job.

Previously, when PrologFlags was not set, the job was not killed on a
prolog failure. Although the step that triggered the prolog failure
would not run, the job's batch script would continue to run and execute
any remaining steps without re-running the prolog.

Bug 10353

1e0608cd

Fix salloc/srun sometimes not getting notified on prolog failure. · bd04a2de

Marshall Garey authored 4 years ago

slurmd relied on ESLURM_BATCH_ONLY being returned to know if it was a
salloc/srun job and sent a message to the job's stdout about prolog
failure. However, in _job_requeue_op(), a few conditions for returning
ESLURM_DISABLED happen before the condition for returning ESLURM_BATCH_ONLY
so this isn't reliable. Instead this commit always sends the message to
the job's stdout about a prolog failure (if there was a prolog failure)
even to batch jobs.

Bug 10353

bd04a2de

Kill job if any prolog fails and job is not requeued. · c7d1131e

Marshall Garey authored 4 years ago

When a prolog fails, slurmd sends a request to slurmctld to requeue the
job. If the requeue doesn't work (the job used --no-requeue or
JobRequeue=0) then slurmd sends a request to slurmctld to complete the
batch job. This can be wrong for a couple of reasons:

(1) The job isn't necessarily a batch job. It could have been started by
salloc or srun.

(2) If a multi-node job had a prolog fail on a node that wasn't the batch
host and requeuing was disabled, when slurmd sends the complete batch
job request to slurmctld, slurmctld ignores that request because it didn't
originate from the batch host. See _slurm_rpc_complete_batch_script().
Therefore, the job isn't killed and runs to completion, even though the
prolog failed and the node was drained.

This commit sends a kill job request instead of a complete batch job
request, ensuring that the job is killed.

Bug 10353

c7d1131e

Fix memory leak in slurm_kill_job2(). · d6e22d49
Marshall Garey authored 4 years ago
```
Bug 10353
```
d6e22d49
When sending a job a warning signal make sure we always send SIGCONT · fa5d21a7
Tim McMullan authored 4 years ago
```
beforehand.

Bug 10675
```
fa5d21a7

Invalid job credential in prolog when slurmd running in old protocol · 488acd45

Carlos Tripiana Montes authored 4 years ago

Ensure the credential is created with the same protocol as the agent,
which should be the oldest version from job's nodes

Bug 10884

488acd45

NEWS entry for last 4 commits. · f417c6fe
Danny Auble authored 4 years ago
```
Bug 10561
```
f417c6fe

Limit data when checking for a default wckey. · 0dc23eea

Tim McMullan authored 4 years ago

By sending in the local cluster list and changing the query to only
give the default wckey first and then the newest one will always be the
one we just added making it not needed to have FOR UPDATE specified.

This also makes it so we only query non-deleted wckeys.

Bug 10561

0dc23eea

Limit data when checking for a default association. · 352c1ca3

Tim McMullan authored 4 years ago

By sending in the local cluster list and changing the query to only
give the default assoc first and then the newest one will always be the
one we just added making it not needed to have FOR UPDATE specified.

This also makes it so we only query non-deleted assocs.

Bug 10561

352c1ca3

Only perform default wckey check on clusters we actually modified. · 49d61038
Tim McMullan authored 4 years ago
```
Bug 10561
```
49d61038
Only perform default account check on clusters we actually modified. · dd4450c0
Tim McMullan authored 4 years ago
```
Bug 10561
```
dd4450c0

Feb 22, 2021
- Fix issue where a job could run in a wrong partition when using · 603f7f04
  Marshall Garey authored 4 years ago
  
  EnforcePartLimits=any and partition based associations. Bug 10229
  603f7f04
- Add a new error for partition-based associations. · 50cabcc4
  Marshall Garey authored 4 years ago
  
  This will be used in the following commit. Bug 10229
  50cabcc4
- Testsuite - Fix minor regression/typo on 20.14 · 6c840110
  Albert Gil authored 4 years ago
  
  Regression typo added merging up on f8f6dd96. Bug 8811
  6c840110
- Docs - Fix typo for 'freeipmi' · 9d1ea9a4
  George Marselis authored 4 years ago
  
  Bug 10917
  9d1ea9a4
- scrontab - fix memory leak when invalid option found in #SCRON line. · 764fc6a5
  Tim McMullan authored 4 years ago
  
  Bug 10891.
  764fc6a5
- scrontab - fix to return the correct index for a bad #SCRON option. · 82ffeb0e
  Tim McMullan authored 4 years ago
  
  This corrects a condition where the bad line is "scrontab" and one where an invalid option won't be detected if its not the first one in the list. Bug 10891
  82ffeb0e
- Merge branch 'slurm-20.02' into slurm-20.11 · f8f6dd96
  Albert Gil authored 4 years ago
  
  Extra refactor removing get_node_cnt_in_part.
  f8f6dd96
- Testsuite - Add get_nodes_by_state and list2hostlist · d3d13c66
  Albert Gil authored 4 years ago
  
  The following procs have been replaced by the new ones: - get_idle_node_in_part - get_partition_nodes - available_nodes_hostnames - available_nodes Bug 8811 Signed-off-by: Marcin Stolarek <cinek@schedmd.com>
  d3d13c66
- Testsuite - Remove get_node_cnt_in_part using get_partition_param · 5d16c2f4
  Albert Gil authored 4 years ago
  
  Bug 8811 Signed-off-by: Marcin Stolarek <cinek@schedmd.com>
  5d16c2f4
Feb 21, 2021
- Fix sacct not displaying UserCPU, SystemCPU and TotalCPU for large times · 7f3f0eb0
  Tim McMullan authored 4 years ago
  
  Was converting unsigned to signed int. Bug 10568
  7f3f0eb0
Feb 19, 2021

Testsuite - Improve 3.11.11 when def partition has down nodes · f6c4836d

Marcin Stolarek authored 4 years ago

Use get_partition_param instead of available_nodes_hostnames to get all
nodes in partition regardless of state.

Bug 8811.

f6c4836d

Stop using strtok() to parse bf_hetjob_prio in the backfill scheduler · 93c52c41

Michael Hinton authored 4 years ago

Get rid of strtok() and the extra xstrdup() that was covering up the
issue from the previous commit and use xstrncasecmp() instead.

Bug 10895

93c52c41

Fix bf_hetjob_prio truncating SchedulerParameters in main scheduler · 83893900

Michael Hinton authored 4 years ago

strtok() is meant to run until it returns NULL. This is because it actually
modifies the first argument in the meantime. Since we only called it once
while parsing SchedulerParameters in the main scheduler, it basically
truncated everything after bf_hetjob_prio. The solution is to simply use
xstrncasecmp() instead.

The backfill scheduler is unaffected due to 5b833ff3.

Bug 10895

83893900

Feb 18, 2021
- Start NEWS for v20.11.5. · 1fd1f71b
  Tim Wickberg authored 4 years ago
  
  1fd1f71b
- Update META for v20.11.4 release. · d176fa8b
  Tim Wickberg authored 4 years ago
  
  Update slurm.spec as well.
  d176fa8b
Feb 17, 2021
- scrontab - fix for when using the emacs editor · 79326e3c
  Tim McMullan authored 4 years ago
  
  Bug 10876
  79326e3c
Feb 16, 2021

Add scrontab to man2html to be made. · 5e8c9a6d
Danny Auble authored 4 years ago

5e8c9a6d

Fix cgroup namespace detection when running in a container · f744707b

Felip Moll authored 4 years ago

When using containers (such as linux containers or docker), slurmds won't
start because when they call xcgroup_ns_is_available to check that they are
in a cgroup, it looks for the presence of the release_agent file. However,
the release_agent file is unique in that it is only present in the root
cgroup. Since containers are namespace-bound to a container-specific child
subdirectory in each control group, it fails to find release_agent and the
slurmd will fail to start.

The fix relies on checking a 'tasks' file instead of 'release_agent'.

Bug 10468

f744707b

Feb 15, 2021

Fix slurmd regression in 20.11 finding/sending the parent node address. · 9550185c

Tim McMullan authored 4 years ago

Introduced in 30af622a, the patch added a new case in reverse_tree_info()
which filled in parent_rank with -1 when width > nodes. This lead to callers
like _send_slurmstepd_init() to attempt to find the slurm_addr_t of this
node's parent slurmd in the step host list with parent_rank = -1 when not
applicable.

This resulted in misleading errors and messages being logged (even with
malformed node names) and incorrect reverse-tree info sent to the stepd.

Bug 10838.

9550185c

Feb 12, 2021
- Merge remote-tracking branch 'origin/slurm-20.02' into slurm-20.11 · c73bf457
  Danny Auble authored 4 years ago
  
  c73bf457
- Testsuite - Only show minutes when time is over a minute. · 2f2d1003
  Scott Jackson authored 4 years ago
  
  Bug 10798
  2f2d1003
- Testsuite - test_id should be a string to avoid 0 truncation · 158ca9f6
  Scott Jackson authored 4 years ago
  
  Bug 10798
  158ca9f6
- Work around glibc bug where "0" as a long double is printed as "nan". · c57311f1
  Albert Gil authored 4 years ago
  
  On broken glibc versions, the zeroes in the association state file will be saved as "nan" in packlongdouble(). Detect if this has happened in unpacklongdouble() and convert back to zero. https://bugzilla.redhat.com/show_bug.cgi?id=1925204 Bug 10824.
  c57311f1
Feb 11, 2021
- Fix regression preventing a full pass of the main scheduler from ever running. · 5b7d0511
  Marcin Stolarek authored 4 years ago
  
  full_queue needs to be reset after the call to _schedule(). Regression from commit 2427f291. Bug 10818.
  5b7d0511
- Add mapping for XCPU when using --signal option. · d1231890
  Colby Ashley authored 4 years ago
  
  Bug 10542.
  d1231890
Feb 10, 2021
- Merge branch 'slurm-20.02' into slurm-20.11 · 26bb8e97
  Albert Gil authored 4 years ago
  
  26bb8e97
- Testsuite - Add --results-file option to regression.py · 9fcf9861
  Scott Jackson authored 4 years ago
  
  To specify a file to write the json testsuite results to. Bug 10798
  9fcf9861
Feb 09, 2021
- Docs - Remove examples showing TaskPluginParam=sched · 61749165
  Ben Roberts authored 4 years ago
  
  sched was removed as an option for TaskPluginParam Bug 10780 Signed-off-by: Tim Wickberg <tim@schedmd.com>
  61749165
- Updated documentation saying only numbered steps can be default.i · 5fd36652
  Scott Hilton authored 4 years ago
  
  Bug 10673
  5fd36652