- Feb 23, 2021
-
-
Marshall Garey authored
Whe PrologFlags is not set, the job prolog runs just before the first step of the job. Previously, when PrologFlags was not set, the job was not killed on a prolog failure. Although the step that triggered the prolog failure would not run, the job's batch script would continue to run and execute any remaining steps without re-running the prolog. Bug 10353
-
Marshall Garey authored
slurmd relied on ESLURM_BATCH_ONLY being returned to know if it was a salloc/srun job and sent a message to the job's stdout about prolog failure. However, in _job_requeue_op(), a few conditions for returning ESLURM_DISABLED happen before the condition for returning ESLURM_BATCH_ONLY so this isn't reliable. Instead this commit always sends the message to the job's stdout about a prolog failure (if there was a prolog failure) even to batch jobs. Bug 10353
-
Marshall Garey authored
When a prolog fails, slurmd sends a request to slurmctld to requeue the job. If the requeue doesn't work (the job used --no-requeue or JobRequeue=0) then slurmd sends a request to slurmctld to complete the batch job. This can be wrong for a couple of reasons: (1) The job isn't necessarily a batch job. It could have been started by salloc or srun. (2) If a multi-node job had a prolog fail on a node that wasn't the batch host and requeuing was disabled, when slurmd sends the complete batch job request to slurmctld, slurmctld ignores that request because it didn't originate from the batch host. See _slurm_rpc_complete_batch_script(). Therefore, the job isn't killed and runs to completion, even though the prolog failed and the node was drained. This commit sends a kill job request instead of a complete batch job request, ensuring that the job is killed. Bug 10353
-
Marshall Garey authored
Bug 10353
-
Tim McMullan authored
beforehand. Bug 10675
-
Carlos Tripiana Montes authored
Ensure the credential is created with the same protocol as the agent, which should be the oldest version from job's nodes Bug 10884
-
Danny Auble authored
Bug 10561
-
Tim McMullan authored
By sending in the local cluster list and changing the query to only give the default wckey first and then the newest one will always be the one we just added making it not needed to have FOR UPDATE specified. This also makes it so we only query non-deleted wckeys. Bug 10561
-
Tim McMullan authored
By sending in the local cluster list and changing the query to only give the default assoc first and then the newest one will always be the one we just added making it not needed to have FOR UPDATE specified. This also makes it so we only query non-deleted assocs. Bug 10561
-
Tim McMullan authored
Bug 10561
-
Tim McMullan authored
Bug 10561
-
- Feb 22, 2021
-
-
Marshall Garey authored
EnforcePartLimits=any and partition based associations. Bug 10229
-
Marshall Garey authored
This will be used in the following commit. Bug 10229
-
Albert Gil authored
Regression typo added merging up on f8f6dd96. Bug 8811
-
George Marselis authored
Bug 10917
-
Tim McMullan authored
Bug 10891.
-
Tim McMullan authored
This corrects a condition where the bad line is "scrontab" and one where an invalid option won't be detected if its not the first one in the list. Bug 10891
-
Albert Gil authored
Extra refactor removing get_node_cnt_in_part.
-
Albert Gil authored
The following procs have been replaced by the new ones: - get_idle_node_in_part - get_partition_nodes - available_nodes_hostnames - available_nodes Bug 8811 Signed-off-by:
Marcin Stolarek <cinek@schedmd.com>
-
Albert Gil authored
Bug 8811 Signed-off-by:
Marcin Stolarek <cinek@schedmd.com>
-
- Feb 21, 2021
-
-
Tim McMullan authored
Was converting unsigned to signed int. Bug 10568
-
- Feb 19, 2021
-
-
Marcin Stolarek authored
Use get_partition_param instead of available_nodes_hostnames to get all nodes in partition regardless of state. Bug 8811.
-
Michael Hinton authored
Get rid of strtok() and the extra xstrdup() that was covering up the issue from the previous commit and use xstrncasecmp() instead. Bug 10895
-
Michael Hinton authored
strtok() is meant to run until it returns NULL. This is because it actually modifies the first argument in the meantime. Since we only called it once while parsing SchedulerParameters in the main scheduler, it basically truncated everything after bf_hetjob_prio. The solution is to simply use xstrncasecmp() instead. The backfill scheduler is unaffected due to 5b833ff3. Bug 10895
-
- Feb 18, 2021
-
-
Tim Wickberg authored
-
Tim Wickberg authored
Update slurm.spec as well.
-
- Feb 17, 2021
-
-
Tim McMullan authored
Bug 10876
-
- Feb 16, 2021
-
-
Danny Auble authored
-
Felip Moll authored
When using containers (such as linux containers or docker), slurmds won't start because when they call xcgroup_ns_is_available to check that they are in a cgroup, it looks for the presence of the release_agent file. However, the release_agent file is unique in that it is only present in the root cgroup. Since containers are namespace-bound to a container-specific child subdirectory in each control group, it fails to find release_agent and the slurmd will fail to start. The fix relies on checking a 'tasks' file instead of 'release_agent'. Bug 10468
-
- Feb 15, 2021
-
-
Tim McMullan authored
Introduced in 30af622a, the patch added a new case in reverse_tree_info() which filled in parent_rank with -1 when width > nodes. This lead to callers like _send_slurmstepd_init() to attempt to find the slurm_addr_t of this node's parent slurmd in the step host list with parent_rank = -1 when not applicable. This resulted in misleading errors and messages being logged (even with malformed node names) and incorrect reverse-tree info sent to the stepd. Bug 10838.
-
- Feb 12, 2021
-
-
Danny Auble authored
-
Scott Jackson authored
Bug 10798
-
Scott Jackson authored
Bug 10798
-
Albert Gil authored
On broken glibc versions, the zeroes in the association state file will be saved as "nan" in packlongdouble(). Detect if this has happened in unpacklongdouble() and convert back to zero. https://bugzilla.redhat.com/show_bug.cgi?id=1925204 Bug 10824.
-
- Feb 11, 2021
-
-
Marcin Stolarek authored
full_queue needs to be reset after the call to _schedule(). Regression from commit 2427f291. Bug 10818.
-
Colby Ashley authored
Bug 10542.
-
- Feb 10, 2021
-
-
Albert Gil authored
-
Scott Jackson authored
To specify a file to write the json testsuite results to. Bug 10798
-
- Feb 09, 2021
-
-
Ben Roberts authored
sched was removed as an option for TaskPluginParam Bug 10780 Signed-off-by:
Tim Wickberg <tim@schedmd.com>
-
Scott Hilton authored
Bug 10673
-