- Jan 23, 2017
-
-
Danny Auble authored
-
Morris Jette authored
Add new knl.conf parameter to the capmc_suspend and capmc_resume programs. They are not used by those programs, but we need to prevent an error if those new parameters are used.
-
Morris Jette authored
-
Morris Jette authored
Reset a job's memory limit based upon what's available after node reboot, which can change on a KNL if the MCDRAM mode is changes on reboot
-
Morris Jette authored
This bug was likely the root cause of bug 3366. If the backfill scheduler allocates resources for a batch job and a node reboot is required, the batch launch RPC would be sent to the agent. At that point, there is a race condition between the agent and the job_time_limit() function testing for boot completion. If the job_time_limit() function ran first, it would trigger a second launch RPC request getting sent to the agent. bug 3366
-
Morris Jette authored
Clean up logic to test if job is configuring bug 3366
-
Morris Jette authored
Do not launch a batch step while the job is configuring. Previous logic checked for the PrologSlurmctld running, but not nodes booting. Checking the job's CONFIGURING state flag will validate both. bug 3366
-
Morris Jette authored
Add check to avoid step allocation logic from executing job configuration completion logic multiple times (check if job is configurating before clearing flag and resetting time limit). bug 3366
-
Brian Christiansen authored
-
Morris Jette authored
slurmctld/agent race condition fix: Prevent job launch while PrologSlurmctld daemon is running or node boot in progress. bug 3366
-
Morris Jette authored
This is required to manage the configuration completion. bug 3366
-
Morris Jette authored
This will be required to lock the job structure bug 3366
-
Morris Jette authored
Remove the return value from the agent_retry() function. It is not used anywhere and needs to be removed to run as a pthread. bug 3366
-
- Jan 21, 2017
-
-
Tim Wickberg authored
-
Tim Wickberg authored
-
Tim Wickberg authored
Reasonable NFS systems do not need a minute to propagate changes.
-
- Jan 20, 2017
-
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
In favor of just using the -a option to show the tracking federated jobs. This allows scontrol -a show jobs to show the tracking jobs as well.
-
Brian Christiansen authored
-
Brian Christiansen authored
to indicate wheter the job was requeue held or not. This enables the federation to trigger off whether the job was requeue held or not.
-
Brian Christiansen authored
So that the origin job tell a remote cluster to cancel the job but mark the job as requeued in the database. See note about the KILL_* flags actually using 12bits instead of noted 8bits.
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
Follows pattern from c5ace562
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
If a job was requeued while in the completing state, the database wasn't being updated with the requeue state.
-
Brian Christiansen authored
When a fed job is requeued, it needs to be requeued to clusters that it was submittted to.
-
Brian Christiansen authored
When the a fed job is requeued and new siblings are submitted to the other siblings, the restart_cnt needs to go to the siblings in case the job runs on a remote sibling.
-
Brian Christiansen authored
The federation needs to make a job_desc when requeueing jobs to siblings.
-