- Oct 05, 2016
-
-
Morris Jette authored
node_features/knl_cray plugin: drain any node not reported by "capmc node_status" on startup or reconfig. Also re-tests on failed node restart for job.
-
Morris Jette authored
node_features/knl_cray plugin: Remove any KNL MCDRAM or NUMA features from node's configuration if capmc does NOT report the node as being KNL. For example, we don't want a non-KNL node with features="quad,cache".
-
- Oct 04, 2016
-
-
Morris Jette authored
Add new knl.conf configuration parameter CapmcRetries Modify capmc_suspend and capmc_resume to retry operations when Cray State Manager is down. Add retry logic to node_features/knl_cray to handle Cray State manager being down. bug 3100
-
- Oct 03, 2016
-
-
Dominik Bartkiewicz authored
-
- Sep 30, 2016
-
-
Alejandro Sanchez authored
Otherwise they'll truncate when packed into the RPC and end up as some bizarre value at the controller. Bug 3098.
-
Dominik Bartkiewicz authored
Set completed time for pending/running runaway jobs to the max of (start, eligible, submit) times. Bug 3075
-
Artem Polyakov authored
Avoid using slurm_forward_data because it causes thread spawn that introduces unwanted delays. Bug 3102.
-
Tim Wickberg authored
-
- Sep 29, 2016
-
-
Morris Jette authored
-
Alejandro Sanchez authored
Also correct the value of NICE_OFFSET used within the perl API. Bug 3098.
-
Artem Polyakov authored
Bug 3051.
-
Tim Wickberg authored
Otherwise updates would be rejected for running jobs even if there would be no impact. Most common when the job_submit plugin is overriding QOS/GRES values on everything; without this change an update to "comment" or other fields would fail with ESLURM_JOB_NOT_PENDING. Bug 3117.
-
Tim Wickberg authored
Never ever run NHC, even on an edge case that NHC_NO would still launch NHC after. Bug 3105.
-
Tim Wickberg authored
Switch to list_for_each, and check if access list actually changed after each update before updating last_prat_update. This prevents the backfill scheduler from resetting mid-cycle unnecessarily. Bug 3123.
-
- Sep 28, 2016
-
-
Morris Jette authored
Add "sbatch_wait_nodes" to SchedulerParameters to control default sbatch behaviour with respect to waiting for all allocated nodes to be ready for use. Job can override the configuration option using the --wait-all-nodes=# option. bug 3120
-
- Sep 27, 2016
-
-
Morris Jette authored
Prior logic would treat execute line like this: $ sbatch --wait-all-nodes -N3 tmp with "-N3" as being the argument to the "--wait-all-nodes" option. See bug 3120
-
Morris Jette authored
Add salloc/sbatch/srun option --use-min-nodes to prefer smaller node counts when a range of node counts is specified (e.g. "-N 2-4"). bug 2996
-
- Sep 26, 2016
-
-
Morris Jette authored
Add salloc/sbatch/srun --priority option of "TOP" to set job priority to the highest possible value. This option is only available to Slurm operators and administrators. bug 3115
-
- Sep 24, 2016
-
-
Morris Jette authored
bug 3090
-
- Sep 23, 2016
-
-
Morris Jette authored
Make sure no attempt is made to schedule a requeued job until all steps are cleaned (Node Health Check completes for all steps on a Cray). bug 3082
-
- Sep 22, 2016
-
-
Dominik Bartkiewicz authored
Otherwise limit is checking the node count against the midplane count. Bug 3049.
-
Alejandro Sanchez authored
Check if node names are contiguous with respect to the node list assigned to the partition, rather than just monotonically increasing. Bug 3006.
-
Janne Blomqvist authored
Bugs 2681 and 2703 Conflicts: NEWS
-
Adam Moody authored
-
Alejandro Sanchez authored
license of a certain type.
-
- Sep 21, 2016
-
-
Morris Jette authored
node_features/knl_cray plugin: Increase default CapmcTimeout parameter from 10 to 60 seconds. bug 3100
-
Morris Jette authored
capmc_suspend/resume - If a request modify NUMA or MCDRAM state on a set of nodes or reboot a set of nodes fails then just requeue the job and abort the entire operation rather than trying to operate on individual nodes. bug 3100
-
Morris Jette authored
Allow a node's PowerUp state flag to be cleared using update_node RPC. bug 3100
-
Morris Jette authored
When powering up a node to change it's state (e.g. KNL NUMA or MCDRAM mode) then pass to the ResumeProgram the job ID assigned to the nodes in the SLURM_JOB_ID environment variable. bug 3100
-
- Sep 20, 2016
-
-
Morris Jette authored
Don't log error for job end_time being zero if node health check is still running. bug 3053
-
- Sep 17, 2016
-
-
Morris Jette authored
Restore ability to manually power down nodes, broken in 15.08.12 in commit b4904661 The patch introduced in commit b4904661 (not powering down dead node) has a bad side effect. Adding the "(node_ptr->last_idle != 0)" condition prevents from powering down nodes with the following command: scontrol update nodename=nX state=power_down because the state update function relies on zeroing the "last_idle" variable when a power_down is requested (see src/slurmctld/node_mgr.c, line 1589). Reverting this commit should solve the problem...but I let you decide... Didier GAZEN
-
- Sep 16, 2016
-
-
Morris Jette authored
node_features/knl_cray: If a node is rebooted outside of Slurm's direction, update it's active features with current MCDRAM and NUMA mode information. bug 3071
-
- Sep 15, 2016
-
-
Morris Jette authored
Fix race condition that could result in MCDRAM state information coming from capmc rather than cnselect (used state for next boot rather than latest boot). bug 3080
-
Nicolas Joly authored
-
- Sep 14, 2016
-
-
Alejandro Sanchez authored
No functional change, just silencing the warning message in this instance. Bug 3079.
-
Alejandro Sanchez authored
Bug 3073.
-
- Sep 09, 2016
-
-
Morris Jette authored
Modify srun task completion handling to only build the task/node string for logging purposes if it is needed. Modified for performance purposes. bug 3044
-
Tim Wickberg authored
This reverts commit 1ec2a4ae.
-
Alejandro Sanchez authored
Bug 3063.
-
- Sep 08, 2016
-
-
Morris Jette authored
Restructure srun command locking for task_exit processing logic for improved parallelism. This change decreases the amount of time consumed by serial logic by 2 orders of magnitude. bug 3044
-