- Mar 11, 2016
-
-
Tim Wickberg authored
-
Tim Wickberg authored
Return [0-100:2] formatting, rather than [0,2,4,6,8,...] when using a step function. Was inadvertantly broken in 14.11 with commit 5ffdca92. Bug 2535.
-
- Mar 10, 2016
-
-
Morris Jette authored
-
Morris Jette authored
burst_buffer/cray plugin: Prevent a requeued job from being restarted while file stage-out is still in progress. Previous logic could restart the job and not perform a new stage-in. bug 2584, comment #45
-
- Mar 09, 2016
-
-
Morris Jette authored
Fix Cray NHC spawning on job requeue. Previous logic would leave nodes allocated to a requeued job as non-usable on job termination. Specifically, each job has a "cleaning/cleaned" flag. Once a job terminates, the cleaning flag is set, then after the job node health check completes, the value gets set to cleaned. If the job is requeued, on its second (or subsequent) termination, the select/cray plugin is called to launch the NHC. The plugin sees the "cleaned" flag already set, it then logs: error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen and returns, never launching the NHC. Since the termination of the job NHC triggers releasing job resources (CPUs, memory, and GRES), those resources are never released for use by other jobs. Bug 2384
-
David Gloe authored
An error in slurmconfgen_smw.py caused it to parse the nic as the nid. On some systems those values differ, causing the generated slurm.conf file to be incorrect. Bug 2532.
-
- Mar 08, 2016
-
-
Tim Wickberg authored
_set_collectors() already has a run_in_daemon("slurmd") that precludes this from being an issue.
-
Bill Brophy authored
route_p_split_hostlist was not thread-safe, and would cause one of several segfaults depending on where in the initialization code each thread was. Bug 2495.
-
Tim Wickberg authored
Was incorrectly displaying "(null)" even when loaded successfully.
-
Morris Jette authored
-
Janne Blomqvist authored
-
- Mar 07, 2016
-
-
Tim Wickberg authored
In particular, it seems that MariaDB has changed the default for innodb_lock_wait_timeout has been lowered which can cause issues for the various rollup processes on systems with high job counts.
-
- Mar 05, 2016
-
-
Danny Auble authored
-
Danny Auble authored
-
- Mar 04, 2016
-
-
Danny Auble authored
Step GRES value changed from type "int" to "int64_t" to support larger values. Signed-off-by:
Danny Auble <da@schedmd.com>
-
Danny Auble authored
-
- Mar 03, 2016
-
-
Danny Auble authored
-
Brian Christiansen authored
Bug 2507
-
Morris Jette authored
Step GRES value changed from type "int" to "int64_t" to support larger values. Previous logic could fail in step allocation values over 32-bits. Other GRES values are 64-bit.
-
Danny Auble authored
slurmstepd to close potential open ones. It was pointed out the slurmd using acct_gather_energy/ipmi links to freeipmi which could possibly open /dev/ipmi0 without the close on exec flag set as root while launching a step leaving it open in the users app. What this does is sets the flag on the first 256 to mitigate the concern. Reported by Maksym Planeta. Bug 2506
-
- Mar 02, 2016
-
-
Gary B Skouson authored
Previous logic tested whatever the job's partition pointer indicated rather than the partition we are trying to run the job in. This bug was introduced in Slurm version 15.08.5, Nov 16, 2015, commit 94f0e948 bug 2499
-
Danny Auble authored
patch 2d5066e7
-
Thomas Cadeau authored
-
- Mar 01, 2016
-
-
Danny Auble authored
-
Morris Jette authored
bug 2496
-
Tim Wickberg authored
-
Tim Wickberg authored
Distribute only the maintained doc/html and doc/man directories. Reports and notes are a historical artifact that can be found in git tags if necessary, but have no value for modern installations.
-
Tim Wickberg authored
-
Tim Wickberg authored
-
Morris Jette authored
This fixes a bug introduced in commit 52fe3de1 in the event the fork() call fails in slurmstepd.
-
Morris Jette authored
Insure that a job is completely launched before trying to suspend it. Previous logic would start suspend logic early in the life of the slurmstepd process, after it's listening socket was open but before the tasks were launched. This defers the suspend logic until after all prologs and setup completes and the tasks are launched. This is important in the case of gang scheduling, in which newly launched jobs can be immediately suspended. bug 2494
-
Morris Jette authored
-
- Feb 29, 2016
-
-
Danny Auble authored
Bug 1976
-
- Feb 26, 2016
-
-
Danny Auble authored
-
Tim Wickberg authored
Add note to slurm.conf man page about setting "--cpu_bind=no" as part of SallocDefaultCommand if a TaskPlugin is in use.
-
Maksym Planeta authored
-
Bjørn-Helge Mevik authored
Test 14.10 in the test suite (of slurm 15.08.8, at least) uses $sinfo -tidle -h -o%n to find idle nodes. This only works if NodeHostname == NodeName on the nodes. The following should work regardless of this: $scontrol show hostnames \$($sinfo -tidle -h -o%N)
-
Tim Wickberg authored
-
- Feb 25, 2016
-
-
Tim Wickberg authored
Since the function is inlined the single definition let GCC build everything properly, but debug builds (which disable inline) resulted in: slurmstepd: [465.0]: symbol lookup error: (trimmed path)/task_cgroup.so: undefined symbol: val_to_char when running srun --cpu_bind=v. task/affinity had this definition already, task/cgroup didn't.
-
Morris Jette authored
Reported by valgrind running test7.2, but shouldn't cause any real problem
-