- Mar 21, 2016
-
-
Morris Jette authored
burst_buffer/cray: Set environment variables just before starting job rather than at job submission time to reflect persistent buffers created or modified while the job is pending. bug 2545
-
Danny Auble authored
buffer is found. Bug 2576 What happened was a function was doing a double read lock which isn't awesome to begin with, but not really horrible (if all you are doing is read locks anyway). The problem was after the first lock was locked a different thread was going for a write lock and so when the second read lock came in it created deadlocked.
-
- Mar 18, 2016
-
-
Morris Jette authored
Avoid possibly aborting srun that gets simultaneous SIGSTOP+SIGCONT while creating the job step. The result is that the signal hanlder gets a argument (the signal received) of zero. Here's a log, window 1: $ srun hostname srun: Job step creation temporarily disabled, retrying srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 0 srun: Cancelled pending job step Window 2: $ kill -STOP 18696 ; kill -CONT 18696 $ kill -STOP 18696 ; kill -CONT 18696 $ kill -STOP 18696 ; kill -CONT 18696 .... bug 2494
-
- Mar 17, 2016
-
-
Tim Wickberg authored
The uid is used as part of the hash function, must remove old reference and recalculate if it may change, otherwise _delete_assoc_hash will not find it again when the association is removed, causing slurmctld to segfault. Bug 2560.
-
- Mar 16, 2016
-
-
Morris Jette authored
Previous gang scheduling logic maintained information about resources originally allocated to the job and made scheduling decisions on that basis. bug 2494
-
Morris Jette authored
Update gang scheduling table when job manually suspended or resumed. Prior logic could mess up job suspend/resume sequencing. bug 2494
-
Danny Auble authored
time. https://bugs.schedmd.com/show_bug.cgi?id=2547 The code just wasn't fully baked before and was probably written before a lot of the other supporting code was done i.e assoc_mgr_set_assoc|qos_tres_cnt were done specifically for this kind of thing. Many of the usage structures weren't realloced either as well as the tres_cnt local to each qos and assoc wasn't updated. So all in all pretty bad code - bad Danny. This makes sure all this sets up and no memory corruption happens.
-
Morris Jette authored
Generate burst buffer use completion email immediately afer teardown completes rather than at job purge time (likely minutes later). bug 2539
-
Morris Jette authored
Change burst buffer use completion message from "SLURM Job_id=1360353 Name=tmp Staged Out, StageOut time 00:01:47" to "SLURM Job_id=1360353 Name=tmp StageOut/Teardown time 00:01:47"
-
- Mar 15, 2016
-
-
Alejandro Sanchez authored
-
Tim Wickberg authored
Bug 2543.
-
- Mar 14, 2016
-
-
Danny Auble authored
on only one port like TopologyParam=NoInAddrAny does for everything else.
-
Tim Wickberg authored
There's no /proc on *BSD, and BSD handles OOM in a completely different way.
-
- Mar 11, 2016
-
-
Tim Wickberg authored
Return [0-100:2] formatting, rather than [0,2,4,6,8,...] when using a step function. Was inadvertantly broken in 14.11 with commit 5ffdca92. Bug 2535.
-
- Mar 10, 2016
-
-
Morris Jette authored
-
- Mar 09, 2016
-
-
Morris Jette authored
Fix Cray NHC spawning on job requeue. Previous logic would leave nodes allocated to a requeued job as non-usable on job termination. Specifically, each job has a "cleaning/cleaned" flag. Once a job terminates, the cleaning flag is set, then after the job node health check completes, the value gets set to cleaned. If the job is requeued, on its second (or subsequent) termination, the select/cray plugin is called to launch the NHC. The plugin sees the "cleaned" flag already set, it then logs: error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen and returns, never launching the NHC. Since the termination of the job NHC triggers releasing job resources (CPUs, memory, and GRES), those resources are never released for use by other jobs. Bug 2384
-
David Gloe authored
An error in slurmconfgen_smw.py caused it to parse the nic as the nid. On some systems those values differ, causing the generated slurm.conf file to be incorrect. Bug 2532.
-
- Mar 08, 2016
-
-
Bill Brophy authored
route_p_split_hostlist was not thread-safe, and would cause one of several segfaults depending on where in the initialization code each thread was. Bug 2495.
-
Tim Wickberg authored
Was incorrectly displaying "(null)" even when loaded successfully.
-
- Mar 05, 2016
-
-
Danny Auble authored
-
- Mar 04, 2016
-
-
Danny Auble authored
-
- Mar 03, 2016
-
-
Danny Auble authored
-
Brian Christiansen authored
Bug 2507
-
Morris Jette authored
Step GRES value changed from type "int" to "int64_t" to support larger values. Previous logic could fail in step allocation values over 32-bits. Other GRES values are 64-bit.
-
Danny Auble authored
slurmstepd to close potential open ones. It was pointed out the slurmd using acct_gather_energy/ipmi links to freeipmi which could possibly open /dev/ipmi0 without the close on exec flag set as root while launching a step leaving it open in the users app. What this does is sets the flag on the first 256 to mitigate the concern. Reported by Maksym Planeta. Bug 2506
-
- Mar 02, 2016
-
-
Gary B Skouson authored
Previous logic tested whatever the job's partition pointer indicated rather than the partition we are trying to run the job in. This bug was introduced in Slurm version 15.08.5, Nov 16, 2015, commit 94f0e948 bug 2499
-
Thomas Cadeau authored
-
- Mar 01, 2016
-
-
Tim Wickberg authored
-
Morris Jette authored
Insure that a job is completely launched before trying to suspend it. Previous logic would start suspend logic early in the life of the slurmstepd process, after it's listening socket was open but before the tasks were launched. This defers the suspend logic until after all prologs and setup completes and the tasks are launched. This is important in the case of gang scheduling, in which newly launched jobs can be immediately suspended. bug 2494
-
- Feb 26, 2016
-
-
Danny Auble authored
-
Tim Wickberg authored
Add note to slurm.conf man page about setting "--cpu_bind=no" as part of SallocDefaultCommand if a TaskPlugin is in use.
-
- Feb 25, 2016
-
-
Danny Auble authored
was also given.
-
- Feb 24, 2016
-
-
Danny Auble authored
a partition.
-
Danny Auble authored
This also reverts most of commit fa331e30 as well as commit bd9fa830 which would try to set the pn_min_cpus every time a job was updated. If a job didn't request node counts then they were hosed. This commit takes away the magic which was screwing things up. Now the person gets what they asked for without magic changing things. Bug 2302 Bug 2742 Bug 2478
-
Danny Auble authored
erroneously.
-
Danny Auble authored
-
Danny Auble authored
-
- Feb 23, 2016
-
-
Danny Auble authored
This whole process could probably be done better by keeping track of old values and new values and only calling one function instead of a pre and post function, but that can probably wait for future generations of the code as it works now and is probably adequate for the time being. Bug 2352
-
- Feb 19, 2016
-
-
Morris Jette authored
BurstBuffer/cray - Defer job cancellation or time limit while "pre-run" operation in progress to avoid inconsistent state due to multiple calls to job termination functions. bug 2454
-
Morris Jette authored
-