- May 13, 2016
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
Make test more robust for compute nodes with large CPU counts.
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
This change fixes Slurm's ability to optimize selection of resources for a job requesting feature counts where some of those node features are currently inactive (require node reboot to claim).
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
# Conflicts: # src/common/slurm_acct_gather_energy.c # src/common/slurm_acct_gather_filesystem.c # src/common/slurm_acct_gather_infiniband.c # src/common/slurm_acct_gather_profile.c # src/common/slurm_jobacct_gather.c
-
Danny Auble authored
when in use. The problem here is the polling threads in the various acct_gather codes were detached and could possibly still be polling after the plugin had been unloaded making a seg fault with a backtrace like this... #0 0x00007fe7af008c00 in ?? () #1 0x00007fe7b1138479 in __nptl_deallocate_tsd () at pthread_create.c:175 #2 0x00007fe7b11398b0 in __nptl_deallocate_tsd () at pthread_create.c:326 #3 start_thread (arg=0x7fe7b1f12700) at pthread_create.c:346 #4 0x00007fe7b0e6fb5d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 The fix was to make the threads non-detached and join them before calling a dlclose.
-
Morris Jette authored
Whenever possible, avoid allocating nodes that require a reboot. Previous logic failed to re-sort the job set table based upon the need for rebooting to achieve the desired features (e.g. KNL MCDRAM or CACHE mode). bug 2726
-
- May 12, 2016
-
-
Morris Jette authored
Put header files in alphabetic order, No change in logic
-
Danny Auble authored
# Conflicts: # src/slurmctld/controller.c
-
Danny Auble authored
-
Danny Auble authored
trying to verify the cluster name (which may try to /create/ files or directories) *before* dropping privs results in a fatal error as slurmctld tries to create items which ultimately fail. Moving this process until after the privs and uid have changed allows the process to succeed. Reported by Jon Nelson <jdnelson@dyn.com> Bug 2728
-
Morris Jette authored
Reject invalid step at submit time rather than leaving it queued. Bug 2722 describes one of the use cases triggering the bug.
-
Morris Jette authored
-
Morris Jette authored
Minor update to commit 2fad3bcf This leaves the files locked until file write completes.
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Morris Jette authored
Disable a job deadline test with no input time limit if the default partition has a default time limit.
-
Morris Jette authored
This partially restores commit 03b2cfb5 Logic was not closing file descriptor, which left the file locked and leaked an open file descriptor.
-
- May 11, 2016
-
-
Danny Auble authored
tasks-per-node/nodes != tasks print warning and ignore ntasks-per-node. Bug 2520
-
Brian Christiansen authored
On a Cray, the output file isn't being created the second time.
-
Morris Jette authored
Make test_id in more tests be just the numeric value rather than "test#.#" for consistency with the other tests.
-
Morris Jette authored
Make test_id in test30.1 be just the numeric value rather than "test30.1" for consistency with the other tests.
-
Brian Christiansen authored
The account still had maxnodes=1 set preventing the qos grpnodes to take affect. This showed up on slower machines because it takes a second for the changes to get to the controller.
-
Morris Jette authored
-
Morris Jette authored
Test would originally try to start more jobs than default_queue_depth in SchedulerParameters and fail
-
Morris Jette authored
Job was failing on Cray/kachina due to timeout. Increase job time limit from 1 to 2 minutes.
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
make it to the slurmctld when using message aggregation.
-
Danny Auble authored
-
- May 10, 2016
-
-
Danny Auble authored
make sure we handle it correctly when the database comes back up.
-
Morris Jette authored
Give test job an extra second to start. Test was failing by one second on kachina.
-
Morris Jette authored
Get the maximum file pathname size from system include file rather than local #define. This was causing failures on kachina test.
-
Morris Jette authored
-
Danny Auble authored
-