- Mar 10, 2016
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
Conflicts: NEWS
-
Morris Jette authored
Fix Cray NHC spawning on job requeue. Previous logic would leave nodes allocated to a requeued job as non-usable on job termination. Specifically, each job has a "cleaning/cleaned" flag. Once a job terminates, the cleaning flag is set, then after the job node health check completes, the value gets set to cleaned. If the job is requeued, on its second (or subsequent) termination, the select/cray plugin is called to launch the NHC. The plugin sees the "cleaned" flag already set, it then logs: error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen and returns, never launching the NHC. Since the termination of the job NHC triggers releasing job resources (CPUs, memory, and GRES), those resources are never released for use by other jobs. Bug 2384
-
David Gloe authored
An error in slurmconfgen_smw.py caused it to parse the nic as the nid. On some systems those values differ, causing the generated slurm.conf file to be incorrect. Bug 2532.
-
Tim Wickberg authored
_set_collectors() already has a run_in_daemon("slurmd") that precludes this from being an issue.
-
Bill Brophy authored
route_p_split_hostlist was not thread-safe, and would cause one of several segfaults depending on where in the initialization code each thread was. Bug 2495.
-
Tim Wickberg authored
Was incorrectly displaying "(null)" even when loaded successfully.
-
Morris Jette authored
-
Morris Jette authored
burst_buffer/cray plugin: Prevent a requeued job from being restarted while file stage-out is still in progress. Previous logic could restart the job and not perform a new stage-in. bug 2584, comment #45
-
Morris Jette authored
possible bug in smap Makefile
-
Manuel Rodríguez-Pascual authored
LIBS can have a previous value, as depicted in ./configure --help "Some influential environment variables: (...) LIBS libraries to pass to the linker, e.g. -l<library> " Original assignation to LIBS overwrites this value. With this edition, the user defined flags and NCURSES ones are both employed by the linker.
-
- Mar 09, 2016
-
-
Morris Jette authored
Fix Cray NHC spawning on job requeue. Previous logic would leave nodes allocated to a requeued job as non-usable on job termination. Specifically, each job has a "cleaning/cleaned" flag. Once a job terminates, the cleaning flag is set, then after the job node health check completes, the value gets set to cleaned. If the job is requeued, on its second (or subsequent) termination, the select/cray plugin is called to launch the NHC. The plugin sees the "cleaned" flag already set, it then logs: error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen and returns, never launching the NHC. Since the termination of the job NHC triggers releasing job resources (CPUs, memory, and GRES), those resources are never released for use by other jobs. Bug 2384
-
David Gloe authored
An error in slurmconfgen_smw.py caused it to parse the nic as the nid. On some systems those values differ, causing the generated slurm.conf file to be incorrect. Bug 2532.
-
Morris Jette authored
This matches the documentation
-
- Mar 08, 2016
-
-
Tim Wickberg authored
_set_collectors() already has a run_in_daemon("slurmd") that precludes this from being an issue.
-
Bill Brophy authored
route_p_split_hostlist was not thread-safe, and would cause one of several segfaults depending on where in the initialization code each thread was. Bug 2495.
-
Tim Wickberg authored
Was incorrectly displaying "(null)" even when loaded successfully.
-
Morris Jette authored
Capture MCDRAM percentages for various configurations from capmc. This assumes the percentages for various configurations will be identical for all nodes within a cluster.
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Janne Blomqvist authored
-
- Mar 07, 2016
-
-
Morris Jette authored
-
Brian Christiansen authored
clang found a deferencing null issue which lead to finding the parsing error.
-
Dominik Bartkiewicz authored
Added new job dependency type of "aftercorr" which will start a task of a job array after the corresponding task of another job array completes. bug 2460
-
Tim Wickberg authored
In particular, it seems that MariaDB has changed the default for innodb_lock_wait_timeout has been lowered which can cause issues for the various rollup processes on systems with high job counts.
-
- Mar 05, 2016
-
-
Morris Jette authored
Fix some timing issues with respect to rebooting a node, especailly KNL node needing reboot to change configuration.
-
Danny Auble authored
would only track gres/gpu, now it will track both gres/gpu and gres/gpu:tesla as separate gres if configured like AccountingStorageTRES=gres/gpu,gres/gpu:tesla
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
--gres=gpu:tesla before you needed to give a count --gres=gpu:tesla:1 now both should work.
-
- Mar 04, 2016
-
-
Danny Auble authored
-
Danny Auble authored
Step GRES value changed from type "int" to "int64_t" to support larger values. Signed-off-by:
Danny Auble <da@schedmd.com>
-
Danny Auble authored
-
Danny Auble authored
-
Morris Jette authored
These changes apply to both the main scheduling logic and backfill scheduler. If some SchedulerParameters value was configured, the slurmctld started, then completely removed, and slurmctld reconfigured the value would not be reset to it's default value but the originally configured value would persist until slurmctld restarted.
-
Brian Christiansen authored
Continuation of 31225a82
-