- Aug 23, 2017
-
-
Alejandro Sanchez authored
Running slurmctld under valgrind while operating with jobcomp/elasticsearch reported the following bytes definitely lost: ==27403== 658 bytes in 1 blocks are definitely lost in loss record 301 of 342 ==27403== at 0x4C2FD4F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==27403== by 0x2281B3: slurm_xrealloc (xmalloc.c:137) ==27403== by 0x22856A: makespace (xstring.c:114) ==27403== by 0x2285D0: _xstrcat (xstring.c:132) ==27403== by 0x228CE0: _xstrfmtcat (xstring.c:291) ==27403== by 0x83C5BCD: ??? ==27403== by 0x30A913: g_slurm_jobcomp_write (slurm_jobcomp.c:172) ==27403== by 0x18D8FC: job_completion_logger (job_mgr.c:13652) It turns out the generated buffer in slurm_jobcomp_log_record was xstrdup'ed to the corresponding job_node->serialized_job, but the originally generated buffer wasn't freed afterwards. The fix consists in change the transfer so that instead of xstrdup'ing the char * we just assign the pointer and NULL the buffer. The job_node->serialized_job was already xfree'd properly later when the job was indexed. Discovered while working on Bug 4065.
-
- Aug 22, 2017
-
-
Alejandro Sanchez authored
Otherwise the resulting URL may be invalid. Update documentation while here as well. Bug 4065.
-
Tim Shaw authored
Otherwise a race between threads in _check_node_status leads to a crash. Bug 4093.
-
Tim Wickberg authored
Modification of commit c7e6d864. Bug 4095.
-
Philip Kovacs authored
bug 4095
-
Philip Kovacs authored
Bug 4094
-
- Aug 21, 2017
-
-
Alejandro Sanchez authored
Given a configuration with TopologyParam including Dragonfly option, if a job requested --switches count, the count timeout specified by either the job request or max_switch_wait SchedulerParameters was not respected. This was due to leaf_switch_count variable not being incremented in _eval_nodes_dfly() function when needed, as we do in _eval_nodes_topo(), the later being a execution path which already succeed to wait for the switch count timeout. Bug 4056
-
- Aug 18, 2017
-
-
Alejandro Sanchez authored
-
- Aug 17, 2017
-
-
Tim Wickberg authored
-
Morris Jette authored
Coverity CID 44649 Bug 4085
-
- Aug 16, 2017
-
-
Danny Auble authored
instead of local. Bug 3546
-
- Aug 14, 2017
-
-
Morris Jette authored
-
Danny Auble authored
This reverts commit 00a691b9.
-
Morris Jette authored
-
- Aug 11, 2017
-
-
Dominik Bartkiewicz authored
-
Danny Auble authored
This will allow dell's custom syscfg to work correctly. NOTE: Dell calls flat memory just memory. Bug 4034
-
Danny Auble authored
No code change, just moving existing code into a switch ready to handle multiple options. Bug 4034
-
Danny Auble authored
Add SystemType to knl_generic.conf for knl_generic in preparations for making KNL work on a Dell system. Add SystemType to knl_generic.conf. This is used to distinguish differences in vendors such as 'Dell'. Bug 4034
-
Danny Auble authored
Bug 4059
-
Dominik Bartkiewicz authored
-
- Aug 10, 2017
-
-
Danny Auble authored
-
- Aug 07, 2017
-
-
Justin Lecher authored
Starting from glibc-2.25 the macros major and minor are only available from sys/sysmacros.h. This patch uses an autoconf macro to detect the location and includes the header accordingly. Bug 3982.
-
Danny Auble authored
-
Dominik Bartkiewicz authored
Bug 4019
-
- Aug 04, 2017
-
-
Morris Jette authored
truncation of core specification and not reserving the specified cores. Fixes Coverity CID 45174 and 45175 Bug 4053
-
Artem Polyakov authored
matter in production, but in testing it can. Bug 4051
-
Danny Auble authored
-
Marshall Garey authored
Fix mysql plugin to correctly return parent limits for all children. Bug 4050
-
Danny Auble authored
the tree. Bug 4050
-
- Aug 02, 2017
-
-
Marshall Garey authored
Would fail when trying to create the clustername file because the StateSaveLocation path didn't exist yet. Bug 3988
-
Marshall Garey authored
srun jobs that could start immediately and requested multiple partitions didn't run in the highest priority partition if the highest priority partition wasn't listed first. It's possible that the scontrol show jobs will show the partition list in priority order now that the job's partition list gets sorted by priority. Bug 4015
-
Dominik Bartkiewicz authored
NULL is returned if the token is not found, testing against '\0' is wrong (although does work okay in older compilers). Fixes new GCC 7.1 warning.
-
- Aug 01, 2017
-
-
Tim Shaw authored
Bug 3999
-
Dominik Bartkiewicz authored
Fix bug in selection of GRES bound to specific CPUs where the GRES count is 2 or more. Previous logic could allocate CPUs not available to the job. bug 4029
-
- Jul 28, 2017
-
-
Danny Auble authored
to have 'socket=' in AuthInfo to work. This is to make it so people don't have to update their slurmdbd.conf's when upgrading (and to match documentation). Continuation of last commit Bug 4009
-
Danny Auble authored
connection. Bug 4009
-
Alejandro Sanchez authored
jobcomp/elasticsearch saves/load the state to/from elasticsearch_state. Since the jobcomp API isn't designed with save/load state operations, the plugin _save_state() isn't extern and not available from outside the plugin itself, thus it is highly coupled to fini() function. This state doesn't follow the same execution path as the rest of Slurm states, where in save_all_sate() they are all independently scheduled. So we save it manually here on a RPC of type REQUEST_CONTROL. This enables that when the Primary ctld issues a REQUEST_CONTROL to the Backup which is currently in controller mode, the Backup will save the state and when the Primary assumes control again it can process the saved pending jobs. The other way around was already controlled, because when the Primary is running in controller mode and the Backup issues a REQUEST_CONTROL, the Primary is shutdown and when breaking the ctld main() function while(1) loop, there was already a g_slurm_jobcomp_fini() call in place. Bug 3908
-
Dominik Bartkiewicz authored
-
- Jul 27, 2017
-
-
Alejandro Sanchez authored
When more than 1 ping cycle is spawned simultaneously (for instance REQUEST_PING + REQUEST_NODE_REGISTRATION_STATUS for the selected nodes), we do not track a separate ping_start time for each cycle. When ping_begin() is called, the information about the previous ping cycle is lost. Then when ping_end() is called for the first of the two cycles, we set ping_start=0, which is incorrectly used to see if the last cycle ran for more than PING_TIMEOUT seconds (100s), thus incorrectly triggering the: error("Node ping apparently hung, many nodes may be DOWN or configured " "SlurmdTimeout should be increased"); Bug 3914
-
- Jul 26, 2017
-
-
Danny Auble authored
-