Commits · 8172b7dfb06bac6d8ee26960e39b1bd5133548f3 · tud-zih-energy / Slurm

Aug 23, 2017

jobcomp/elasticsearch - fix memory leak when transferring generated buffer. · 8172b7df

Running slurmctld under valgrind while operating with jobcomp/elasticsearch
reported the following bytes definitely lost:

==27403== 658 bytes in 1 blocks are definitely lost in loss record 301 of 342
==27403==    at 0x4C2FD4F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==27403==    by 0x2281B3: slurm_xrealloc (xmalloc.c:137)
==27403==    by 0x22856A: makespace (xstring.c:114)
==27403==    by 0x2285D0: _xstrcat (xstring.c:132)
==27403==    by 0x228CE0: _xstrfmtcat (xstring.c:291)
==27403==    by 0x83C5BCD: ???
==27403==    by 0x30A913: g_slurm_jobcomp_write (slurm_jobcomp.c:172)
==27403==    by 0x18D8FC: job_completion_logger (job_mgr.c:13652)

It turns out the generated buffer in slurm_jobcomp_log_record was xstrdup'ed to
the corresponding job_node->serialized_job, but the originally generated buffer
wasn't freed afterwards. The fix consists in change the transfer so that instead
of xstrdup'ing the char * we just assign the pointer and NULL the buffer.

The job_node->serialized_job was already xfree'd properly later when the job
was indexed.

Discovered while working on Bug 4065.

8172b7df

Aug 22, 2017
- Strip trailing slashes from the JobCompLoc for jobcomp/elasticsearch. · 60eed77f
  Alejandro Sanchez authored 7 years ago
  
  Otherwise the resulting URL may be invalid. Update documentation while here as well. Bug 4065.
  60eed77f
- Change capmc_node_bitmap to a local variable. · b56f12e0
  Tim Shaw authored 7 years ago
  
  Otherwise a race between threads in _check_node_status leads to a crash. Bug 4093.
  b56f12e0
- Fail on EPERM as you would any other error. · a5b47f7b
  Tim Wickberg authored 7 years ago
  
  Modification of commit c7e6d864. Bug 4095.
  a5b47f7b
- In salloc with --uid option, drop supplementary groups before changing UID · c7e6d864
  Philip Kovacs authored 7 years ago
  
  bug 4095
  c7e6d864
- Elimiate -Wformat-truncation warnings · d04fa289
  Philip Kovacs authored 7 years ago
  
  Bug 4094
  d04fa289
Aug 21, 2017

select/cons_res - fix bug with Dragonfly and --switches count timeout · 46c0919d

Alejandro Sanchez authored 7 years ago

Given a configuration with TopologyParam including Dragonfly option, if a
job requested --switches count, the count timeout specified by either
the job request or max_switch_wait SchedulerParameters was not respected.
This was due to leaf_switch_count variable not being incremented in
_eval_nodes_dfly() function when needed, as we do in _eval_nodes_topo(),
the later being a execution path which already succeed to wait for the
switch count timeout.

Bug 4056

46c0919d

Aug 18, 2017
- correct type cast · 896e462f
  Alejandro Sanchez authored 7 years ago
  
  896e462f
Aug 17, 2017
- Remove errant newline. · 66784220
  Tim Wickberg authored 7 years ago
  
  66784220
- mpi/mvapich - Buffer being only partially cleared. No failures observed. · e7831316
  Morris Jette authored 7 years ago
  
  Coverity CID 44649 Bug 4085
  e7831316
Aug 16, 2017
- Add 'slurmdbd:' to the accounting plugin to notify message is from dbd · 8014b5a4
  Danny Auble authored 7 years ago
  
  instead of local. Bug 3546
  8014b5a4
Aug 14, 2017
- CRAY - Fix BB to handle type= correctly, regression in 17.02.6. · f151c6c0
  Morris Jette authored 7 years ago
  
  f151c6c0
- Revert "CRAY - Fix BB to handle type= correctly, regression in 17.02.6." · 80a6fa49
  Danny Auble authored 7 years ago
  
  This reverts commit 00a691b9.
  80a6fa49
- CRAY - Fix BB to handle type= correctly, regression in 17.02.6. · 00a691b9
  Morris Jette authored 7 years ago
  
  00a691b9
Aug 11, 2017
- Fix potential coredump introduced by commit 9af4a934 · 8076278b
  Dominik Bartkiewicz authored 7 years ago
  
  8076278b
- Add Dell option to the node_features/knl_generic plugin. · ab5c0900
  Danny Auble authored 7 years ago
  
  This will allow dell's custom syscfg to work correctly. NOTE: Dell calls flat memory just memory. Bug 4034
  ab5c0900
- Continuation of last commit. · e7f66309
  Danny Auble authored 7 years ago
  
  No code change, just moving existing code into a switch ready to handle multiple options. Bug 4034
  e7f66309
- Add SystemType to knl_generic.conf for knl_generic in preparations for making... · 92184850
  Danny Auble authored 7 years ago
  
  Add SystemType to knl_generic.conf for knl_generic in preparations for making KNL work on a Dell system. Add SystemType to knl_generic.conf. This is used to distinguish differences in vendors such as 'Dell'. Bug 4034
  92184850
- Fix overlapping reservation resize · 9af4a934
  Danny Auble authored 7 years ago
  
  Bug 4059
  9af4a934
- Fix incorrect lock levels when creating or updating a reservation · 605d7e1f
  Dominik Bartkiewicz authored 7 years ago
  
  605d7e1f
Aug 10, 2017
- Add fake syscfg file to test output from the Dell flavor. · 356b3272
  Danny Auble authored 7 years ago
  
  356b3272
Aug 07, 2017
- Include sysmacros.h when required for major() and minor(). · 35b505cc
  Justin Lecher authored 7 years ago
  
  Starting from glibc-2.25 the macros major and minor are only available from sys/sysmacros.h. This patch uses an autoconf macro to detect the location and includes the header accordingly. Bug 3982.
  35b505cc
- Make it so the cray/switch plugin grabs new DebugFlags on a reconfigure. · d30f79d1
  Danny Auble authored 7 years ago
  
  d30f79d1
- Close race condition on Slurm structures when setting DebugFlags. · 13b78dd2
  Dominik Bartkiewicz authored 7 years ago
  
  Bug 4019
  13b78dd2
Aug 04, 2017
- Correct buffer size used in determining specialized cores to avoid possible · 75bb7c40
  Morris Jette authored 7 years ago
  
  truncation of core specification and not reserving the specified cores. Fixes Coverity CID 45174 and 45175 Bug 4053
  75bb7c40
- Fix issue with pmi[2|x] when TreeWidth=1. This will very likely never · b72096ac
  Artem Polyakov authored 7 years ago
  
  matter in production, but in testing it can. Bug 4051
  b72096ac
- Sort TRES id's on limits when getting them from the database. · 7e55acf7
  Danny Auble authored 7 years ago
  
  7e55acf7
- Continuation of last commit. · 5c2a74a5
  Marshall Garey authored 7 years ago
  
  Fix mysql plugin to correctly return parent limits for all children. Bug 4050
  5c2a74a5
- Fix inherited association 'max' TRES limits combining multiple limits in · ab24f8b4
  Danny Auble authored 7 years ago
  
  the tree. Bug 4050
  ab24f8b4
Aug 02, 2017

Fix starting ctld w/out existing StateSaveLocation · ec78d45a

Marshall Garey authored 7 years ago

Would fail when trying to create the clustername file because the
StateSaveLocation path didn't exist yet.

Bug 3988

ec78d45a

Fix srun jobs to run in high prio partition · 948de46b

Marshall Garey authored 7 years ago

srun jobs that could start immediately and requested multiple partitions
didn't run in the highest priority partition if the highest priority
partition wasn't listed first.

It's possible that the scontrol show jobs will show the partition list
in priority order now that the job's partition list gets sorted by
priority.

Bug 4015

948de46b

Fix strchr return value tests. · a5630a9b

Dominik Bartkiewicz authored 7 years ago

NULL is returned if the token is not found, testing against '\0'
is wrong (although does work okay in older compilers).

Fixes new GCC 7.1 warning.

a5630a9b

Aug 01, 2017
- Increase buffer to handle long /proc//stat output · 9f3b04c0
  Tim Shaw authored 7 years ago
  
  Bug 3999
  9f3b04c0
- Fix GRES selection with CPU binding · e94fdf2e
  Dominik Bartkiewicz authored 7 years ago
  
  Fix bug in selection of GRES bound to specific CPUs where the GRES count is 2 or more. Previous logic could allocate CPUs not available to the job. bug 4029
  e94fdf2e
Jul 28, 2017

Partial revert of commit making it possible again to not have · 1f6555c7

Danny Auble authored 7 years ago

to have 'socket=' in AuthInfo to work.

This is to make it so people don't have to update their slurmdbd.conf's
when upgrading (and to match documentation).

Continuation of last commit

Bug 4009

1f6555c7

Fix issue when an alternate munge key when communicating on a persistent · 591dc036
Danny Auble authored 7 years ago
```
connection.

Bug 4009
```
591dc036

jobcomp/elasticsearch - save state on REQUEST_CONTROL. · 8944b77a

Alejandro Sanchez authored 7 years ago

jobcomp/elasticsearch saves/load the state to/from elasticsearch_state.  Since
the jobcomp API isn't designed with save/load state operations, the plugin
_save_state() isn't extern and not available from outside the plugin itself,
thus it is highly coupled to fini() function. This state doesn't follow the
same execution path as the rest of Slurm states, where in save_all_sate()
they are all independently scheduled. So we save it manually here on a RPC
of type REQUEST_CONTROL.

This enables that when the Primary ctld issues a REQUEST_CONTROL to the Backup
which is currently in controller mode, the Backup will save the state and when
the Primary assumes control again it can process the saved pending jobs.  The
other way around was already controlled, because when the Primary is running
in controller mode and the Backup issues a REQUEST_CONTROL, the Primary is
shutdown and when breaking the ctld main() function while(1) loop, there was
already a g_slurm_jobcomp_fini() call in place.

Bug 3908

8944b77a

Fix for uninitialized federation lock · eb963179
Dominik Bartkiewicz authored 7 years ago

eb963179

Jul 27, 2017

Fix bug when tracking multiple simultaneous spawned ping cycles · f7463ef5

Alejandro Sanchez authored 7 years ago

When more than 1 ping cycle is spawned simultaneously (for instance
REQUEST_PING + REQUEST_NODE_REGISTRATION_STATUS for the selected nodes),
we do not track a separate ping_start time for each cycle. When ping_begin()
is called, the information about the previous ping cycle is lost. Then when
ping_end() is called for the first of the two cycles, we set ping_start=0,
which is incorrectly used to see if the last cycle ran for more than
PING_TIMEOUT seconds (100s), thus incorrectly triggering the:

error("Node ping apparently hung, many nodes may be DOWN or configured "
"SlurmdTimeout should be increased");

Bug 3914

f7463ef5

Jul 26, 2017
- Fix issue where UnkillableStepProgram if step was in an ending state. · 9f48e07c
  Danny Auble authored 7 years ago
  
  9f48e07c