Commits · 8172b7dfb06bac6d8ee26960e39b1bd5133548f3 · tud-zih-energy / Slurm

Aug 23, 2017

jobcomp/elasticsearch - fix memory leak when transferring generated buffer. · 8172b7df

Running slurmctld under valgrind while operating with jobcomp/elasticsearch
reported the following bytes definitely lost:

==27403== 658 bytes in 1 blocks are definitely lost in loss record 301 of 342
==27403==    at 0x4C2FD4F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==27403==    by 0x2281B3: slurm_xrealloc (xmalloc.c:137)
==27403==    by 0x22856A: makespace (xstring.c:114)
==27403==    by 0x2285D0: _xstrcat (xstring.c:132)
==27403==    by 0x228CE0: _xstrfmtcat (xstring.c:291)
==27403==    by 0x83C5BCD: ???
==27403==    by 0x30A913: g_slurm_jobcomp_write (slurm_jobcomp.c:172)
==27403==    by 0x18D8FC: job_completion_logger (job_mgr.c:13652)

It turns out the generated buffer in slurm_jobcomp_log_record was xstrdup'ed to
the corresponding job_node->serialized_job, but the originally generated buffer
wasn't freed afterwards. The fix consists in change the transfer so that instead
of xstrdup'ing the char * we just assign the pointer and NULL the buffer.

The job_node->serialized_job was already xfree'd properly later when the job
was indexed.

Discovered while working on Bug 4065.

8172b7df

Aug 22, 2017
- Strip trailing slashes from the JobCompLoc for jobcomp/elasticsearch. · 60eed77f
  Alejandro Sanchez authored 7 years ago
  
  Otherwise the resulting URL may be invalid. Update documentation while here as well. Bug 4065.
  60eed77f
- Change capmc_node_bitmap to a local variable. · b56f12e0
  Tim Shaw authored 7 years ago
  
  Otherwise a race between threads in _check_node_status leads to a crash. Bug 4093.
  b56f12e0
- Elimiate -Wformat-truncation warnings · d04fa289
  Philip Kovacs authored 7 years ago
  
  Bug 4094
  d04fa289
Aug 21, 2017

select/cons_res - fix bug with Dragonfly and --switches count timeout · 46c0919d

Alejandro Sanchez authored 7 years ago

Given a configuration with TopologyParam including Dragonfly option, if a
job requested --switches count, the count timeout specified by either
the job request or max_switch_wait SchedulerParameters was not respected.
This was due to leaf_switch_count variable not being incremented in
_eval_nodes_dfly() function when needed, as we do in _eval_nodes_topo(),
the later being a execution path which already succeed to wait for the
switch count timeout.

Bug 4056

46c0919d

Aug 17, 2017
- mpi/mvapich - Buffer being only partially cleared. No failures observed. · e7831316
  Morris Jette authored 7 years ago
  
  Coverity CID 44649 Bug 4085
  e7831316
Aug 16, 2017
- Add 'slurmdbd:' to the accounting plugin to notify message is from dbd · 8014b5a4
  Danny Auble authored 7 years ago
  
  instead of local. Bug 3546
  8014b5a4
Aug 14, 2017
- CRAY - Fix BB to handle type= correctly, regression in 17.02.6. · f151c6c0
  Morris Jette authored 7 years ago
  
  f151c6c0
- Revert "CRAY - Fix BB to handle type= correctly, regression in 17.02.6." · 80a6fa49
  Danny Auble authored 7 years ago
  
  This reverts commit 00a691b9.
  80a6fa49
- CRAY - Fix BB to handle type= correctly, regression in 17.02.6. · 00a691b9
  Morris Jette authored 7 years ago
  
  00a691b9
Aug 11, 2017

Add Dell option to the node_features/knl_generic plugin. · ab5c0900

Danny Auble authored 7 years ago

This will allow dell's custom syscfg to work correctly.

NOTE: Dell calls flat memory just memory.

Bug 4034

ab5c0900

Continuation of last commit. · e7f66309

Danny Auble authored 7 years ago

No code change, just moving existing code into a switch ready to handle
multiple options.

Bug 4034

e7f66309

Add SystemType to knl_generic.conf for knl_generic in preparations for making... · 92184850

Danny Auble authored 7 years ago

Add SystemType to knl_generic.conf for knl_generic in preparations for making KNL work on a Dell system.

Add SystemType to knl_generic.conf.  This is used to distinguish
differences in vendors such as 'Dell'.

    Bug 4034

92184850

Aug 10, 2017
- Add fake syscfg file to test output from the Dell flavor. · 356b3272
  Danny Auble authored 7 years ago
  
  356b3272
Aug 07, 2017
- Include sysmacros.h when required for major() and minor(). · 35b505cc
  Justin Lecher authored 7 years ago
  
  Starting from glibc-2.25 the macros major and minor are only available from sys/sysmacros.h. This patch uses an autoconf macro to detect the location and includes the header accordingly. Bug 3982.
  35b505cc
- Make it so the cray/switch plugin grabs new DebugFlags on a reconfigure. · d30f79d1
  Danny Auble authored 7 years ago
  
  d30f79d1
- Close race condition on Slurm structures when setting DebugFlags. · 13b78dd2
  Dominik Bartkiewicz authored 7 years ago
  
  Bug 4019
  13b78dd2
Aug 04, 2017
- Sort TRES id's on limits when getting them from the database. · 7e55acf7
  Danny Auble authored 7 years ago
  
  7e55acf7
- Continuation of last commit. · 5c2a74a5
  Marshall Garey authored 7 years ago
  
  Fix mysql plugin to correctly return parent limits for all children. Bug 4050
  5c2a74a5
- Fix inherited association 'max' TRES limits combining multiple limits in · ab24f8b4
  Danny Auble authored 7 years ago
  
  the tree. Bug 4050
  ab24f8b4
Aug 01, 2017
- Increase buffer to handle long /proc//stat output · 9f3b04c0
  Tim Shaw authored 7 years ago
  
  Bug 3999
  9f3b04c0
Jul 28, 2017

Partial revert of commit making it possible again to not have · 1f6555c7

Danny Auble authored 7 years ago

to have 'socket=' in AuthInfo to work.

This is to make it so people don't have to update their slurmdbd.conf's
when upgrading (and to match documentation).

Continuation of last commit

Bug 4009

1f6555c7

Jul 26, 2017

Fix regression in commit that would put the stepd pid into the... · f28b1a97

Dominik Bartkiewicz authored 7 years ago

Fix regression in commit e5c05549 that would put the stepd pid into the memory cgroup instead of the task's pid.

Beforehand this would put the result of getpid() into the cgroup. Before
e5c05549 this was done in the child of the fork which would get you
the task's pid, but moving it to run in the parent broke this logic.

What this patch does is adds pid to the input parameters of
task_g_pre_launch_priv making it so we could use the correct pid.

f28b1a97

Jul 19, 2017

Prevent slurmctld abort with gres socket binding · c850ccf4

Morris Jette authored 7 years ago

Fix for possible slurmctld abort with use of salloc/sbatch/srun
    --gres-flags=enforce-binding option.
bug 4008

c850ccf4

Jul 07, 2017
- Follow on to last commit. Make it so no job that is currently pending · 66201036
  Danny Auble authored 7 years ago
  
  will have a time displayed when truncating time. Bug 3940.
  66201036
- Set job/step start and end times to 0 when using --truncate and start > end. · 3e11d04c
  Alejandro Sanchez authored 7 years ago
  
  Otherwise we can end up printing Start times greater than End times, leading to confusion when reading sacct output. 0 is displayed as Unknown. Cosmetic change. Bug 3940.
  3e11d04c
Jun 30, 2017

Burst buffer size unit changes · 7e161809

Alejandro Sanchez authored 7 years ago

burst_buffer logic modified to support sizes in both SI and EIC size units
    (e.g. M/MiB for powers of 1024, MB for powers of 1000).
bug 3922

7e161809

Jun 13, 2017

Add LaunchParameters option of cray_net_exclusive. · 23721c4c

Tim Wickberg authored 7 years ago

Changes the alpsc_configure_nic() call to set the exclusive flag,
and 100 for both the cpu and memory scaling values.

Should only be used with exclusive jobs without concurrent steps
running on a node, otherwise oversubscription of the GNI resources
can occur leading to performance issues.

Bug 3713.

23721c4c

Jun 12, 2017
- Fix bug in task/affinity that could result in slurmd fatal error · 6dd2be3b
  Morris Jette authored 7 years ago
  
  An array was only being partially cleared due to bad logic bug 3876
  6dd2be3b
- Only set kmem cgroup limit if ConstrainKmemSpace=yes · ba32ac48
  Tim Wickberg authored 7 years ago
  
  Bug 3874.
  ba32ac48
Jun 09, 2017
- Correction to commit 47b5fe60 to eliminate memory leak · a54455c4
  Morris Jette authored 7 years ago
  
  a54455c4
Jun 08, 2017

Improve preempted job selection logic · 47b5fe60

Dominik Bartkiewicz authored 7 years ago

Improve selection of jobs to preempt when there are multiple partitions
    with jobs subject to preemption.
bug 3824

47b5fe60

Jun 02, 2017
- Fix regression from commit 3e8aa451 . (wrong list given in · 59a820ad
  Dominik Bartkiewicz authored 7 years ago
  
  list_for_each)
  59a820ad
May 31, 2017
- Prevent segfault in sacctmgr due to bad handling of return code. · 15276c01
  Tim Shaw authored 7 years ago
  
  Bug 3840.
  15276c01
May 30, 2017
- don't clear GRES from non-KNL node · 56ea068c
  Tim Shaw authored 7 years ago
  
  node_featurs/knl_cray plugin: Don't clear configured GRES from non-KNL node. bug 3768
  56ea068c
- Reset variables if no qos exist. Continuation of 8129acfe and others · 757a3169
  Danny Auble authored 7 years ago
  
  757a3169
- Fix correct checking for commit 8129acfe · 50fffb31
  Danny Auble authored 7 years ago
  
  50fffb31
- Avoid need for lock from previous commit. · 8129acfe
  Danny Auble authored 7 years ago
  
  8129acfe
- Use job_state_qos_grp_limit for more full result from previous commits · f8ca7493
  Danny Auble authored 7 years ago
  
  f8ca7493
May 26, 2017
- Backfill partitions that use QOS Grp limits to "float" better. · 3e8aa451
  Dominik Bartkiewicz authored 7 years ago
  
  Initial fix for handling floating partitions that use qos grp limits. Bug 3776
  3e8aa451