Commits · 74e1f2838a78e4e2ac01de657710dc7aadff1b2a · tud-zih-energy / Slurm

Sep 09, 2017
- Remove unused include. · 74e1f283
  Tim Wickberg authored 7 years ago
  
  74e1f283
Sep 08, 2017
- Remove notify and notify_prog from cgroup code. · 18360290
  Tim Wickberg authored 7 years ago
  
  Since ReleaseAgent is no longer required, we can strip out all the supporting logic for it.
  18360290
- Correction to commit 241f31d7 · b7e6c42e
  Morris Jette authored 7 years ago
  
  Accidentally removed bracked in checked-in code
  b7e6c42e
- Add burst buffer support for heterogeneous jobs · 241f31d7
  Morris Jette authored 7 years ago
  
  241f31d7
Sep 07, 2017
- Optimization enhancements for partition based job preemption · 0f501359
  Dominik Bartkiewicz authored 7 years ago
  
  bug 3824
  0f501359
- Cray: Don't run step NHC on external step · a6407a68
  Morris Jette authored 7 years ago
  
  Do not run the Node Health Check on termination of the external step as this happens when the job allocation ends and the job NHC will be executed anyway. Bug 4074
  a6407a68
- Change name of kill_tasks_msg_t to signal_tasks_msg_t, no functionality change. · 67d15aa2
  Danny Auble authored 7 years ago
  
  67d15aa2
Sep 05, 2017
- Fix uninitialized variable · a99ce990
  Morris Jette authored 7 years ago
  
  Reported by Clang
  a99ce990
Sep 01, 2017

Morris Jette authored 7 years ago

Prevent a heterogeneous job allocation from including the same nodes in
      multiple components (required by MPI jobs spanning components).

7d46d43e

Add support for SLURM_TESTSUITE_DROP_PRIV to the accounting_storage plugins. · 95d841b6

Tim Wickberg authored 7 years ago

A lot of slurmdbd operations are authenticated by the accounting_storage
plugin, rather than in slurmdbd. To allow the drop_priv flag to work it
must be checked in is_user_min_admin_level() in addition to the various
functions in proc_req.c .

95d841b6

Revert "Add support for SLURM_TESTSUITE_DROP_PRIV to the accounting_storage plugins." · e245d107
Tim Wickberg authored 7 years ago
```
This reverts commit 47dad9e8.
```
e245d107

Revert "Remove unused functions in common_as.[ch]." · 931eca80

Tim Wickberg authored 7 years ago

On second thought, we should be using these or quite similar functions.

This reverts commit a44ef130.

931eca80

Aug 31, 2017

Add support for SLURM_TESTSUITE_DROP_PRIV to the accounting_storage plugins. · 47dad9e8

Tim Wickberg authored 7 years ago

A lot of slurmdbd operations are authenticated by the accounting_storage
plugin, rather than in slurmdbd. To allow the drop_priv flag to work it
must be checked in is_user_min_admin_level() in addition to the various
functions in proc_req.c .

47dad9e8

Remove unused functions in common_as.[ch]. · a44ef130
Tim Wickberg authored 7 years ago

a44ef130
mpi/pmix: Addittional UCX settings to ensure safe fork() · 4e66b4b3
Artem Polyakov authored 7 years ago
```
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
```
4e66b4b3

mpi/pmix: Add collectives evaluation tool · 4903c81a

Artem Polyakov authored 7 years ago


Add new performance debugging feature to pmix plugin that allows to
measure plain collectives performance on selected message sizes.
This functionality is similar to a ping-pong feature.

Performance results for existing point-to-point modes:

size    sapi            dtcp            ducx
1       0.002319442     0.000492334     0.000135243
2       0.002318223     0.000453552     0.000137356
4       0.00227046      0.00045832      0.000137117
8       0.002342675     0.000463539     0.000136455
16      0.00235131      0.000481208     0.00013619
32      0.002333058     0.000562986     0.000140756
64      0.002456691     0.000883791     0.000142574
128     0.002953556     0.001326429     0.000142336
256     0.003892236     0.002324766     0.000161224
512     0.006044123     0.004371988     0.000177675
1024    0.010324001     0.008485476     0.000224325
2048    0.018556118     0.016488896     0.000347243
4096    0.035331223     0.032744778     0.000481764
8192    0.06957123      0.065519465     0.001194106
16384   0.137925333     0.130130662     0.002544668
32768   0.272100422     0.259290563     0.009916888
65536   0.543431362     0.486692217     0.012841119

Signed-off-by: Artem Polyakov <artpol84@gmail.com>

4903c81a

mpi/pmix: rework tree-based collective · 96eb8b29
Artem Polyakov authored 7 years ago
```
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
```
96eb8b29
mpi/pmix: Fix UCX connection error case handling · 36f59fb3
Artem Polyakov authored 7 years ago
```
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
```
36f59fb3
mpi/pmix: Enforce parameter checking in pmix_[r]list · bc916aeb
Artem Polyakov authored 7 years ago
```
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
```
bc916aeb

mpi/pmix: Fix the case where UCX fails to connect · 056f9ce2

Artem Polyakov authored 7 years ago


There were segmentation faults because of double free of a pending
list when UCX comonent was trying to connect multiple times.

Signed-off-by: Artem Polyakov <artpol84@gmail.com>

056f9ce2

Aug 30, 2017

Fix statically linked applications to CRAY's PMI. · 85d83258

David Gloe authored 7 years ago

Statically linked Cray PMI applications still expect to use some file paths
containing the old SLURM_ID_HASH format. Some Cray customers have
certification requirements that make recompilation difficult.

The attached patch defines a macro to convert the new SLURM_ID_HASH
to the old format, and writes the files and symlinks necessary for statically
linked Cray PMI applications to work.

Bug 4114

85d83258

Aug 29, 2017

Make the UsageFactor of a QOS work when a qos has the nodecay flag. · f51a77fa
Brian Christiansen authored 7 years ago
```
Bug 4090
```
f51a77fa
Add WorkDir to the job record in the database. · 3c8ec590
Danny Auble authored 7 years ago

3c8ec590

Set SLURM_JOBID to be pack job leader · 91337c3f

Morris Jette authored 7 years ago

This applies to job steps for MPI. Even if there is more than one
  pack job component in a single MPI_COMM_WORLD, they will share
  a common SLURM_JOBID.

91337c3f

Fix uninitialized value · 8df575c5
Brian Christiansen authored 7 years ago
```
as reported when compiling with optimizations (-O2).
```
8df575c5

Handle uninitialized values · c652acd1

Brian Christiansen authored 7 years ago

as reported when compiling with optimizations (-02). Initialize the
variables early since they were being initialized in their loops and
later checked for -1.

Technically couldn't have happened since for example, user_part_inx1
would only be set to -1 if max_backfill_job_per_user_part was set. And
user_part_inx is only checked later if max_backfill_job_per_user_part
is set.

Same thing for part_inx, max_backfill_job_per_part and user_inx,
max_backfill_job_per_user.

c652acd1

Handle gcc "ignoring return value" warning · 3cc8bba3

Brian Christiansen authored 7 years ago

reported when compiling with optimizations (-O2). The compiler ignores
the (void) cast and reports the error.

3cc8bba3

Fix uninitialized value · 61ab4f25

Brian Christiansen authored 7 years ago

reported when compiling with optimizations (-O2). field_id may be
uninitialized or the the value from the previous iteration in the while
loop. The only possible values of dataset_loc->type are:

typedef enum {
	PROFILE_FIELD_NOT_SET,
	PROFILE_FIELD_UINT64,
	PROFILE_FIELD_DOUBLE
} acct_gather_profile_field_type_t;

and the while loop condition ensures that PROFILE_FIELD_NOT_SET is not
handled. So instead of handling PROFILE_FIELD_NOT_SET directly, just
catching everything else with the "default" case statement and
continuing the loop.

61ab4f25

Remove comment · 2891d63f
Brian Christiansen authored 7 years ago

2891d63f

Make it so a backup DBD doesn't attempt to create database tables and · fc7de2ee

Danny Auble authored 7 years ago

relies on the primary to do so.

There is a potential race condition if the backup DBD tries to create/check the
database at the same time as the primary.  This patch removes this race by not
allowing the backup to do the check/create.

Bug 3827

fc7de2ee

Fix for possible unsigned iterator with negative bounds · d7bf9013
Morris Jette authored 7 years ago
```
Coverity 44943, 44944
```
d7bf9013
Fix for possible unsigned interator with negative bounds · 8e628b01
Morris Jette authored 7 years ago
```
Coverity CID 44941, 44942
```
8e628b01

Set pack job env vars · 4cc64094

Morris Jette authored 7 years ago

This sets more per-pack-job environment variables for launched steps.
All of the following are used by Open MPI:
SLURM_CPUS_PER_TASK
SLURM_STEP_NUM_TASKS
SLURM_TASKS_PER_NODE
A few more env vars are still needed by OpenMPI

4cc64094

Aug 25, 2017
- Set SLURM_GTIDS and SLURM_NODEID env vars for pack jobs · 605fb6d6
  Morris Jette authored 7 years ago
  
  These are required by OpenMPI
  605fb6d6
- Add error check · 0e4494c5
  Morris Jette authored 7 years ago
  
  Coverity CID 44723
  0e4494c5
Aug 24, 2017

Fix Coverity CID 174746: Control flow issues (DEADCODE). · 1a603a7b

Alejandro Sanchez authored 7 years ago

Testing if curl_handle != NULL or rc != SLURM_SUCCESS was already done
in the right above if/else statements, jumping to the consequent goto
cleanup label if needed. Thus the removed test was never going to be
evaluated to true, and Coverity properly warned about this.

Regression introduced in commit 5f5e6472 (code cleanup).

1a603a7b

Cosmetic changes. No changes in logic · a955a119
Morris Jette authored 7 years ago

a955a119

Aug 23, 2017

jobcomp/elasticsearch - code cleanup, no functional change. · 5f5e6472
Alejandro Sanchez authored 7 years ago

5f5e6472

jobcomp/elasticsearch - fix memory leak when transferring generated buffer. · 8172b7df

Alejandro Sanchez authored 7 years ago

Running slurmctld under valgrind while operating with jobcomp/elasticsearch
reported the following bytes definitely lost:

==27403== 658 bytes in 1 blocks are definitely lost in loss record 301 of 342
==27403==    at 0x4C2FD4F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==27403==    by 0x2281B3: slurm_xrealloc (xmalloc.c:137)
==27403==    by 0x22856A: makespace (xstring.c:114)
==27403==    by 0x2285D0: _xstrcat (xstring.c:132)
==27403==    by 0x228CE0: _xstrfmtcat (xstring.c:291)
==27403==    by 0x83C5BCD: ???
==27403==    by 0x30A913: g_slurm_jobcomp_write (slurm_jobcomp.c:172)
==27403==    by 0x18D8FC: job_completion_logger (job_mgr.c:13652)

It turns out the generated buffer in slurm_jobcomp_log_record was xstrdup'ed to
the corresponding job_node->serialized_job, but the originally generated buffer
wasn't freed afterwards. The fix consists in change the transfer so that instead
of xstrdup'ing the char * we just assign the pointer and NULL the buffer.

The job_node->serialized_job was already xfree'd properly later when the job
was indexed.

Discovered while working on Bug 4065.

8172b7df

Print a warning if no results list is available. · 6d15591f

Tim Wickberg authored 7 years ago

This should only happen due to ESLURM_RESULT_TOO_LARGE,
which leads to no list being packed.

Follow on to 390da8cf / 8cf1835c.

Bug 3624.

6d15591f