Commits · 6a2c08edbc2604a80aae40ad7fcef4cc1984d0a6 · tud-zih-energy / Slurm

Dec 14, 2013
- Don't run the init on the DBD. Addition to commit · 6a2c08ed
  Danny Auble authored 11 years ago
  
  226b49a3
  6a2c08ed
- Make test more robust · 28dea960
  Morris Jette authored 11 years ago
  
  Test would periodically fail due to expect timing. This seems to fix the problem
  28dea960
- Fix minor memory leak · b4015f90
  Danny Auble authored 11 years ago
  
  b4015f90
Dec 13, 2013

Fix erroneous error messages when running gang scheduling. · 206dc223
Danny Auble authored 11 years ago

206dc223
Add typecast to prevent build warning · bc441bef
Morris Jette authored 11 years ago

bc441bef

Fix slurmstepd race condition causing abort · be703c47

Morris Jette authored 11 years ago

Fix slurmstepd race condition when separate threads are reading and
modifying the job's environment, which can result in the slurmstepd failing
with an invalid memory reference. Observed at shutdown when trying
to run the task epilog and trying to read the env var:
SLURM_STEP_KILLED_MSG_NODE_ID

be703c47

Avoid test generating core file on segv · 6a5a6a9b

Morris Jette authored 11 years ago

We do not want to look at the core file, so avoid generating it
and then having to manually clear it later.

6a5a6a9b

Dec 12, 2013

Prevent sstat abort with jobacct_gather/none · 226b49a3

Morris Jette authored 11 years ago

Without this change, sstat would try to unpack accounting data
that was never packed, resulting in message unpack errors.

226b49a3

Major test re-write · da93d677

Morris Jette authored 11 years ago

There were some parsing issues and the test was not as general
as it should have been

da93d677

Fix typos · 6516b9c0
Danny Auble authored 11 years ago

6516b9c0
Fix regression in 06b41cdc that would · 76c031e9
Danny Auble authored 11 years ago
```
throw away initialized variable.
```
76c031e9

slurmstepd variable initialization · 06b41cdc

Morris Jette authored 11 years ago

Without this patch, free() is called on a random memory location
(i.e. whatever is on the stack), which can result in slurmstepd
dying and a completed job not being purged in a timely fashion.

06b41cdc

Dec 11, 2013
- Expand slurm upgrade notes · 40628497
  Morris Jette authored 11 years ago
  
  40628497
- HDF5 - Fix minor memory leak. · 960dd5cd
  Danny Auble authored 11 years ago
  
  960dd5cd
- Fix race condition in auth cred create · 925bcfb6
  Morris Jette authored 11 years ago
  
  Fix race condition in authentication credential creation that could corrupt memory. (NOTE: This race condition has existed since 2003 and would be exceedingly rare.)
  925bcfb6
Dec 10, 2013
- add gres.conf optimization advice · 856949e5
  Morris Jette authored 11 years ago
  
  856949e5
- Document how slurmctld -v option is preserved · f28b5d9f
  Morris Jette authored 11 years ago
  
  f28b5d9f
- Expand upgrade notes · e58d28ed
  Morris Jette authored 11 years ago
  
  e58d28ed
Dec 09, 2013

Modify squeue to support longer job ID values · 17f27007
Morris Jette authored 11 years ago
```
This is needed for job arrays with discontiguous task ID values
(e.g. "123_[1,3,5,...99999]")
```
17f27007

Improve sview support for job arrays · d998640f

Morris Jette authored 11 years ago

Previously job arrays were only listed with their native job ID
(e.g. 123_0 listed as 123, 123_1 as 124, etc). Now lists the job ID
using both format (e.g. "123_1 (124)"). The same format is used
for job step IDs (e.g. "123_1.2 (124.2)").

d998640f

Dec 08, 2013

Describe previous commit in NEWS · b19bd476
jette authored 11 years ago

b19bd476

Fix for dynamic changes to GRES · b9fe6815

jette authored 11 years ago

If the GRES is associated with specific files AND
the GRES count is reset using scontrol AND
the slurmd is restarted either without a gres.conf file or with a count and no specific files AND
the GRES count is then increased using scontrol the GRES bitmap will not match its count

This fixes the root cause of the mismatch between bitmap size and GRES
count and should render the rebuilding of the bitmap unnecessary.
The rebuilding was handled in the following commits
commit ec4df3bf
commit 1712d619

b9fe6815

Dec 07, 2013
- Fix srun hang when IO fails to start at launch. · f98413b1
  Danny Auble authored 11 years ago
  
  f98413b1
- Handle in the API if parts of the node structure are NULL. · 5d3b1d4e
  Philip D. Eckert authored 11 years ago
  
  5d3b1d4e
Dec 06, 2013

Improve hwloc support for various processors · ac5d734b

Jason Bacon authored 11 years ago

Using CPU: Intel(R) Pentium(R) 4 CPU 2.40GHz (2392.04-MHz 686-class CPU)
Origin = "GenuineIntel" Id = 0xf27 Family = f Model = 2 Stepping = 7
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>

It's also using an older version of hwloc (1.3.1) and I have not yet tested it with a newer one, but since 0 and -1 are legitimate returns values for hwloc_get_nbobjs_by_type(), I think they should be handled in any case.

From the hwloc_get_nbobjs_by_type() man page:

static inline int hwloc_get_nbobjs_by_type (hwloc_topology_ttopology,
hwloc_obj_type_ttype) [static]
Returns the width of level type type. If no object for that type
exists, 0 is returned. If there are several levels with objects of that
type, -1 is returned.

I'm attaching a smarter patch that handles both 0 and -1 return values for both CORE and SOCKET. It logs a warning if it has to fudge a 0 return code and bails out with a helpful error message for -1, which I have no idea how to handle. At least people won't have to waste time tracking down the problem this way.

Happy Friday,

Jason

ac5d734b

Added ApbasilTimeout parameter to the cray.conf · 270f696e
Trofinoff Stephen authored 11 years ago
```
This adds a mechanism to kill a hung apbasil command
```
270f696e
Fix bad print · 1712d619
Morris Jette authored 11 years ago
```
error introduced in commit ec4df3bf
```
1712d619
Fix for hwloc returning zero core count · ec4df3bf
Jason Bacon authored 11 years ago

ec4df3bf

Fix for gres count change · 4e56260f

Morris Jette authored 11 years ago

A abort has been reported if the node's gres count differs from
it's bitmap. This has been induced by changing the count manually
(e.g. scontrol update nodename=tux123 gres=gpu:4"). I have not
been able to reproduce this problem, but this will resize the
bitmap in order to avoid the assert failure.

4e56260f

Dec 05, 2013
- Remove unneeded line · bb45f022
  Danny Auble authored 11 years ago
  
  bb45f022
- Remove notice of CVE with very old/deprecated versions of Slurm in · 723fb063
  Danny Auble authored 11 years ago
  
  news.html.
  723fb063
Dec 04, 2013
- Add web link to legal notices · 9041c45c
  Morris Jette authored 11 years ago
  
  9041c45c
- jobcomp/filetxt - Reopen the file on SIGHUP · fd7c9360
  Morris Jette authored 11 years ago
  
  Previous logic never reopened the file, preventing proper functioning of logrotate.
  fd7c9360
Dec 03, 2013

Improve REQUEST_JOB_INFO_SINGLE RPC performance · 80d3b343
Morris Jette authored 11 years ago
```
Use hash function to locate job records for improved performance.
```
80d3b343

Improve REQUEST_JOB_INFO_SINGLE RPC performance · 14bcfe58

Morris Jette authored 11 years ago

Change partition write lock to a read lock as we use a different
mechanism for hidden partitions in getting individual jobs.

14bcfe58

Add SC13 BOF presentations · d042ce0d
Morris Jette authored 11 years ago

d042ce0d

Correct job dependency string · 08265c03

Morris Jette authored 11 years ago

Correct logic returning remaining job dependencies in job information
reported by scontrol and squeue. Eliminates vestigial descriptors with
no job ID values (e.g. "afterany"). As depdencies are removed, the
job ID values were removed from the strings, but not the descriptors.
This eliminates both. It also checks the full job ID to make sure we do
not remove "afterany:1234" when job "123" completes.

08265c03

Dec 02, 2013
- fix race condition in batch exit code · 6d1d932b
  Morris Jette authored 11 years ago
  
  Fix race condition on batch job termination that could result in a job exit code of 0xfffffffe if the slurmd on node zero registers its active jobs at the same time that slurmstepd is recording the job's exit code. but 535
  6d1d932b
- Remove trailing spaces from docs · 1526643c
  Morris Jette authored 11 years ago
  
  1526643c
- Fixed sh5util loop when there are no node-step files. · fff922e9
  David Bigagli authored 11 years ago
  
  fff922e9