Commits · 3c8bb61f40afaaab8b3e428dc4ebf4edcd7535d1 · tud-zih-energy / Slurm

Apr 07, 2011
- -- Fix bug in front-end configurations which reports job_cnt_comp underflow · 3c8bb61f
  Moe Jette authored 13 years ago
  
  errors after slurmctld restarts.
  3c8bb61f
- improve information logged with respect to event triggers · ab679155
  Moe Jette authored 13 years ago
  
  ab679155
- update the cray documentation. · 5bb05e25
  Moe Jette authored 13 years ago
  
  5bb05e25
- start to properly map srun options into aprun options · 09655d17
  Moe Jette authored 13 years ago
  
  09655d17
- fix for dealing with a NULL jobinfo structure · 7bb26622
  Danny Auble authored 13 years ago
  
  7bb26622
- svn merge -r23022:23028 https://eris.llnl.gov/svn/slurm/branches/slurm-2.2 · 170ad087
  Danny Auble authored 13 years ago
  
  170ad087
- Fix so slurmctld will pack correctly 2.1 step information. (Only needed if a... · 767898e7
  Danny Auble authored 13 years ago
  
  Fix so slurmctld will pack correctly 2.1 step information. (Only needed if a 2.1 client is talking to a 2.2 slurmctld.)
  767898e7
- Fix for when configuring a node with more resources than in real life and using task/affinity. · 8017a113
  Danny Auble authored 13 years ago
  
  8017a113
Apr 06, 2011
- now have all srun options in the wrapper. · 0cbcfba6
  Moe Jette authored 13 years ago
  
  0cbcfba6
- next batch of options · 05eb4182
  Moe Jette authored 13 years ago
  
  05eb4182
- add more options to wrapper · ad97446b
  Moe Jette authored 13 years ago
  
  ad97446b
- svn merge -r23010:23022 https://eris.llnl.gov/svn/slurm/branches/slurm-2.2 · 6c163a5c
  Moe Jette authored 13 years ago
  
  6c163a5c
- Minor changes to salloc/sbatch/srun man pages · 95cbc075
  Don Lipari authored 13 years ago
  
  95cbc075
- Applied Martin Perry's corrections to the srun man page. · 505b9118
  Don Lipari authored 13 years ago
  
  505b9118
Apr 05, 2011
- -- Fix memory leak in select/cons_res when backfill scheduling generic · 182fdc03
  Moe Jette authored 13 years ago
  
  resources (gres).
  182fdc03
- Fix typo · 9e156df0
  Moe Jette authored 13 years ago
  
  9e156df0
- -- Fix bug in select/cons_res with respect to generic resource (gres) · de0e3c70
  Moe Jette authored 13 years ago
  
  scheduling which prevented some jobs from starting as soon as possible.
  de0e3c70
- BLUEGENE - ok, most is in place to make runjob work well. · 71e30677
  Danny Auble authored 13 years ago
  
  71e30677
- fix for srun wrapper to print out correct labels · 62623e5c
  Danny Auble authored 13 years ago
  
  62623e5c
- Add info about Rosa · b8971f91
  Moe Jette authored 13 years ago
  
  b8971f91
- switch a couple of options to put into alphabetic order. · 2a83650b
  Moe Jette authored 13 years ago
  
  improve wording on an option.
  2a83650b
- added a bunch more srun options · 425e4422
  Moe Jette authored 13 years ago
  
  425e4422
- add more srun/aprun options · af6b8020
  Moe Jette authored 13 years ago
  
  af6b8020
- svn merge -r22954:23010 https://eris.llnl.gov/svn/slurm/branches/slurm-2.2 · 736e1836
  Moe Jette authored 13 years ago
  
  736e1836
Apr 04, 2011
- BLUEGENE - ok, almost all options for runjob are done. · 2192a0a5
  Danny Auble authored 13 years ago
  
  2192a0a5
- -- Correct logic to perperly support --ntasks-per-node option in the · 1cd6cc22
  Moe Jette authored 13 years ago
  
  select/cons_res plugin. Patch from Rod Schulz, Bull.
  1cd6cc22
- BGQ - fix to actually call the correct exe with runjob · fb349e04
  Danny Auble authored 13 years ago
  
  fb349e04
- -- Improve accuracy of response to "srun --test-only jobid=#". · a9f3d4b6
  Moe Jette authored 13 years ago
  
  a9f3d4b6
- another installment of srun options in the aprun wrapper · 4655f199
  Moe Jette authored 13 years ago
  
  4655f199
- correction to last checkin, add missing parenthesis. · 457bef1c
  Moe Jette authored 13 years ago
  
  457bef1c
- Note new way to release held jobs · 2284ba0a
  Moe Jette authored 13 years ago
  
  2284ba0a
- checkin of srrun/aprun wrapper with about 10 more options. · 9445ec4d
  Moe Jette authored 13 years ago
  
  9445ec4d
Apr 03, 2011

add support for a bunch more options to the aprun wrapper. · 8d7290f7
Moe Jette authored 13 years ago

8d7290f7
change to contribs/cray/Makefile.in after autogen.sh run on snowflake · 2f7713e0
Moe Jette authored 13 years ago

2f7713e0
version 1 of srun wrapper for aprun, still needs support added for many · 4bc12c62
Moe Jette authored 13 years ago
```
options and not really tested, but this is a good start
```
4bc12c62

slurmctld: ignore secondary group members that have no valid login · 2f03a66f

Moe Jette authored 13 years ago

This avoids a warning message which is repeated on each reconfiguration of slurm
and which is due to a dangling group configuration in LDAP entries.

The error occurs when traversing the secondary group members of a given group
name, when trying to add these to a configured group. If these secondary group
members have no valid login (e.g. disabled via LDAP configuration), the error
is repeated on each reconfigure of slurm.

The error is harmless: since the users have no valid login, they can not log into
the system anyway. I have raised the issue described below with our LDAP admin,
there was no reply (likely since not important enough).

Since slurm is not a tool to debug the work of system administrators, and since
the secondary group members can not log in anyway, this patch replaces the error
message with a comment; it leaves untouched the positive case of found secondary
group members that have successfully been added to a configured group due to 
having a valid passwd/LDAP login entry.

Here is the case which gets repeated on our system, showing that each error message
corresponds to a 'no such user' error when trying to look up the user id:

-----------------------------------------------------------------------------------------------
[2011-03-29T08:19:35] error: Could not find user baradmin in configured group csstaff
[2011-03-29T08:19:35] error: Could not find user mvalle in configured group csstaff
[2011-03-29T08:19:35] error: Could not find user puradm in configured group csstaff
[2011-03-29T08:19:35] error: Could not find user ggobbi in configured group csappli
[2011-03-29T08:19:35] error: Could not find user mvalle in configured group csappli
-----------------------------------------------------------------------------------------------

palu2:0 ~>getent group csstaff
csstaff:*:1000:baradmin,biddisco,jfavre,mvalle,puradm
palu2:0 ~>id baradmin
id: baradmin: No such user
palu2:1 ~>id mvalle
id: mvalle: No such user
palu2:1 ~>id puradm
id: puradm: No such user

==> The secondary group members 'biddisco' and 'jfavre' are ok, no warnings.  
-----------------------------------------------------------------------------------------------
palu2:1 ~>getent group csappli
csappli:*:1010:ajocksch,alam,amangili,annaloro,biddisco,cordery,cponti,fgilles,ggobbi,grenker,jfavre,mgg,mvalle,nstring,piccinal,robinson,soumagne,tack,tadrian,uvaretto,wsawyer
palu2:0 ~>id ggobbi
id: ggobbi: No such user

2f03a66f

multiple frontend mode: avoid unnecesary slurmd configuration warning · adb3806b

Moe Jette authored 13 years ago

When running in multiple-slurmd mode, the actual hardware configuration reported
by the slurmd is ignored, and internal entries (via register_front_ends() just
use 1 as dummy value for CPUs, sockets, cores, and threads.

On a dual-core service node this lead to continual warning messages like

[2011-04-01T10:06:40] Node configuration differs from hardware
   Procs=1:2(hw) Sockets=1:1(hw)
   CoresPerSocket=1:2(hw) ThreadsPerCore=1:1(hw)
[2011-04-01T10:07:24] Node configuration differs from hardware
   Procs=1:2(hw) Sockets=1:1(hw)
   CoresPerSocket=1:2(hw) ThreadsPerCore=1:1(hw)

adb3806b

select/cray: more carefully test for NULL job pointer · bf2f57c9

Moe Jette authored 13 years ago

This audits the select/cray code so that it does not accidentally dereference a NULL job_ptr.
This instance happens once, upon restart of slurmctld (detailed description below).
Similar checks are also in place in other select plugins, in any case it is better to check this.
Almost all cases use xassert(), the only exception is p_job_fini(), which assumes NULL means
there is nothing to be finalized.

bf2f57c9

multiple frontend mode: avoid unnecesary slurmd configuration warning · 7d15aa3d

Moe Jette authored 13 years ago

When running in multiple-slurmd mode, the actual hardware configuration reported
by the slurmd is ignored, and internal entries (via register_front_ends() just
use 1 as dummy value for CPUs, sockets, cores, and threads.

On a dual-core service node this lead to continual warning messages like

[2011-04-01T10:06:40] Node configuration differs from hardware
   Procs=1:2(hw) Sockets=1:1(hw)
   CoresPerSocket=1:2(hw) ThreadsPerCore=1:1(hw)
[2011-04-01T10:07:24] Node configuration differs from hardware
   Procs=1:2(hw) Sockets=1:1(hw)
   CoresPerSocket=1:2(hw) ThreadsPerCore=1:1(hw)

Since validate_nodes_via_front_end() ignores the reported values, it is safe
to use the actual hardware configuration here, which also helps with taking
stock of the current cluster configuration (e.g. via scontrol show slurmd).

After applying this patch, the slurmds report without warnings as

[2011-04-01T12:03:38] slurmd version 2.3.0-pre4 started
[2011-04-01T12:03:38] slurmd started on Fri 01 Apr 2011 12:03:38 +0200
[2011-04-01T12:03:38] Procs=2 Sockets=1 Cores=2 Threads=1 Memory=3886 TmpDisk=1943 Uptime=14355

7d15aa3d

select/cray: attempt to free non-allocated storage · 31df4987
Moe Jette authored 13 years ago
```
This caused segfaults/core dumps when the slurmd/slurmctld unloaded the select/cray plugin.
```
31df4987