- Apr 07, 2011
-
-
Moe Jette authored
errors after slurmctld restarts.
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
Danny Auble authored
-
-
Danny Auble authored
Fix so slurmctld will pack correctly 2.1 step information. (Only needed if a 2.1 client is talking to a 2.2 slurmctld.)
-
Danny Auble authored
-
- Apr 06, 2011
-
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
-
Don Lipari authored
-
Don Lipari authored
-
- Apr 05, 2011
-
-
Moe Jette authored
resources (gres).
-
Moe Jette authored
-
Moe Jette authored
scheduling which prevented some jobs from starting as soon as possible.
-
Danny Auble authored
-
Danny Auble authored
-
Moe Jette authored
-
Moe Jette authored
improve wording on an option.
-
Moe Jette authored
-
Moe Jette authored
-
-
- Apr 04, 2011
-
-
Danny Auble authored
-
Moe Jette authored
select/cons_res plugin. Patch from Rod Schulz, Bull.
-
Danny Auble authored
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
- Apr 03, 2011
-
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
options and not really tested, but this is a good start
-
Moe Jette authored
This avoids a warning message which is repeated on each reconfiguration of slurm and which is due to a dangling group configuration in LDAP entries. The error occurs when traversing the secondary group members of a given group name, when trying to add these to a configured group. If these secondary group members have no valid login (e.g. disabled via LDAP configuration), the error is repeated on each reconfigure of slurm. The error is harmless: since the users have no valid login, they can not log into the system anyway. I have raised the issue described below with our LDAP admin, there was no reply (likely since not important enough). Since slurm is not a tool to debug the work of system administrators, and since the secondary group members can not log in anyway, this patch replaces the error message with a comment; it leaves untouched the positive case of found secondary group members that have successfully been added to a configured group due to having a valid passwd/LDAP login entry. Here is the case which gets repeated on our system, showing that each error message corresponds to a 'no such user' error when trying to look up the user id: ----------------------------------------------------------------------------------------------- [2011-03-29T08:19:35] error: Could not find user baradmin in configured group csstaff [2011-03-29T08:19:35] error: Could not find user mvalle in configured group csstaff [2011-03-29T08:19:35] error: Could not find user puradm in configured group csstaff [2011-03-29T08:19:35] error: Could not find user ggobbi in configured group csappli [2011-03-29T08:19:35] error: Could not find user mvalle in configured group csappli ----------------------------------------------------------------------------------------------- palu2:0 ~>getent group csstaff csstaff:*:1000:baradmin,biddisco,jfavre,mvalle,puradm palu2:0 ~>id baradmin id: baradmin: No such user palu2:1 ~>id mvalle id: mvalle: No such user palu2:1 ~>id puradm id: puradm: No such user ==> The secondary group members 'biddisco' and 'jfavre' are ok, no warnings. ----------------------------------------------------------------------------------------------- palu2:1 ~>getent group csappli csappli:*:1010:ajocksch,alam,amangili,annaloro,biddisco,cordery,cponti,fgilles,ggobbi,grenker,jfavre,mgg,mvalle,nstring,piccinal,robinson,soumagne,tack,tadrian,uvaretto,wsawyer palu2:0 ~>id ggobbi id: ggobbi: No such user
-
Moe Jette authored
When running in multiple-slurmd mode, the actual hardware configuration reported by the slurmd is ignored, and internal entries (via register_front_ends() just use 1 as dummy value for CPUs, sockets, cores, and threads. On a dual-core service node this lead to continual warning messages like [2011-04-01T10:06:40] Node configuration differs from hardware Procs=1:2(hw) Sockets=1:1(hw) CoresPerSocket=1:2(hw) ThreadsPerCore=1:1(hw) [2011-04-01T10:07:24] Node configuration differs from hardware Procs=1:2(hw) Sockets=1:1(hw) CoresPerSocket=1:2(hw) ThreadsPerCore=1:1(hw)
-
Moe Jette authored
This audits the select/cray code so that it does not accidentally dereference a NULL job_ptr. This instance happens once, upon restart of slurmctld (detailed description below). Similar checks are also in place in other select plugins, in any case it is better to check this. Almost all cases use xassert(), the only exception is p_job_fini(), which assumes NULL means there is nothing to be finalized.
-
Moe Jette authored
When running in multiple-slurmd mode, the actual hardware configuration reported by the slurmd is ignored, and internal entries (via register_front_ends() just use 1 as dummy value for CPUs, sockets, cores, and threads. On a dual-core service node this lead to continual warning messages like [2011-04-01T10:06:40] Node configuration differs from hardware Procs=1:2(hw) Sockets=1:1(hw) CoresPerSocket=1:2(hw) ThreadsPerCore=1:1(hw) [2011-04-01T10:07:24] Node configuration differs from hardware Procs=1:2(hw) Sockets=1:1(hw) CoresPerSocket=1:2(hw) ThreadsPerCore=1:1(hw) Since validate_nodes_via_front_end() ignores the reported values, it is safe to use the actual hardware configuration here, which also helps with taking stock of the current cluster configuration (e.g. via scontrol show slurmd). After applying this patch, the slurmds report without warnings as [2011-04-01T12:03:38] slurmd version 2.3.0-pre4 started [2011-04-01T12:03:38] slurmd started on Fri 01 Apr 2011 12:03:38 +0200 [2011-04-01T12:03:38] Procs=2 Sockets=1 Cores=2 Threads=1 Memory=3886 TmpDisk=1943 Uptime=14355
-
Moe Jette authored
This caused segfaults/core dumps when the slurmd/slurmctld unloaded the select/cray plugin.
-