Commits · 94b48e9b9acf6e4836637a75905b335f43672b63 · tud-zih-energy / Slurm

Jan 16, 2004
- Minor structual mods for clean AIX build. · 94b48e9b
  Moe Jette authored 21 years ago
  
  94b48e9b
Jan 14, 2004
- Explicitly set the sigaction for key signals to SIG_DFL. On thunder SIGTERM · e248f940
  Moe Jette authored 21 years ago
  
  was being ignored by default when manually initiated, which prevents it from terminating gracefully.
  e248f940
Jan 13, 2004
- Remember the last ping node time and use that as a basis for determining · 9605ed1a
  Moe Jette authored 21 years ago
  
  when a node becomes DOWN for not responding. This is because if there are a large number of non-responsive nodes, the ping agent can take a long time to complete (one second per non-responsive node or 10 second timeout per node with 10 parallel tasks). This should more properly mark nodes as DOWN.
  9605ed1a
- Unblock the SIGALRM so that hung I/O can be killed for each thread. · ba3a865e
  Moe Jette authored 21 years ago
  
  ba3a865e
- If an Elan system has a node with too few nodes, but the node will be DOWN, · d957fa98
  Moe Jette authored 21 years ago
  
  then don't treat as fatal error.
  d957fa98
Dec 31, 2003
- Switch plugin module added. While many files were modified, these · 0dd3f0cf
  Moe Jette authored 21 years ago
  
  modifications were relatively minor - mostly changes in function names or arguments.
  0dd3f0cf
Dec 24, 2003
- Fix bug which could prevent DOWN node from transitioning to DRAINED state · 691479ff
  Moe Jette authored 21 years ago
  
  (it was inappropriately going to DRAINING state).
  691479ff
Dec 23, 2003

Add free of api config data structure on termination to more easily · cec31bda
Moe Jette authored 21 years ago
```
  see memory leaks.
```
cec31bda
Remove reference to QSW_PACK_SIZE, just use sufficiently large buffer · 06d1ed1d
Moe Jette authored 21 years ago
```
  (it grows anyway).
Fix state read logic to better handle error conditions.
```
06d1ed1d

Fix state read to deal with errors better. · 7b2428f5

Moe Jette authored 21 years ago

Fix update node RPC to handle reason field change without state change.
  State was being handled as type int instead of uint16_t so NO_VAL check
  was not working properly.

7b2428f5

Fix partition state read to deal with possible EINTR and errors better. · fed83e1b
Moe Jette authored 21 years ago

fed83e1b

Fix bug in slurmctld parsing of config file that can cause seg fault. · 40b6c211

Moe Jette authored 21 years ago

"Scontrol abort" works. It was leaving a hung pthread due to a recent
   change.
Fix a couple of potential memory leaks
"switch_type" has been added to config data structure, un/pack, etc,
   but not yet reported to the user or documented yet.
The plugins now use function calls to get a their type and plugin directory
   from a common data structure rather than individually reading and
   parsing the configuration file.

40b6c211

Dec 22, 2003
- Fix potential memory leak. · 0c8404d0
  Moe Jette authored 21 years ago
  
  0c8404d0
Dec 19, 2003

o don't return SLURM_ERROR from function returning type void · a5b05e27
Mark Grondona authored 21 years ago

a5b05e27

o QsNet: Use /etc/elanhosts for translating hostnames to ElanIDs · 91114a0f

Mark Grondona authored 21 years ago

  (as well as the reverse) in all qsw code.

 - move elanhosts.[ch] to common
 - initialize elanhost_config on demand in qsw.c
 - remove calls to elanhosts in slurmd/elan_interconnect.c
 - merge libhostlist into libcommon since elanhosts needs it.

91114a0f

Dec 11, 2003

Fix handling of shared nodes. Track node use by jobs which don't permit · 0ebd75fb

Moe Jette authored 21 years ago

sharing via node record of job count (0 | 1) and bitmap of nodes which
permit sharing. Previous logic could permit a job accepting shared nodes
to be scheduled on a node that already had a running job not accepting
shared nodes.

0ebd75fb

Don't create parent directories for state save location. On create · 10561a58
Moe Jette authored 21 years ago
```
failure, log with fatal() and exit.
```
10561a58

Dec 10, 2003
- Allignment change, no changes in logic. · 290fe56c
  Moe Jette authored 21 years ago
  
  290fe56c
- Create parent directories as needed before creating log, pid, TMPDIR, · e8148e2f
  Moe Jette authored 21 years ago
  
  job logs, state save directories, etc.
  e8148e2f
Dec 08, 2003
- Restore job_signal argument inadevertently removed by Jay. · ae687962
  Moe Jette authored 21 years ago
  
  ae687962
Dec 06, 2003
- Avoid sending TASKS=0 to Wiki; Maui silently rejects the job (insofar as Maui... · a7eb09f8
  jwindley authored 21 years ago
  
  Avoid sending TASKS=0 to Wiki; Maui silently rejects the job (insofar as Maui silently does anything)
  a7eb09f8
Dec 05, 2003

If nodes vanish on a reconfig or slurmctld restart then: · 7ef7afe6

Moe Jette authored 21 years ago

gracefully kill all jobs allocated resources on those nodes,
gracefully kill all pending jobs that require those nodes,
leave pending jobs that exclude those nodes but ignore those nodes.
Added "best_effort" argument to node_name2bitmap() function.
Fix potential memory leak when maui scheduler interface resets
the required nodes.
(gnats:342)

7ef7afe6

Dec 03, 2003
- Rename some overly verbose variables. No changes in logic. · 161ed17d
  Moe Jette authored 21 years ago
  
  161ed17d
- Change some verbose variable names for greater clarity. No change in logic. · 4199ab97
  Moe Jette authored 21 years ago
  
  4199ab97
- Fix bug. Job time limits were not enforced if InactiveLimit configured as zero. · f0288857
  Moe Jette authored 21 years ago
  
  f0288857
Nov 25, 2003
- Add batch_flag to job signal call. If set, then signal the batch shell only. · de83f21b
  Moe Jette authored 21 years ago
  
  Otherwise signal all steps associated with the job (unless individual job steps are identified).
  de83f21b
Nov 22, 2003
- Note RPC times over 1 second with message type of error(). · 637c0eae
  Moe Jette authored 21 years ago
  
  Allocate calls were using job table info in place after freeing mutex. New logic copies data structures, frees mutex, sends message, and then frees memory allocated for copy.
  637c0eae
- Restore signal to SIGALRM. Still seeing very rare communication failures · 54ca492d
  Moe Jette authored 21 years ago
  
  with SIGCONT.
  54ca492d
Nov 21, 2003
- Do not send messages to srun jobs/steps that are no longer running (the · 314068b2
  Moe Jette authored 21 years ago
  
  requests can get backed up in a queue while the job actually completes).
  314068b2
- Turn off out-of-band launch response message to srun. Appears to be causing · 65a676f9
  Moe Jette authored 21 years ago
  
  some problems with slurmctld communication hangs.
  65a676f9
- Add logging of save state times. · 2e5aeb7d
  Moe Jette authored 21 years ago
  
  2e5aeb7d
- Avoid cache collisions · 1d70b769
  jwindley authored 21 years ago
  
  1d70b769
- Relocate a lock to eliminate a deadlock situation that was arising. · bd7761e3
  Moe Jette authored 21 years ago
  
  bd7761e3
Nov 20, 2003
- Don't log communication errors to srun. Srun processes are subject to going · 5ee03340
  Moe Jette authored 21 years ago
  
  away at any time anyway.
  5ee03340
- Move all slurm message free calls into slurm_protocol_defs. Remove · c862639c
  Moe Jette authored 21 years ago
  
  src/api/free_msg.c.
  c862639c
- Add logic for slurmctld to send launch response to srun whenever a · a7930665
  Moe Jette authored 21 years ago
  
  queued allocation request is satisfied. Srun just has a stub to catch and log the message while using polling to notice the allocation has been made.
  a7930665
- Report job.start_time as NO_VAL if not yet set; do not pass STARTTIME to Maui if not set. · f39b3ce0
  jwindley authored 21 years ago
  
  f39b3ce0
- Move lock unlock to eliminate possible error in managing job data · 33862e7e
  Moe Jette authored 21 years ago
  
  structure.
  33862e7e
Nov 19, 2003
- Clean up some long variable names, no changes in logic. · a1a4d860
  Moe Jette authored 21 years ago
  
  a1a4d860
- Slurmctld restart was preserving node reason, but not state if in some · e121f5a3
  Moe Jette authored 21 years ago
  
  DOWN or DRAIN state and not responding. Changed logic to clear NO_RESPOND flag before proceeding with logic.
  e121f5a3