Commits · 61e7f91e875cfc5ff57d08b316447864fb2ba9e8 · tud-zih-energy / Slurm

Mar 05, 2004

Increase the throughput rate of outgoing slurmctld message traffix. · 61e7f91e

Moe Jette authored 21 years ago

Fixes problem running very large number of simultaneous jobs when
their inactivity limit was reached due to a backlog of message.

61e7f91e

Mar 04, 2004
- Added some missing read locks in slurmctld's references to its configuration · 10d7b272
  Moe Jette authored 21 years ago
  
  data structure. This eliminates risks associated with re-reading slurm.conf.
  10d7b272
Feb 12, 2004
- Remove vestigial references to ESLURMD_KILL_JOB_FAILES. · 0f27bda4
  Moe Jette authored 21 years ago
  
  0f27bda4
- Node does not transition from COMPLETING to DOWN state due to node not · cfbdb3b2
  Moe Jette authored 21 years ago
  
  responding. Wait for tasks to complete or administrator to set DOWN.
  cfbdb3b2
Jan 28, 2004
- Fix bug introduced in handling SIGALRM, don't unblock, was causing · 3dbc535a
  Moe Jette authored 21 years ago
  
  slurmctld to exit.
  3dbc535a
Jan 13, 2004
- Unblock the SIGALRM so that hung I/O can be killed for each thread. · ba3a865e
  Moe Jette authored 21 years ago
  
  ba3a865e
Nov 22, 2003
- Restore signal to SIGALRM. Still seeing very rare communication failures · 54ca492d
  Moe Jette authored 21 years ago
  
  with SIGCONT.
  54ca492d
Nov 21, 2003
- Do not send messages to srun jobs/steps that are no longer running (the · 314068b2
  Moe Jette authored 21 years ago
  
  requests can get backed up in a queue while the job actually completes).
  314068b2
Nov 20, 2003
- Don't log communication errors to srun. Srun processes are subject to going · 5ee03340
  Moe Jette authored 21 years ago
  
  away at any time anyway.
  5ee03340
- Move all slurm message free calls into slurm_protocol_defs. Remove · c862639c
  Moe Jette authored 21 years ago
  
  src/api/free_msg.c.
  c862639c
- Add logic for slurmctld to send launch response to srun whenever a · a7930665
  Moe Jette authored 21 years ago
  
  queued allocation request is satisfied. Srun just has a stub to catch and log the message while using polling to notice the allocation has been made.
  a7930665
Oct 29, 2003

Slurmctld now pings srun periodically. If srun fails to respond, the job · e80b2442

Moe Jette authored 21 years ago

and/or job step(s) will have their resources de-allocated and be killed.
A resource allocation will not be release unless no job steps are active
for at least InactiveLimit seconds. DPCS jobs will be subject to this
forced de-allocation if they remain inactive for an extended period of
time, which can get SLURM and DPCS back in sync if DPCS does a cold-start.

e80b2442

Oct 11, 2003
- Make agent logic more robust in the face of communications errors. · 7dbf1ccd
  Moe Jette authored 21 years ago
  
  Send multiple SIGALRMs if needed and deal with possible abort of a thread.
  7dbf1ccd
Oct 10, 2003
- Modify agent thread timeout to indicate where the error is. · 848effeb
  Moe Jette authored 21 years ago
  
  848effeb
Sep 25, 2003
- Comment out a too verbose debug3() message listing each RPC send to each node. · 5ccb629a
  Moe Jette authored 21 years ago
  
  5ccb629a
Sep 23, 2003

Add timers for handling queued agent requests so as to support better · 5ad795fc

Moe Jette authored 21 years ago

scalability. An arbitrary number of requests may be queued and they
are processed one per second until the queue is empty or pending
requests were last attempted recently (configuration parameters set
to 60 seconds as a minimum retry interval).

5ad795fc

Sep 17, 2003
- Relocate pthread_cond_signal() function call to avoid possible deadlock. · cf39645e
  Moe Jette authored 21 years ago
  
  cf39645e
Sep 12, 2003

Note new slurmd return code as not being a "real" error (KILL_JOB response · 9bda8004
Moe Jette authored 21 years ago
```
when the job does not exist).
```
9bda8004

Supported added for new RPC EPILOG_COMPLETE. Slurmd now responds immediately · 926aa1d9

Moe Jette authored 21 years ago

to job_kill request and slurmctld leaves node and job in COMPLETING state
until the slurmd issues an EPILOG_COMPLETE RPC on each node. This permits
better support for non-killable processes and/or long-running epilog scripts.
Several minor changes in node registration handling and slurmctld agent
logic to better address a flood of incomming RPC (typically when system
restarts).
(gnats:268)

926aa1d9

Aug 04, 2003

Improve fault-tolerance for batch jobs. If a node fails to respond to the · e1147ea9

Moe Jette authored 21 years ago

batch_job_launch RPC, then deallocate those resources and requeue the job.
If a node registers and fails to show a batch job that should have a
script running there (node zero of allocation), then consider the job
complete.

e1147ea9

Aug 02, 2003
- Only report a node's response error once. Don't keep reporting · 9d351634
  Moe Jette authored 21 years ago
  
  "Can't connect to node" with every ping failure.
  9d351634
- Specify the reason a node is set DOWN, eg. "Prolog failed", "Low TmpDisk", etc. · 014463e4
  Moe Jette authored 21 years ago
  
  014463e4
- Define proc_req.h, change other timers to use functions defined therein. · cb8201ab
  Moe Jette authored 21 years ago
  
  Changed the logging level of a few other message.
  cb8201ab
- Add flag to control when node ping runs. Don't start a new ping cycle · 52e76e16
  Moe Jette authored 21 years ago
  
  until the previous one completes. This avoids having too many cycles active (and a bunch of threads too). Ping_nodes control functions moved to a new module.
  52e76e16
Jul 23, 2003
- Add more logging for agent events. · d6090705
  Moe Jette authored 21 years ago
  
  d6090705
Jul 15, 2003
- Fix a few memory leaks, particularly related to the reconfigure RPC. · 3fa0ec36
  Moe Jette authored 21 years ago
  
  Perform some general code clean-up in those areas.
  3fa0ec36
Jul 14, 2003
- Change pthread_create logic from: if fail/sleep/retry/fatal to · fc0e5f7c
  Moe Jette authored 21 years ago
  
  while fail/if too many then fatal/sleep and retry
  fc0e5f7c
Jul 08, 2003
- slurmctld/agent.c kills a non-startable batch job and releases its allocation. · d5a05c4b
  Moe Jette authored 21 years ago
  
  d5a05c4b
Jun 27, 2003
- Restore delayed timeout on KILL_JOB RPC. · e6c8b74b
  Moe Jette authored 21 years ago
  
  e6c8b74b
Jun 23, 2003

Don't flag a node as not responding if it has responded *after* the timed · 7e730695

Moe Jette authored 21 years ago

out message was sent (e.g., slurmd down, msg sent to slurmd, slurmd up
and registers, msg previously sent to slurmd times out).

7e730695

Jun 18, 2003
- Move retry_cnt initialization into while loop. Note and fix retry count errors · 1b89e1b3
  Moe Jette authored 21 years ago
  
  rather than letting agent go off the end of an array.
  1b89e1b3
Jun 13, 2003
- o add new thread state : DSH_JOB_HUNG for nodes where job kill request · 1669e688
  Mark Grondona authored 21 years ago
  
  failed (presumably due to unkillable processes) o retry failed JOB_KILL rpcs
  1669e688
Jun 12, 2003
- Don't wait for response from slurmd on reconfig request (was producing an · 804f30ab
  Moe Jette authored 21 years ago
  
  error).
  804f30ab
- Delete a job's steps when its completion is noted via *retry* of job kill · 88ac4b20
  Moe Jette authored 21 years ago
  
  RPC in agent.
  88ac4b20
Jun 11, 2003

o Need to wait kill_wait + 2 seconds for slurmd reply to kill job request · eaa33e31

Mark Grondona authored 21 years ago

   instead of just kill_wait seconds because slurmd sleeps for kill_wait
   seconds, so therefore slurmctld would never recv a reply.

eaa33e31

May 28, 2003

o slurm_receive_msg() now takes timeout argument. Updated callers, including · f4e0330e

Mark Grondona authored 21 years ago

   slurm_send_recv_node_msg(), slurm_send_recv_rc_msg(), etc.
 o Fixed fd leak in agent.c using slurm_send_recv_rc_msg() w/ timeout.

f4e0330e

Mar 26, 2003
- Relocate a semaphore lock in the agent that could cause a bogus memory · 9f839e9b
  Moe Jette authored 22 years ago
  
  reference if a reconfigure RPC was active at the same time.
  9f839e9b
Mar 14, 2003
- slurmctld agent treats "INVALID_JOBID" as warning, not fatal condition. · 3d4d2fc2
  Moe Jette authored 22 years ago
  
  This would occur for example if we killed a job that never started any steps on a node.
  3d4d2fc2
- Convert some tests in agent.c from "if (X) fatal();" to "xassert(!X);" for · dc569ced
  Moe Jette authored 22 years ago
  
  improved performance in production.
  dc569ced
Mar 13, 2003
- First cut of new job credential logic. · 46cbfa89
  Moe Jette authored 22 years ago
  
  46cbfa89