- Mar 05, 2004
-
-
Moe Jette authored
Fixes problem running very large number of simultaneous jobs when their inactivity limit was reached due to a backlog of message.
-
- Mar 04, 2004
-
-
Moe Jette authored
data structure. This eliminates risks associated with re-reading slurm.conf.
-
- Feb 12, 2004
- Jan 28, 2004
-
-
Moe Jette authored
slurmctld to exit.
-
- Jan 13, 2004
-
-
Moe Jette authored
-
- Nov 22, 2003
-
-
Moe Jette authored
with SIGCONT.
-
- Nov 21, 2003
-
-
Moe Jette authored
requests can get backed up in a queue while the job actually completes).
-
- Nov 20, 2003
- Oct 29, 2003
-
-
Moe Jette authored
and/or job step(s) will have their resources de-allocated and be killed. A resource allocation will not be release unless no job steps are active for at least InactiveLimit seconds. DPCS jobs will be subject to this forced de-allocation if they remain inactive for an extended period of time, which can get SLURM and DPCS back in sync if DPCS does a cold-start.
-
- Oct 11, 2003
-
-
Moe Jette authored
Send multiple SIGALRMs if needed and deal with possible abort of a thread.
-
- Oct 10, 2003
-
-
Moe Jette authored
-
- Sep 25, 2003
-
-
Moe Jette authored
-
- Sep 23, 2003
-
-
Moe Jette authored
scalability. An arbitrary number of requests may be queued and they are processed one per second until the queue is empty or pending requests were last attempted recently (configuration parameters set to 60 seconds as a minimum retry interval).
-
- Sep 17, 2003
-
-
Moe Jette authored
-
- Sep 12, 2003
-
-
Moe Jette authored
when the job does not exist).
-
Moe Jette authored
to job_kill request and slurmctld leaves node and job in COMPLETING state until the slurmd issues an EPILOG_COMPLETE RPC on each node. This permits better support for non-killable processes and/or long-running epilog scripts. Several minor changes in node registration handling and slurmctld agent logic to better address a flood of incomming RPC (typically when system restarts). (gnats:268)
-
- Aug 04, 2003
-
-
Moe Jette authored
batch_job_launch RPC, then deallocate those resources and requeue the job. If a node registers and fails to show a batch job that should have a script running there (node zero of allocation), then consider the job complete.
-
- Aug 02, 2003
-
-
Moe Jette authored
"Can't connect to node" with every ping failure.
-
Moe Jette authored
-
Moe Jette authored
Changed the logging level of a few other message.
-
Moe Jette authored
until the previous one completes. This avoids having too many cycles active (and a bunch of threads too). Ping_nodes control functions moved to a new module.
-
- Jul 23, 2003
-
-
Moe Jette authored
-
- Jul 15, 2003
-
-
Moe Jette authored
Perform some general code clean-up in those areas.
-
- Jul 14, 2003
-
-
Moe Jette authored
while fail/if too many then fatal/sleep and retry
-
- Jul 08, 2003
-
-
Moe Jette authored
-
- Jun 27, 2003
-
-
Moe Jette authored
-
- Jun 23, 2003
-
-
Moe Jette authored
out message was sent (e.g., slurmd down, msg sent to slurmd, slurmd up and registers, msg previously sent to slurmd times out).
-
- Jun 18, 2003
-
-
Moe Jette authored
rather than letting agent go off the end of an array.
-
- Jun 13, 2003
-
-
Mark Grondona authored
failed (presumably due to unkillable processes) o retry failed JOB_KILL rpcs
-
- Jun 12, 2003
- Jun 11, 2003
-
-
Mark Grondona authored
instead of just kill_wait seconds because slurmd sleeps for kill_wait seconds, so therefore slurmctld would never recv a reply.
-
- May 28, 2003
-
-
Mark Grondona authored
slurm_send_recv_node_msg(), slurm_send_recv_rc_msg(), etc. o Fixed fd leak in agent.c using slurm_send_recv_rc_msg() w/ timeout.
-
- Mar 26, 2003
-
-
Moe Jette authored
reference if a reconfigure RPC was active at the same time.
-
- Mar 14, 2003
- Mar 13, 2003
-
-
Moe Jette authored
-