- Mar 05, 2004
-
-
Moe Jette authored
-
- Mar 04, 2004
-
-
Moe Jette authored
data structure. This eliminates risks associated with re-reading slurm.conf.
-
- Dec 31, 2003
-
-
Moe Jette authored
modifications were relatively minor - mostly changes in function names or arguments.
-
- Dec 23, 2003
-
-
Moe Jette authored
"Scontrol abort" works. It was leaving a hung pthread due to a recent change. Fix a couple of potential memory leaks "switch_type" has been added to config data structure, un/pack, etc, but not yet reported to the user or documented yet. The plugins now use function calls to get a their type and plugin directory from a common data structure rather than individually reading and parsing the configuration file.
-
- Nov 25, 2003
-
-
Moe Jette authored
Otherwise signal all steps associated with the job (unless individual job steps are identified).
-
- Nov 22, 2003
-
-
Moe Jette authored
Allocate calls were using job table info in place after freeing mutex. New logic copies data structures, frees mutex, sends message, and then frees memory allocated for copy.
-
- Nov 21, 2003
-
-
Moe Jette authored
-
- Nov 20, 2003
-
-
Moe Jette authored
structure.
-
- Nov 14, 2003
-
-
Moe Jette authored
-
- Nov 10, 2003
-
-
Moe Jette authored
-
- Nov 05, 2003
-
-
Moe Jette authored
-
- Oct 24, 2003
-
-
Moe Jette authored
avoid highly fragmented resource allocations. Add list of excluded nodes to job info dumpped and reported. Fix how mis-matched RPC version number are handled. Let error code get back to the API function. Dump job state information upon each job's termination via plugin. Re-issue incomplete write requests in job/partition state save. Make slurmctld continue proper operation without any default partition (gnats:317). Add command/RPC to delete a partition. Retry socket connection for slurmd/io.c as needed (gnats:253).
-
- Oct 11, 2003
-
-
Mark Grondona authored
- changed defs of HAVE_LIBELAN3 to HAVE_ELAN
-
- Oct 08, 2003
-
-
Moe Jette authored
node registration message to itself.
-
- Sep 29, 2003
- Sep 25, 2003
-
-
Moe Jette authored
-
- Sep 23, 2003
-
-
Moe Jette authored
scalability. An arbitrary number of requests may be queued and they are processed one per second until the queue is empty or pending requests were last attempted recently (configuration parameters set to 60 seconds as a minimum retry interval).
-
- Sep 21, 2003
-
-
Moe Jette authored
control (it needs to complete all pending RPCs and save state before the primary reads state and takes over).
-
- Sep 20, 2003
-
-
Moe Jette authored
EPILOG_COMPLETE_MESSAGE. At this time the job is COMPLETED and all associated nodes available.
-
- Sep 17, 2003
-
-
Moe Jette authored
returned to service. The priority is changed from 1 to value which would be set for the job if submitted at that time. (gnats:279)
-
- Sep 12, 2003
-
-
Moe Jette authored
to job_kill request and slurmctld leaves node and job in COMPLETING state until the slurmd issues an EPILOG_COMPLETE RPC on each node. This permits better support for non-killable processes and/or long-running epilog scripts. Several minor changes in node registration handling and slurmctld agent logic to better address a flood of incomming RPC (typically when system restarts). (gnats:268)
-
- Sep 04, 2003
-
-
Moe Jette authored
This prevents an orphan job if srun dies after sending the request or the network fails or the authenticaion mechanism fails.
-
- Aug 08, 2003
-
-
Moe Jette authored
Add new logic to prevent some node state transitions via update_node RPC (e.g. IDLE to ALLOCATED).
-
- Aug 02, 2003
- Jul 31, 2003
-
-
Moe Jette authored
Set "reason" field when node set down for slurmd error.
-
Moe Jette authored
-
Moe Jette authored
the backup controller and proc_req.c is the code to process incomming RPCs. No changes in controller logic were made for this. job_mgr.c was also modified to better handle bad job records on controller restart's data recovery.
-
- Jul 29, 2003
- Jul 24, 2003
- Jul 23, 2003
- Jul 15, 2003
- Jul 07, 2003
-
-
Moe Jette authored
Message to controller will retry after SlurmctldTimeout period if message to primary controller fails and backup controller returns error indicating it is in backup mode.
-