- Jun 19, 2013
-
-
Danny Auble authored
-
- Jun 18, 2013
-
-
Danny Auble authored
-
- Jun 05, 2013
-
-
Danny Auble authored
-
Danny Auble authored
Since we don't currently track energy usage per task (only per step). Otherwise we get double the energy.
-
- May 24, 2013
-
-
Danny Auble authored
acctg-freq is a string instead of a uint16_t
-
- May 23, 2013
-
-
Yiannis Georgiou authored
-
- May 21, 2013
-
-
Danny Auble authored
-
Danny Auble authored
acct_gather_infiniband_g_node_init
-
Yiannis Georgiou authored
-
- May 15, 2013
-
-
Danny Auble authored
are called.
-
- May 10, 2013
-
-
Danny Auble authored
-
Rod Schultz authored
-
- Apr 24, 2013
-
-
- Jan 29, 2013
-
-
Morris Jette authored
-
- Nov 26, 2012
-
-
Morris Jette authored
If the slurmstepd connects task I/O, but aborts after srun accepts the connect and before slurmstepd writes data then srun could possibly hand indefinitely. This probably does not explain failures seen at CEA, but can't hurt matters. then the sr
-
- Nov 21, 2012
-
-
Matthieu Hautreux authored
A dedicated thread (_kill_thr) is launched by slurmstepd at the end of a step in order to destroy the IO thread if it does not manage to correctly terminate by itself after 300 seconds. Two bugs are corrected in this logic by this patch. First, the performed sleep(300) is not protected against interruptions and this delay can be reduced to a few seconds in case of signals received by slurmstepd, thus, reducing the delay and forcing the IO thread to terminate before the expiration of the grace time. The logic is modified to ensure that the delay is respected using a loop around the sleep(). Second, to terminate the IO thread, a SIGKILL is delivered to the IO thread using pthread_kill. However, sending SIGKILL using pthread_kill is a process-wide operation (see man pthread_kill), thus all the slurmstepd threads are killed and slurmstepd is terminated. This logic is modified by using pthread_cancel() instead of pthread_kill() thus letting the pthread_join() of _wait_for_io() having a chance to act as expected. Without this patch, when _kill_thr is interrupted, slurmstepd is terminated, letting the step in a incomplete state, as the node may not have been able to send the REQUEST_STEP_COMPLETE to the controler. Thus, consecutive steps can no longer be executed and stay permanently in the "Job step creation temporarily disabled, retrying" state.
-
- Nov 07, 2012
-
-
Janne Blomqvist authored
the attached patch changes the default timestamp format in logfiles to conform to RFC 5424 (the current version of the syslog RFC). It is identical to the current default "ISO 8601" timestamp used by slurm, with the exception that the timezone offset is appended. This has the benefits of 1) It's unambiguous. 2) Avoids potential confusion for admins running cluster(s) in different timezones. 3) Might help debug issues related to DST transitions. (More on that later..) (To be pedantic, a RFC 5424 timestamp is still a valid ISO 8601 timestamp, but the converse is not necessarily true. So there is RFC 3339 which is a "profile" of ISO 8601, that is a subset, recommended for internet protocols. The RFC 5424 timestamp, in turn, is a subset of the RFC 3339 timestamps.) The previous behavior of can be used by running configure with the --disable-rfc5424time flag.
-
- Oct 22, 2012
-
-
Matthieu Hautreux authored
This privileged call is executed just after the fork of each forked task by user root before becoming the user. It enables the task plugin to perform actions as a privileged user in every task context.
-
- Oct 16, 2012
-
-
Morris Jette authored
-
- Oct 15, 2012
-
-
Morris Jette authored
-
yiannis georgiou authored
provides the following improvements on the energy accounting framework : 1)the per step average frequency is now calculated correctly and may be reported by sstat and sacct 2)correction on the logic of per step energy consumption calculation through rapl plugin. This value may be also reported through sstat and sacct 3)node power and energy monitoring now working correctly through rapl plugin.
-
- Jul 16, 2012
-
-
Don Albert authored
-
- Jul 13, 2012
-
-
Mark A. Grondona authored
If exec_wait_child_wait_for_parent() fails for any reason, it is safer to abort immediately rather than proceed to execute the user's job.
-
Mark A. Grondona authored
On a failure of fork(2), slurmstepd would print an error and exit, possibly leaving previously forked children waiting. Ensure a better cleanup by killing all active children on fork failure before exiting slurmstepd.
-
Mark A. Grondona authored
Close the read end of the pipe slurmstepd uses to notify children it is time to call exec(2) in order to save one file descriptor per task. (Previously, the read side of the pipe wasn't closed until exec_wait_info was destroyed)
-
- May 29, 2012
-
-
Danny Auble authored
maintain.
-
- May 11, 2012
-
-
Martin Perry authored
Original patch from Martin Perry (Bull)
-
- May 07, 2012
-
-
Morris Jette authored
-
- May 05, 2012
-
-
Morris Jette authored
This will eventually permit the slurmctld to respond to the slurmd with a new task launch RPC
-
- Apr 27, 2012
-
-
Morris Jette authored
Cray - Add support for zero compute note resource allocation to run batch script on front-end node with no ALPS reservation. Useful for pre- or post- processing. NOTE: The partition must be configured with MinNodes=0.
-
- Mar 26, 2012
-
-
Morris Jette authored
Add job name to call. Add logging for call.
-
- Mar 22, 2012
-
-
Morris Jette authored
-
Matthieu Hautreux authored
Access to secured FS often requires to have a valid token in the user context. With SLURM, this token can be obtained using one of the possible pluggable architecture, SPANK or PAM. IO setup of SLURM can require to access secured FS (stdout/stderr files). This patch ensures that pluggable frameworks are activated and called prior to IO setup and that IO are terminated before calling pluggable framework exit calls.
-
Matthieu Hautreux authored
set PR_DUMPABLE as soon as possible, especially before any plugins are loaded. This will allow someone debugging to get a coredump.
-
Matthieu Hautreux authored
To prepare io_setup integration in _fork_all_tasks, error handling must be transformed to not always return SLURM_ERROR but be prepared to return SLURM_SUCCESS in case of an io_setup error.
-
- Feb 04, 2012
-
-
Morris Jette authored
Add call to mpi_hook_slurmstepd_prefork() from slurmstep immediately prior to fork/exec of user tasks. Patch from Hongjia Cao, NUDT.
-
- Jan 19, 2012
-
-
Morris Jette authored
-
- Jan 18, 2012
-
-
Morris Jette authored
-
- Jan 17, 2012
-
-
Morris Jette authored
-
- Jan 13, 2012
-
-
Mark A. Grondona authored
It was found that slurmstepd was intermittently leaving SIGPIPE blocked when launching user tasks. This may have something to do with the fact that the xsignal_unblock() call in _fork_all_tasks() is referencing an extern array (nominally this should have unblocked SIGPIPE), but I didn't spend the time to fully track this issue down. Instead, I figured there is probably no reason we would _not_ want to unblock *all* signals, so this patch does that. Before this change, the following program fails every once in awhile: #include <stdio.h> #include <signal.h> int main (int ac, char **av) { int i, rc = 0; struct sigaction act; for (i = 1; i < SIGRTMAX; i++) { sigaction (i, NULL, &act); if (act.sa_handler == SIG_DFL) continue; fprintf (stderr, "Signal %d appears to be ignored!\n", i); rc = 1; } return (rc); } with: srun -N1 -n1 ./test Signal 13 appears to be ignored! after the change, the program succeeds.
-