- Jul 26, 2017
-
-
Danny Auble authored
Get rid of any race conditions and call anything that was in _pre_task_privileged from the parent instead of the child. NOTE: This should be safe as we don't execute the task until after _exec_wait_child_wait_for_parent is signaled which happens after all this is long over.
-
Danny Auble authored
Bug 3865
-
Dominik Bartkiewicz authored
Fix regression in commit e5c05549 that would put the stepd pid into the memory cgroup instead of the task's pid. Beforehand this would put the result of getpid() into the cgroup. Before e5c05549 this was done in the child of the fork which would get you the task's pid, but moving it to run in the parent broke this logic. What this patch does is adds pid to the input parameters of task_g_pre_launch_priv making it so we could use the correct pid.
-
- Jul 19, 2017
-
-
Danny Auble authored
step wasn't always gathered correctly. Bug 3531
-
- Jul 13, 2017
-
-
Tim Wickberg authored
-
- Feb 09, 2017
-
-
Morris Jette authored
-
- Jan 19, 2017
-
-
Danny Auble authored
-
Danny Auble authored
jobid and stepid to be used if needing to kill job because it hadn't started yet.
-
- Jan 18, 2017
-
-
Tim Wickberg authored
-
- Jan 11, 2017
-
-
Morris Jette authored
The old logic would result in test16.4 failing some of the time. The failure was caused by the sattach command attaching to a job step before the original srun command received a RESPONSE_LAUNCH_TASKS message. That messsage would then be sent to the salloc command. Since srun never got the message, it would hang. This change does not mark the job step as RUNNING until after the original srun gets sent the RESPONSE_LAUNCH_TASKS message and sattach requests are blocked until that time.
-
Morris Jette authored
Identify the function where an error is generated.
-
- Dec 06, 2016
-
-
Tim Wickberg authored
Note that this does not protect against all possible problems here. The setgroups() call in Linux at least is willing to set any gid_t value except -1 on a group, so calls will not always fail on corrupted group lists. Bug 3320.
-
- Sep 28, 2016
-
-
Morris Jette authored
Add "flag" field to launch_tasks_request_msg. Remove the following fields (moved into flags): multi_prog, task_flags, user_managed_io, pty, buffered_stdio, and labelio. More flags to be added later.
-
- Sep 09, 2016
-
-
Morris Jette authored
Previous cap was 2 sec (default TCP timeout) times the node count and divided by 1000. A 9000 node job would have the messages spread out over 18 seconds. This change caps the spread at 5 seconds and assumes the normal TCP logic can handle the rest bug 3044
-
- Sep 08, 2016
-
-
Morris Jette authored
-
- Aug 16, 2016
-
-
Morris Jette authored
slurmstepd modified to pre-load all relevant plugins at startup to avoid the possibility of modified plugins later resulting in inconsistent API or data structures and a failure of slurmstepd. bug 2334
-
- Jul 27, 2016
-
-
Morris Jette authored
This patch builds upon commit 0c27392b, adding calls to the new function so the slurmctld daemon will log reasons for jobs failing to start. bug 1828
-
- Jul 22, 2016
-
-
Danny Auble authored
or failed based on the signal that would always be killing it.
-
- Jul 15, 2016
-
-
Danny Auble authored
What this does is set the state earlier to match a normal set. Remove the unneeded _send_pending_exit_msgs. There is only one task and we have the message for it, so don't worry about that one. Most important, wait for the other slurmstepd's to send their message, otherwise they could be lost on the other end.
-
- Jul 08, 2016
-
-
Danny Auble authored
This will keep from referencing the task array that might not be set up correctly in src/common/plugstack.c _spank_handle_init().
-
- Jul 07, 2016
-
-
Danny Auble authored
sleep is killed.
-
- Jul 02, 2016
-
-
Danny Auble authored
we have the pids added to the system correctly. Most likely related to bug 2874
-
- Jul 01, 2016
-
-
Danny Auble authored
back to the slurmctld.
-
- May 25, 2016
-
-
Tim Wickberg authored
unsetenv is POSIX1.2001, and should always be available.
-
Tim Wickberg authored
-
- May 24, 2016
-
-
Tim Wickberg authored
Coverity 44992.
-
Morris Jette authored
Add slurm_cond_* macros to wrap pthread_cond_* functions and log error on failures.
-
- May 11, 2016
-
-
Danny Auble authored
make it to the slurmctld when using message aggregation.
-
- May 10, 2016
-
-
Tim Wickberg authored
POSIX, Linux, FreeBSD all agree setpgrp(0, 0) is equivalent to setpgid(0, 0), and setpgrp() is deprecated in POSIX.
-
Tim Wickberg authored
Continue sorting system includes and un-#ifdef'ing a few additional headers. (HAVE_STRING, HAVE_UNISTD, HAVE_LIMITS, HAVE_PTHREAD). Unconditionally define _GNU_SOURCE when used. Should audit use of this further and possibly define this in config.h directly along with POSIX macros. Remove last vestige of src/common/malloc.[ch].
-
Morris Jette authored
This might possibly be related to bug 2334, but it's a long shot.
-
- May 06, 2016
-
-
John Thiltges authored
With slurm-15.08.10, we're seeing occasional segfaults in slurmstepd. The logs point to the following line: slurm-15.08.10/src/slurmd/slurmstepd/mgr.c:2612 On that line, _get_primary_group() is accessing the results of getpwnam_r(): *gid = pwd0->pw_gid; If getpwnam_r() cannot find a matching password record, it will set the result (pwd0) to NULL, but still return 0. When the pointer is accessed, it will cause a segfault. Checking the result variable (pwd0) to determine success should fix the issue.
-
- May 05, 2016
-
-
Morris Jette authored
-
Morris Jette authored
RHEL6 requires resetting the processes "dumpable" flag after all seteuid calls complete in order to generate a core file. bug 2334
-
- Apr 02, 2016
-
-
Danny Auble authored
-
- Mar 29, 2016
-
-
Danny Auble authored
-
- Mar 28, 2016
-
-
Danny Auble authored
make the wait to return data only hit after 500 nodes and configurable based on the TcpTimeout value.
-
- Mar 03, 2016
-
-
Morris Jette authored
This may be helpful for timing purposes. Added by Cray request.
-
- Mar 01, 2016
-
-
Morris Jette authored
This fixes a bug introduced in commit 52fe3de1 in the event the fork() call fails in slurmstepd.
-
Morris Jette authored
Insure that a job is completely launched before trying to suspend it. Previous logic would start suspend logic early in the life of the slurmstepd process, after it's listening socket was open but before the tasks were launched. This defers the suspend logic until after all prologs and setup completes and the tasks are launched. This is important in the case of gang scheduling, in which newly launched jobs can be immediately suspended. bug 2494
-