Commits · 7e5d3d15e71ea55881ca6cec5dce73d2ad1f2f04 · tud-zih-energy / Slurm

Jul 26, 2017

Danny Auble authored 7 years ago

Get rid of any race conditions and call anything that was in
_pre_task_privileged from the parent instead of the child.

NOTE: This should be safe as we don't execute the task until after
_exec_wait_child_wait_for_parent is signaled which happens after all this is
long over.

7e5d3d15

If failing after switch_g_job_init happened make sure switch_g_job_fini is called. · 488c7c36
Danny Auble authored 7 years ago
```
Bug 3865
```
488c7c36

Fix regression in commit that would put the stepd pid into the... · f28b1a97

Dominik Bartkiewicz authored 7 years ago

Fix regression in commit e5c05549 that would put the stepd pid into the memory cgroup instead of the task's pid.

Beforehand this would put the result of getpid() into the cgroup. Before
e5c05549 this was done in the child of the fork which would get you
the task's pid, but moving it to run in the parent broke this logic.

What this patch does is adds pid to the input parameters of
task_g_pre_launch_priv making it so we could use the correct pid.

f28b1a97

Jul 19, 2017
- Fix race condition when using jobacct_gather/cgroup where the memory of the · e5c05549
  Danny Auble authored 7 years ago
  
  step wasn't always gathered correctly. Bug 3531
  e5c05549
Jul 13, 2017
- Fix typo in comment. · ea95dc90
  Tim Wickberg authored 7 years ago
  
  ea95dc90
Feb 09, 2017
- Add new job state JOB_OOM for job/step out of memory · 818a09e8
  Morris Jette authored 8 years ago
  
  818a09e8
Jan 19, 2017
- Simplify code by using the same correct variable instead of handing it different ways. · 8950e9e8
  Danny Auble authored 8 years ago
  
  8950e9e8
- Change the argument to step_terminate_monitor_start to stepd_step_rec_t instead of · 023e72e7
  Danny Auble authored 8 years ago
  
  jobid and stepid to be used if needing to kill job because it hadn't started yet.
  023e72e7
Jan 18, 2017
- Change http to https for all links to SchedMD sites. · d7770b9b
  Tim Wickberg authored 8 years ago
  
  d7770b9b
Jan 11, 2017

Fix srun/sattach race condtion · 38089f2b

Morris Jette authored 8 years ago

The old logic would result in test16.4 failing some of the time.
The failure was caused by the sattach command attaching to a
job step before the original srun command received a
RESPONSE_LAUNCH_TASKS message. That messsage would then be sent
to the salloc command. Since srun never got the message, it
would hang. This change does not mark the job step as RUNNING
until after the original srun gets sent the RESPONSE_LAUNCH_TASKS
message and sattach requests are blocked until that time.

38089f2b

Improve an error message · a82a70dc
Morris Jette authored 8 years ago
```
Identify the function where an error is generated.
```
a82a70dc

Dec 06, 2016

Add missing early return to _drop_privileges() if _initgroups() call fails. · b5954e60

Tim Wickberg authored 8 years ago

Note that this does not protect against all possible problems here.
The setgroups() call in Linux at least is willing to set any gid_t
value except -1 on a group, so calls will not always fail on corrupted
group lists.

Bug 3320.

b5954e60

Sep 28, 2016

Add task launch flag field · 3251406c

Morris Jette authored 8 years ago

Add "flag" field to launch_tasks_request_msg. Remove the following fields
    (moved into flags): multi_prog, task_flags, user_managed_io, pty,
    buffered_stdio, and labelio. More flags to be added later.

3251406c

Sep 09, 2016

Cap job termination message staggering at 5 seconds · 3e2251cb

Morris Jette authored 8 years ago

Previous cap was 2 sec (default TCP timeout) times the node count
  and divided by 1000. A 9000 node job would have the messages
  spread out over 18 seconds. This change caps the spread at
  5 seconds and assumes the normal TCP logic can handle the rest
bug 3044

3e2251cb

Sep 08, 2016
- Correct comment · e3d6dc68
  Morris Jette authored 8 years ago
  
  e3d6dc68
Aug 16, 2016

slurmstepd to load all plugins at startup · 962f0cce

Morris Jette authored 8 years ago

slurmstepd modified to pre-load all relevant plugins at startup to avoid
    the possibility of modified plugins later resulting in inconsistent API
    or data structures and a failure of slurmstepd.
bug 2334

962f0cce

Jul 27, 2016

Add calls from slurmstepd to log reason for failure · e35da197

Morris Jette authored 8 years ago

This patch builds upon commit 0c27392b,
adding calls to the new function so the slurmctld daemon will log
reasons for jobs failing to start.
bug 1828

e35da197

Jul 22, 2016
- Always report a 0 exit code for the extern step instead of being canceled · d1fbb57b
  Danny Auble authored 8 years ago
  
  or failed based on the signal that would always be killing it.
  d1fbb57b
Jul 15, 2016

Various cleanup needed for extern step. Continuation of commit · c79063b0

Danny Auble authored 8 years ago

What this does is set the state earlier to match a normal set.

Remove the unneeded _send_pending_exit_msgs. There is only one task and
we have the message for it, so don't worry about that one.

Most important, wait for the other slurmstepd's to send their message,
otherwise they could be lost on the other end.

c79063b0

Jul 08, 2016

Send in a -1 for a taskid into spank_task_post_fork for the extern_step. · 1c38fa64

Danny Auble authored 8 years ago

This will keep from referencing the task array that might not be set up
correctly in src/common/plugstack.c _spank_handle_init().

1c38fa64

Jul 07, 2016
- If extern step doesn't get added into the proctrack plugin make sure the · 6f5fc4a0
  Danny Auble authored 8 years ago
  
  sleep is killed.
  6f5fc4a0
Jul 02, 2016
- Handle a few error codes when dealing with the extern step to make sure · 5d3e5e1e
  Danny Auble authored 8 years ago
  
  we have the pids added to the system correctly. Most likely related to bug 2874
  5d3e5e1e
Jul 01, 2016
- Make it so the extern step sends a message with accounting information · ee8edf8f
  Danny Auble authored 8 years ago
  
  back to the slurmctld.
  ee8edf8f
May 25, 2016
- Remove local copy of unsetenv function used for AIX compatibility. · 7bb3fea7
  Tim Wickberg authored 8 years ago
  
  unsetenv is POSIX1.2001, and should always be available.
  7bb3fea7
- Cleanup and remove HAVE_AIX ifdef blocks. · 992d2805
  Tim Wickberg authored 8 years ago
  
  992d2805
May 24, 2016
- Fix assignment instead of comparison to prevent infinite loop on failed execve. · a8601e91
  Tim Wickberg authored 8 years ago
  
  Coverity 44992.
  a8601e91
- Use slurm_cond macros · 695caf51
  Morris Jette authored 8 years ago
  
  Add slurm_cond_* macros to wrap pthread_cond_* functions and log error on failures.
  695caf51
May 11, 2016
- Fix issue when TopologyParam=NoInAddrAny is set the responses wouldn't · 03a9e836
  Danny Auble authored 8 years ago
  
  make it to the slurmctld when using message aggregation.
  03a9e836
May 10, 2016

Remove SETPGRP_TWO_ARGS, replace setpgrp(0, 0) with setpgid(0, 0) · f735a0a3

Tim Wickberg authored 8 years ago

POSIX, Linux, FreeBSD all agree setpgrp(0, 0) is equivalent to setpgid(0, 0),
and setpgrp() is deprecated in POSIX.

f735a0a3

Remove #ifdef HAVE_CONFIG_H blocks, always rely on config.h · 1c1adb26

Tim Wickberg authored 8 years ago

Continue sorting system includes and un-#ifdef'ing a few additional headers.
(HAVE_STRING, HAVE_UNISTD, HAVE_LIMITS, HAVE_PTHREAD).

Unconditionally define _GNU_SOURCE when used. Should audit use of this further
and possibly define this in config.h directly along with POSIX macros.

Remove last vestige of src/common/malloc.[ch].

1c1adb26

Move array from stack to heap · ba2fc67a
Morris Jette authored 8 years ago
```
This might possibly be related to bug 2334, but it's a long shot.
```
ba2fc67a

May 06, 2016

Fix for slurmstepd setfault · db0fe22e

John Thiltges authored 8 years ago

With slurm-15.08.10, we're seeing occasional segfaults in slurmstepd. The logs point to the following line: slurm-15.08.10/src/slurmd/slurmstepd/mgr.c:2612

On that line, _get_primary_group() is accessing the results of getpwnam_r():
    *gid = pwd0->pw_gid;

If getpwnam_r() cannot find a matching password record, it will set the result (pwd0) to NULL, but still return 0. When the pointer is accessed, it will cause a segfault.

Checking the result variable (pwd0) to determine success should fix the issue.

db0fe22e

May 05, 2016
- Expand comment for better clarity · 91a7587f
  Morris Jette authored 8 years ago
  
  91a7587f
- Make slurmstepd dumpable · e2937345
  Morris Jette authored 8 years ago
  
  RHEL6 requires resetting the processes "dumpable" flag after all seteuid calls complete in order to generate a core file. bug 2334
  e2937345
Apr 02, 2016
- Add spank_task_post_fork to the extern step. · c280838a
  Danny Auble authored 8 years ago
  
  c280838a
Mar 29, 2016
- Add spank_init and spank_fini to the extern step. · 68d7448e
  Danny Auble authored 9 years ago
  
  68d7448e
Mar 28, 2016
- When a stepd is about to shutdown and send it's response to srun · ea470f71
  Danny Auble authored 9 years ago
  
  make the wait to return data only hit after 500 nodes and configurable based on the TcpTimeout value.
  ea470f71
Mar 03, 2016
- Add slurmstepd logging just before fork/exec · 916b5e3e
  Morris Jette authored 9 years ago
  
  This may be helpful for timing purposes. Added by Cray request.
  916b5e3e
Mar 01, 2016

Defer suspend until launch completes · d2cd18d1

Morris Jette authored 9 years ago

This fixes a bug introduced in commit 52fe3de1
in the event the fork() call fails in slurmstepd.

d2cd18d1

Defer suspend until launch completes · 52fe3de1

Morris Jette authored 9 years ago

Insure that a job is completely launched before trying to suspend it.
Previous logic would start suspend logic early in the life of the
slurmstepd process, after it's listening socket was open but before
the tasks were launched. This defers the suspend logic until after
all prologs and setup completes and the tasks are launched. This is
important in the case of gang scheduling, in which newly launched
jobs can be immediately suspended.
bug 2494

52fe3de1