Commits · d323d859351a8984e7299233c55a142b17f9f10c · tud-zih-energy / Slurm

Jan 07, 2014
- Modify slurmstepd to log using LogTimeFormat in slurm.conf. · d323d859
  David Bigagli authored 11 years ago
  
  d323d859
Oct 15, 2013
- Memory freeing up to avoid minor memory leaks at close of daemons · 46bac772
  Danny Auble authored 11 years ago
  
  46bac772
Jul 22, 2013
- Rename job_* functions to stepd_step_rec functions. · 2242efc8
  Danny Auble authored 11 years ago
  
  2242efc8
Jul 20, 2013
- change name of variable from slurmd_job_t to stepd_step_rec_t · 1240ce09
  Nathan Yee authored 11 years ago
  
  1240ce09
Jun 19, 2013
- add fini_setproctitle to end of step when doing memory check · 8d2f9c04
  Danny Auble authored 11 years ago
  
  8d2f9c04
- Allow way to easily run memcheck on slurmstepd · fdf52997
  Danny Auble authored 11 years ago
  
  fdf52997
Jun 05, 2013

Fix issue where if a job was needing memory limit enforcement and · c2395acb

Danny Auble authored 11 years ago

the user specifies a polling frequency larger than the default
(Meaning the enforcement would happen at perhaps a much slower pace)
then deny the job.

c2395acb

May 24, 2013
- code to make the uint16_t to char * work correctly · 56781f49
  Danny Auble authored 11 years ago
  
  56781f49
- Infrastructure for setting up the polling thread. Not working until · 46c5a011
  Danny Auble authored 11 years ago
  
  acctg-freq is a string instead of a uint16_t
  46c5a011
Apr 27, 2013

No where else in the code is the return code checked here. With the advent · 4abcaf36

Danny Auble authored 11 years ago

of the ThreadID DebugFlag this happens to make it so no jobs can run.

Currently we do not see any negative effects of not checking the return
code.  But this needs to probably be more thought out.

Possibly revert 27564d62

4abcaf36

Apr 24, 2013
- changed http://www.schedmd.com/slurmdocs/ to http://slurm.schedmd.com/ · ee7bca79
  Danny Auble authored 11 years ago
  
  ee7bca79
Jan 29, 2013
- No change in logic. Chnage formatting to match linux kernel standard · c10d7f6e
  Morris Jette authored 12 years ago
  
  c10d7f6e
Nov 26, 2012

Add timeout on srun's I/O connect message to better handle some failure modes · 8405b4eb

Morris Jette authored 12 years ago

If the slurmstepd connects task I/O, but aborts after srun accepts the connect
and before slurmstepd writes data then srun could possibly hand indefinitely.
This probably does not explain failures seen at CEA, but can't hurt matters.
then the sr

8405b4eb

Jul 19, 2012
- move a verbose message to debug · bba19262
  Danny Auble authored 12 years ago
  
  bba19262
Jul 16, 2012
- Add support for user setting of cpu frequency for a job step · ac3fb2bf
  Don Albert authored 12 years ago
  
  ac3fb2bf
Jul 13, 2012
- Fix initialization of protocol_version for some messages to make sure it · b34e5c28
  Danny Auble authored 12 years ago
  
  is always set when sending or receiving a message.
  b34e5c28
May 29, 2012
- Changed jobacct_gather plugin infrastructure to be cleaner and easier to · 6d015792
  Danny Auble authored 12 years ago
  
  maintain.
  6d015792
Mar 21, 2012

slurmstepd: refactor spank prolog/epilog code · e409986a

Mark A. Grondona authored 13 years ago

Add new handle_spank_mode() function in slurmstepd to handle
when slurmstepd is called with "spank prolog" or "spank epilog".
In this function, the slurmd_conf_lite is read to handle reinitializing
the log facility as defined by slurmd config.

e409986a

slurmd/slurmstepd: factor out read/write of slurmd_conf_lite · 00e71ef3

Mark A. Grondona authored 13 years ago

Factor out the read and write of the packed slurmd_conf_lite
data between slurmd and slurmstepd. This simplifies the code
in which that data is handled, and will allow for other callers
in the future.

00e71ef3

slurmstepd: Add new mode to run spank job prolog/epilog · 1e01c729

Mark A. Grondona authored 13 years ago

The spank_job_prolog() and spank_job_epilog() spank calls need
to be run in a different address space from slurmd. This not allows
reinitializing the spank plugin stack on each run of the prolog or
epilog, but also ensures that any static data in plugins does not
propagate to each invocation of the job prolog and epilog (e.g. global
variables). Additionally, it is much safer to run these plugins
in a new process because we may be calling prolog/epilog for multiple
jobs at the same time.

This patch runs spank_job_prolog() or spank_job_epilog() from slurmstepd
when slurmstepd is invoked as

 slurmstepd spank [prolog|epilog]

The environment variables SLURM_JOBID and SLURM_UID are used to set
the jobid and uid for the prolog/epilog. Spank plugin options may
also be passed through the current environment.

1e01c729

slurmstepd: Move handling of cmdline to a function · a136a5ab
Mark A. Grondona authored 13 years ago
```
Move special handling of slurmstepd cmdline to a function for
future expansion.
```
a136a5ab

Mar 20, 2012

Improve task binding logic · f2fab483

Morris Jette authored 13 years ago

Improve task binding logic by making fuller use of HWLOC library,
especially with respect to Opteron 6000 series processors. Work contributed
by Komoto Masahiro.

f2fab483

Feb 02, 2012

Transfer GPU file information to slurmstepd · bccf0f85

Morris Jette authored 13 years ago

Add logic to cache GPU file information (bitmap index mapping to device
file number) in the slurmd daemon and transfer that information to the
slurmstepd whenever a job step is initiated. This is needed to set the
appropriate CUDA_VISIBLE_DEVICES environment variable value when the
devices are not in strict numeric order (e.g. some GPUs are skipped).
Based upon work by Nicolas Bigaouette.

bccf0f85

Cosmetic changes, no change in logic · d4bfab24
Morris Jette authored 13 years ago

d4bfab24

Aug 09, 2011
- change link from LLNL to SchedMD · 52dcbea8
  Danny Auble authored 13 years ago
  
  52dcbea8
Apr 22, 2011
- moved debug to be actually printed out · 72c2620c
  Danny Auble authored 13 years ago
  
  72c2620c
Apr 10, 2011

slurmstepd: avoid coredump in case of NULL job · e0d92b8a

Moe Jette authored 13 years ago

We build slurm with --enable-memory-leak-debug and encountered twice the same core
dump when user 'root' was trying to run jobs during a maintenance session. 

The root user is not in the accounting database, which explains the errors seen
below. The gdb session shows that in this invocation 

palu7:0 log>stat /var/crash/palu7-slurmstepd-6602.core 
...
Modify: 2011-04-04 19:34:44.000000000 +0200

slurmctld.log
[2011-04-04T19:34:44] _slurm_rpc_submit_batch_job JobId=3254 usec=1773
[2011-04-04T19:34:44] ALPS RESERVATION #5, JobId 3254: BASIL -n 1920 -N 0 -d 1 -m 1333
[2011-04-04T19:34:44] sched: Allocate JobId=3254 NodeList=nid000[03-13,18-29,32-88] #CPUs=1920
[2011-04-04T19:34:44] error: slurmd error 4005 running JobId=3254 on front_end=palu7: User not found on host
[2011-04-04T19:34:44] update_front_end: set state of palu7 to DRAINING
[2011-04-04T19:34:44] completing job 3254
[2011-04-04T19:34:44] Requeue JobId=3254 due to node failure
[2011-04-04T19:34:44] sched: job_complete for JobId=3254 successful
[2011-04-04T19:34:44] requeue batch job 3254
[2011-04-04T20:28:43] sched: Cancel of JobId=3254 by UID=0, usec=57285

(gdb) core-file palu7-slurmstepd-6602.core 
[New Thread 6604]
Core was generated by `/opt/slurm/2.3.0/sbin/slurmstepd'.
Program terminated with signal 11, Segmentation fault.
#0  main (argc=1, argv=0x7fffd65a1fd8) at slurmstepd.c:413
413             jobacct_gather_g_destroy(job->jobacct);
(gdb) print job
$1 = (slurmd_job_t *) 0x0
(gdb) list
408
409     #ifdef MEMORY_LEAK_DEBUG
410     static void
411     _step_cleanup(slurmd_job_t *job, slurm_msg_t *msg, int rc)
412     {
413             jobacct_gather_g_destroy(job->jobacct);
414             if (!job->batch)
415                     job_destroy(job);
416             /*
417              * The message cannot be freed until the jobstep is complete
(gdb) print msg
$2 = (slurm_msg_t *) 0x916008
(gdb) print rc
$3 = -1
(gdb) 

The patch tests for a NULL job argument for the calls that need to dereference the job pointer.

e0d92b8a

Mar 31, 2011
- Move log_init() higher in the code for slurmd and slurmstepd. · da46b821
  Moe Jette authored 14 years ago
  
  Also add the function to sview. log_init() is now one of the first statements for all commands and daemons.
  da46b821
Aug 27, 2010
- Add support for batch jobs to set gpu gres env var · 5ecf0584
  Moe Jette authored 14 years ago
  
  Update docs and fix minor bug in step gpu gres env var
  5ecf0584
- Get gres/gpu setting CUDA_VISIBLE_DEVICES env var set. · d4dfa52e
  Moe Jette authored 14 years ago
  
  More testing is needed, but basic functionality is now available!!!
  d4dfa52e
Aug 26, 2010
- load debug_flags into slurmstepd and also load the job's and step's gres allocation · 8d5c71fb
  Moe Jette authored 14 years ago
  
  8d5c71fb
Aug 04, 2010
- removed last patch for later insert · a7b97af9
  Danny Auble authored 14 years ago
  
  a7b97af9
- fixed if( to be if ( · 61cb9251
  Danny Auble authored 14 years ago
  
  61cb9251
Jul 14, 2010

Better handling of select plugins that don't exist on various systems for... · 44927f67

Danny Auble authored 14 years ago

Better handling of select plugins that don't exist on various systems for cross cluster communication.  Slurmctld, slurmd, and slurmstepd now only load the default select plugin as well.

44927f67

Jul 01, 2010
- Replaced slurm_addr with slurm_addr_t · 2a7a6cd7
  Danny Auble authored 14 years ago
  
  2a7a6cd7
Apr 16, 2010
- Forward step failure reason back to slurmd before in some cases it would just... · 5ae02542
  Danny Auble authored 14 years ago
  
  Forward step failure reason back to slurmd before in some cases it would just be SLURM_FAILURE returned.
  5ae02542
Dec 23, 2009

Added logic to make sure if enforcing a memory limit when using the... · 1b64fa5a

Danny Auble authored 15 years ago

Added logic to make sure if enforcing a memory limit when using the jobacct_gather plugin a user can no longer turn off the logic to enforce the limit.

1b64fa5a

Sep 10, 2009

svn merge -r18529:18676 https://eris.llnl.gov/svn/slurm/branches/slurm-2.1.topo.addr · bd8435c1

Moe Jette authored 15 years ago

 -- Move processing of node configuration information in slurm.conf and
    topology information in topology.conf from slurmctld into common and load 
    that information into slurmd. Use it to set environment variables for jobs
    SLURM_TOPOLOGY_ADDR and SLURM_TOPOLOGY_ADDR_PATTERN describing the network 
    topology for each task. Based upon patch from Mattheu Hautreux (CEA).

bd8435c1

Mar 27, 2009
- svn merge -r16751:17054 https://eris.llnl.gov/svn/slurm/branches/slurm-1.4.labelio · 1d32d815
  David J. Bremer authored 16 years ago
  
  1d32d815
Mar 26, 2009

Set "/proc/self/oom_adj" for slurmd and slurmstepd daemons based upon · e12ba6fe

Moe Jette authored 16 years ago

  the values of SLURMD_OOM_ADJ and SLURMSTEPD_OOM_ADJ environment
  variables. This can be used to prevent daemons being killed when
  a node's memory is exhausted.

e12ba6fe