Skip to content
Snippets Groups Projects
  1. Jan 07, 2014
  2. Oct 15, 2013
  3. Jul 22, 2013
  4. Jul 20, 2013
  5. Jun 19, 2013
  6. Jun 05, 2013
  7. May 24, 2013
  8. Apr 27, 2013
  9. Apr 24, 2013
  10. Jan 29, 2013
  11. Nov 26, 2012
  12. Jul 19, 2012
  13. Jul 16, 2012
  14. Jul 13, 2012
  15. May 29, 2012
  16. Mar 21, 2012
    • Mark A. Grondona's avatar
      slurmstepd: refactor spank prolog/epilog code · e409986a
      Mark A. Grondona authored
      Add new handle_spank_mode() function in slurmstepd to handle
      when slurmstepd is called with "spank prolog" or "spank epilog".
      In this function, the slurmd_conf_lite is read to handle reinitializing
      the log facility as defined by slurmd config.
      e409986a
    • Mark A. Grondona's avatar
      slurmd/slurmstepd: factor out read/write of slurmd_conf_lite · 00e71ef3
      Mark A. Grondona authored
      Factor out the read and write of the packed slurmd_conf_lite
      data between slurmd and slurmstepd. This simplifies the code
      in which that data is handled, and will allow for other callers
      in the future.
      00e71ef3
    • Mark A. Grondona's avatar
      slurmstepd: Add new mode to run spank job prolog/epilog · 1e01c729
      Mark A. Grondona authored
      The spank_job_prolog() and spank_job_epilog() spank calls need
      to be run in a different address space from slurmd. This not allows
      reinitializing the spank plugin stack on each run of the prolog or
      epilog, but also ensures that any static data in plugins does not
      propagate to each invocation of the job prolog and epilog (e.g. global
      variables). Additionally, it is much safer to run these plugins
      in a new process because we may be calling prolog/epilog for multiple
      jobs at the same time.
      
      This patch runs spank_job_prolog() or spank_job_epilog() from slurmstepd
      when slurmstepd is invoked as
      
       slurmstepd spank [prolog|epilog]
      
      The environment variables SLURM_JOBID and SLURM_UID are used to set
      the jobid and uid for the prolog/epilog. Spank plugin options may
      also be passed through the current environment.
      1e01c729
    • Mark A. Grondona's avatar
      slurmstepd: Move handling of cmdline to a function · a136a5ab
      Mark A. Grondona authored
      Move special handling of slurmstepd cmdline to a function for
      future expansion.
      a136a5ab
  17. Mar 20, 2012
    • Morris Jette's avatar
      Improve task binding logic · f2fab483
      Morris Jette authored
      Improve task binding logic by making fuller use of HWLOC library,
      especially with respect to Opteron 6000 series processors. Work contributed
      by Komoto Masahiro.
      f2fab483
  18. Feb 02, 2012
    • Morris Jette's avatar
      Transfer GPU file information to slurmstepd · bccf0f85
      Morris Jette authored
      Add logic to cache GPU file information (bitmap index mapping to device
      file number) in the slurmd daemon and transfer that information to the
      slurmstepd whenever a job step is initiated. This is needed to set the
      appropriate CUDA_VISIBLE_DEVICES environment variable value when the
      devices are not in strict numeric order (e.g. some GPUs are skipped).
      Based upon work by Nicolas Bigaouette.
      bccf0f85
    • Morris Jette's avatar
      Cosmetic changes, no change in logic · d4bfab24
      Morris Jette authored
      d4bfab24
  19. Aug 09, 2011
  20. Apr 22, 2011
  21. Apr 10, 2011
    • Moe Jette's avatar
      slurmstepd: avoid coredump in case of NULL job · e0d92b8a
      Moe Jette authored
      We build slurm with --enable-memory-leak-debug and encountered twice the same core
      dump when user 'root' was trying to run jobs during a maintenance session. 
      
      The root user is not in the accounting database, which explains the errors seen
      below. The gdb session shows that in this invocation 
      
      palu7:0 log>stat /var/crash/palu7-slurmstepd-6602.core 
      ...
      Modify: 2011-04-04 19:34:44.000000000 +0200
      
      slurmctld.log
      [2011-04-04T19:34:44] _slurm_rpc_submit_batch_job JobId=3254 usec=1773
      [2011-04-04T19:34:44] ALPS RESERVATION #5, JobId 3254: BASIL -n 1920 -N 0 -d 1 -m 1333
      [2011-04-04T19:34:44] sched: Allocate JobId=3254 NodeList=nid000[03-13,18-29,32-88] #CPUs=1920
      [2011-04-04T19:34:44] error: slurmd error 4005 running JobId=3254 on front_end=palu7: User not found on host
      [2011-04-04T19:34:44] update_front_end: set state of palu7 to DRAINING
      [2011-04-04T19:34:44] completing job 3254
      [2011-04-04T19:34:44] Requeue JobId=3254 due to node failure
      [2011-04-04T19:34:44] sched: job_complete for JobId=3254 successful
      [2011-04-04T19:34:44] requeue batch job 3254
      [2011-04-04T20:28:43] sched: Cancel of JobId=3254 by UID=0, usec=57285
      
      (gdb) core-file palu7-slurmstepd-6602.core 
      [New Thread 6604]
      Core was generated by `/opt/slurm/2.3.0/sbin/slurmstepd'.
      Program terminated with signal 11, Segmentation fault.
      #0  main (argc=1, argv=0x7fffd65a1fd8) at slurmstepd.c:413
      413             jobacct_gather_g_destroy(job->jobacct);
      (gdb) print job
      $1 = (slurmd_job_t *) 0x0
      (gdb) list
      408
      409     #ifdef MEMORY_LEAK_DEBUG
      410     static void
      411     _step_cleanup(slurmd_job_t *job, slurm_msg_t *msg, int rc)
      412     {
      413             jobacct_gather_g_destroy(job->jobacct);
      414             if (!job->batch)
      415                     job_destroy(job);
      416             /*
      417              * The message cannot be freed until the jobstep is complete
      (gdb) print msg
      $2 = (slurm_msg_t *) 0x916008
      (gdb) print rc
      $3 = -1
      (gdb) 
      
      The patch tests for a NULL job argument for the calls that need to dereference the job pointer.
      e0d92b8a
  22. Mar 31, 2011
  23. Aug 27, 2010
  24. Aug 26, 2010
  25. Aug 04, 2010
  26. Jul 14, 2010
  27. Jul 01, 2010
  28. Apr 16, 2010
  29. Dec 23, 2009
  30. Sep 10, 2009
    • Moe Jette's avatar
      svn merge -r18529:18676 https://eris.llnl.gov/svn/slurm/branches/slurm-2.1.topo.addr · bd8435c1
      Moe Jette authored
       -- Move processing of node configuration information in slurm.conf and
          topology information in topology.conf from slurmctld into common and load 
          that information into slurmd. Use it to set environment variables for jobs
          SLURM_TOPOLOGY_ADDR and SLURM_TOPOLOGY_ADDR_PATTERN describing the network 
          topology for each task. Based upon patch from Mattheu Hautreux (CEA).
      bd8435c1
  31. Mar 27, 2009
  32. Mar 26, 2009
Loading