- Jan 07, 2014
-
-
David Bigagli authored
-
- Oct 15, 2013
-
-
Danny Auble authored
-
- Jul 22, 2013
-
-
Danny Auble authored
-
- Jul 20, 2013
-
-
Nathan Yee authored
-
- Jun 19, 2013
-
-
Danny Auble authored
-
Danny Auble authored
-
- Jun 05, 2013
-
-
Danny Auble authored
the user specifies a polling frequency larger than the default (Meaning the enforcement would happen at perhaps a much slower pace) then deny the job.
-
- May 24, 2013
-
-
Danny Auble authored
-
Danny Auble authored
acctg-freq is a string instead of a uint16_t
-
- Apr 27, 2013
-
-
Danny Auble authored
of the ThreadID DebugFlag this happens to make it so no jobs can run. Currently we do not see any negative effects of not checking the return code. But this needs to probably be more thought out. Possibly revert 27564d62
-
- Apr 24, 2013
-
-
- Jan 29, 2013
-
-
Morris Jette authored
-
- Nov 26, 2012
-
-
Morris Jette authored
If the slurmstepd connects task I/O, but aborts after srun accepts the connect and before slurmstepd writes data then srun could possibly hand indefinitely. This probably does not explain failures seen at CEA, but can't hurt matters. then the sr
-
- Jul 19, 2012
-
-
Danny Auble authored
-
- Jul 16, 2012
-
-
Don Albert authored
-
- Jul 13, 2012
-
-
Danny Auble authored
is always set when sending or receiving a message.
-
- May 29, 2012
-
-
Danny Auble authored
maintain.
-
- Mar 21, 2012
-
-
Mark A. Grondona authored
Add new handle_spank_mode() function in slurmstepd to handle when slurmstepd is called with "spank prolog" or "spank epilog". In this function, the slurmd_conf_lite is read to handle reinitializing the log facility as defined by slurmd config.
-
Mark A. Grondona authored
Factor out the read and write of the packed slurmd_conf_lite data between slurmd and slurmstepd. This simplifies the code in which that data is handled, and will allow for other callers in the future.
-
Mark A. Grondona authored
The spank_job_prolog() and spank_job_epilog() spank calls need to be run in a different address space from slurmd. This not allows reinitializing the spank plugin stack on each run of the prolog or epilog, but also ensures that any static data in plugins does not propagate to each invocation of the job prolog and epilog (e.g. global variables). Additionally, it is much safer to run these plugins in a new process because we may be calling prolog/epilog for multiple jobs at the same time. This patch runs spank_job_prolog() or spank_job_epilog() from slurmstepd when slurmstepd is invoked as slurmstepd spank [prolog|epilog] The environment variables SLURM_JOBID and SLURM_UID are used to set the jobid and uid for the prolog/epilog. Spank plugin options may also be passed through the current environment.
-
Mark A. Grondona authored
Move special handling of slurmstepd cmdline to a function for future expansion.
-
- Mar 20, 2012
-
-
Morris Jette authored
Improve task binding logic by making fuller use of HWLOC library, especially with respect to Opteron 6000 series processors. Work contributed by Komoto Masahiro.
-
- Feb 02, 2012
-
-
Morris Jette authored
Add logic to cache GPU file information (bitmap index mapping to device file number) in the slurmd daemon and transfer that information to the slurmstepd whenever a job step is initiated. This is needed to set the appropriate CUDA_VISIBLE_DEVICES environment variable value when the devices are not in strict numeric order (e.g. some GPUs are skipped). Based upon work by Nicolas Bigaouette.
-
Morris Jette authored
-
- Aug 09, 2011
-
-
Danny Auble authored
-
- Apr 22, 2011
-
-
Danny Auble authored
-
- Apr 10, 2011
-
-
Moe Jette authored
We build slurm with --enable-memory-leak-debug and encountered twice the same core dump when user 'root' was trying to run jobs during a maintenance session. The root user is not in the accounting database, which explains the errors seen below. The gdb session shows that in this invocation palu7:0 log>stat /var/crash/palu7-slurmstepd-6602.core ... Modify: 2011-04-04 19:34:44.000000000 +0200 slurmctld.log [2011-04-04T19:34:44] _slurm_rpc_submit_batch_job JobId=3254 usec=1773 [2011-04-04T19:34:44] ALPS RESERVATION #5, JobId 3254: BASIL -n 1920 -N 0 -d 1 -m 1333 [2011-04-04T19:34:44] sched: Allocate JobId=3254 NodeList=nid000[03-13,18-29,32-88] #CPUs=1920 [2011-04-04T19:34:44] error: slurmd error 4005 running JobId=3254 on front_end=palu7: User not found on host [2011-04-04T19:34:44] update_front_end: set state of palu7 to DRAINING [2011-04-04T19:34:44] completing job 3254 [2011-04-04T19:34:44] Requeue JobId=3254 due to node failure [2011-04-04T19:34:44] sched: job_complete for JobId=3254 successful [2011-04-04T19:34:44] requeue batch job 3254 [2011-04-04T20:28:43] sched: Cancel of JobId=3254 by UID=0, usec=57285 (gdb) core-file palu7-slurmstepd-6602.core [New Thread 6604] Core was generated by `/opt/slurm/2.3.0/sbin/slurmstepd'. Program terminated with signal 11, Segmentation fault. #0 main (argc=1, argv=0x7fffd65a1fd8) at slurmstepd.c:413 413 jobacct_gather_g_destroy(job->jobacct); (gdb) print job $1 = (slurmd_job_t *) 0x0 (gdb) list 408 409 #ifdef MEMORY_LEAK_DEBUG 410 static void 411 _step_cleanup(slurmd_job_t *job, slurm_msg_t *msg, int rc) 412 { 413 jobacct_gather_g_destroy(job->jobacct); 414 if (!job->batch) 415 job_destroy(job); 416 /* 417 * The message cannot be freed until the jobstep is complete (gdb) print msg $2 = (slurm_msg_t *) 0x916008 (gdb) print rc $3 = -1 (gdb) The patch tests for a NULL job argument for the calls that need to dereference the job pointer.
-
- Mar 31, 2011
-
-
Moe Jette authored
Also add the function to sview. log_init() is now one of the first statements for all commands and daemons.
-
- Aug 27, 2010
- Aug 26, 2010
-
-
Moe Jette authored
-
- Aug 04, 2010
-
-
Danny Auble authored
-
Danny Auble authored
-
- Jul 14, 2010
-
-
Danny Auble authored
Better handling of select plugins that don't exist on various systems for cross cluster communication. Slurmctld, slurmd, and slurmstepd now only load the default select plugin as well.
-
- Jul 01, 2010
-
-
Danny Auble authored
-
- Apr 16, 2010
-
-
Danny Auble authored
Forward step failure reason back to slurmd before in some cases it would just be SLURM_FAILURE returned.
-
- Dec 23, 2009
-
-
Danny Auble authored
Added logic to make sure if enforcing a memory limit when using the jobacct_gather plugin a user can no longer turn off the logic to enforce the limit.
-
- Sep 10, 2009
-
-
https://eris.llnl.gov/svn/slurm/branches/slurm-2.1.topo.addrMoe Jette authored
-- Move processing of node configuration information in slurm.conf and topology information in topology.conf from slurmctld into common and load that information into slurmd. Use it to set environment variables for jobs SLURM_TOPOLOGY_ADDR and SLURM_TOPOLOGY_ADDR_PATTERN describing the network topology for each task. Based upon patch from Mattheu Hautreux (CEA).
-
- Mar 27, 2009
-
-
- Mar 26, 2009
-
-
Moe Jette authored
the values of SLURMD_OOM_ADJ and SLURMSTEPD_OOM_ADJ environment variables. This can be used to prevent daemons being killed when a node's memory is exhausted.
-