- Aug 22, 2011
-
-
Danny Auble authored
_job_create() consistent with similar logic in select_nodes().
-
Danny Auble authored
partition in the slurm.conf.
-
- Aug 19, 2011
-
-
Morris Jette authored
One of our testers created an illegal topology.conf file. He has a config you probably wouldn't see in production, but can see in testing when you are sometimes given a collection of miscellaneous resources. |-- nodes switch1 --| |-- switch2 -- nodes He tried the topology.conf file below. Switch s1 is defined twice. Slurm accepted this config, but wouldn't allocate nodes from both switches to one job. SwitchName=s1 Nodes=xna[14-26] SwitchName=s2 Nodes=xna[41-43] SwitchName=s1 Switches=s2 I believe slurm shouldn't allow the second definition of switch s1. The attached patch checks for duplicate switch names. Patch from Rod Schultz, Bull.
-
- Aug 17, 2011
-
-
Danny Auble authored
This reverts commit 350ef5dc.
-
- Aug 16, 2011
-
-
Danny Auble authored
-
- Aug 12, 2011
-
-
Danny Auble authored
next parallel step is ran on a sub block, SLURM won't over subscribe cnodes.
-
Danny Auble authored
-
- Aug 11, 2011
-
-
Danny Auble authored
-
Morris Jette authored
BLUEGENE - Modify "scontrol show step" to show I/O nodes (BGL and BGP) or c-nodes (BGQ) allocated to each step. Change field name from "Nodes=" to "BP_List=".
-
- Aug 10, 2011
-
-
Danny Auble authored
cannot fit into the available shape.
-
Morris Jette authored
Previous code would fail when trying to launch more than 4096 tasks, which is a problem on BGQ systems where SLURM actually launches job steps.
-
Danny Auble authored
or not.
-
- Aug 09, 2011
-
-
Morris Jette authored
This change applies only to Cray systems and only when the srun wrapper for aprun. Map --exclusive to -F exclusive and --share to -F share. Note this does not consider the partition's Shared configuration, so it is an imperfect mapping of options.
-
Morris Jette authored
A node DOWN to ALPS will be marked DOWN to SLURM only after reaching SlurmdTimeout. In the interim, the node state will be NO_RESPOND. This change makes behavior makes SLURM handling of the node DOWN state more consistent with ALPS. This change effects only Cray systems.
-
Morris Jette authored
Fix the node state accounting to be consistent with the node state set by ALPS.
-
- Aug 05, 2011
-
-
Danny Auble authored
be the same.
-
Danny Auble authored
previously marked down by alps.
-
- Aug 04, 2011
-
-
Morris Jette authored
Require SchedulerTimeSlice configuration parameter to be at least 5 seconds to avoid thrashing slurmd daemon. Addresses Cray bug 774692
-
Morris Jette authored
Change in GRES behavior for job steps: A job step's default generic resource allocation will be set to that of the job. If a job step's --gres value is set to "none" then none of the generic resources which have been allocated to the job will be allocated to the job step. Add srun environment value of SLURM_STEP_GRES to set default --gres value for a job step.
-
- Aug 03, 2011
-
-
Morris Jette authored
On Bluegene systems, smap's command-line mode would generate an invalid memory reference due to an uninitialized variable.
-
Danny Auble authored
a POLLERR the dbd_fail callback is called.
-
- Aug 02, 2011
-
-
Danny Auble authored
the DBD where both remained up but were disconnected the slurmctld would get registered again with the DBD.
-
Danny Auble authored
-
- Aug 01, 2011
-
-
Morris Jette authored
With sched/wiki or sched/wiki2 (Maui or Moab scheduler), insure that a requeued job's priority is reset to zero.
-
Morris Jette authored
-
- Jul 29, 2011
-
-
Danny Auble authored
-
- Jul 28, 2011
-
-
Morris Jette authored
Add the ability for a user to limit the number of leaf switches in a job's allocation using the --switch option of salloc, sbatch and srun. There is also a new SchedulerParameters value of max_switch_wait, which a SLURM administrator can used to set a maximum job delay and prevent a user job from blocking lower priority jobs for too long. Based on work by Rod Schultz, Bull.
-
- Jul 22, 2011
-
-
Morris Jette authored
BlueGene: Permit users to specify a separate connection type for each dimension (e.g. "--conn-type=torus,mesh,torus").
-
Morris Jette authored
On Cray systems with the srun2aprun wrapper, build an srun man page that describes which options are available with the wrapper.
-
- Jul 21, 2011
-
-
Morris Jette authored
Restore node configuration information (CPUs, memory, etc.) for powered down when slurmctld daemon restarts rather than waiting for the node to be restored to service and getting the information from the node (NOTE: Only relevent if FastSchedule=0).
-
- Jul 20, 2011
-
-
Morris Jette authored
Fix bug in select/cons_res task distribution logic when tasks-per-node=0. Eliminates misleading slurmctld message "error: cons_res: _compute_c_b_task_dist oversubscribe." This problem was introduced in SLURM version 2.2.5 in order to fix a task distribution problem when cpus_per_task=0. Patch from Rod Schultz, Bull.
-
- Jul 14, 2011
-
-
Morris Jette authored
Set SLURM_MEM_PER_CPU or SLURM_MEM_PER_NODE environment variables for both interactive (salloc) and batch jobs if the job has a memory limit. For Cray systems also set CRAY_AUTO_APRUN_OPTIONS environment variable with the memory limit.
-
- Jul 13, 2011
-
-
Morris Jette authored
For front-end configurations (Cray and IBM BlueGene), bind each batch job to a unique CPU to limit the damage which a single job can cause. Previously any single job could use all CPUs causing problems for other jobs or system daemons. This addresses a problem reported by Steve Trofinoff, CSCS.
-
- Jul 12, 2011
-
-
Danny Auble authored
man pages. Patch by Nancy Kritkausky, Bull.
-
Danny Auble authored
Bill Brophy, Bull.
-
Morris Jette authored
Note the job and partition state file formats have changed and RPCs with information for jobs and partitions have changed.
-
- Jul 06, 2011
-
-
Morris Jette authored
Fix bug in generic resource tracking of gres associated with specific CPUs. Resources were being over-allocated.
-
Morris Jette authored
Fix memory buffering bug if a AllowGroups parameter of a partition has 100 or more users. Patch by Andriy Grytsenko (Massive Solutions Limited).
-
- Jul 05, 2011
-
-
Morris Jette authored
Add cgroup support for device files in both the task/cgroup plugin and generic resource (GRES) logic. Based upon patch Yiannis Georgiou.
-
Morris Jette authored
When suspending a job, wait 2 seconds instead of 1 second between sending SIGTSTP and SIGSTOP. Some MPI implementation were not stopping within the 1 second delay.
-