Newer
Older
This file describes changes in recent versions of SLURM. It primarily
documents those changes that are of interest to users and admins.
* Changes in SLURM 2.4.0.pre3
=============================
-- Let a job be submitted even if it exceeds a QOS limit. Job will be left
in a pending state until the QOS limit or job parameters change. Patch by
Phil Eckert, LLNL.
-- Add sacct support for the option "--name". Work by Yuri D'Elia, Center for
Biomedicine, EURAC Research, Italy.
-- Add an srun shepard process to cancel a job and/or step of the srun process
is killed abnormally (e.g. SIGKILL).
-- BGQ - handle deadlock issue when a nodeboard goes into an error state.
-- BGQ - more thorough handling of blocks with multiple jobs running on them.
-- Fix man2html process to compile in the build directory instead of the
source dir.
-- Behavior of srun --multi-prog modified so that any program arguments
specified on the command line will be appended to the program arguments
specified in the program configuration file.
-- Add new command, sdiag, which reports a variety of job scheduling
statistics. Based upon work by Alejandro Lucero Palau, BSC.
* Changes in SLURM 2.4.0.pre2
=============================
-- CRAY - Add support for GPU memory allocation using SLURM GRES (Generic
RESource) support. Work by Steve Trofinoff, CSCS.
-- Add support for job allocations with multiple job constraint counts. For
example: salloc -C "[rack1*2&rack2*4]" ... will allocate the job 2 nodes
from rack1 and 4 nodes from rack2. Support for only a single constraint
name been added to job step support.
-- BGQ - Remove old method for marking cnodes down.
-- BGQ - Remove BGP images from view in sview.
-- BGQ - print out failed cnodes in scontrol show nodes.
-- BGQ - Add srun option of "--runjob-opts" to pass options to the runjob
command.
-- FRONTEND - handle step launch failure better.
-- BGQ - Added a mutex to protect the now changing ba_system pointers.
-- BGQ - added new functionality for sub-block allocations - no preemption
for this yet though.
-- Add --name option to squeue to filter output by job name. Patch from Yuri
D'Elia.
-- BGQ - Added linking to runjob client libary which gives support to totalview
to use srun instead of runjob.
-- Add numeric range checks to scontrol update options. Patch from Phil
Eckert, LLNL.
-- Add ReconfigFlags configuration option to control actions of "scontrol
reconfig". Patch from Don Albert, Bull.
-- BGQ - handle reboots with multiple jobs running on a block.
-- BGQ - Add message handler thread to forward signals to runjob process.
* Changes in SLURM 2.4.0.pre1
=============================
-- BGQ - use the ba_geo_tables to figure out the blocks instead of the old
algorithm. The improves timing in the worst cases and simplifies the code
greatly.
-- BLUEGENE - Change to output tools labels from BP to Midplane
(i.e. BP List -> MidplaneList).
-- BLUEGENE - read MPs and BPs from the bluegene.conf
-- Modify srun's SIGINT handling logic timer (two SIGINTs within one second) to
be based microsecond rather than second timer.
-- Modify advance reservation to accept multiple specific block sizes rather
than a single node count.
-- Permit administrator to change a job's QOS to any value without validating
the job's owner has permission to use that QOS. Based upon patch by Phil
Eckert (LLNL).
-- Add trigger flag for a permanent trigger. The trigger will NOT be purged
after an event occurs, but only when explicitly deleted.
-- Interpret a reservation with Nodes=ALL and a Partition specification as
reserving all nodes within the specified partition rather than all nodes
on the system. Based upon patch by Phil Eckert (LLNL).
-- Add the ability to reboot all compute nodes after they become idle. The
RebootProgram configuration parameter must be set and an authorized user
must execute the command "scontrol reboot_nodes". Patch from Andriy
Grytsenko (Massive Solutions Limited).
-- Modify slurmdbd.conf parsing to accept DebugLevel strings (quiet, fatal,
info, etc.) in addition to numeric values. The parsing of slurm.conf was
modified in the same fashion for SlurmctldDebug and SlurmdDebug values.
The output of sview and "scontrol show config" was also modified to report
those values as strings rather than numeric values.
-- Changed default value of StateSaveLocation configuration parameter from
-- Prevent associations from being deleted if it has any jobs in running,
pending or suspended state. Previous code prevented this only for running
jobs.
-- If a job can not run due to QOS or association limits, then do not cancel
the job, but leave it pending in a system held state (priority = 1). The
job will run when its limits or the QOS/association limits change. Based
upon a patch by Phil Ekcert (LLNL).
-- BGQ - Added logic to keep track of cnodes in an error state inside of a
booted block.
-- Added the ability to update a node's NodeAddr and NodeHostName with
scontrol. Also enable setting a node's state to "future" using scontrol.
-- Add a node state flag of CLOUD and save/restore NodeAddr and NodeHostName
information for nodes with a flag of CLOUD.
-- Cray: Add support for job reservations with node IDs that are not in
numeric order. Fix for Bugzilla #5.
-- Fix association limit support for jobs queued for multiple partitions.
-- BLUEGENE - fix issue for sub-midplane systems to create a full system
block correctly.
-- BLUEGENE - Added option to the bluegene.conf to tell you are running on
a sub midplane system.
-- Added the UserID of the user issuing the RPC to the job_submit/lua
functions.
-- Fixed issue where if a job ended with ESLURMD_UID_NOT_FOUND and
ESLURMD_GID_NOT_FOUND where slurm would be a little over zealous
in treating missing a GID or UID as a fatal error.
-- If job time limit exceeds partition maximum, but job's minimum time limit
does not, set job's time limit to partition maximum at allocation time.
* Changes in SLURM 2.3.3
========================
-- Fix task/cgroup plugin error when used with GRES. Patch by Alexander
Bersenev (Institute of Mathematics and Mechanics, Russia).
-- Permit pending job exceeding a partition limit to run if its QOS flag is
modified to permit the partition limit to be exceeded. Patch from Bill
Brophy, Bull.
-- sacct search for jobs using filtering was ignoring wckey filter.
-- Fixed issue with QOS preemption when adding new QOS.
-- Fixed issue with comment field being used in a job finishing before it
starts in accounting.
-- Add slashes in front of derived exit code when modifying a job.
-- Handle numeric suffix of "T" for terabyte units. Patch from John Thiltges,
University of Nebraska-Lincoln.
-- Prevent resetting a held job's priority when updating other job parameters.
Patch from Alejandro Lucero Palau, BSC.
-- Improve logic to import a user's environment. Needed with --get-user-env
option used with Moab. Patch from Mark Grondona, LLNL.
-- Fix bug in sview layout if node count less than configured grid_x_width.
-- Modify PAM module to prefer to use SLURM library with same major release
number that it was built with.
-- Permit gres count configuration of zero.
* Changes in SLURM 2.3.2
========================
-- Add configure option of "--without-rpath" which builds SLURM tools without
the rpath option, which will work if Munge and BlueGene libraries are in
the default library search path and make system updates easier.
-- Fixed issue where if a job ended with ESLURMD_UID_NOT_FOUND and
ESLURMD_GID_NOT_FOUND where slurm would be a little over zealous
in treating missing a GID or UID as a fatal error.
-- Backfill scheduling - Add SchedulerParameters configuration parameter of
"bf_res" to control the resolution in the backfill scheduler's data about
when jobs begin and end. Default value is 60 seconds (used to be 1 second).
-- Cray - Remove the "family" specification from the GPU reservation request.

Morris Jette
committed
-- Updated set_oomadj.c, replacing deprecated oom_adj reference with
oom_score_adj
-- Fix resource allocation bug, generic resources allocation was ignoring the
job's ntasks_per_node and cpus_per_task parameters. Patch from Carles
Fenoy, BSC.
-- Avoid orphan job step if slurmctld is down when a job step completes.
-- Fix Lua link order, patch from Pär Andersson, NSC.
-- Set SLURM_CPUS_PER_TASK=1 when user specifies --cpus-per-task=1.
-- Fix for fatal error managing GRES. Patch by Carles Fenoy, BSC.
-- Fixed race condition when using the DBD in accounting where if a job
wasn't started at the time the eligible message was sent but started
before the db_index was returned information like start time would be lost.
-- Fix issue in accounting where normalized shares could be updated
incorrectly when getting fairshare from the parent.
-- Fixed if not enforcing associations but want QOS support for a default
qos on the cluster to fill that in correctly.
-- Fix in select/cons_res for "fatal: cons_res: sync loop not progressing"
with some configurations and job option combinations.
-- BLUEGNE - Fixed issue with handling HTC modes and rebooting.
-- Do not remove the backup slurmctld's pid file when it assumes control, only
when it actually shuts down. Patch from Andriy Grytsenko (Massive Solutions
Limited).
-- Avoid clearing a job's reason from JobHeldAdmin or JobHeldUser when it is
otherwise updated using scontrol or sview commands. Patch based upon work
by Phil Eckert (LLNL).
-- BLUEGENE - Fix for if changing the defined blocks in the bluegene.conf and
jobs happen to be running on blocks not in the new config.
-- Many cosmetic modifications to eliminate warning message from GCC version
4.6 compiler.
-- Fix for sview reservation tab when finding correct reservation.
-- Fix for handling QOS limits per user on a reconfig of the slurmctld.
-- Do not treat the absence of a gres.conf file as a fatal error on systems
configured with GRES, but set GRES counts to zero.
-- BLUEGENE - Update correctly the state in the reason of a block if an
admin sets the state to error.
-- BLUEGENE - handle reason of blocks in error more correctly between
restarts of the slurmctld.
-- BLUEGENE - Fix minor potential memory leak when setting block error reason.
-- BLUEGENE - Fix if running in Static/Overlap mode and full system block
is in an error state, won't deny jobs.
-- Fix for accounting where your cluster isn't numbered in counting order
(i.e. 1-9,0 instead of 0-9). The bug would cause 'sacct -N nodename' to
not give correct results on these systems.
-- Fix to GRES allocation logic when resources are associated with specific
CPUs on a node. Patch from Steve Trofinoff, CSCS.
-- Fix bugs in sched/backfill with respect to QOS reservation support and job
time limits. Patch from Alejandro Lucero Palau (Barcelona Supercomputer
Center).
-- BGQ - fix to set up corner correctly for sub block jobs.
-- Major re-write of the CPU Management User and Administrator Guide (web
Loading
Loading full blame...