Newer
Older
This file describes changes in recent versions of SLURM. It primarily
documents those changes that are of interest to users and admins.
* Changes in SLURM 2.3.3
========================
-- Fix task/cgroup plugin error when used with GRES. Patch by Alexander
Bersenev (Institute of Mathematics and Mechanics, Russia).
-- Permit pending job exceeding a partition limit to run if its QOS flag is
modified to permit the partition limit to be exceeded. Patch from Bill
Brophy, Bull.
-- sacct search for jobs using filtering was ignoring wckey filter.
-- Fixed issue with QOS preemption when adding new QOS.
-- Fixed issue with comment field being used in a job finishing before it
starts in accounting.
-- Add slashes in front of derived exit code when modifying a job.
-- Handle numeric suffix of "T" for terabyte units. Patch from John Thiltges,
University of Nebraska-Lincoln.
-- Prevent resetting a held job's priority when updating other job parameters.
Patch from Alejandro Lucero Palau, BSC.
-- Improve logic to import a user's environment. Needed with --get-user-env
option used with Moab. Patch from Mark Grondona, LLNL.
* Changes in SLURM 2.3.2
========================
-- Add configure option of "--without-rpath" which builds SLURM tools without
the rpath option, which will work if Munge and BlueGene libraries are in
the default library search path and make system updates easier.
-- Fixed issue where if a job ended with ESLURMD_UID_NOT_FOUND and
ESLURMD_GID_NOT_FOUND where slurm would be a little over zealous
in treating missing a GID or UID as a fatal error.
-- Backfill scheduling - Add SchedulerParameters configuration parameter of
"bf_res" to control the resolution in the backfill scheduler's data about
when jobs begin and end. Default value is 60 seconds (used to be 1 second).
-- Cray - Remove the "family" specification from the GPU reservation request.

Morris Jette
committed
-- Updated set_oomadj.c, replacing deprecated oom_adj reference with
oom_score_adj
-- Fix resource allocation bug, generic resources allocation was ignoring the
job's ntasks_per_node and cpus_per_task parameters. Patch from Carles
Fenoy, BSC.
-- Avoid orphan job step if slurmctld is down when a job step completes.
-- Fix Lua link order, patch from Pär Andersson, NSC.
-- Set SLURM_CPUS_PER_TASK=1 when user specifies --cpus-per-task=1.
-- Fix for fatal error managing GRES. Patch by Carles Fenoy, BSC.
-- Fixed race condition when using the DBD in accounting where if a job
wasn't started at the time the eligible message was sent but started
before the db_index was returned information like start time would be lost.
-- Fix issue in accounting where normalized shares could be updated
incorrectly when getting fairshare from the parent.
-- Fixed if not enforcing associations but want QOS support for a default
qos on the cluster to fill that in correctly.
-- Fix in select/cons_res for "fatal: cons_res: sync loop not progressing"
with some configurations and job option combinations.
-- BLUEGNE - Fixed issue with handling HTC modes and rebooting.
-- Do not remove the backup slurmctld's pid file when it assumes control, only
when it actually shuts down. Patch from Andriy Grytsenko (Massive Solutions
Limited).
-- Avoid clearing a job's reason from JobHeldAdmin or JobHeldUser when it is
otherwise updated using scontrol or sview commands. Patch based upon work
by Phil Eckert (LLNL).
-- BLUEGENE - Fix for if changing the defined blocks in the bluegene.conf and
jobs happen to be running on blocks not in the new config.
-- Many cosmetic modifications to eliminate warning message from GCC version
4.6 compiler.
-- Fix for sview reservation tab when finding correct reservation.
-- Fix for handling QOS limits per user on a reconfig of the slurmctld.
-- Do not treat the absence of a gres.conf file as a fatal error on systems
configured with GRES, but set GRES counts to zero.
-- BLUEGENE - Update correctly the state in the reason of a block if an
admin sets the state to error.
-- BLUEGENE - handle reason of blocks in error more correctly between
restarts of the slurmctld.
-- BLUEGENE - Fix minor potential memory leak when setting block error reason.
-- BLUEGENE - Fix if running in Static/Overlap mode and full system block
is in an error state, won't deny jobs.
-- Fix for accounting where your cluster isn't numbered in counting order
(i.e. 1-9,0 instead of 0-9). The bug would cause 'sacct -N nodename' to
not give correct results on these systems.
-- Fix to GRES allocation logic when resources are associated with specific
CPUs on a node. Patch from Steve Trofinoff, CSCS.
-- Fix bugs in sched/backfill with respect to QOS reservation support and job
time limits. Patch from Alejandro Lucero Palau (Barcelona Supercomputer
Center).
-- BGQ - fix to set up corner correctly for sub block jobs.
-- Major re-write of the CPU Management User and Administrator Guide (web
page) by Martin Perry, Bull.
-- BLUEGENE - If removing blocks from system that once existed cleanup of old
block happens correctly now.
-- Prevent slurmctld crashing with configuration of MaxMemPerCPU=0.
-- Prevent job hold by operator or account coordinator of his own job from
being an Administrator Hold rather than User Hold by default.
-- Cray - Fix for srun.pl parsing to avoid adding spaces between option and
argument (e.g. "-N2" parsed properly without changing to "-N 2").
-- Major updates to cgroup support by Mark Grondona (LLNL) and Matthieu
Hautreux (CEA) and Sam Lang. Fixes timing problems with respect to the
task_epilog. Allows cgroup mount point to be configurable. Added new
configuration parameters MaxRAMPercent and MaxSwapPercent. Allow cgroup
configuration parameters that are precentages to be floating point.
-- Fixed issue where sview wasn't displaying correct nice value for jobs.
-- Fixed issue where sview wasn't displaying correct min memory per node/cpu
value for jobs.
-- Disable some SelectTypeParameters for select/linear that aren't compatible.
-- Move slurm_select_init to proper place to avoid loading multiple select
plugins in the slurmd.
-- BGQ - Include runjob_plugin.so in the bluegene rpm.
-- Report correct job "Reason" if needed nodes are DOWN, DRAINED, or
NOT_RESPONDING, "Resources" rather than "PartitionNodeLimit".
-- BLUEGENE - Fixed issues with running on a sub-midplane system.
-- Added some missing calls to allow older versions of SLURM to talk to newer.
-- Do not attempt to run HeathCheckProgram on powered down nodes. Patch from
Ramiro Alba, Centre Tecnològic de Tranferència de Calor, Spain.
* Changes in SLURM 2.3.0-2
==========================
-- Fix issue where if a job was pending and the slurmctld was restarted a
variable wasn't initialized in the job structure making it so that job
wouldn't run.
========================
-- BLUEGENE - make sure we only set the jobinfo_select start_loc on a job
when we are on a small block, not a regular one.
-- BGQ - fix issue where not copying the correct amount of memory.
-- BLUEGENE - fix clean start if jobs were running when the slurmctld was
shutdown and then the system size changed. This would probably only happen
if you were emulating a system.
-- Fix sview for calling a cray system from a non-cray system to get the
correct geometry of the system.
-- BLUEGENE - fix to correctly import pervious version of block state file.
-- BLUEGENE - handle loading better when doing a clean start with static
blocks.
-- Add sinfo format and sort option "%n" for NodeHostName and "%o" for
NodeAddr.
-- If a job is deferred due to partition limits, then re-test those limits
after a partition is modified. Patch from Don Lipari.
-- Fix bug which would crash slurmcld if job's owner (not root) tries to clear
a job's licenses by setting value to "".
-- Cosmetic fix for printing out debug info in the priority plugin.
-- In sview when switching from a bluegene machine to a regular linux cluster
and vice versa the node->base partition lists will be displayed if setup
in your .slurm/sviewrc file.
-- BLUEGENE - Fix for creating full system static block on a BGQ system.
-- BLUEGENE - Fix deadlock issue if toggling between Dynamic and Static block
allocation with jobs running on blocks that don't exist in the static
setup.
-- BLUEGENE - Modify code to only give HTC states to BGP systems and not
allow them on Q systems.
-- BLUEGENE - Make it possible for an admin to define multiple dimension
conn_types in a block definition.
-- BGQ - Alter tools to output multiple dimensional conn_type.
-- With sched/wiki or sched/wiki2 (Maui or Moab scheduler), insure that a
requeued job's priority is reset to zero.
-- BLUEGENE - fix to run steps correctly in a BGL/P emulated system.
-- Fixed issue where if there was a network issue between the slurmctld and
the DBD where both remained up but were disconnected the slurmctld would
get registered again with the DBD.
-- Fixed issue where if the DBD connection from the ctld goes away because of
a POLLERR the dbd_fail callback is called.
-- BLUEGENE - Fix to smap command-line mode display.
-- Change in GRES behavior for job steps: A job step's default generic
resource allocation will be set to that of the job. If a job step's --gres
value is set to "none" then none of the generic resources which have been
allocated to the job will be allocated to the job step.
-- Add srun environment value of SLURM_STEP_GRES to set default --gres value
for a job step.
-- Require SchedulerTimeSlice configuration parameter to be at least 5 seconds
to avoid thrashing slurmd daemon.
-- Cray - Fix to make nodes state in accounting consistent with state set by
ALPS.
-- Cray - A node DOWN to ALPS will be marked DOWN to SLURM only after reaching
SlurmdTimeout. In the interim, the node state will be NO_RESPOND. This
change makes behavior makes SLURM handling of the node DOWN state more
consistent with ALPS. This change effects only Cray systems.
-- Cray - Fix to work with 4.0.* instead of just 4.0.0
-- Cray - Modify srun/aprun wrapper to map --exclusive to -F exclusive and
--share to -F share. Note this does not consider the partition's Shared
configuration, so it is an imperfect mapping of options.
-- BLUEGENE - Added notice in the print config to tell if you are emulated
or not.
-- BLUEGENE - Fix job step scalability issue with large task count.
-- BGQ - Improved c-node selection when asked for a sub-block job that
cannot fit into the available shape.
-- BLUEGENE - Modify "scontrol show step" to show I/O nodes (BGL and BGP) or
c-nodes (BGQ) allocated to each step. Change field name from "Nodes=" to
"BP_List=".
-- Code cleanup on step request to get the correct select_jobinfo.
-- Memory leak fixed for rolling up accounting with down clusters.
-- BGQ - fix issue where if first job step is the entire block and then the
next parallel step is ran on a sub block, SLURM won't over subscribe cnodes.
-- Treat duplicate switch name in topology.conf as fatal error. Patch from Rod
Schultz, Bull
-- Minor update to documentation describing the AllowGroups option for a
Loading
Loading full blame...