Newer
Older
This file describes changes in recent versions of SLURM. It primarily
documents those changes that are of interest to users and admins.
-- Set DEFAULT flag in partition structure when slurmctld reads the
configuration file. Patch from Rémi Palancher.
-- Fix for possible deadlock in accounting logic: Avoid calling
jobacct_gather_g_getinfo() until there is data to read from the socket.
-- Fix typo in accounting when using reservations. Patch from Alejandro
Lucero Palau.
-- Fix to the multifactor priority plugin to calculate effective usage earlier
to give a correct priority on the first decay cycle after a restart of the
slurmctld. Patch from Martin Perry, Bull.
-- Permit user root to run a job step for any job as any user. Patch from
Didier Gazen, Laboratoire d'Aerologie.
-- BLUEGENE - fix for not allowing jobs if all midplanes are drained and all
blocks are in an error state.
-- Avoid slurmctld abort due to bad pointer when setting an advanced
reservation MAINT flag if it contains no nodes (only licenses).
-- Fix bug when requeued batch job is scheduled to run on a different node
zero, but attemts job launch on old node zero.
-- Fix bug in step task distribution when nodes are not configured in numeric
order. Patch from Hongjia Cao, NUDT.
-- Fix for srun allocating running within existing allocation with --exclude
option and --nnodes count small enough to remove more nodes. Patch from
Phil Eckert, LLNL.
-- Work around to handle certain combinations of glibc/kernel
(i.e. glibc-2.14/Linux-3.1) to correctly open the pty of the slurmstepd
as the job user. Patch from Mark Grondona, LLNL.
-- Modify linking to include "-ldl" only when needed. Patch from Aleksej
Saushev.
-- Fix smap regression to display nodes that are drained or down correctly.
-- Several bug fixes and performance improvements with related to batch
scripts containing very large numbers of arguments. Patches from Par
Andersson, NSC.
-- Fixed extremely hard to reproduce threading issue in assoc_mgr.
* Changes in SLURM 2.3.3
========================
-- Fix task/cgroup plugin error when used with GRES. Patch by Alexander
Bersenev (Institute of Mathematics and Mechanics, Russia).
-- Permit pending job exceeding a partition limit to run if its QOS flag is
modified to permit the partition limit to be exceeded. Patch from Bill
Brophy, Bull.
-- sacct search for jobs using filtering was ignoring wckey filter.
-- Fixed issue with QOS preemption when adding new QOS.
-- Fixed issue with comment field being used in a job finishing before it
starts in accounting.
-- Add slashes in front of derived exit code when modifying a job.
-- Handle numeric suffix of "T" for terabyte units. Patch from John Thiltges,
University of Nebraska-Lincoln.
-- Prevent resetting a held job's priority when updating other job parameters.
Patch from Alejandro Lucero Palau, BSC.
-- Improve logic to import a user's environment. Needed with --get-user-env
option used with Moab. Patch from Mark Grondona, LLNL.
-- Fix bug in sview layout if node count less than configured grid_x_width.
-- Modify PAM module to prefer to use SLURM library with same major release
number that it was built with.
-- Permit gres count configuration of zero.
-- Fix race condition where sbcast command can result in deadlock of slurmd
daemon. Patch by Don Albert, Bull.
-- Fix bug in srun --multi-prog configuration file to avoid printing duplicate
record error when "*" is used at the end of the file for the task ID.
-- Let operators see reservation data even if "PrivateData=reservations" flag
is set in slurm.conf. Patch from Don Albert, Bull.
-- Added new sbatch option "--export-file" as needed for latest version of
Moab. Patch from Phil Eckert, LLNL.
-- Fix for sacct printing CPUTime(RAW) where the the is greater than a 32 bit
number.
-- Fix bug in --switch option with topology resulting in bad switch count use.
Patch from Alejandro Lucero Palau (Barcelona Supercomputer Center).
-- Fix PrivateFlags bug when using Priority Multifactor plugin. If using sprio
all jobs would be returned even if the flag was set.
Patch from Bill Brophy, Bull.
-- Fix for possible invalid memory reference in slurmctld in job dependency
logic. Patch from Carles Fenoy (Barcelona Supercomputer Center).
* Changes in SLURM 2.3.2
========================
-- Add configure option of "--without-rpath" which builds SLURM tools without
the rpath option, which will work if Munge and BlueGene libraries are in
the default library search path and make system updates easier.
-- Fixed issue where if a job ended with ESLURMD_UID_NOT_FOUND and
ESLURMD_GID_NOT_FOUND where slurm would be a little over zealous
in treating missing a GID or UID as a fatal error.
-- Backfill scheduling - Add SchedulerParameters configuration parameter of
"bf_res" to control the resolution in the backfill scheduler's data about
when jobs begin and end. Default value is 60 seconds (used to be 1 second).
-- Cray - Remove the "family" specification from the GPU reservation request.

Morris Jette
committed
-- Updated set_oomadj.c, replacing deprecated oom_adj reference with
oom_score_adj
-- Fix resource allocation bug, generic resources allocation was ignoring the
job's ntasks_per_node and cpus_per_task parameters. Patch from Carles
Fenoy, BSC.
-- Avoid orphan job step if slurmctld is down when a job step completes.
-- Fix Lua link order, patch from Pär Andersson, NSC.
-- Set SLURM_CPUS_PER_TASK=1 when user specifies --cpus-per-task=1.
-- Fix for fatal error managing GRES. Patch by Carles Fenoy, BSC.
-- Fixed race condition when using the DBD in accounting where if a job
wasn't started at the time the eligible message was sent but started
before the db_index was returned information like start time would be lost.
-- Fix issue in accounting where normalized shares could be updated
incorrectly when getting fairshare from the parent.
-- Fixed if not enforcing associations but want QOS support for a default
qos on the cluster to fill that in correctly.
-- Fix in select/cons_res for "fatal: cons_res: sync loop not progressing"
with some configurations and job option combinations.
-- BLUEGNE - Fixed issue with handling HTC modes and rebooting.
-- Do not remove the backup slurmctld's pid file when it assumes control, only
when it actually shuts down. Patch from Andriy Grytsenko (Massive Solutions
Limited).
-- Avoid clearing a job's reason from JobHeldAdmin or JobHeldUser when it is
otherwise updated using scontrol or sview commands. Patch based upon work
by Phil Eckert (LLNL).
-- BLUEGENE - Fix for if changing the defined blocks in the bluegene.conf and
jobs happen to be running on blocks not in the new config.
-- Many cosmetic modifications to eliminate warning message from GCC version
4.6 compiler.
-- Fix for sview reservation tab when finding correct reservation.
-- Fix for handling QOS limits per user on a reconfig of the slurmctld.
-- Do not treat the absence of a gres.conf file as a fatal error on systems
configured with GRES, but set GRES counts to zero.
-- BLUEGENE - Update correctly the state in the reason of a block if an
admin sets the state to error.
-- BLUEGENE - handle reason of blocks in error more correctly between
restarts of the slurmctld.
-- BLUEGENE - Fix minor potential memory leak when setting block error reason.
-- BLUEGENE - Fix if running in Static/Overlap mode and full system block
is in an error state, won't deny jobs.
-- Fix for accounting where your cluster isn't numbered in counting order
(i.e. 1-9,0 instead of 0-9). The bug would cause 'sacct -N nodename' to
not give correct results on these systems.
-- Fix to GRES allocation logic when resources are associated with specific
CPUs on a node. Patch from Steve Trofinoff, CSCS.
-- Fix bugs in sched/backfill with respect to QOS reservation support and job
time limits. Patch from Alejandro Lucero Palau (Barcelona Supercomputer
Center).
-- BGQ - fix to set up corner correctly for sub block jobs.
-- Major re-write of the CPU Management User and Administrator Guide (web
page) by Martin Perry, Bull.
-- BLUEGENE - If removing blocks from system that once existed cleanup of old
block happens correctly now.
-- Prevent slurmctld crashing with configuration of MaxMemPerCPU=0.
-- Prevent job hold by operator or account coordinator of his own job from
being an Administrator Hold rather than User Hold by default.
-- Cray - Fix for srun.pl parsing to avoid adding spaces between option and
argument (e.g. "-N2" parsed properly without changing to "-N 2").
-- Major updates to cgroup support by Mark Grondona (LLNL) and Matthieu
Hautreux (CEA) and Sam Lang. Fixes timing problems with respect to the
task_epilog. Allows cgroup mount point to be configurable. Added new
configuration parameters MaxRAMPercent and MaxSwapPercent. Allow cgroup
configuration parameters that are precentages to be floating point.
-- Fixed issue where sview wasn't displaying correct nice value for jobs.
-- Fixed issue where sview wasn't displaying correct min memory per node/cpu
value for jobs.
-- Disable some SelectTypeParameters for select/linear that aren't compatible.
-- Move slurm_select_init to proper place to avoid loading multiple select
plugins in the slurmd.
-- BGQ - Include runjob_plugin.so in the bluegene rpm.
-- Report correct job "Reason" if needed nodes are DOWN, DRAINED, or
NOT_RESPONDING, "Resources" rather than "PartitionNodeLimit".
-- BLUEGENE - Fixed issues with running on a sub-midplane system.
-- Added some missing calls to allow older versions of SLURM to talk to newer.
-- Do not attempt to run HeathCheckProgram on powered down nodes. Patch from
Ramiro Alba, Centre Tecnològic de Tranferència de Calor, Spain.
* Changes in SLURM 2.3.0-2
==========================
-- Fix issue where if a job was pending and the slurmctld was restarted a
variable wasn't initialized in the job structure making it so that job
wouldn't run.
========================
-- BLUEGENE - make sure we only set the jobinfo_select start_loc on a job
when we are on a small block, not a regular one.
-- BGQ - fix issue where not copying the correct amount of memory.
-- BLUEGENE - fix clean start if jobs were running when the slurmctld was
shutdown and then the system size changed. This would probably only happen
if you were emulating a system.
-- Fix sview for calling a cray system from a non-cray system to get the
correct geometry of the system.
-- BLUEGENE - fix to correctly import pervious version of block state file.
-- BLUEGENE - handle loading better when doing a clean start with static
blocks.
-- Add sinfo format and sort option "%n" for NodeHostName and "%o" for
NodeAddr.
-- If a job is deferred due to partition limits, then re-test those limits
after a partition is modified. Patch from Don Lipari.
-- Fix bug which would crash slurmcld if job's owner (not root) tries to clear
a job's licenses by setting value to "".
-- Cosmetic fix for printing out debug info in the priority plugin.
-- In sview when switching from a bluegene machine to a regular linux cluster
Loading
Loading full blame...