Newer
Older
This file describes changes in recent versions of SLURM. It primarily
documents those changes that are of interest to users and admins.
-- Improve support for overlapping advanced reservations. Patch from
Bill Brophy, Bull.
-- Modify Makefiles for support of Debian hardening flags. Patch from
Simon Ruderich.
-- CRAY: Fix support for configuration with SlurmdTimeout=0 (never mark
node that is DOWN in ALPS as DOWN in SLURM).
-- Fixed the setting of SLURM_SUBMIT_DIR for jobs submitted by Moab (BZ#1467).
Patch by Don Lipari, LLNL.
-- Correction to init.d/slurmdbd exit code for status option. Patch by Bill
Brophy, Bull.
-- When the optional max_time is not specified for --switches=count, the site
max (SchedulerParameters=max_switch_wait=seconds) is used for the job.
Based on patch from Rod Schultz.
-- Fix bug in select/cons_res plugin when used with topology/tree and a node
range count in job allocation request.
-- Fixed moab_2_slurmdb.pl script to correctly work for end records.
-- Add support for new SchedulerParameters of max_depend_depth defining the
maximum number of jobs to test for circular dependencies (i.e. job A waits
for job B to start and job B waits for job A to start). Default value is
10 jobs.
-- Fix potential race condition if MinJobAge is very low (i.e. 1) and using
slurmdbd accounting and running large amounts of jobs (>50 sec). Job
information could be corrupted before it had a chance to reach the DBD.
-- Fix state restore of job limit set from admin value for min_cpus.
-- Fix clearing of limit values if an admin removes the limit for max cpus
and time limit where it was previously set by an admin.
-- Set DEFAULT flag in partition structure when slurmctld reads the
configuration file. Patch from Rémi Palancher.
-- Fix for possible deadlock in accounting logic: Avoid calling
jobacct_gather_g_getinfo() until there is data to read from the socket.
-- Fix typo in accounting when using reservations. Patch from Alejandro
Lucero Palau.
-- Fix to the multifactor priority plugin to calculate effective usage earlier
to give a correct priority on the first decay cycle after a restart of the
slurmctld. Patch from Martin Perry, Bull.
-- Permit user root to run a job step for any job as any user. Patch from
Didier Gazen, Laboratoire d'Aerologie.
-- BLUEGENE - fix for not allowing jobs if all midplanes are drained and all
blocks are in an error state.
-- Avoid slurmctld abort due to bad pointer when setting an advanced
reservation MAINT flag if it contains no nodes (only licenses).
-- Fix bug when requeued batch job is scheduled to run on a different node
zero, but attemts job launch on old node zero.
-- Fix bug in step task distribution when nodes are not configured in numeric
order. Patch from Hongjia Cao, NUDT.
-- Fix for srun allocating running within existing allocation with --exclude
option and --nnodes count small enough to remove more nodes. Patch from
Phil Eckert, LLNL.
-- Work around to handle certain combinations of glibc/kernel
(i.e. glibc-2.14/Linux-3.1) to correctly open the pty of the slurmstepd
as the job user. Patch from Mark Grondona, LLNL.
-- Modify linking to include "-ldl" only when needed. Patch from Aleksej
Saushev.
-- Fix smap regression to display nodes that are drained or down correctly.
-- Several bug fixes and performance improvements with related to batch
scripts containing very large numbers of arguments. Patches from Par
Andersson, NSC.
-- Fixed extremely hard to reproduce threading issue in assoc_mgr.
-- Correct "scontrol show daemons" output if there is more than one
ControlMachine configured.
-- Add node read lock where needed in slurmctld/agent code.
-- Added test for LUA library named "liblua5.1.so.0" in addition to
"liblua5.1.so" as needed by Debian. Patch by Remi Palancher.
-- Added partition default_time field to job_submit LUA plugin. Patch by
Remi Palancher.
-- Fix bug in cray/srun wrapper stdin/out/err file handling.
-- In cray/srun wrapper, only include aprun "-q" option when srun "--quiet"
option is used.
-- BLUEGENE - fix issue where if a small block was in error it could hold up
the queue when trying to place a larger than midplane job.
-- CRAY - ignore all interactive nodes and jobs on interactive nodes.
-- Add new job state reason of "FrontEndDown" which applies only to Cray and
IBM BlueGene systems.
-- Cray - Enable configure option of "--enable-salloc-background" to permit
the srun and salloc commands to be executed in the background. This does
NOT remove the ALPS limitation that only one job reservation can be created
for each Linux session ID.
-- Cray - For srun wrapper when creating a job allocation, set the default job
name to the executable file's name.
-- FRONTEND - if a front end unexpectedly reboots kill all jobs but don't
mark front end node down.
-- FRONTEND - don't down a front end node if you have an epilog error.
-- Cray - fix for if a frontend slurmd was started after the slurmctld had
already pinged it on startup the unresponding flag would be removed from
the frontend node.
-- Cray - Fix issue on smap not displaying grid correctly.
* Changes in SLURM 2.3.3
========================
-- Fix task/cgroup plugin error when used with GRES. Patch by Alexander
Bersenev (Institute of Mathematics and Mechanics, Russia).
-- Permit pending job exceeding a partition limit to run if its QOS flag is
modified to permit the partition limit to be exceeded. Patch from Bill
Brophy, Bull.
-- sacct search for jobs using filtering was ignoring wckey filter.
-- Fixed issue with QOS preemption when adding new QOS.
-- Fixed issue with comment field being used in a job finishing before it
starts in accounting.
-- Add slashes in front of derived exit code when modifying a job.
-- Handle numeric suffix of "T" for terabyte units. Patch from John Thiltges,
University of Nebraska-Lincoln.
-- Prevent resetting a held job's priority when updating other job parameters.
Patch from Alejandro Lucero Palau, BSC.
-- Improve logic to import a user's environment. Needed with --get-user-env
option used with Moab. Patch from Mark Grondona, LLNL.
-- Fix bug in sview layout if node count less than configured grid_x_width.
-- Modify PAM module to prefer to use SLURM library with same major release
number that it was built with.
-- Permit gres count configuration of zero.
-- Fix race condition where sbcast command can result in deadlock of slurmd
daemon. Patch by Don Albert, Bull.
-- Fix bug in srun --multi-prog configuration file to avoid printing duplicate
record error when "*" is used at the end of the file for the task ID.
-- Let operators see reservation data even if "PrivateData=reservations" flag
is set in slurm.conf. Patch from Don Albert, Bull.
-- Added new sbatch option "--export-file" as needed for latest version of
Moab. Patch from Phil Eckert, LLNL.
-- Fix for sacct printing CPUTime(RAW) where the the is greater than a 32 bit
number.
-- Fix bug in --switch option with topology resulting in bad switch count use.
Patch from Alejandro Lucero Palau (Barcelona Supercomputer Center).
-- Fix PrivateFlags bug when using Priority Multifactor plugin. If using sprio
all jobs would be returned even if the flag was set.
Patch from Bill Brophy, Bull.
-- Fix for possible invalid memory reference in slurmctld in job dependency
logic. Patch from Carles Fenoy (Barcelona Supercomputer Center).
* Changes in SLURM 2.3.2
========================
-- Add configure option of "--without-rpath" which builds SLURM tools without
the rpath option, which will work if Munge and BlueGene libraries are in
the default library search path and make system updates easier.
-- Fixed issue where if a job ended with ESLURMD_UID_NOT_FOUND and
ESLURMD_GID_NOT_FOUND where slurm would be a little over zealous
in treating missing a GID or UID as a fatal error.
-- Backfill scheduling - Add SchedulerParameters configuration parameter of
"bf_res" to control the resolution in the backfill scheduler's data about
when jobs begin and end. Default value is 60 seconds (used to be 1 second).
-- Cray - Remove the "family" specification from the GPU reservation request.

Morris Jette
committed
-- Updated set_oomadj.c, replacing deprecated oom_adj reference with
oom_score_adj
-- Fix resource allocation bug, generic resources allocation was ignoring the
job's ntasks_per_node and cpus_per_task parameters. Patch from Carles
Fenoy, BSC.
-- Avoid orphan job step if slurmctld is down when a job step completes.
-- Fix Lua link order, patch from Pär Andersson, NSC.
-- Set SLURM_CPUS_PER_TASK=1 when user specifies --cpus-per-task=1.
-- Fix for fatal error managing GRES. Patch by Carles Fenoy, BSC.
-- Fixed race condition when using the DBD in accounting where if a job
wasn't started at the time the eligible message was sent but started
before the db_index was returned information like start time would be lost.
-- Fix issue in accounting where normalized shares could be updated
incorrectly when getting fairshare from the parent.
-- Fixed if not enforcing associations but want QOS support for a default
qos on the cluster to fill that in correctly.
-- Fix in select/cons_res for "fatal: cons_res: sync loop not progressing"
with some configurations and job option combinations.
-- BLUEGNE - Fixed issue with handling HTC modes and rebooting.
-- Do not remove the backup slurmctld's pid file when it assumes control, only
when it actually shuts down. Patch from Andriy Grytsenko (Massive Solutions
Limited).
-- Avoid clearing a job's reason from JobHeldAdmin or JobHeldUser when it is
otherwise updated using scontrol or sview commands. Patch based upon work
by Phil Eckert (LLNL).
-- BLUEGENE - Fix for if changing the defined blocks in the bluegene.conf and
jobs happen to be running on blocks not in the new config.
-- Many cosmetic modifications to eliminate warning message from GCC version
4.6 compiler.
-- Fix for sview reservation tab when finding correct reservation.
-- Fix for handling QOS limits per user on a reconfig of the slurmctld.
-- Do not treat the absence of a gres.conf file as a fatal error on systems
configured with GRES, but set GRES counts to zero.
-- BLUEGENE - Update correctly the state in the reason of a block if an
admin sets the state to error.
-- BLUEGENE - handle reason of blocks in error more correctly between
restarts of the slurmctld.
-- BLUEGENE - Fix minor potential memory leak when setting block error reason.
-- BLUEGENE - Fix if running in Static/Overlap mode and full system block
is in an error state, won't deny jobs.
-- Fix for accounting where your cluster isn't numbered in counting order
(i.e. 1-9,0 instead of 0-9). The bug would cause 'sacct -N nodename' to
not give correct results on these systems.
-- Fix to GRES allocation logic when resources are associated with specific
CPUs on a node. Patch from Steve Trofinoff, CSCS.
-- Fix bugs in sched/backfill with respect to QOS reservation support and job
time limits. Patch from Alejandro Lucero Palau (Barcelona Supercomputer
Loading
Loading full blame...