Skip to content
Snippets Groups Projects
NEWS 127 KiB
Newer Older
Christopher J. Morrone's avatar
Christopher J. Morrone committed
This file describes changes in recent versions of SLURM. It primarily
documents those changes that are of interest to users and admins.
Morris Jette's avatar
Morris Jette committed
* Changes in SLURM 2.3.5
========================
 -- Improve support for overlapping advanced reservations. Patch from
    Bill Brophy, Bull.
 -- Modify Makefiles for support of Debian hardening flags. Patch from
    Simon Ruderich.
 -- CRAY: Fix support for configuration with SlurmdTimeout=0 (never mark
    node that is DOWN in ALPS as DOWN in SLURM).
 -- Fixed the setting of SLURM_SUBMIT_DIR for jobs submitted by Moab (BZ#1467).
    Patch by Don Lipari, LLNL.
 -- Correction to init.d/slurmdbd exit code for status option. Patch by Bill
    Brophy, Bull.
 -- When the optional max_time is not specified for --switches=count, the site
    max (SchedulerParameters=max_switch_wait=seconds) is used for the job.
    Based on patch from Rod Schultz.
 -- Fix bug in select/cons_res plugin when used with topology/tree and a node
    range count in job allocation request.
 -- Fixed moab_2_slurmdb.pl script to correctly work for end records.
 -- Add support for new SchedulerParameters of max_depend_depth defining the
    maximum number of jobs to test for circular dependencies (i.e. job A waits
    for job B to start and job B waits for job A to start). Default value is
    10 jobs.
 -- Fix potential race condition if MinJobAge is very low (i.e. 1) and using
    slurmdbd accounting and running large amounts of jobs (>50 sec).  Job
    information could be corrupted before it had a chance to reach the DBD.
 -- Fix state restore of job limit set from admin value for min_cpus.
 -- Fix clearing of limit values if an admin removes the limit for max cpus
    and time limit where it was previously set by an admin.
 -- Fix issue where log message is more than 256 chars and then has a format.
 -- Fix sched/wiki2 to support job account name, gres, partition name, wckey,
    or working directory that contains "#" (a job record separator).
 -- CRAY - fix for handling memory requests from user for an allocation.
Morris Jette's avatar
Morris Jette committed

Morris Jette's avatar
Morris Jette committed
* Changes in SLURM 2.3.4
========================
 -- Set DEFAULT flag in partition structure when slurmctld reads the
    configuration file. Patch from Rémi Palancher.
 -- Fix for possible deadlock in accounting logic: Avoid calling
    jobacct_gather_g_getinfo() until there is data to read from the socket.
 -- Fix typo in accounting when using reservations. Patch from Alejandro
    Lucero Palau.
 -- Fix to the multifactor priority plugin to calculate effective usage earlier
    to give a correct priority on the first decay cycle after a restart of the
    slurmctld. Patch from Martin Perry, Bull.
 -- Permit user root to run a job step for any job as any user. Patch from
    Didier Gazen, Laboratoire d'Aerologie.
 -- BLUEGENE - fix for not allowing jobs if all midplanes are drained and all
    blocks are in an error state.
 -- Avoid slurmctld abort due to bad pointer when setting an advanced
    reservation MAINT flag if it contains no nodes (only licenses).
Morris Jette's avatar
Morris Jette committed
 -- Fix bug when requeued batch job is scheduled to run on a different node
    zero, but attemts job launch on old node zero.
 -- Fix bug in step task distribution when nodes are not configured in numeric
    order. Patch from Hongjia Cao, NUDT.
 -- Fix for srun allocating running within existing allocation with --exclude
    option and --nnodes count small enough to remove more nodes. Patch from
    Phil Eckert, LLNL.
 -- Work around to handle certain combinations of glibc/kernel
    (i.e. glibc-2.14/Linux-3.1) to correctly open the pty of the slurmstepd
    as the job user. Patch from Mark Grondona, LLNL.
 -- Modify linking to include "-ldl" only when needed. Patch from Aleksej
    Saushev.
 -- Fix smap regression to display nodes that are drained or down correctly.
 -- Several bug fixes and performance improvements with related to batch
    scripts containing very large numbers of arguments. Patches from Par
    Andersson, NSC.
 -- Fixed extremely hard to reproduce threading issue in assoc_mgr.
 -- Correct "scontrol show daemons" output if there is more than one
    ControlMachine configured.
 -- Add node read lock where needed in slurmctld/agent code.
Morris Jette's avatar
Morris Jette committed
 -- Added test for LUA library named "liblua5.1.so.0" in addition to
    "liblua5.1.so" as needed by Debian. Patch by Remi Palancher.
 -- Added partition default_time field to job_submit LUA plugin. Patch by
    Remi Palancher.
 -- Fix bug in cray/srun wrapper stdin/out/err file handling.
 -- In cray/srun wrapper, only include aprun "-q" option when srun "--quiet"
    option is used.
 -- BLUEGENE - fix issue where if a small block was in error it could hold up
    the queue when trying to place a larger than midplane job.
 -- CRAY - ignore all interactive nodes and jobs on interactive nodes.
 -- Add new job state reason of "FrontEndDown" which applies only to Cray and
    IBM BlueGene systems.
 -- Cray - Enable configure option of "--enable-salloc-background" to permit
    the srun and salloc commands to be executed in the background. This does
    NOT remove the ALPS limitation that only one job reservation can be created
    for each Linux session ID.
 -- Cray - For srun wrapper when creating a job allocation, set the default job
    name to the executable file's name.
 -- Add support for Cray ALPS 5.0.0
 -- FRONTEND - if a front end unexpectedly reboots kill all jobs but don't
    mark front end node down.
 -- FRONTEND - don't down a front end node if you have an epilog error.
 -- Cray - fix for if a frontend slurmd was started after the slurmctld had
    already pinged it on startup the unresponding flag would be removed from
    the frontend node.
 -- Cray - Fix issue on smap not displaying grid correctly.
 -- Fixed minor memory leak in sview.
Morris Jette's avatar
Morris Jette committed

* Changes in SLURM 2.3.3
========================
 -- Fix task/cgroup plugin error when used with GRES. Patch by Alexander
    Bersenev (Institute of Mathematics and Mechanics, Russia).
 -- Permit pending job exceeding a partition limit to run if its QOS flag is
    modified to permit the partition limit to be exceeded. Patch from Bill
    Brophy, Bull.
 -- BLUEGENE - Fixed preemption issue.
 -- sacct search for jobs using filtering was ignoring wckey filter.
 -- Fixed issue with QOS preemption when adding new QOS.
 -- Fixed issue with comment field being used in a job finishing before it
    starts in accounting.
 -- Add slashes in front of derived exit code when modifying a job.
 -- Handle numeric suffix of "T" for terabyte units. Patch from John Thiltges,
    University of Nebraska-Lincoln.
 -- Prevent resetting a held job's priority when updating other job parameters.
    Patch from Alejandro Lucero Palau, BSC.
Morris Jette's avatar
Morris Jette committed
 -- Improve logic to import a user's environment. Needed with --get-user-env
    option used with Moab. Patch from Mark Grondona, LLNL.
 -- Fix bug in sview layout if node count less than configured grid_x_width.
 -- Modify PAM module to prefer to use SLURM library with same major release
    number that it was built with.
 -- Permit gres count configuration of zero.
 -- Fix race condition where sbcast command can result in deadlock of slurmd
    daemon. Patch by Don Albert, Bull.
 -- Fix bug in srun --multi-prog configuration file to avoid printing duplicate
    record error when "*" is used at the end of the file for the task ID.
 -- Let operators see reservation data even if "PrivateData=reservations" flag
    is set in slurm.conf. Patch from Don Albert, Bull.
 -- Added new sbatch option "--export-file" as needed for latest version of
    Moab. Patch from Phil Eckert, LLNL.
 -- Fix for sacct printing CPUTime(RAW) where the the is greater than a 32 bit
    number.
 -- Fix bug in --switch option with topology resulting in bad switch count use.
    Patch from Alejandro Lucero Palau (Barcelona Supercomputer Center).
 -- Fix PrivateFlags bug when using Priority Multifactor plugin.  If using sprio
    all jobs would be returned even if the flag was set.
    Patch from Bill Brophy, Bull.
 -- Fix for possible invalid memory reference in slurmctld in job dependency
    logic. Patch from Carles Fenoy (Barcelona Supercomputer Center).
* Changes in SLURM 2.3.2
========================
 -- Add configure option of "--without-rpath" which builds SLURM tools without
    the rpath option, which will work if Munge and BlueGene libraries are in
    the default library search path and make system updates easier.
 -- Fixed issue where if a job ended with ESLURMD_UID_NOT_FOUND and
    ESLURMD_GID_NOT_FOUND where slurm would be a little over zealous
    in treating missing a GID or UID as a fatal error.
 -- Backfill scheduling - Add SchedulerParameters configuration parameter of
    "bf_res" to control the resolution in the backfill scheduler's data about
    when jobs begin and end. Default value is 60 seconds (used to be 1 second).
 -- Cray - Remove the "family" specification from the GPU reservation request.
 -- Updated set_oomadj.c, replacing deprecated oom_adj reference with
    oom_score_adj
 -- Fix resource allocation bug, generic resources allocation was ignoring the
    job's ntasks_per_node and cpus_per_task parameters. Patch from Carles
    Fenoy, BSC.
 -- Avoid orphan job step if slurmctld is down when a job step completes.
 -- Fix Lua link order, patch from Pär Andersson, NSC.
 -- Set SLURM_CPUS_PER_TASK=1 when user specifies --cpus-per-task=1.
 -- Fix for fatal error managing GRES. Patch by Carles Fenoy, BSC.
 -- Fixed race condition when using the DBD in accounting where if a job
    wasn't started at the time the eligible message was sent but started
    before the db_index was returned information like start time would be lost.
 -- Fix issue in accounting where normalized shares could be updated
    incorrectly when getting fairshare from the parent.
 -- Fixed if not enforcing associations  but want QOS support for a default
    qos on the cluster to fill that in correctly.
 -- Fix in select/cons_res for "fatal: cons_res: sync loop not progressing"
    with some configurations and job option combinations.
 -- BLUEGNE - Fixed issue with handling HTC modes and rebooting.
* Changes in SLURM 2.3.1
========================
 -- Do not remove the backup slurmctld's pid file when it assumes control, only
    when it actually shuts down. Patch from Andriy Grytsenko (Massive Solutions
    Limited).
 -- Avoid clearing a job's reason from JobHeldAdmin or JobHeldUser when it is
    otherwise updated using scontrol or sview commands. Patch based upon work
    by Phil Eckert (LLNL).
 -- BLUEGENE - Fix for if changing the defined blocks in the bluegene.conf and
    jobs happen to be running on blocks not in the new config.
Morris Jette's avatar
Morris Jette committed
 -- Many cosmetic modifications to eliminate warning message from GCC version
    4.6 compiler.
 -- Fix for sview reservation tab when finding correct reservation.
 -- Fix for handling QOS limits per user on a reconfig of the slurmctld.
 -- Do not treat the absence of a gres.conf file as a fatal error on systems
    configured with GRES, but set GRES counts to zero.
 -- BLUEGENE - Update correctly the state in the reason of a block if an
    admin sets the state to error.
 -- BLUEGENE - handle reason of blocks in error more correctly between
    restarts of the slurmctld.
 -- BLUEGENE - Fix minor potential memory leak when setting block error reason.
 -- BLUEGENE - Fix if running in Static/Overlap mode and full system block
    is in an error state, won't deny jobs.
 -- Fix for accounting where your cluster isn't numbered in counting order
    (i.e. 1-9,0 instead of 0-9).  The bug would cause 'sacct -N nodename' to
    not give correct results on these systems.
Loading
Loading full blame...