Skip to content
Snippets Groups Projects
NEWS 123 KiB
Newer Older
Christopher J. Morrone's avatar
Christopher J. Morrone committed
This file describes changes in recent versions of SLURM. It primarily
documents those changes that are of interest to users and admins.
Morris Jette's avatar
Morris Jette committed
* Changes in SLURM 2.3.4
========================
 -- Set DEFAULT flag in partition structure when slurmctld reads the
    configuration file. Patch from Rémi Palancher.
 -- Fix for possible deadlock in accounting logic: Avoid calling
    jobacct_gather_g_getinfo() until there is data to read from the socket.
 -- Fix typo in accounting when using reservations. Patch from Alejandro
    Lucero Palau.
 -- Fix to the multifactor priority plugin to calculate effective usage earlier
    to give a correct priority on the first decay cycle after a restart of the
    slurmctld. Patch from Martin Perry, Bull.
 -- Permit user root to run a job step for any job as any user. Patch from
    Didier Gazen, Laboratoire d'Aerologie.
 -- BLUEGENE - fix for not allowing jobs if all midplanes are drained and all
    blocks are in an error state.
 -- Avoid slurmctld abort due to bad pointer when setting an advanced
    reservation MAINT flag if it contains no nodes (only licenses).
Morris Jette's avatar
Morris Jette committed
 -- Fix bug when requeued batch job is scheduled to run on a different node
    zero, but attemts job launch on old node zero.
 -- Fix bug in step task distribution when nodes are not configured in numeric
    order. Patch from Hongjia Cao, NUDT.
 -- Fix for srun allocating running within existing allocation with --exclude
    option and --nnodes count small enough to remove more nodes. Patch from
    Phil Eckert, LLNL.
 -- Work around to handle certain combinations of glibc/kernel
    (i.e. glibc-2.14/Linux-3.1) to correctly open the pty of the slurmstepd
    as the job user. Patch from Mark Grondona, LLNL.
 -- Modify linking to include "-ldl" only when needed. Patch from Aleksej
    Saushev.
 -- Fix smap regression to display nodes that are drained or down correctly.
 -- Several bug fixes and performance improvements with related to batch
    scripts containing very large numbers of arguments. Patches from Par
    Andersson, NSC.
 -- Fixed extremely hard to reproduce threading issue in assoc_mgr.
Morris Jette's avatar
Morris Jette committed

* Changes in SLURM 2.3.3
========================
 -- Fix task/cgroup plugin error when used with GRES. Patch by Alexander
    Bersenev (Institute of Mathematics and Mechanics, Russia).
 -- Permit pending job exceeding a partition limit to run if its QOS flag is
    modified to permit the partition limit to be exceeded. Patch from Bill
    Brophy, Bull.
 -- BLUEGENE - Fixed preemption issue.
 -- sacct search for jobs using filtering was ignoring wckey filter.
 -- Fixed issue with QOS preemption when adding new QOS.
 -- Fixed issue with comment field being used in a job finishing before it
    starts in accounting.
 -- Add slashes in front of derived exit code when modifying a job.
 -- Handle numeric suffix of "T" for terabyte units. Patch from John Thiltges,
    University of Nebraska-Lincoln.
 -- Prevent resetting a held job's priority when updating other job parameters.
    Patch from Alejandro Lucero Palau, BSC.
Morris Jette's avatar
Morris Jette committed
 -- Improve logic to import a user's environment. Needed with --get-user-env
    option used with Moab. Patch from Mark Grondona, LLNL.
 -- Fix bug in sview layout if node count less than configured grid_x_width.
 -- Modify PAM module to prefer to use SLURM library with same major release
    number that it was built with.
 -- Permit gres count configuration of zero.
 -- Fix race condition where sbcast command can result in deadlock of slurmd
    daemon. Patch by Don Albert, Bull.
 -- Fix bug in srun --multi-prog configuration file to avoid printing duplicate
    record error when "*" is used at the end of the file for the task ID.
 -- Let operators see reservation data even if "PrivateData=reservations" flag
    is set in slurm.conf. Patch from Don Albert, Bull.
 -- Added new sbatch option "--export-file" as needed for latest version of
    Moab. Patch from Phil Eckert, LLNL.
 -- Fix for sacct printing CPUTime(RAW) where the the is greater than a 32 bit
    number.
 -- Fix bug in --switch option with topology resulting in bad switch count use.
    Patch from Alejandro Lucero Palau (Barcelona Supercomputer Center).
 -- Fix PrivateFlags bug when using Priority Multifactor plugin.  If using sprio
    all jobs would be returned even if the flag was set.
    Patch from Bill Brophy, Bull.
 -- Fix for possible invalid memory reference in slurmctld in job dependency
    logic. Patch from Carles Fenoy (Barcelona Supercomputer Center).
* Changes in SLURM 2.3.2
========================
 -- Add configure option of "--without-rpath" which builds SLURM tools without
    the rpath option, which will work if Munge and BlueGene libraries are in
    the default library search path and make system updates easier.
 -- Fixed issue where if a job ended with ESLURMD_UID_NOT_FOUND and
    ESLURMD_GID_NOT_FOUND where slurm would be a little over zealous
    in treating missing a GID or UID as a fatal error.
 -- Backfill scheduling - Add SchedulerParameters configuration parameter of
    "bf_res" to control the resolution in the backfill scheduler's data about
    when jobs begin and end. Default value is 60 seconds (used to be 1 second).
 -- Cray - Remove the "family" specification from the GPU reservation request.
 -- Updated set_oomadj.c, replacing deprecated oom_adj reference with
    oom_score_adj
 -- Fix resource allocation bug, generic resources allocation was ignoring the
    job's ntasks_per_node and cpus_per_task parameters. Patch from Carles
    Fenoy, BSC.
 -- Avoid orphan job step if slurmctld is down when a job step completes.
 -- Fix Lua link order, patch from Pär Andersson, NSC.
 -- Set SLURM_CPUS_PER_TASK=1 when user specifies --cpus-per-task=1.
 -- Fix for fatal error managing GRES. Patch by Carles Fenoy, BSC.
 -- Fixed race condition when using the DBD in accounting where if a job
    wasn't started at the time the eligible message was sent but started
    before the db_index was returned information like start time would be lost.
 -- Fix issue in accounting where normalized shares could be updated
    incorrectly when getting fairshare from the parent.
 -- Fixed if not enforcing associations  but want QOS support for a default
    qos on the cluster to fill that in correctly.
 -- Fix in select/cons_res for "fatal: cons_res: sync loop not progressing"
    with some configurations and job option combinations.
 -- BLUEGNE - Fixed issue with handling HTC modes and rebooting.
* Changes in SLURM 2.3.1
========================
 -- Do not remove the backup slurmctld's pid file when it assumes control, only
    when it actually shuts down. Patch from Andriy Grytsenko (Massive Solutions
    Limited).
 -- Avoid clearing a job's reason from JobHeldAdmin or JobHeldUser when it is
    otherwise updated using scontrol or sview commands. Patch based upon work
    by Phil Eckert (LLNL).
 -- BLUEGENE - Fix for if changing the defined blocks in the bluegene.conf and
    jobs happen to be running on blocks not in the new config.
Morris Jette's avatar
Morris Jette committed
 -- Many cosmetic modifications to eliminate warning message from GCC version
    4.6 compiler.
 -- Fix for sview reservation tab when finding correct reservation.
 -- Fix for handling QOS limits per user on a reconfig of the slurmctld.
 -- Do not treat the absence of a gres.conf file as a fatal error on systems
    configured with GRES, but set GRES counts to zero.
 -- BLUEGENE - Update correctly the state in the reason of a block if an
    admin sets the state to error.
 -- BLUEGENE - handle reason of blocks in error more correctly between
    restarts of the slurmctld.
 -- BLUEGENE - Fix minor potential memory leak when setting block error reason.
 -- BLUEGENE - Fix if running in Static/Overlap mode and full system block
    is in an error state, won't deny jobs.
 -- Fix for accounting where your cluster isn't numbered in counting order
    (i.e. 1-9,0 instead of 0-9).  The bug would cause 'sacct -N nodename' to
    not give correct results on these systems.
 -- Fix to GRES allocation logic when resources are associated with specific
    CPUs on a node. Patch from Steve Trofinoff, CSCS.
 -- Fix bugs in sched/backfill with respect to QOS reservation support and job
    time limits. Patch from Alejandro Lucero Palau (Barcelona Supercomputer
    Center).
 -- BGQ - fix to set up corner correctly for sub block jobs.
 -- Major re-write of the CPU Management User and Administrator Guide (web
    page) by Martin Perry, Bull.
 -- BLUEGENE - If removing blocks from system that once existed cleanup of old
    block happens correctly now.
 -- Prevent slurmctld crashing with configuration of MaxMemPerCPU=0.
jette's avatar
jette committed
 -- Prevent job hold by operator or account coordinator of his own job from
    being an Administrator Hold rather than User Hold by default.
Morris Jette's avatar
Morris Jette committed
 -- Cray - Fix for srun.pl parsing to avoid adding spaces between option and
    argument (e.g. "-N2" parsed properly without changing to "-N 2").
 -- Major updates to cgroup support by Mark Grondona (LLNL) and Matthieu
    Hautreux (CEA) and Sam Lang. Fixes timing problems with respect to the
    task_epilog. Allows cgroup mount point to be configurable. Added new
    configuration parameters MaxRAMPercent and MaxSwapPercent. Allow cgroup
    configuration parameters that are precentages to be floating point.
 -- Fixed issue where sview wasn't displaying correct nice value for jobs.
 -- Fixed issue where sview wasn't displaying correct min memory per node/cpu
    value for jobs.
 -- Disable some SelectTypeParameters for select/linear that aren't compatible.
 -- Move slurm_select_init to proper place to avoid loading multiple select
    plugins in the slurmd.
 -- BGQ - Include runjob_plugin.so in the bluegene rpm.
 -- Report correct job "Reason" if needed nodes are DOWN, DRAINED, or
    NOT_RESPONDING, "Resources" rather than "PartitionNodeLimit".
 -- BLUEGENE - Fixed issues with running on a sub-midplane system.
 -- Added some missing calls to allow older versions of SLURM to talk to newer.
 -- BGQ - allow steps to be ran.
 -- Do not attempt to run HeathCheckProgram on powered down nodes. Patch from
    Ramiro Alba, Centre Tecnològic de Tranferència de Calor, Spain.
* Changes in SLURM 2.3.0-2
==========================
 -- Fix for memory issue inside sview.
 -- Fix issue where if a job was pending and the slurmctld was restarted a
    variable wasn't initialized in the job structure making it so that job
    wouldn't run.
Danny Auble's avatar
Danny Auble committed
* Changes in SLURM 2.3.0
========================
 -- BLUEGENE - make sure we only set the jobinfo_select start_loc on a job
    when we are on a small block, not a regular one.
 -- BGQ - fix issue where not copying the correct amount of memory.
 -- BLUEGENE - fix clean start if jobs were running when the slurmctld was
    shutdown and then the system size changed.  This would probably only happen
    if you were emulating a system.
 -- Fix sview for calling a cray system from a non-cray system to get the
    correct geometry of the system.
 -- BLUEGENE - fix to correctly import pervious version of block state file.
 -- BLUEGENE - handle loading better when doing a clean start with static
    blocks.
 -- Add sinfo format and sort option "%n" for NodeHostName and "%o" for
    NodeAddr.
 -- If a job is deferred due to partition limits, then re-test those limits
    after a partition is modified. Patch from Don Lipari.
 -- Fix bug which would crash slurmcld if job's owner (not root) tries to clear
    a job's licenses by setting value to "".
 -- Cosmetic fix for printing out debug info in the priority plugin.
 -- In sview when switching from a bluegene machine to a regular linux cluster
Loading
Loading full blame...