NEWS

This file describes changes in recent versions of SLURM. It primarily
documents those changes that are of interest to users and admins.

* Changes in SLURM 2.4.3
========================
 -- Accounting - Fix so complete 32 bit numbers can be put in for a priority.
 -- cgroups - fix if initial directory is non-existent SLURM creates it
    correctly.  Before the errno wasn't being checked correctly
 -- BGQ - fixed srun when only requesting a task count and not a node count
    to operate the same way salloc or sbatch did and assign a task per cpu
    by default instead of task per node.

* Changes in SLURM 2.4.2
========================
 -- BLUEGENE - Correct potential deadlock issue when hardware goes bad and
    there are jobs running on that hardware.
 -- If job is submitted to more than one partition, it's partition pointer can
    be set to an invalid value. This can result in the count of CPUs allocated
    on a node being bad, resulting in over- or under-allocation of its CPUs.
    Patch by Carles Fenoy, BSC.
 -- Fix bug in task layout with select/cons_res plugin and --ntasks-per-node
    option. Patch by Martin Perry, Bull.
 -- BLUEGENE - remove race condition where if a block is removed while waiting
    for a job to finish on it the number of unused cpus wasn't updated
    correctly.
 -- BGQ - make sure we have a valid block when creating or finishing a step
    allocation.
 -- BLUEGENE - If a large block (> 1 midplane) is in error and underlying
    hardware is marked bad remove the larger block and create a block over
    just the bad hardware making the other hardware available to run on.
 -- BLUEGENE - Handle job completion correctly if an admin removes a block
    where other blocks on an overlapping midplane are running jobs.
 -- BLUEGENE - correctly remove running jobs when freeing a block.
 -- BGQ - correct logic to place multiple (< 1 midplane) steps inside a
    multi midplane block allocation.
 -- BGQ - Make it possible for a multi midplane allocation to run on more
    than 1 midplane but not the entire allocation.
 -- BGL - Fix for syncing users on block from Tim Wickberg
 -- Fix initialization of protocol_version for some messages to make sure it
    is always set when sending or receiving a message.
 -- Reset backfilled job counter only when explicitly cleared using scontrol.
    Patch from Alejandro Lucero Palau, BSC.
 -- BLUEGENE - Fix for handling blocks when a larger block will not free and
    while it is attempting to free underlying hardware is marked in error
    making small blocks overlapping with the freeing block.  This only
    applies to dynamic layout mode.
 -- Cray and BlueGene - Do not treat lack of usable front-end nodes when
    slurmctld deamon starts as a fatal error. Also preserve correct front-end
    node for jobs when there is more than one front-end node and the slurmctld
    daemon restarts.
 -- Correct parsing of srun/sbatch input/output/error file names so that only
    the name "none" is mapped to /dev/null and not any file name starting
    with "none" (e.g. "none.o").
 -- BGQ - added version string to the load of the runjob_mux plugin to verify
    the current plugin has been loaded when using runjob_mux_refresh_config
 -- CGROUPS - Use system mount/umount function calls instead of doing fork
    exec of mount/umount from Janne Blomqvist.
 -- BLUEGENE - correct start time setup when no jobs are blocking the way
    from Mark Nelson
 -- Fixed sacct --state=S query to return information about suspended jobs
    current or in the past.
 -- FRONTEND - Made error warning more apparent if a frontend node isn't
    configured correctly.
 -- BGQ - update documentation about runjob_mux_refresh_config which works
    correctly as of IBM driver V1R1M1 efix 008.

* Changes in SLURM 2.4.1
========================
 -- Fix bug for job state change from 2.3 -> 2.4 job state can now be preserved
    correctly when transitioning.  This also applies for 2.4.0 -> 2.4.1, no
    state will be lost. (Thanks to Carles Fenoy)

* Changes in SLURM 2.4.0
========================
 -- Cray - Improve support for zero compute note resource allocations.
    Partition used can now be configured with no nodes nodes.
 -- BGQ - make it so srun -i<taskid> works correctly.
 -- Fix parse_uint32/16 to complain if a non-digit is given.
 -- Add SUBMITHOST to job state passed to Moab vial sched/wiki2. Patch by Jon
    Bringhurst (LANL).
 -- BGQ - Fix issue when running with AllowSubBlockAllocations=Yes without
    compiling with --enable-debug
 -- Modify scontrol to require "-dd" option to report batch job's script. Patch
    from Don Albert, Bull.
 -- Modify SchedulerParamters option to match documentation: "bf_res="
    changed to "bf_resolution=". Patch from Rod Schultz, Bull.
 -- Fix bug that clears job pending reason field. Patch fron Don Lipari, LLNL.
 -- In etc/init.d/slurm move check for scontrol after sourcing
    /etc/sysconfig/slurm. Patch from Andy Wettstein, University of Chicago.
 -- Fix in scheduling logic that can delay jobs with min/max node counts.
 -- BGQ - fix issue where if a step uses the entire allocation and then
    the next step in the allocation only uses part of the allocation it gets
    the correct cnodes.
 -- BGQ - Fix checking for IO on a block with new IBM driver V1R1M1 previous
    function didn't always work correctly.
 -- BGQ - Fix issue when a nodeboard goes down and you want to combine blocks
    to make a larger small block and are running with sub-blocks.
 -- BLUEGENE - Better logic for making small blocks around bad nodeboard/card.
 -- BGQ - When using an old IBM driver cnodes that go into error because of
    a job kill timeout aren't always reported to the system.  This is now
    handled by the runjob_mux plugin.
 -- BGQ - Added information on how to setup the runjob_mux to run as SlurmUser.
 -- Improve memory consumption on step layouts with high task count.
 -- BGQ - quiter debug when the real time server comes back but there are
    still messages we find when we poll but haven't given it back to the real
    time yet.
 -- BGQ - fix for if a request comes in smaller than the smallest block and
    we must use a small block instead of a shared midplane block.
 -- Fix issues on large jobs (>64k tasks) to have the correct counter type when
    packing the step layout structure.
 -- BGQ - fix issue where if a user was asking for tasks and ntasks-per-node
    but not node count the node count is correctly figured out.
 -- Move logic to always use the 1st alphanumeric node as the batch host for
    batch jobs.
 -- BLUEGENE - fix race condition where if a nodeboard/card goes down at the
    same time a block is destroyed and that block just happens to be the
    smallest overlapping block over the bad hardware.
 -- Fix bug when querying accounting looking for a job node size.
 -- BLUEGENE - fix possible race condition if cleaning up a block and the
    removal of the job on the block failed.
 -- BLUEGENE - fix issue if a cable was in an error state make it so we can
    check if a block is still makable if the cable wasn't in error.
 -- Put nodes names in alphabetic order in node table.
 -- If preempted job should have a grace time and preempt mode is not cancel
    but job is going to be canceled because it is interactive or other reason
    it now receives the grace time.
 -- BGQ - Modified documents to explain new plugin_flags needed in bg.properties
    in order for the runjob_mux to run correctly.
 -- BGQ - change linking from libslurm.o to libslurmhelper.la to avoid warning.

* Changes in SLURM 2.4.0.rc1
=============================
 -- Improve task binding logic by making fuller use of HWLOC library,
    especially with respect to Opteron 6000 series processors. Work contributed
    by Komoto Masahiro.
 -- Add new configuration parameter PriorityFlags, based upon work by
    Carles Fenoy (Barcelona Supercomputer Center).
 -- Modify the step completion RPC between slurmd and slurmstepd in order to
    eliminate a possible deadlock. Based on work by Matthieu Hautreux, CEA.
 -- Change the owner of slurmctld and slurmdbd log files to the appropriate
    user. Without this change the files will be created by and owned by the
    user starting the daemons (likely user root).
 -- Reorganize the slurmstepd logic in order to better support NFS and
    Kerberos credentials via the AUKS plugin. Work by Matthieu Hautreux, CEA.
 -- Fix bug in allocating GRES that are associated with specific CPUs. In some
    cases the code allocated first available GRES to job instead of allocating
    GRES accessible to the specific CPUs allocated to the job.
 -- spank: Add callbacks in slurmd: slurm_spank_slurmd_{init,exit}
    and job epilog/prolog: slurm_spank_job_{prolog,epilog}
 -- spank: Add spank_option_getopt() function to api
 -- Change resolution of switch wait time from minutes to seconds.
 -- Added CrpCPUMins to the output of sshare -l for those using hard limit
    accounting.  Work contributed by Mark Nelson.
 -- Added mpi/pmi2 plugin for complete support of pmi2 including acquiring
    additional resources for newly launched tasks. Contributed by Hongjia Cao,
    NUDT.
 -- BGQ - fixed issue where if a user asked for a specific node count and more
    tasks than possible without overcommit the request would be allowed on more
    nodes than requested.
 -- Add support for new SchedulerParameters of bf_max_job_user, maximum number
    of jobs to attempt backfilling per user. Work by Bjørn-Helge Mevik,
    University of Oslo.
 -- BLUEGENE - fixed issue where MaxNodes limit on a partition only limited
    larger than midplane jobs.
 -- Added cpu_run_min to the output of sshare --long.  Work contributed by
    Mark Nelson.
 -- BGQ - allow regular users to resolve Rack-Midplane to AXYZ coords.
 -- Add sinfo output format option of "%R" for partition name without "*"
    appended for default partition.
 -- Cray - Add support for zero compute note resource allocation to run batch
    script on front-end node with no ALPS reservation. Useful for pre- or post-
    processing.
 -- Support for cyclic distribution of cpus in task/cgroup plugin from Martin
    Perry, Bull.
 -- GrpMEM limit for QOSes and associations added Patch from Bjørn-Helge Mevik,
    University of Oslo.
 -- Various performance improvements for up to 500% higher throughput depending
    upon configuration. Work supported by the Oak Ridge National Laboratory
    Extreme Scale Systems Center.
 -- Added jobacct_gather/cgroup plugin.  It is not advised to use this in
    production as it isn't currently complete and doesn't provide an equivalent
    substitution for jobacct_gather/linux yet. Work by Martin Perry, Bull.

* Changes in SLURM 2.4.0.pre4
=============================
 -- Add logic to cache GPU file information (bitmap index mapping to device
    file number) in the slurmd daemon and transfer that information to the
    slurmstepd whenever a job step is initiated. This is needed to set the
    appropriate CUDA_VISIBLE_DEVICES environment variable value when the
    devices are not in strict numeric order (e.g. some GPUs are skipped).
    Based upon work by Nicolas Bigaouette.
 -- BGQ - Remove ability to make a sub-block with a geometry with one or more
    of it's dimensions of length 3.  There is a limitation in the IBM I/O
    subsystem that is problematic with multiple sub-blocks with a dimension
    of length 3, so we will disallow them to be able to be created.  This
    mean you if you ask the system for an allocation of 12 c-nodes you will
    be given 16.  If this is ever fix in BGQ you can remove this patch.
 -- BLUEGENE - Better handling blocks that go into error state or deallocate
    while jobs are running on them.
 -- BGQ - fix for handling mix of steps running at same time some of which