Skip to content
Snippets Groups Projects
NEWS 516 KiB
Newer Older
David Bigagli's avatar
David Bigagli committed
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and administrators.

Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.10
Tim Wickberg's avatar
Tim Wickberg committed
===========================
 -- Move priority_sort_part_tier from slurmctld to libslurm to make it possible
    to run the regression tests 24.* without changing that code since it links
    directly to the priority plugin where that function isn't defined.
 -- Fix issue where job time limits can increase to max walltime when updating
    a job with scontrol.
 -- Fix invalid protocol_version manipulation on big endian platforms causing
    srun and sattach to fail.
 -- Fix for QOS, Reservation and Alias env variables in srun.
 -- mpi/pmi2 - Backport 6a702158b49c4 from 18.08 to avoid dangerous detached
 -- When allowing heterogeneous steps make sure we copy all the options to
    avoid copying strings that may be overwritten.
 -- Print correctly when sh5util finds and empty file.
 -- Fix sh5util to not seg fault on exit.
 -- Fix sh5util to check correctly for H5free_memory.
 -- Adjust OOM monitoring function in task/cgroup to prevent problems in
    regression suite from leaked file descriptors.
 -- Fix issue with gres when defined with a type and no count
    (i.e. gres=gpu/tesla) it would get a count of 0.
 -- Allow sstat to talk to slurmd's that are new in protocol version.
 -- Permit database names over 33 characters in accounting_storage/mysql.
 -- Fix negative values when profiling.
Danny Auble's avatar
Danny Auble committed
 -- Fix srun segfault caused by invalid memory reads on the env.
 -- Fix segfault on job arrays when starting controller without dbd up.
 -- Fix pmi2 to build with gcc 8.0+.
 -- Fix proper alignment of clauses when determining if more nodes are needed
    for an allocation.
 -- Fix race condition when canceling a federation job that just started
    running.
 -- Prevent extra resources from being allocated when combining certain flags.
 -- Fix problem in task/affinity plugin that can lead to slurmd fatal()'ing
    when using --hint=nomultithread.
 -- Fix left over socket file when step is ending and using pmi2 with
    %n or %h in the spool dir.
 -- Fix incorrect spacing for PartitionName lines in 'scontrol write config'.
 -- Fix sacct to not print huge reserve times when the job was never eligible.
Tim Wickberg's avatar
Tim Wickberg committed

* Changes in Slurm 17.11.9-2
============================
 -- Fix printing of node state "drain + reboot" (and other node state flags).
 -- Fix invalid read (segfault) when sorting multi-partition jobs.
 -- Move several new error() messages to debug() to keep them out of users'
    srun output.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.9
==========================
 -- Fix segfault in slurmctld when a job's node bitmap is NULL during a
    scheduling cycle.  Primarily caused by EnforcePartLimits=ALL.
 -- Remove erroneous unlock in acct_gather_energy/ipmi.
 -- Enable support for hwloc version 2.0.1.
Jason Booth's avatar
Jason Booth committed
 -- Fix 'srun -q' (--qos) option handling.
 -- Fix socket communication issue that can lead to lost task completition
    messages, which will cause a permanently stuck srun process.
 -- Handle creation of TMPDIR if environment variable is set or changed in
    a task prolog script.
 -- Avoid node layout fragmentation if running with a fixed CPU count but
    without Sockets and CoresPerSocket defined.
 -- burst_buffer/cray - Fix datawarp swap default pool overriding jobdw.
 -- Fix incorrect job priority assignment for multi-partition job with
    different PriorityTier settings on the partitions.
 -- Fix sinfo to print correct node state.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.8
==========================
 -- Fix incomplete RESPONSE_[RESOURCE|JOB_PACK]_ALLOCATION building path.
 -- Do not allocate nodes that were marked down due to the node not responding
    by ResumeTimeout.
 -- task/cray plugin - search for "mems" cgroup information in the file
    "cpuset.mems" then fall back to the file "mems".
 -- Fix ipmi profile debug uninitialized variable.
 -- Improve detection of Lua package on older RHEL distributions.
Danny Auble's avatar
Danny Auble committed
 -- PMIx: fixed the direct connect inline msg sending.
 -- MYSQL: Fix issue not handling all fields when loading an archive dump.
 -- Allow a job_submit plugin to change the admin_comment field during
    job_submit_plugin_modify().
 -- job_submit/lua - fix access into reservation table.
 -- MySQL - Prevent deadlock caused by archive logic locking reads.
 -- Don't enforce MaxQueryTimeRange when requesting specific jobs.
 -- Modify --test-only logic to properly support jobs submitted to more than
    one partition.
 -- Prevent slurmctld from abort when attempting to set non-existing
 -- Add new job dependency type of "afterburstbuffer". The pending job will be
    delayed until the first job completes execution and it's burst buffer
    stage-out is completed.
 -- Reorder proctrack/task plugin load in the slurmstepd to match that of slurmd
    and avoid race condition calling task before proctrack can introduce.
 -- Prevent reboot of a busy KNL node when requesting inactive features.
 -- Revert to previous behavior when requesting memory per cpu/node introduced
    in 17.11.7.
 -- Fix to reinitialize previously adjusted job members to their original value
    when validating the job memory in multi-partition requests.
 -- Fix _step_signal() from always returning SLURM_SUCCESS.
 -- Combine active and available node feature change logs on one line rather
    than one line per node for performance reasons.
 -- Prevent occasionally leaking freezer cgroups.
Danny Auble's avatar
Danny Auble committed
 -- Fix potential segfault when closing the mpi/pmi2 plugin.
 -- Fix issues with --exclusive=[user|mcs] to work correctly
    with preemption or when job requests a specific list of hosts.
 -- Make code compile with hdf5 1.10.2+
 -- mpi/pmix: Fixed the collectives canceling.
 -- SlurmDBD: improve error message handling on archive load failure.
 -- Fix incorrect locking when deleting reservations.
 -- Fix incorrect locking when setting up the power save module.
 -- Fix setting format output length for squeue when showing array jobs.
Brian Christiansen's avatar
Brian Christiansen committed
 -- Add xstrstr function.
 -- Fix printing out of --hint options in sbatch, salloc --help.
 -- Prevent possible divide by zero in _validate_time_limit().
 -- Add Delegate=yes to the slurmd.service file to prevent systemd from
    interfering with the jobs' cgroup hierarchies.
 -- Change the backlog argument to the listen() syscall within srun to 4096
    to match elsewhere in the code, and avoid communication problems at scale.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.7
==========================
 -- Fix for possible slurmctld daemon abort with NULL pointer.
 -- Fix different issues when requesting memory per cpu/node.
 -- PMIx - override default paths at configure time if --with-pmix is used.
 -- Have sprio display jobs before eligible time when
    PriorityFlags=ACCRUE_ALWAYS is set.
 -- Make sure locks are always in place when calling _post_qos_list().
 -- Notify srun and ctld when unkillable stepd exits.
 -- Fix slurmstepd deadlock in stepd cleanup caused by race condition in
    the jobacct_gather fini() interfaces introduced in 17.11.6.
 -- Fix slurmstepd deadlock in PMIx startup.
 -- task/cgroup - fix invalid free() if the hwloc library does not return a
    string as expected.
 -- Fix insecure handling of job requested gid field. CVE-2018-10995.
 -- Add --without x11 option to rpmbuild in slurm.spec.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.6
==========================
 -- CRAY - Add slurmsmwd to the contribs/cray dir.
 -- sview - fix crash when closing any search dialog.
 -- Fix initialization of variable in stepd when using native x11.
 -- Fix reading slurm_io_init_msg to handle partial messages.
 -- Fix scontrol create res segfault when wrong user/account parameters given.
Felip Moll's avatar
Felip Moll committed
 -- Fix documentation for sacct on parameter -X (--allocations)
 -- Change TRES Weights debug messages to debug3.
 -- FreeBSD - assorted fixes to restore build.
Felip Moll's avatar
Felip Moll committed
 -- Fix for not tracking environment variables from unrelated different jobs.
 -- PMIX - Added the direct connect authentication.
    When upgrading this may cause issues with jobs using pmix starting on mixed
    slurmstepd versions where some are less than 17.11.6.
Morris Jette's avatar
Morris Jette committed
 -- Prevent the backup slurmctld from losing the active/available node
    features list on takeover.
Felip Moll's avatar
Felip Moll committed
 -- Add documentation for fix IDLE*+POWER due to capmc stuck in Cray systems.
 -- Fix missing mutex unlock when prolog is failing on a node, leading to a
    hung slurmd.
 -- Fix locking around Cray CCM prolog/epilog.
 -- Add missing fed_mgr read locks.
 -- Fix issue incorrectly setting a job time_start to 0 while requeueing.
 -- smail - remove stray '-s' from mail subject line.
 -- srun - prevent segfault if ClusterName setting is unset but
    SLURM_WORKING_CLUSTER environment variable is defined.
 -- In configurator.html web pages change default configuration from
    task/none to task/affinity plugin and from select/linear plugin to
    select/cons_res plus CR_Core.
 -- Allow jobs to run beyond a FLEX reservation end time.
 -- Fix problem with wrongly set as Reservation job state_reason.
 -- Prevent bit_ffs() from returnig value out of bitmap range.
 -- Improve performance of 'squeue -u' when PrivateData=jobs is enabled.
 -- Make UnavailableNodes value in job reason be correct for each job.
 -- Fix 'squeue -o %s' on Cray systems.
 -- Fix incorrect error thrown when cancelling part of a job array.
 -- Fix error code and scheduling problem for --exclusive=[user|mcs].
 -- Fix build when lz4 is in a non-standard location.
 -- Be able to force power_down of cloud node even if in power_save state.
 -- Allow cloud nodes to be recognized in Slurm when booted out of band.
 -- Fixes race condition in _pack_job_gres() when is called multiple times.
 -- Increase duration of "sleep" command used to keep extern step alive.
 -- Remove unsafe usage of pthread_cancel in slurmstepd that can lead to
    to deadlock in glibc.
 -- Fix total TRES Billing on partitions.
 -- Don't tear down a BB if a node fails and --no-kill or resize of a job
    happens.
 -- Remove unsafe usage of pthread_cancel in pmix plugin that can lead to
    to deadlock in glibc.
 -- Fix fatal in controller when loading completed trigger
 -- Ignore reservation overlap at submission time.
 -- GRES type model and QOS limits documentation added
 -- slurmd - fix ABRT on SIGINT after reconfigure with MemSpecLimit set.
 -- PMIx - move two error messages on retry to debug level, and only display
    the error after the retry count has been exceeded.
 -- Increase number of tries when sending responses to srun.
 -- Fix checkpointing requeued/completing jobs in a bad state which caused a
    segfault on restart.
 -- Fix srun on ppc64 platforms.
 -- Prevent slurmd from starting steps if the Prolog returns an error when using
    PrologFlags=alloc.
 -- priority/multifactor - prevent segfault running sprio if a partition has
    just been deleted and PriorityFlags=CALCULATE_RUNNING is turned on.
 -- job_submit/lua - add ESLURM_INVALID_TIME_LIMIT return code value.
 -- job_submit/lua - print an error if the script calls log.user in
    job_modify() instead of returning it to the next submitted job erroneously.
Loading
Loading full blame...