Newer
Older
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and administrators.
-- Move priority_sort_part_tier from slurmctld to libslurm to make it possible
to run the regression tests 24.* without changing that code since it links
directly to the priority plugin where that function isn't defined.
-- Fix issue where job time limits can increase to max walltime when updating
a job with scontrol.
-- Fix invalid protocol_version manipulation on big endian platforms causing
srun and sattach to fail.
-- Fix for QOS, Reservation and Alias env variables in srun.
-- mpi/pmi2 - Backport 6a702158b49c4 from 18.08 to avoid dangerous detached
thread.
-- When allowing heterogeneous steps make sure we copy all the options to
avoid copying strings that may be overwritten.
-- Print correctly when sh5util finds and empty file.
-- Fix sh5util to not seg fault on exit.
-- Fix sh5util to check correctly for H5free_memory.

Dominik Bartkiewicz
committed
-- Adjust OOM monitoring function in task/cgroup to prevent problems in
regression suite from leaked file descriptors.
-- Fix issue with gres when defined with a type and no count
(i.e. gres=gpu/tesla) it would get a count of 0.
-- Allow sstat to talk to slurmd's that are new in protocol version.
-- Permit database names over 33 characters in accounting_storage/mysql.
-- Fix srun segfault caused by invalid memory reads on the env.
-- Fix segfault on job arrays when starting controller without dbd up.

Dominik Bartkiewicz
committed
-- Fix proper alignment of clauses when determining if more nodes are needed
for an allocation.
-- Fix race condition when canceling a federation job that just started
running.

Dominik Bartkiewicz
committed
-- Prevent extra resources from being allocated when combining certain flags.

Dominik Bartkiewicz
committed
-- Fix problem in task/affinity plugin that can lead to slurmd fatal()'ing
when using --hint=nomultithread.
-- Fix left over socket file when step is ending and using pmi2 with
%n or %h in the spool dir.
-- Fix incorrect spacing for PartitionName lines in 'scontrol write config'.
-- Fix sacct to not print huge reserve times when the job was never eligible.
* Changes in Slurm 17.11.9-2
============================
-- Fix printing of node state "drain + reboot" (and other node state flags).
-- Fix invalid read (segfault) when sorting multi-partition jobs.
-- Move several new error() messages to debug() to keep them out of users'
srun output.
* Changes in Slurm 17.11.9
==========================
-- Fix segfault in slurmctld when a job's node bitmap is NULL during a
scheduling cycle. Primarily caused by EnforcePartLimits=ALL.
-- Remove erroneous unlock in acct_gather_energy/ipmi.
-- Enable support for hwloc version 2.0.1.
-- Fix socket communication issue that can lead to lost task completition
messages, which will cause a permanently stuck srun process.
-- Handle creation of TMPDIR if environment variable is set or changed in
a task prolog script.
-- Avoid node layout fragmentation if running with a fixed CPU count but
without Sockets and CoresPerSocket defined.
-- burst_buffer/cray - Fix datawarp swap default pool overriding jobdw.
-- Fix incorrect job priority assignment for multi-partition job with
different PriorityTier settings on the partitions.
-- Fix sinfo to print correct node state.
* Changes in Slurm 17.11.8
==========================
-- Fix incomplete RESPONSE_[RESOURCE|JOB_PACK]_ALLOCATION building path.
-- Do not allocate nodes that were marked down due to the node not responding
by ResumeTimeout.
-- task/cray plugin - search for "mems" cgroup information in the file
"cpuset.mems" then fall back to the file "mems".
-- Fix ipmi profile debug uninitialized variable.
-- Improve detection of Lua package on older RHEL distributions.
-- MYSQL: Fix issue not handling all fields when loading an archive dump.
-- Allow a job_submit plugin to change the admin_comment field during
job_submit_plugin_modify().
-- job_submit/lua - fix access into reservation table.
-- MySQL - Prevent deadlock caused by archive logic locking reads.
-- Don't enforce MaxQueryTimeRange when requesting specific jobs.
-- Modify --test-only logic to properly support jobs submitted to more than
one partition.

Dominik Bartkiewicz
committed
-- Prevent slurmctld from abort when attempting to set non-existing
qos as def_qos_id.
-- Add new job dependency type of "afterburstbuffer". The pending job will be
delayed until the first job completes execution and it's burst buffer
stage-out is completed.
-- Reorder proctrack/task plugin load in the slurmstepd to match that of slurmd
and avoid race condition calling task before proctrack can introduce.
-- Prevent reboot of a busy KNL node when requesting inactive features.
-- Revert to previous behavior when requesting memory per cpu/node introduced
in 17.11.7.
-- Fix to reinitialize previously adjusted job members to their original value
when validating the job memory in multi-partition requests.
-- Fix _step_signal() from always returning SLURM_SUCCESS.
-- Combine active and available node feature change logs on one line rather
than one line per node for performance reasons.
-- Prevent occasionally leaking freezer cgroups.
-- Fix potential segfault when closing the mpi/pmi2 plugin.
-- Fix issues with --exclusive=[user|mcs] to work correctly
with preemption or when job requests a specific list of hosts.
-- Make code compile with hdf5 1.10.2+
-- mpi/pmix: Fixed the collectives canceling.
-- SlurmDBD: improve error message handling on archive load failure.
-- Fix incorrect locking when deleting reservations.
-- Fix incorrect locking when setting up the power save module.
-- Fix setting format output length for squeue when showing array jobs.
-- Fix printing out of --hint options in sbatch, salloc --help.
-- Prevent possible divide by zero in _validate_time_limit().
-- Add Delegate=yes to the slurmd.service file to prevent systemd from
interfering with the jobs' cgroup hierarchies.
-- Change the backlog argument to the listen() syscall within srun to 4096
to match elsewhere in the code, and avoid communication problems at scale.
* Changes in Slurm 17.11.7
==========================
-- Fix for possible slurmctld daemon abort with NULL pointer.
-- Fix different issues when requesting memory per cpu/node.
-- PMIx - override default paths at configure time if --with-pmix is used.
-- Have sprio display jobs before eligible time when
PriorityFlags=ACCRUE_ALWAYS is set.
-- Make sure locks are always in place when calling _post_qos_list().
-- Notify srun and ctld when unkillable stepd exits.
-- Fix slurmstepd deadlock in stepd cleanup caused by race condition in
the jobacct_gather fini() interfaces introduced in 17.11.6.
-- Fix slurmstepd deadlock in PMIx startup.
-- task/cgroup - fix invalid free() if the hwloc library does not return a
string as expected.
-- Fix insecure handling of job requested gid field. CVE-2018-10995.
-- Add --without x11 option to rpmbuild in slurm.spec.
* Changes in Slurm 17.11.6
==========================
-- CRAY - Add slurmsmwd to the contribs/cray dir.
-- sview - fix crash when closing any search dialog.
-- Fix initialization of variable in stepd when using native x11.
-- Fix reading slurm_io_init_msg to handle partial messages.
-- Fix scontrol create res segfault when wrong user/account parameters given.
-- Fix documentation for sacct on parameter -X (--allocations)
-- Change TRES Weights debug messages to debug3.
-- FreeBSD - assorted fixes to restore build.
-- Fix for not tracking environment variables from unrelated different jobs.
-- PMIX - Added the direct connect authentication.
When upgrading this may cause issues with jobs using pmix starting on mixed
slurmstepd versions where some are less than 17.11.6.
-- Prevent the backup slurmctld from losing the active/available node
-- Add documentation for fix IDLE*+POWER due to capmc stuck in Cray systems.
-- Fix missing mutex unlock when prolog is failing on a node, leading to a
hung slurmd.
-- Fix locking around Cray CCM prolog/epilog.
-- Fix issue incorrectly setting a job time_start to 0 while requeueing.
-- smail - remove stray '-s' from mail subject line.

Ben Matthews
committed
-- srun - prevent segfault if ClusterName setting is unset but
SLURM_WORKING_CLUSTER environment variable is defined.
-- In configurator.html web pages change default configuration from
task/none to task/affinity plugin and from select/linear plugin to
select/cons_res plus CR_Core.
-- Allow jobs to run beyond a FLEX reservation end time.
-- Fix problem with wrongly set as Reservation job state_reason.
-- Prevent bit_ffs() from returnig value out of bitmap range.
-- Improve performance of 'squeue -u' when PrivateData=jobs is enabled.
-- Make UnavailableNodes value in job reason be correct for each job.
-- Fix 'squeue -o %s' on Cray systems.
-- Fix incorrect error thrown when cancelling part of a job array.
-- Fix error code and scheduling problem for --exclusive=[user|mcs].
-- Fix build when lz4 is in a non-standard location.
-- Be able to force power_down of cloud node even if in power_save state.
-- Allow cloud nodes to be recognized in Slurm when booted out of band.
-- Fixes race condition in _pack_job_gres() when is called multiple times.
-- Increase duration of "sleep" command used to keep extern step alive.
-- Remove unsafe usage of pthread_cancel in slurmstepd that can lead to
to deadlock in glibc.
-- Fix total TRES Billing on partitions.
-- Don't tear down a BB if a node fails and --no-kill or resize of a job
happens.
-- Remove unsafe usage of pthread_cancel in pmix plugin that can lead to
to deadlock in glibc.
-- Fix fatal in controller when loading completed trigger
-- Ignore reservation overlap at submission time.
-- GRES type model and QOS limits documentation added
-- slurmd - fix ABRT on SIGINT after reconfigure with MemSpecLimit set.
-- PMIx - move two error messages on retry to debug level, and only display
the error after the retry count has been exceeded.
-- Increase number of tries when sending responses to srun.
-- Fix checkpointing requeued/completing jobs in a bad state which caused a
segfault on restart.
-- Fix srun on ppc64 platforms.
-- Prevent slurmd from starting steps if the Prolog returns an error when using
PrologFlags=alloc.
-- priority/multifactor - prevent segfault running sprio if a partition has
just been deleted and PriorityFlags=CALCULATE_RUNNING is turned on.
-- job_submit/lua - add ESLURM_INVALID_TIME_LIMIT return code value.

Tim Wickberg
committed
-- job_submit/lua - print an error if the script calls log.user in
job_modify() instead of returning it to the next submitted job erroneously.
Loading
Loading full blame...