Newer
Older
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and administrators.
* Changes in Slurm 17.11.6
==========================
-- CRAY - Add slurmsmwd to the contribs/cray dir.
-- sview - fix crash when closing any search dialog.
-- Fix initialization of variable in stepd when using native x11.
-- Fix reading slurm_io_init_msg to handle partial messages.
-- Fix scontrol create res segfault when wrong user/account parameters given.
-- Fix documentation for sacct on parameter -X (--allocations)
-- Change TRES Weights debug messages to debug3.
-- FreeBSD - assorted fixes to restore build.
-- Fix for not tracking environment variables from unrelated different jobs.
-- PMIX - Added the direct connect authentication.
When upgrading this may cause issues with jobs using pmix starting on mixed
slurmstepd versions where some are less than 17.11.6.
-- Prevent the backup slurmctld from losing the active/available node
-- Add documentation for fix IDLE*+POWER due to capmc stuck in Cray systems.
-- Fix missing mutex unlock when prolog is failing on a node, leading to a
hung slurmd.
-- Fix locking around Cray CCM prolog/epilog.
-- Fix issue incorrectly setting a job time_start to 0 while requeueing.
-- smail - remove stray '-s' from mail subject line.

Ben Matthews
committed
-- srun - prevent segfault if ClusterName setting is unset but
SLURM_WORKING_CLUSTER environment variable is defined.
-- In configurator.html web pages change default configuration from
task/none to task/affinity plugin and from select/linear plugin to
select/cons_res plus CR_Core.
-- Allow jobs to run beyond a FLEX reservation end time.
-- Fix problem with wrongly set as Reservation job state_reason.
-- Prevent bit_ffs() from returnig value out of bitmap range.
-- Improve performance of 'squeue -u' when PrivateData=jobs is enabled.
-- Make UnavailableNodes value in job reason be correct for each job.
-- Fix 'squeue -o %s' on Cray systems.
-- Fix incorrect error thrown when cancelling part of a job array.
-- Fix error code and scheduling problem for --exclusive=[user|mcs].
-- Fix build when lz4 is in a non-standard location.
-- Be able to force power_down of cloud node even if in power_save state.
-- Allow cloud nodes to be recognized in Slurm when booted out of band.
-- Fixes race condition in _pack_job_gres() when is called multiple times.
-- Increase duration of "sleep" command used to keep extern step alive.
-- Remove unsafe usage of pthread_cancel in slurmstepd that can lead to
to deadlock in glibc.
-- Fix total TRES Billing on partitions.
-- Don't tear down a BB if a node fails and --no-kill or resize of a job
happens.
-- Remove unsafe usage of pthread_cancel in pmix plugin that can lead to
to deadlock in glibc.
* Changes in Slurm 17.11.5
==========================
-- Fix cloud nodes getting stuck in DOWN+POWER_UP+NO_RESPOND state after not
responding by ResumeTimeout.
-- Add job's array_task_cnt and user_name along with partitions
[max|def]_mem_per_[cpu|node], max_cpus_per_node, and max_share with the
SHARED_FORCE definition to the job_submit/lua plugin.
-- srun - fix for SLURM_JOB_NUM_NODES env variable assignment.
-- sacctmgr - fix runaway jobs identification.
-- Fix for setting always the correct status on job update in mysql.
-- Fix issue if running with an association manager cache (slurmdbd was down
when slurmctld was started) you could loose QOS usage information.
-- CRAY - Fix spec file to work correctly.
-- Set scontrol exit code to 1 if attempting to update a node state to DRAIN
or DOWN without specifying a reason.
-- Fix race condition when running with an association manager cache
(slurmdbd was down when slurmctld was started).
-- Print out missing SLURM_PERSIST_INIT slurmdbd message type.
-- Fix two build errors related to use of the O_CLOEXEC flag with older glibc.
-- Add Google Cloud Platform integration scripts into contribs directory.
-- Fix minor potential memory leak in backfill plugin.
-- Add missing node flags (maint/power/etc) to node states.
-- Fix issue where job time limits may end up at 1 minute when using the
NoReserve flag on their QOS.
-- Fix security issue in accounting_storage/mysql plugin by always escaping
strings within the slurmdbd. CVE-2018-7033.
-- Soften messages about best_fit topology to debug2 to avoid alarm.
-- Fix issue in sreport reservation utilization report to handle more
allocated time than 100% (Flex reservations).
-- When a job is requesting a Flex reservation prefer the reservation's nodes
over any other nodes.
* Changes in Slurm 17.11.4
==========================
-- Add fatal_abort() function to be able to get core dumps if we hit an
"impossible" edge case.
-- Link slurmd against all libraries that slurmstepd links to.

Alejandro Sanchez
committed
-- Fix limits enforce order when they're set at partition and other levels.
-- Add slurm_load_single_node() function to the Perl API.
-- slurm.spec - change dependency for --with lua to use pkgconfig.
-- Fix small memory leaks in node_features plugins on reconfigure.
-- slurmdbd - only permit requests to update resources from operators or
administrators.
-- Fix handling of partial writes in io_init_msg_write_to_fd() which can
lead to job step launch failure under higher cluster loads.
-- MYSQL - Fix to handle quotes in a given work_dir of a job.
-- sbcast - fix a race condition that leads to "Unspecified error".
-- Log that support for the ChosLoc configuration parameter will end in Slurm
version 18.08.
-- Fix backfill performance issue where bf_min_prio_reserve was not respected.
-- Print MaxQueryTimeRange in "sacctmgr show config".
-- Correctly check return codes when creating a step to check if needing to
wait to retry or not.
-- Fix issue where a job could be denied by Reason=MaxMemPerLimit when not
requesting any tasks.
-- In perl tools, fix for regexp that caused extra incorrectly shown results.
-- Add some extra locks in fed_mgr to be extra safe.
-- Minor memory leak fixes in the fed_mgr on slurmctld shutdown.
-- Make sreport job reports also report duplicate jobs correctly.
-- Fix issues restoring certain Partition configuration elements, especially
when ReconfigFlags=KeepPartInfo is enabled.
-- Don't add TRES whose value is NO_VAL64 when building string line.
-- Fix removing array jobs from hash in slurmctld.
-- Print out missing user messages from jobsubmit plugin when srun/salloc are
waiting for an allocation.
-- Handle --clusters=all as case insensitive.
-- Only check requested clusters in federation when using --test-only
submission option.
-- In the federation, make it so you can cancel stranded sibling jobs.
-- Silence an error from PSS memory stat collection process.
-- Requeue jobs allocated to nodes requested to DRAIN or FAIL if nodes are
POWER_SAVE or POWER_UP, preventing jobs to start on NHC-failed nodes.
-- Make MAINT and OVERLAP resvervation flags order agnostic on overlap test.
-- Preserve node features when slurmctld daemons reconfigured including active
and available KNL features.
-- Prevent creation of multiple io_timeout threads within srun, which can
lead to fatal() messages when those unexpected and additional mutexes are
destroyed when srun shuts down.
-- burst_buffer/cray - Prevent use of "#DW create_persistent" and
"#DW destroy_persistent" directives available in Cray CLE6.0UP06. This
will be supported in Slurm version 18.08. Use "#BB" directives until then.
-- Fix task/cgroup affinity to behave correctly.
-- FreeBSD - fix build on systems built with WITHOUT_KERBEROS.

Alejandro Sanchez
committed
-- Fix to restore pn_min_memory calculated result to correctly enforce
MaxMemPerCPU setting on a partition when the job uses --mem.

Dominik Bartkiewicz
committed
-- slurmdbd - prevent infinite loop if a QOS is set to preempt itself.
-- Fix issue with log rotation for slurmstepd processes.
-- Revert node_features changes in 17.11.3 that lead to various segfaults on
slurmctld startup.
* Changes in Slurm 17.11.3
==========================
-- Sort sreport's reservation report by cluster, time_start, resv_name instead
of cluster, resv_name, time_start.
-- Avoid setting node in COMPLETING state indefinitely if the job initiating
the node reboot is cancelled while the reboot in in progress.
-- Scheduling fix for changing node features without any NodeFeatures plugins.
-- Improve logic when summarizing job arrays mail notifications.
-- Add scontrol -F/--future option to display nodes in FUTURE state.
-- Fix REASONABLE_BUF_SIZE to actually be 3/4 of MAX_BUF_SIZE.
-- When a job array is preempting make it so tasks in the array don't wait
to preempt other possible jobs.
-- Change free_buffer to FREE_NULL_BUFFER to prevent possible double free
in slurmstepd.
-- node_feature/knl_cray - Fix memory leaks that occur when slurmctld
reconfigured.
-- node_feature/knl_cray - Fix memory leak that can occur during normal
operation.
-- Fix srun environment variables for --prolog script.
-- Fix job array dependency with "aftercorr" option and some task arrays in
the first job fail. This fix lets all task array elements that can run
proceed rather than stopping all subsequent task array elements.
-- Fix potential deadlock in the slurmctld when using list_for_each.
-- Fix for possible memory corruption in srun when running heterogeneous job
steps.
-- Fix job array dependency with "aftercorr" option and some task arrays in
the first job fail. This fix lets all task array elements that can run
proceed rather than stopping all subsequent task array elements.
-- Fix output file containing "%t" (task ID) for heterogeneous job step to
be based upon global task ID rather than task ID for that component of the
heterogeneous job step.
-- MYSQL - Fix potential abort when attempting to make an account a parent of
itself.
-- Fix potentially uninitialized variable in slurmctld.
-- MYSQL - Fix issue for multi-dimensional machines when using sacct to
find jobs that ran on specific nodes.
-- Reject --acctg-freq at submit if invalid.
-- Added info string on sh5util when deleting an empty file.
-- Correct dragonfly topology support when job allocation specifies desired
switch count.
-- Fix minor memory leak on an sbcast error path.
-- Fix issues when starting the backup slurmdbd.
-- Revert uid check when requesting a jobid from a pid.
-- task/cgroup - add support to detect OOM_KILL cgroup events.
-- Fix whole node allocation cpu counts when --hint=nomultihtread.
-- Allow execution of task prolog/epilog when uid has access
rights by a secondary group id.
-- Validate command existence on the srun *[pro|epi]log options
if LaunchParameter test_exec is set.
-- Fix potential memory leak if clean starting and the TRES didn't change
from when last started.

Alejandro Sanchez
committed
-- Fix for association MaxWall enforcement when none is given at submission.
-- Add a job's allocated licenses to the [Pro|Epi]logSlurmctld.
Loading
Loading full blame...