Newer
Older
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and administrators.
* Changes in Slurm 17.02.5
==========================
-- Prevent segfault if a job was blocked from running by a QOS that is then
deleted.
-- Improve selection of jobs to preempt when there are multiple partitions
with jobs subject to preemption.
-- Only set kmem limit when ConstrainKmemSpace=yes is set in cgroup.conf.
-- Fix bug in task/affinity that could result in slurmd fatal error.
-- Increase number of jobs that are tracked in the slurmd as finishing at one
time.
-- Note when a job finishes in the slurmd to avoid a race when launching a
batch job takes longer than it takes to finish.
-- Improve slurmd startup on large systems (> 10000 nodes)
-- Add LaunchParameters option of cray_net_exclusive to control whether all
jobs on the cluster have exclusive access to their assigned nodes.
* Changes in Slurm 17.02.4
==========================
-- Do not attempt to schedule jobs after changing the power cap if there are
already many active threads.
-- Job expansion example in FAQ enhanced to demonstrate operation in
heterogeneous environments.

Alejandro Sanchez
committed
-- Prevent scontrol crash when operating on array and no-array jobs at once.
-- knl_cray plugin: Log incomplete capmc output for a node.
-- knl_cray plugin: Change capmc parsing of mcdram_pct from string to number.
-- When rebooting a node and using the PrologFlags=alloc make sure the
prolog is ran after the reboot.
-- node_features/knl_generic - If a node is rebooted for a pending job, but
fails to enter the desired NUMA and/or MCDRAM mode then drain the node and
requeue the job.
-- node_features/knl_generic disable mode change unless RebootProgram
configured.
-- Add new burst_buffer function bb_g_job_revoke_alloc() to be executed
if there was a failure after the initial resource allocation. Does not
release previously allocated resources.
-- Test if the node_bitmap on a job is NULL when testing if the job's nodes
are ready. This will be NULL is a job was revoked while beginning.
-- Fix incorrect lock levels when testing when job will run or updating a job.
-- Add missing locks to job_submit/pbs plugin when updating a jobs
dependencies.
-- Add min_memory_per_node|cpu to the job_submit/lua plugin to deal with lua
not being able to deal with pn_min_memory being a uint64_t. Scripts are
urged to change to these new variables avoid issue. If not set the
variables will be 'nil'.
-- Calculate priority correctly when 'nice' is given.
-- Fix minor typos in the documentation.
-- node_features/knl_cray: Preserve non-KNL active features if slurmctld
reconfigured while node boot in progress.
-- node_features/knl_generic: Do not repeatedly log errors when trying to read
KNL modes if not KNL system.
-- Add missing QOS read lock to backfill scheduler.
-- When doing a dlopen on liblua only attempt the version compiled against.
-- Fix null-dereference in sreport cluster ulitization when configured with
memory-leak-debug.
-- Fix Partition info in 'scontrol show node'. Previously duplicate partition
names, or Partitions the node did not belong to could be displayed.
-- Fix it so the backup slurmdbd will take control correctly.

Alejandro Sanchez
committed
-- Fix unsafe use of MAX() macro, which could result in problems cleaning up

Alejandro Sanchez
committed
accounting plugins in slurmd, or repeat job cancellation attempts in
scancel.
-- Fix 'scontrol update reservation duration=unlimited' to set the duration
to 365-days (as is done elsewhere), rather than 49710 days.
-- Check if variable given to scontrol show job is a valid jobid.
-- Fix WithSubAccounts option to not include WithDeleted unless requested.

Dominik Bartkiewicz
committed
-- Prevent a job tested on multiple partitions from being marked
WHOLE_NODE_USER.

Dominik Bartkiewicz
committed
-- Prevent a race between completing jobs on a user-exclusive node from
leaving the node owned.
-- When scheduling take the nodes in completing jobs out of the mix to reduce
fragmentation. SchedulerParameters=reduce_completing_frag
-- For jobs submited to multiple partitions, report the job's earliest start
time for any partition.
-- Backfill partitions that use QOS Grp limits to "float" better.
-- node_features/knl_cray: don't clear configured GRES from non-KNL node.
-- sacctmgr - prevent segfault in command when a request is denied due
to a insufficient priviledges.
-- Add warning about libcurl-devel not being installed during configure.
-- Streamline job purge by handling file deletion on a separate thread.
-- Always set RLIMIT_CORE to the maximum permitted for slurmd, to ensure
core files are created even on non-developer builds.
-- Fix --ntasks-per-core option/environment variable parsing to set
the requested value, instead of always setting one.
-- If trying to cancel a step that hasn't started yet for some reason return
a good return code.
-- Fix issue with sacctmgr show where user=''
* Changes in Slurm 17.02.3
==========================
-- Increase --cpu_bind and --mem_bind field length limits.
-- Fix segfault when using AdminComment field with job arrays.
-- Clear Dependency field when all dependencies are satisfied.
-- Add --array-unique to squeue which will display one unique pending job
array element per line.
-- Reset backfill timers correctly without skipping over them in certain
circumstances.
-- When running the "scontrol top" command, make sure that all of the user's
jobs have a priority that is lower than the selected job. Previous logic
would permit other jobs with equal priority (no jobs with higher priority).
-- Fix perl api so we always get an allocation when calling Slurm::new().
-- Fix issue with cleaning up cpuset and devices cgroups when multiple steps
end at the same time.
-- Document that PriorityFlags option of DEPTH_OBLIVIOUS precludes the use of
FAIR_TREE.
-- Fix issue if an invalid message came in a Slurm daemon/command may abort.
-- Make it impossible to use CR_CPU* along with CR_ONE_TASK_PER_CORE. The
options are mutually exclusive.
-- ALPS - Fix scheduling when ALPS doesn't agree with Slurm on what nodes
are free.
-- When removing a partition make sure it isn't part of a reservation.
-- Fix seg fault if loading attempting to load non-existent burstbuffer plugin.
-- Fix to backfill scheduling with respect to QOS and association limits. Jobs
submitted to multiple partitions are most likley to be effected.
-- sched/backfill: Improve assoc_limit_stop configuration parameter support.
-- sched/backfill: Fix bug related to advanced reservations and the need to
reboot nodes to change KNL mode.
-- Preempt plugins - fix check for 'preempt_youngest_first' option.
-- Preempt plugins - fix incorrect casts in preempt_youngest_first mode.
-- Preempt/job_prio - fix incorrect casts in sort function.
-- Fix to make task/affinity work with ldoms where there are more than 64
cpus on the node.
-- When using node_features/knl_generic make it so the slurmd doesn't segfault
when shutting down.

Tim Wickberg
committed
-- Fix potential double-xfree() when using job arrays that can lead to
slurmctld crashing.
-- Fix priority/multifactor priorities on a slurmctld restart if not using
accounting_storage/[mysql|slurmdbd].
-- Fix NULL dereference reported by CLANG.
-- Update proctrack documentation to strongly encourage use of
proctrack/cgroup.
-- Fix potential memory leak if job fails to begin after nodes have been
selected for a job.
-- Handle a job that made it out of the select plugin without a job_resrcs
pointer.
-- Fix potential race condition when persistent connections are being closed at
shutdown.
-- Fix incorrect locks levels when submitting a batch job or updating a job
in general.
-- CRAY - Move delay waiting for job cleanup to after we check once.
-- MYSQL - Fix memory leak when loading archived jobs into the database.
-- Fix potential race condition when starting the priority/multifactor plugin's
decay thread.
-- Sanity check to make sure we have started a job in acct_policy.c before we
clear it as started.
-- Allow reboot program to use arguments.
-- Message Aggr - Remove race condition on slurmd shutdown with respects to
destroying a mutex.
-- Fix updating job priority on multiple partitions to be correct.
-- Don't remove admin comment when updating a job.

Dominik Bartkiewicz
committed
-- Return error when bad separator is given for scontrol update job licenses.
* Changes in Slurm 17.02.2
==========================
-- Update hyperlink to LBNL Node Health Check program.
-- burst_buffer/cray - Add support for line continuation.
-- If a job is cancelled by the user while it's allocated nodes are being
reconfigured (i.e. the capmc_resume program is rebooting nodes for the job)
and the node reconfiguration fails (i.e. the reboot fails), then don't
requeue the job but leave it in a cancelled state.
-- capmc_resume (Cray resume node script) - Do not disable changing a node's
active features if SyscfgPath is configured in the knl.conf file.
-- Improve the srun documentation for the --resv-ports option.
-- burst_buffer/cray - Fix parsing for discontinuous allocated nodes. A job
allocation of "20,22" must be expressed as "20\n22".
-- Fix rare segfault when shutting down slurmctld and still sending data to
the database.
-- Fix gres output of a job if it is updated while pending to be displayed
correctly with Slurm tools.
-- Fix missing unlock when job_list doesn't exist when starting priority/
multifactor.
-- Fix segfault if slurmctld is shutting down and the slurmdbd plugin was
in the middle of setting db_indexes.
-- Add ESLURM_JOB_SETTING_DB_INX to errno to note when a job can't be updated
because the dbd is setting a db_index.
-- Fix possible double insertion into database when a job is updated at the
moment the dbd is assigning a db_index.
-- Fix memory error when updating a job's licenses.

Danny Auble
committed
-- Fix seff to work correctly with non-standard perl installs.
-- Export missing slurmdbd_defs_[init|fini] needed for libslurmdb.so to work.
-- Fix sacct from returning way more than requested when querying against a job
array task id.
-- Fix double read lock of tres when updating gres or licenses on a job.
-- Make sure locks are always in place when calling
assoc_mgr_make_tres_str_from_array.
-- Prevent slurmctld SEGV when creating reservation with duplicated name.
-- Consider QOS flags Partition[Min|Max]Nodes when doing backfill.
-- Fix slurmdbd_defs.c to not have half symbols go to libslurm.so and the
other half go to libslurmdb.so.
-- Fix 'scontrol show jobs' to remove an errant newline when 'Switches' is
printed.
-- Better code for handling memory required by a task on a heterogeneous
system.
-- Fix regression in 17.02.0 with respects to GrpTresMins on a QOS or
Association.
Loading
Loading full blame...