Newer
Older
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and admins.
* Changes in Slurm 14.03.12
===========================
-- Make it so previous versions of salloc/srun work with newer versions
of Slurm daemons.
* Changes in Slurm 14.03.11
===========================
-- ALPS - Fix depth for Memory items in BASIL with CLE 5.2
(changed starting in 5.2.3).
-- ALPS - Fix issue when tracking memory on a PerNode basis instead of
PerCPU.
-- Modify assoc_mgr_fill_in_qos() to allow for a flag to know if the QOS read
lock was locked outside of the function or not.
-- Give even better estimates on pending node count if no node count
is requested.
-- Fix jobcomp/mysql plugin for MariaDB 10+/Mysql 5.6+ to work with reserved
work "partition".
-- If requested (scontrol reboot node_name) reboot a node even if it has
an maintenance reservation that is not active yet.
-- Fix issue where exclusive allocations wouldn't lay tasks out correctly
with CR_PACK_NODES.
-- Do not requeue a batch job from slurmd daemon if it is killed while in
the process of being launched (a race condition introduced in v14.03.9).
-- Do not let srun overwrite SLURM_JOB_NUM_NODES if already in an allocation.

Brian Christiansen
committed
-- Prevent a job's end_time from being too small after a basil reservation
error.

Brian Christiansen
committed
-- Fix sbatch --ntasks-per-core option from setting invalid
SLURM_NTASKS_PER_CORE environment value.
-- Prevent scancel abort when no job satisfies filter options.
-- ALPS - Fix --ntasks-per-core option on multiple nodes.
-- Double max string that Slurm can pack from 16MB to 32MB to support
larger MPI2 configurations.
-- Log Cray MPI job calling exit() without mpi_fini(), but do not treat it as
a fatal error. This partially reverts logic added in version 14.03.9.
-- sview - Fix displaying of suspended steps elapsed times.
-- Increase number of messages that get cached before throwing them away
when the DBD is down.

Brian Christiansen
committed
-- Fix jobs from starting in overlapping reservations that won't finish before
a "maint" reservation begins.
-- Fix "squeue --start" to override SQUEUE_FORMAT env variable.
-- Restore GRES functionality with select/linear plugin. It was broken in
version 14.03.10.
-- Fix possible race condition when attempting to use QOS on a system running
accounting_storage/filetxt.
-- Sanity check for Correct QOS on startup.
* Changes in Slurm 14.03.10
===========================
-- Treat non-zero SlurmSchedLogLevel without SlurmSchedLogFile as a fatal
error.
-- Correct sched_config.html documentation SchedulingParameters
should be SchedulerParameters.
-- When using gres and cgroup ConstrainDevices set correct access
permission for the batch step.
-- Fix minor memory leak in jobcomp/mysql on slurmctld reconfig.
-- Fix bug that prevented preservation of a job's GRES bitmap on slurmctld
restart or reconfigure (bug was introduced in 14.03.5 "Clear record of a
job's gres when requeued" and only applies when GRES mapped to specific
files).
-- BGQ: Fix race condition when job fails due to hardware failure and is
requeued. Previous code could result in slurmctld abort with NULL pointer.
-- Prevent negative job array index, which could cause slurmctld to crash.
-- Fix issue with squeue/scontrol showing correct node_cnt when only tasks
are specified.
-- Check the status of the database connection before using it.
-- ALPS - If an allocation requests -n set the BASIL -N option to the
amount of tasks / number of node.
-- ALPS - Don't set the env var APRUN_DEFAULT_MEMORY, it is not needed anymore.
-- Give better estimates on pending node count if no node count is requested.
-- BLUEGENE - Fix issue where requeuing jobs could cause an assert.
* Changes in Slurm 14.03.9
==========================
-- If slurmd fails to stat(2) the configuration print the string describing
the error code.
-- Fix for mixing core base reservations with whole node based reservations
to avoid overlapping erroneously.
-- BLUEGENE - Remove references to Base Partition.
-- sview - If compiled on a non-bluegene system then used to view a BGQ fix
to allow sview to display blocks correctly.
-- Fix bug in update reservation. When modifying the reservation the end time
was set incorrectly.
-- The start time of a reservation that is in ACTIVE state cannot be modified.
-- Update the cgroup documentation about release agent for devices.
-- MYSQL - fix for setting up preempt list on a QOS for multiple QOS.
-- Correct a minor error in the scancel.1 man page related to the
--signal option.
-- Enhance the scancel.1 man page to document the sequence of signals sent
-- Fix slurmstepd core dump if the cgroup hierarchy is not completed
when terminating the job.
-- Fix hostlist_shift to be able to give correct node names on names with a
different number of dimensions than the cluster.
-- BLUEGENE - Fix invalid pointer in corner case in the plugin.
-- Make sure on a reconfigure the select information for a node is preserved.
-- Correct logic to support job GRES specification over 31 bits (problem
in logic converting int to uint32_t).
-- Remove logic that was creating GRES bitmap for node when not needed (only
needed when GRES mapped to specific files).
-- BLUEGENE - Fix sinfo -tr before it would only print idle nodes correctly.
-- BLUEGENE - Fix for licenses_only reservation on bluegene systems.
-- sview - Verify pointer before using strchr.
-- -M option on tools talking to a Cray from a non-Cray fixed.
-- CRAY - Fix rpmbuild issue for missing file slurm.conf.template.
-- Fix race condition when dealing with removing many associations at
different times when reservations are using the associations that are
being deleted.
-- When a node's state is set to power_down/power_up, then execute
SuspendProgram/ResumeProgram even if previously executed for that node.
-- Fix logic determining when job configuration (i.e. running node power up
logic) is complete.
-- Setting the state of a node in powered down state node to "resume" will
no longer cause it to reboot, but only clear the "drain" state flag.

Brian Christiansen
committed
-- Fix srun documentation to remove SLURM_NODELIST being equivalent as the -w
option (since it isn't).
-- Fix issue with --hint=nomultithread and allocations with steps running
arbitrary layouts (test1.59).
-- PrivateData=reservation modified to permit users to view the reservations
which they have access to (rather then preventing them from seeing ANY
reservation). Backport from 14.11 commit 77c2bd25c.
-- Fix PrivateData=reservation when using associations to give privileges to
a reservation.
-- Better checking to see if select plugin is linear or not.
-- Add support for time specification of "fika" (3 PM).
-- Provide better estimate of minimum node count for pending jobs using more
job parameters.
-- ALPS - Add SubAllocate to cray.conf file for those who like the way <=2.5
did the ALPS reservation.
-- Safer check to avoid invalid reads when shutting down the slurmctld with
lots of jobs.
-- Fix minor memory leak in the backfill scheduler when shutting down.
-- Add ArchiveResvs to the output of sacctmgr show config and init the variable
on slurmdbd startup.
-- SLURMDBD - Only set the archive flag if purging the object
(i.e ArchiveJobs PurgeJobs). This is only a cosmetic change.
-- Fix for job step memory allocation logic if step requests GRES and memory
is not allocations are not managed.
-- Fix sinfo to display mixed nodes as allocated in '%F' output.
-- Sview - Fix cpu and node counts for partitions.
-- Ignore NO_VAL in SLURMDB_PURGE_* macros.
-- ALPS - Don't drain nodes if epilog fails. It leaves them in drain state
with no way to get them out.
-- Fix issue with task/affinity oversubscribing cpus erroneously when
using --ntasks-per-node.
-- MYSQL - Fix load of archive files.
-- Treat Cray MPI job calling exit() without mpi_fini() as fatal error for
that specific task and let srun handle all timeout logic.
-- Fix small memory leak in jobcomp/mysql.
-- Correct tracking of licenses for suspended jobs on slurmctld reconfigure or
restart.
-- If failed to launch a batch job requeue it in hold.
* Changes in Slurm 14.03.8
==========================
-- Fix minor memory leak when Job doesn't have nodes on it (Meaning the job
has finished)
-- Fix sinfo/sview to be able to query against nodes in reserved and other
states.
-- Make sbatch/salloc read in (SLURM|(SBATCH|SALLOC))_HINT in order to
handle sruns in the script that will use it.
-- srun properly interprets a leading "." in the executable name based upon
the working directory of the compute node rather than the submit host.
-- Fix Lustre misspellings in hdf5 guide

Kilian Cavalotti
committed
-- Fix wrong reference in slurm.conf man page to what --profile option should
be used for AcctGatherFilesystemType.
-- Update HDF5 document to point out the SlurmdUser is who creates the
ProfileHDF5Dir directory as well as all it's sub-directories and files.
-- CRAY NATIVE - Remove error message for srun's ran inside an salloc that
had --network= specified.
-- Defer job step initiation of required GRES are in use by other steps rather
than immediately returning an error.
-- Deprecate --cpu_bind from sbatch and salloc. These never worked correctly
and only caused confusion since the cpu_bind options mostly refer to a
step we opted to only allow srun to set them in future versions.
-- Modify sgather to work if Nodename and NodeHostname differ.
-- Changed use of JobContainerPlugin where it should be JobContainerType.
-- Fix for possible error if job has GRES, but the step explicitly requests a
GRES count of zero.
-- Make "srun --gres=none ..." work when executed without a job allocation.
-- Change the global eio_shutdown_time to a field in eio handle.
-- Advanced reservation fixes for heterogeneous systems, especially when
reserving cores.
-- If --hint=nomultithread is used in a job allocation make sure any srun's
ran inside the allocation can read the environment correctly.
-- If batchdir can't be made set errno correctly so the slurmctld is notified
correctly.
-- Remove repeated batch complete if batch directory isn't able to be made
since the slurmd will send the same message.
-- sacctmgr fix default format for list transactions.
-- BLUEGENE - Fix backfill issue with backfilling jobs on blocks already
reserved for higher priority jobs.
-- When creating job arrays the job specification files for each elements
are hard links to the first element specification files. If the controller
Loading
Loading full blame...