Newer
Older
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and admins.
* Changes in Slurm 14.03.11
===========================
-- ALPS - Fix depth for Memory items in BASIL with CLE 5.2
(changed starting in 5.2.3).
-- ALPS - Fix issue when tracking memory on a PerNode basis instead of
PerCPU.
-- Modify assoc_mgr_fill_in_qos() to allow for a flag to know if the QOS read
lock was locked outside of the function or not.
-- Give even better estimates on pending node count if no node count
is requested.
-- Fix jobcomp/mysql plugin for MariaDB 10+/Mysql 5.6+ to work with reserved
work "partition".
-- If requested (scontrol reboot node_name) reboot a node even if it has
an maintenance reservation that is not active yet.
-- Fix issue where exclusive allocations wouldn't lay tasks out correctly
with CR_PACK_NODES.
-- Do not requeue a batch job from slurmd daemon if it is killed while in
the process of being launched (a race condition introduced in v14.03.9).
-- Do not let srun overwrite SLURM_JOB_NUM_NODES if already in an allocation.

Brian Christiansen
committed
-- Prevent a job's end_time from being too small after a basil reservation
error.

Brian Christiansen
committed
-- Fix sbatch --ntasks-per-core option from setting invalid
SLURM_NTASKS_PER_CORE environment value.
-- Prevent scancel abort when no job satisfies filter options.
-- ALPS - Fix --ntasks-per-core option on multiple nodes.
-- Double max string that Slurm can pack from 16MB to 32MB to support
larger MPI2 configurations.
-- Log Cray MPI job calling exit() without mpi_fini(), but do not treat it as
a fatal error. This partially reverts logic added in version 14.03.9.
-- sview - Fix displaying of suspended steps elapsed times.
-- Increase number of messages that get cached before throwing them away
when the DBD is down.

Brian Christiansen
committed
-- Fix jobs from starting in overlapping reservations that won't finish before
a "maint" reservation begins.
-- Fix "squeue --start" to override SQUEUE_FORMAT env variable.
-- Restore GRES functionality with select/linear plugin. It was broken in
version 14.03.10.
-- Fix possible race condition when attempting to use QOS on a system running
accounting_storage/filetxt.
* Changes in Slurm 14.03.10
===========================
-- Treat non-zero SlurmSchedLogLevel without SlurmSchedLogFile as a fatal
error.
-- Correct sched_config.html documentation SchedulingParameters
should be SchedulerParameters.
-- When using gres and cgroup ConstrainDevices set correct access
permission for the batch step.
-- Fix minor memory leak in jobcomp/mysql on slurmctld reconfig.
-- Fix bug that prevented preservation of a job's GRES bitmap on slurmctld
restart or reconfigure (bug was introduced in 14.03.5 "Clear record of a
job's gres when requeued" and only applies when GRES mapped to specific
files).
-- BGQ: Fix race condition when job fails due to hardware failure and is
requeued. Previous code could result in slurmctld abort with NULL pointer.
-- Prevent negative job array index, which could cause slurmctld to crash.
-- Fix issue with squeue/scontrol showing correct node_cnt when only tasks
are specified.
-- Check the status of the database connection before using it.
-- ALPS - If an allocation requests -n set the BASIL -N option to the
amount of tasks / number of node.
-- ALPS - Don't set the env var APRUN_DEFAULT_MEMORY, it is not needed anymore.
-- Give better estimates on pending node count if no node count is requested.
-- BLUEGENE - Fix issue where requeuing jobs could cause an assert.
* Changes in Slurm 14.03.9
==========================
-- If slurmd fails to stat(2) the configuration print the string describing
the error code.
-- Fix for mixing core base reservations with whole node based reservations
to avoid overlapping erroneously.
-- BLUEGENE - Remove references to Base Partition.
-- sview - If compiled on a non-bluegene system then used to view a BGQ fix
to allow sview to display blocks correctly.
-- Fix bug in update reservation. When modifying the reservation the end time
was set incorrectly.
-- The start time of a reservation that is in ACTIVE state cannot be modified.
-- Update the cgroup documentation about release agent for devices.
-- MYSQL - fix for setting up preempt list on a QOS for multiple QOS.
-- Correct a minor error in the scancel.1 man page related to the
--signal option.
-- Enhance the scancel.1 man page to document the sequence of signals sent
-- Fix slurmstepd core dump if the cgroup hierarchy is not completed
when terminating the job.
-- Fix hostlist_shift to be able to give correct node names on names with a
different number of dimensions than the cluster.
-- BLUEGENE - Fix invalid pointer in corner case in the plugin.
-- Make sure on a reconfigure the select information for a node is preserved.
-- Correct logic to support job GRES specification over 31 bits (problem
in logic converting int to uint32_t).
-- Remove logic that was creating GRES bitmap for node when not needed (only
needed when GRES mapped to specific files).
-- BLUEGENE - Fix sinfo -tr before it would only print idle nodes correctly.
-- BLUEGENE - Fix for licenses_only reservation on bluegene systems.
-- sview - Verify pointer before using strchr.
-- -M option on tools talking to a Cray from a non-Cray fixed.
-- CRAY - Fix rpmbuild issue for missing file slurm.conf.template.
-- Fix race condition when dealing with removing many associations at
different times when reservations are using the associations that are
being deleted.
-- When a node's state is set to power_down/power_up, then execute
SuspendProgram/ResumeProgram even if previously executed for that node.
-- Fix logic determining when job configuration (i.e. running node power up
logic) is complete.
-- Setting the state of a node in powered down state node to "resume" will
no longer cause it to reboot, but only clear the "drain" state flag.

Brian Christiansen
committed
-- Fix srun documentation to remove SLURM_NODELIST being equivalent as the -w
option (since it isn't).
-- Fix issue with --hint=nomultithread and allocations with steps running
arbitrary layouts (test1.59).
-- PrivateData=reservation modified to permit users to view the reservations
which they have access to (rather then preventing them from seeing ANY
reservation). Backport from 14.11 commit 77c2bd25c.
-- Fix PrivateData=reservation when using associations to give privileges to
a reservation.
-- Better checking to see if select plugin is linear or not.
-- Add support for time specification of "fika" (3 PM).
-- Provide better estimate of minimum node count for pending jobs using more
job parameters.
-- ALPS - Add SubAllocate to cray.conf file for those who like the way <=2.5
did the ALPS reservation.
-- Safer check to avoid invalid reads when shutting down the slurmctld with
lots of jobs.
-- Fix minor memory leak in the backfill scheduler when shutting down.
-- Add ArchiveResvs to the output of sacctmgr show config and init the variable
on slurmdbd startup.
-- SLURMDBD - Only set the archive flag if purging the object
(i.e ArchiveJobs PurgeJobs). This is only a cosmetic change.
-- Fix for job step memory allocation logic if step requests GRES and memory
is not allocations are not managed.
-- Fix sinfo to display mixed nodes as allocated in '%F' output.
-- Sview - Fix cpu and node counts for partitions.
-- Ignore NO_VAL in SLURMDB_PURGE_* macros.
-- ALPS - Don't drain nodes if epilog fails. It leaves them in drain state
with no way to get them out.
-- Fix issue with task/affinity oversubscribing cpus erroneously when
using --ntasks-per-node.
-- MYSQL - Fix load of archive files.
-- Treat Cray MPI job calling exit() without mpi_fini() as fatal error for
that specific task and let srun handle all timeout logic.
-- Fix small memory leak in jobcomp/mysql.
-- Correct tracking of licenses for suspended jobs on slurmctld reconfigure or
restart.
-- If failed to launch a batch job requeue it in hold.
* Changes in Slurm 14.03.8
==========================
-- Fix minor memory leak when Job doesn't have nodes on it (Meaning the job
has finished)
-- Fix sinfo/sview to be able to query against nodes in reserved and other
states.
-- Make sbatch/salloc read in (SLURM|(SBATCH|SALLOC))_HINT in order to
handle sruns in the script that will use it.
-- srun properly interprets a leading "." in the executable name based upon
the working directory of the compute node rather than the submit host.
-- Fix Lustre misspellings in hdf5 guide

Kilian Cavalotti
committed
-- Fix wrong reference in slurm.conf man page to what --profile option should
be used for AcctGatherFilesystemType.
-- Update HDF5 document to point out the SlurmdUser is who creates the
ProfileHDF5Dir directory as well as all it's sub-directories and files.
-- CRAY NATIVE - Remove error message for srun's ran inside an salloc that
had --network= specified.
-- Defer job step initiation of required GRES are in use by other steps rather
than immediately returning an error.
-- Deprecate --cpu_bind from sbatch and salloc. These never worked correctly
and only caused confusion since the cpu_bind options mostly refer to a
step we opted to only allow srun to set them in future versions.
-- Modify sgather to work if Nodename and NodeHostname differ.
-- Changed use of JobContainerPlugin where it should be JobContainerType.
-- Fix for possible error if job has GRES, but the step explicitly requests a
GRES count of zero.
-- Make "srun --gres=none ..." work when executed without a job allocation.
-- Change the global eio_shutdown_time to a field in eio handle.
-- Advanced reservation fixes for heterogeneous systems, especially when
reserving cores.
-- If --hint=nomultithread is used in a job allocation make sure any srun's
ran inside the allocation can read the environment correctly.
-- If batchdir can't be made set errno correctly so the slurmctld is notified
correctly.
-- Remove repeated batch complete if batch directory isn't able to be made
since the slurmd will send the same message.
-- sacctmgr fix default format for list transactions.
-- BLUEGENE - Fix backfill issue with backfilling jobs on blocks already
reserved for higher priority jobs.
-- When creating job arrays the job specification files for each elements
are hard links to the first element specification files. If the controller
fails to make the links the files are copied instead.
-- Fix error handling for job array create failure due to inability to copy
job files (script and environment).
-- Added patch in the contribs directory for integrating make version 4.0 with
Slurm and renamed the previous patch "make-3.81.slurm.patch".
-- Don't wait for an update message from the DBD to finish before sending rc
message back. In slow systems with many associations this could speed
Loading
Loading full blame...