Newer
Older
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and administrators.
* Changes in Slurm 14.11.5
==========================
-- Correct the squeue command taking into account that a node can
have NULL name if it is not in DNS but still in slurm.conf.
-- Fix slurmdbd regression which would cause a segfault when a node is set
down with no reason.
-- BGQ - Fix issue with job arrays not being handled correctly
in the runjob_mux plugin.
-- Print FAIR_TREE, if configured, in "scontrol show config" output for
PriorityFlags.
-- Add SLURM_JOB_GPUS environment variable to those available in the Prolog.
-- Load lua-5.2 library if using lua5.2 for lua job submit plugin.
-- GRES logic: Prevent bad node_offset due to not preserving no_consume flag.
-- Fix wrong variables used in the wrapper functions needed for systems that
don't support strong_alias
-- Fix code for apple computers SOL_TCP is not defined
-- Cray/BASIL - Check for mysql credentials in /root/.my.cnf.

Brian Christiansen
committed
-- Fix sprio showing wrong priority for job arrays until priority is
recalculated.
-- Account to batch step all CPUs that are allocated to a job not
just one since the batch step has access to all CPUs like other steps.

Brian Christiansen
committed
-- Fix job getting EligibleTime set before meeting dependency requirements.
-- Correct the initialization of QOS MinCPUs per job limit.
-- Set the debug level of information messages in cgroup plugin to debug2.
-- For job running under a debugger, if the exec of the task fails, then
cancel its I/O and abort immediately rather than waiting 60 seconds for
I/O timeout.
-- Fix associations not getting default qos set until after a restart.
-- Set the value of total_cpus not to be zero before invoking
acct_policy_job_runnable_post_select.
-- MySQL - When requesting cluster resources, only return resources for the
cluster(s) requested.
-- Add TaskPluginParam=autobind=threads option to set a default binding in the
case that "auto binding" doesn't find a match.
-- Introduce a new SchedulerParameters variable nohold_on_prolog_fail.
If configured don't requeue jobs on hold is a Prolog fails.
* Changes in Slurm 14.11.4
==========================
-- Make sure assoc_mgr locks are initialized correctly.
-- Correct check of enforcement when filling in an association.
-- Make sacctmgr print out classification correctly for clusters.
-- Add array_task_str to the perlapi job info.
-- Fix for slurmctld abort with GRES types configured and no CPU binding.
-- Fix for GRES scheduling where count > 1 per topology type (or GRES types).
-- Make CR_ONE_TASK_PER_CORE work correctly with task/affinity.
-- job_submit/pbs - Fix possible deadlock.
-- job_submit/lua - Add "alloc_node" to job information available.
-- Fix memory leak in mysql accounting when usage rollup happens.
-- If users specify ALL together with other variables using the
--export sbatch/srun command line option, propagate the users'
environ to the execution side.
-- Fix job array scheduling anomaly that can stop scheduling of valid tasks.
-- Fix perl api tests for libslurmdb to work correctly.
-- Remove some misleading logs related to non-consumable GRES.
-- Allow --ignore-pbs to take effect when read as an #SBATCH argument.

Brian Christiansen
committed
-- Fix Slurmdb::clusters_get() in perl api from not returning information.
-- Fix TaskPluginParam=Cpusets from logging error message about not being able
to remove cpuset dir which was already removed by the release_agent.
-- Fix the file name substitution for job stderr when %A, %a %j and %u
are specified.
-- Remove minor warning when compiling slurmstepd.
-- Fix database resources so they can add new clusters to them after they have
initially been added.
-- Use the slurm_getpwuid_r wrapper of getpwuid_r to handle possible
interrupts.
-- Correct the scontrol man page and command listing which node states can
be set by the command.
-- Stop sacct from printing non-existent stat information for
Front End systems.
-- Correct srun and acct_gather.conf man pages, mention Filesystem instead
of Lustre.
-- When a job using multiple partition starts send to slurmdbd only
the partition in which the job runs.
-- ALPS - Fix depth for MemoryAllocation in BASIL with CLE 5.2.3.
-- Fix assoc_mgr hash to deal with users that don't have a uid yet when making
reservations.
-- When a job uses multiple partition set the environment variable
SLURM_JOB_PARTITION to be the one in which the job started.
-- Print spurious message about the absence of cgroup.conf at log level debug2
instead of info.
-- Enable CUDA v7.0+ use with a Slurm configuration of TaskPlugin=task/cgroup
ConstrainDevices=yes (in cgroup.conf). With that configuration
CUDA_VISIBLE_DEVICES will start at 0 rather than the device number.
-- Fix job array logic that can cause slurmctld to abort.
-- Report job "shared" field properly in scontrol, squeue, and sview.
-- If a job is requeued because of RequeueExit or RequeueExitHold sent event
REQUEUED to slurmdbd.
-- Fix build if hwloc is in non-standard location.
-- Fix slurmctld job recovery logic which could cause the last task in a job
array to be lost.
-- Fix slurmctld initialization problem which could cause requeue of the last
task in a job array to fail if executed prior to the slurmctld loading
the maximum size of a job array into a variable in the job_mgr.c module.
-- Fix fatal in controller when deleting a user association of a user which

Brian Christiansen
committed
had been previously removed from the system.
-- MySQL - If a node state and reason are the same on a node state change
don't insert a new row in the event table.

Brian Christiansen
committed
-- Fix issue with "sreport cluster AccountUtilizationByUser" when using
PrivateData=users.
-- Fix perlapi tests for libslurm perl module.
-- MySQL - Fix potential issue when PrivateData=Usage and a normal user
runs certain sreport reports.
* Changes in Slurm 14.11.3
==========================
-- Prevent vestigial job record when canceling a pending job array record.
-- Fix job array hash table bug, could result in slurmctld infinite loop or
invalid memory reference.
-- In srun honor ntasks_per_node before looking at cpu count when the user
doesn't request a number of tasks.
-- Fix ghost job when submitting job after all jobids are exhausted.
-- MySQL - Enhanced coordinator security checks.
-- Fix for task/affinity if an admin configures a node for having threads
but then sets CPUs to only represent the number of cores on the node.
-- Make it so previous versions of salloc/srun work with newer versions
of Slurm daemons.
-- Avoid delay on commit for PMI rank 0 to improve performance with some
MPI implementations.
-- auth/munge - Correct logic to read old format AccountingStoragePass.
-- Reset node "RESERVED" state as appropriate when deleting a maintenance
reservation.
-- Prevent a job manually suspended from being resumed by gang scheduler once
free resources are available.
-- Prevent invalid job array task ID value if a task is started using gang
scheduling.
-- Fix documentation bugs in slurm.conf.5. DenyAccount should be DenyAccounts.
-- For backward compatibility with older versions of OMPI not compiled
with --with-pmi restore the SLURM_STEP_RESV_PORTS in the job environment.
-- Update the html documentation describing the integration with openmpi.
-- Fix sacct when searching by nodelist.
-- Fix cosmetic info statements when dealing with a job array task instead of
a normal job.
-- BGQ - Put print statement under a DebugFlag. This was just an oversight.
-- BLUEGENE - Remove check that would erroneously remove the CONFIGURING
flag from a job while the job is waiting for a block to boot.
-- Fix segfault in slurmstepd when job exceeded memory limit.
-- Fix race condition that could start a job that is dependent upon a job array
before all tasks of that job array complete.
* Changes in Slurm 14.11.2
==========================
-- Fix issue with association hash not getting the correct index which
could result in seg fault.
-- Avoid huge malloc if GRES configured with "Type" and huge "Count".
-- Fix jobs from starting in overlapping reservations that won't finish before
a "maint" reservation begins.
-- When node gets drained while in state mixed display its status as draining
in sinfo output.
-- Allow priority/multifactor to work with sched/wiki(2) if all priorities
have no weight. This allows for association and QOS decay limits to work.
-- Fix "squeue --start" to override SQUEUE_FORMAT env variable.

Brian Christiansen
committed
-- Fix scancel to be able to cancel multiple jobs that are space delimited.
-- Log Cray MPI job calling exit() without mpi_fini(), but do not treat it as
a fatal error. This partially reverts logic added in version 14.03.9.
-- sview - Fix displaying of suspended steps elapsed times.
-- Increase number of messages that get cached before throwing them away
when the DBD is down.
-- Fix jobs from starting in overlapping reservations that won't finish before
a "maint" reservation begins.
-- Restore GRES functionality with select/linear plugin. It was broken in
version 14.03.10.
-- Fix bug with GRES having multiple types that can cause slurmctld abort.

Brian Christiansen
committed
-- Fix squeue issue with not recognizing "localhost" in --nodelist option.
-- Make sure the bitstrings for a partitions Allow/DenyQOS are up to date
when running from cache.
-- Add smap support for job arrays and larger job ID values.
-- Fix possible race condition when attempting to use QOS on a system running
accounting_storage/filetxt.
-- Fix issue with accounting_storage/filetxt and job arrays not being printed
correctly.
-- In proctrack/linuxproc and proctrack/pgid, check the result of strtol()
for error condition rather than errno, which might have a vestigial error
code.
-- Improve information recording for jobs deferred due to advanced
reservation.
-- Exports eio_new_initial_obj to the plugins and initialize kvs_seq on
mpi/pmi2 setup to support launching.
* Changes in Slurm 14.11.1
==========================
-- Get libs correct when doing the xtree/xhash make check.
-- Update xhash/tree make check to work correctly with current code.
-- Remove the reference 'experimental' for the jobacct_gather/cgroup
plugin.
-- Add QOS manipulation examples to the qos.html documentation page.
-- If 'squeue -w node_name' specifies an unknown host name print
an error message and return 1.
-- Fix race condition in job_submit plugin logic that could cause slurmctld to
Loading
Loading full blame...