Skip to content
Snippets Groups Projects
NEWS 601 KiB
Newer Older
David Bigagli's avatar
David Bigagli committed
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and administrators.

Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 20.02.2
==========================
 -- Fix slurmctld segfault when checking no_consume GRES node allocation counts.
 -- Fix resetting of cloud_dns on a reconfigure.
 -- squeue - change output for dependency column to use "(null)" instead of ""
    for no dependncies as documented in the man page, and used by other columns.
 -- Clear node_cnt_wag after job update.
 -- Fix regression where AccountingStoreJobComment was not defaulting to 'yes'.
 -- Send registration message immediately after a node is resumed.
 -- Cray - Fix hetjobs when using only a single component in the step launch.
 -- Cray - Fix hetjobs launched without component 0.
Danny Auble's avatar
Danny Auble committed
 -- Cray - Quiet cookies missing message which is expected on for hetjobs.
 -- Fix handling of -m/--distribution options for across socket/2nd level by
    task/affinity plugin.
 -- Fix grp_node_bitmap error when slurmctld started before slurmdbd.
 -- Fix scheduling issue when there are not enough nodes available to run a job
    resulting in possible job starvation.
 -- Make it so mpi/cray_shasta appears in srun --mpi=list
 -- Don't requeue jobs that have been explicitly canceled.
 -- Fix error message for a regular user trying to update licenses on a running
    job.
 -- Fix backup slurmctld handling for logrotation via SIGUSR2.
 -- Fix reservation feature specification when looking for inactive features
    after active features fails.
 -- Prevent misleading error messages for reservation creation.
 -- Print message in scontrol when a request fails for not having enough nodes.
 -- Fix duplicate output in sacct with multiple resv events.
 -- auth/jwt - return correct gid for a given user. This was incorrectly
    assuming the users's primary group name matched their username.
 -- slurmrestd - permit non-SlurmUser/root job submission.
 -- Use host IP if hostname unknown for job submission for allocating node.
 -- Fix issue with primary_slurmdbd_resumed_operation trigger not happening
    on slurmctld restart.
 -- Fix race in acct_gather_interconnect/ofed on step termination.
 -- Fix typo of SlurmctldProlog -> PrologSlurmctld in error message.
 -- slurm.spec - add SuSE-specific dependencies for optional slurmrestd package.
 -- Fix FreeBSD build issues.
 -- Fixed sbatch not processing --ignore-pbs in batch script.
 -- Don't clear the qos_id of an invalid QOS.
 -- Allow a job that was once FAIL_[QOS|ACCOUNT] to be eligible again if
    the qos|account limitation is remedied.
 -- Fix core reservations using the FLEX flag to allow use of resources
    outside of the reservation allocation.
 -- Fix MPS without File with 1 GPU, and without GPUs.
 -- Add FreeBSD support to proctrack/pgid plugin.
 -- Fix remote dependency testing for meta job in job array.
 -- Fix preemption when dealing with a job array.
 -- Don't send remote non-pending singleton dependencies on federation update.
 -- slurmrestd - fix crash on empty query.
 -- Fix race condition which could lead to invalid references in backfill.
 -- Fix edge case in _remove_job_hash().
 -- Fix exit code when using --cluster/-M client options.
 -- Fix compilation issues in GCC10.
 -- Fix invalid references when federated job is revoked while in backfill loop.
 -- Fix distributing job steps across idle nodes within a job.
 -- Fix detected floating reservation overlapping.
 -- Break infinite loop in cons_tres dealing with incorrect tasks per tres
    request resulting in slurmctld hang.
 -- Send the current (not the previous) reason for a pending job to client
    commands like squeue/scontrol.
 -- Fix incorrect lock levels for select_g_reconfigure().
Nate Rini's avatar
Nate Rini committed
 -- Handle hidden nodes correctly in slurmrestd.
 -- Allow sacctmgr to use MaxSubmitP[U|A] as format options.
 -- Fix segfault when trying to delete a corrupted association.
 -- Fix setting ntasks-per-core when using --multithread.
 -- Only override job wait reason to priority if Reason=None or
    Reason=Resources.
 -- Perl API / seff - fix missing symbol issue with accounting_storage/slurmdbd.
 -- slurm.spec - add --with cray_shasta option.
 -- Downgrade "Node config differ.." error message if config_overrides enabled.
 -- Add client error when using --gpus-per-socket without --sockets-per-node.
 -- Fix nvml/rsmi debug statements making it to stderr.
 -- NodeSets - fix slurmctld segfault in newer glibc if any nodes have no
    defined features.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 20.02.1
==========================
 -- Improve job state reason for jobs hitting partition_job_depth.
 -- Speed up testing of singleton dependencies.
 -- Fix negative loop bound in cons_tres.
 -- srun - capture the MPI plugin return code from mpi_hook_client_fini() and
    use as final return code for step failure.
 -- Fix segfault in cli_filter/lua.
 -- Fix --gpu-bind=map_gpu reusability if tasks > elements.
 -- Make sure config_flags on a gres are sent to the slurmctld on node
    registration.
 -- Prolog/Epilog - Fix missing GPU information.
 -- Fix segfault when using config parser for expanded lines.
 -- Fix bit overlap test function.
 -- Don't accrue time if job begin time is in the future.
 -- Remove accrue time when updating a job start/eligible time to the future.
 -- Fix regression in 20.02.0 that broke --depend=expand.
 -- Reset begin time on job release if it's not in the future.
 -- Fix for recovering burst buffers when using high-availability.
 -- Fix invalid read due to freeing an incorrectly allocated env array.
 -- Update slurmctld -i message to warn about losing data.
 -- Fix scontrol cancel_reboot so it clears the DRAIN flag and node reason for a
    pending ASAP reboot.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 20.02.0
==========================
 -- Fix minor memory leak in slurmd on reconfig.
 -- Fix invalid ptr reference when rolling up data in the database.
Tim Wickberg's avatar
Tim Wickberg committed
 -- Change shtml2html.py to require python3 for RHEL8 support, and match
    man2html.py.
 -- slurm.spec - override "hardening" linker flags to ensure RHEL8 builds
    in a usable manner.
 -- Fix type mismatches in the perl API.
 -- Prevent use of uninitialized slurmctld_diag_stats.
 -- Fixed various Coverity issues.
 -- Only show warning about root-less topology in daemons.
 -- Fix accounting of jobs in IGNORE_JOBS reservations.
 -- Fix issue with batch steps state not loading correctly when upgrading from
    19.05.
 -- Deprecate max_depend_depth in SchedulerParameters and move it to
    DependencyParameters.
 -- Silence erroneous error on slurmctld upgrade when loading federation state.
 -- Break infinite loop in cons_tres dealing with incorrect tasks per tres
    request resulting in slurmctld hang.
 -- Improve handling of --gpus-per-task to make sure appropriate number of GPUs
    is assigned to job.
 -- Fix seg fault on cons_res when requesting --spread-job.
* Changes in Slurm 20.02.0rc1
=============================
 -- sbatch - fix segfault when no newline at the end of a burst buffer file.
 -- Change scancel to only check job's base state when matching -t options.
Tim Wickberg's avatar
Tim Wickberg committed
 -- Save job dependency list in state files.
Tim Wickberg's avatar
Tim Wickberg committed
 -- cons_tres - allow jobs to be run on systems with root-less topologies.
 -- Restore pre-20.02pre1 PrologSlurmctld synchonization behavior to avoid
    various race conditions, and ensure proper batch job launch.
Tim Wickberg's avatar
Tim Wickberg committed
 -- Add new slurmrestd command/daemon which implements the Slurm REST API.
Danny Auble's avatar
Danny Auble committed
* Changes in Slurm 20.02.0pre1
==============================
 -- Avoid possible race when 2 conf files are read at the same exact time.
 -- Add last and mean backfill table size to sdiag output.
Morris Jette's avatar
Morris Jette committed
 -- Add support for additional job submit environment variables:
    SALLOC_MEM_PER_CPU, SALLOC_MEM_PER_NODE, SBATCH_MEM_PER_CPU and
    SBATCH_MEM_PER_NODE.
 -- Add 'Agent thread count' stat to sdiag.
Brian Christiansen's avatar
Brian Christiansen committed
 -- Add sdiag -M, --clusters option.
 -- NodeName configurations with CPUs != Sockets*Cores or
    Sockets*Cores*Threads will be rejected with fatal.
 -- Add scontrol write config <filename> option.
Felip Moll's avatar
Felip Moll committed
 -- Increase maximum number of hostlist ranges from 64k to 256k.
 -- Don't acquire unneeded locks in slurmctld _run_prolog thread.
 -- Fix sinfo/squeue sort by nodename/nodeaddr/hostname.
 -- Optimize getting wckey and associations usage.
 -- Keep SLURM_MPI_TYPE variable in srun when not set to 'none'.
Tim Wickberg's avatar
Tim Wickberg committed
 -- Remove slurm.spec-legacy packaging file.
 -- pam_slurm_adopt - with action_unknown=newest configured, pick a user job
    even when failing to get cgroup mtime.
 -- Fix "srun --export=" parsing to handle nested commas.
 -- Add default "reboot requested" reason to nodes when rebooting with scontrol.
 -- Duplicate PartitionName entries in slurm.conf will now fatal() instead of
    printing an error message and ignoring the successive records.
Tim Wickberg's avatar
Tim Wickberg committed
 -- Remove the smap command.
 -- Change exclusive behavior of a node to include all GRES on a node as well
    as the cpus.
 -- Append ": reboot issued" to node reason when reboot is issued from
    controller. Previously only happened when nextstate was specified.
 -- Add default jobname of "no-shell" for salloc --no-shell.
 -- Save reservation state when automatically shrinking nodes.
 -- Add slurm.conf option MaxDBDMsgs to control how many messages will be
    stored in the slurmctld before throwing them away when the slurmdbd is down.
 -- Change default SLURM_PMIX_TMPDIR to include user id to avoid potential
    conflicts on development systems running multiple Slurm instances.
 -- Return a newly added ESLURM_DEFER error and set a job state reason to
    FAIL_DEFER for immediate alloc requests if defer in SchedulerParameters.
 -- Make slurmctld fatal if unable to load a script or a job environment when
    building the launch job message.
 -- Removed the checkpoint plugin interface and all associated API calls.
 -- Add job_get_grace_time() functions to preempt plugins and refactor
    slurm_job_check_grace() to use them.
 -- Remove --disable-iso8601 configure option.
 -- Display StepId=<jobid>.batch instead of StepId=<jobid>.4294967294 in output
    of "scontrol show step". (slurm_sprint_job_step_info())
 -- Make it so you can have a grace time when preempting by requeue.
 -- Translate MpiDefault=openmpi to functionally-equivalent MpiDefault=none,
    and remove the mpi/openmpi plugin.
 -- burst_buffer/datawarp - add a set of % symbols that will be replaced by
    job details. E.g., %d will be filled in with the WorkDir for the job.
 -- Fix sacctmgr show events to support node list ranges.
 -- Add SchedulerParameters option bf_one_resv_per_job to disallow adding more
    than one backfill reservation per job.
 -- Allow sacctmgr to filter node events by states that are flags.
 -- Allow sacctmgr to filter node events by REBOOT state/flag.
 -- Add ability to set MailType and MailUser of job with scontrol.
 -- slurm_init_job_desc_msg() initializes mail_type as uint16_t. This allows
    mail_type to be set to NONE with scontrol.
 -- Add new slurm_spank_log() function to print messages back to the user from
    within a SPANK plugin. (This can be done with slurm_error() instead, but
    that will always prepend "error: " to every message which may lead to
    confusion.)
 -- Enforce specification of partition and ALL nodes with PART_NODES flag.
 -- Add 'promiscuous' flag to a reservation.
 -- Implement the idea of PURGE_COMP=timespec.
 -- SPANK - removed never-implemented slurm_spank_slurmd_init() interface. This
    hook has always been accessible through slurm_spank_init() in the
    S_CTX_SLURMD context instead.
 -- sbcast - add new BcastAddr option to NodeName lines to allow sbcast traffic
    to flow over an alternate network path.
Tim Wickberg's avatar
Tim Wickberg committed
 -- Add auth/jwt plugin.
 -- Add new 'scontrol token' subcommand.
 -- PMIx - improve performance of proc map generation.
 -- For a heterogeneous job to be considered for preemption all components must
    be eligible for preemption.
 -- Added JobCompParams to slurm.conf.
 -- Add configuration parameter DependencyParameters to slurm.conf.
 -- Deprecate kill_invalid_depend in SchedulerParameters and move it to new
    DependencyParameters.
 -- Enable job dependencies for any job on any cluster in the same federation.
 -- Stricter escaping of strings sent to Elasticsearch.
 -- Allow clusters to be added automatically to db at startup of ctld.
 -- Add AccountingStorageExternalHost slurm.conf parameter.
 -- Add support for srun -M<cluster> --jobid=# for existing remote allocations.
 -- Remove LicensesUsed from 'scontrol show config'.
 -- sbatch - adjusted backoff times for "--wait" option to reduce load on
    slurmctld. This results in a steady-state delay of 32s between queries,
    instead of the prior 10s delay.
 -- Add SchedulerParameters option bf_running_job_reserve to add backfill
    reservations for jobs running on whole nodes
 -- salloc/sbatch/srun - error on invalid --profile option strings.
 -- Remove max_job_bf option and replace with bf_max_job_test.
 -- Disable sbatch, salloc, srun --reboot for non-admins.
 -- jobcomp/elasticsearch - added connect_timeout and timeout options to
    JobCompParams.
 -- SPANK - added support for S_JOB_GID in the job script context with
    spank_get_item().
 -- Prolog/Epilog - add SLURM_JOB_GID environment variable.
 -- Add gpu/rsmi plugin to support AMD GPUs
 -- Make it so you can "stack" the energy plugins
 -- Add energy accounting plugin for AMD GPU
Danny Auble's avatar
Danny Auble committed

Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 19.05.7
==========================
 -- Fix handling of -m/--distribution options for across socket/2nd level by
    task/affinity plugin.
 -- Fix grp_node_bitmap error when slurmctld started before slurmdbd.
Felip Moll's avatar
Felip Moll committed
 -- Fix compilation issues in GCC10.
 -- Fix distributing job steps across idle nodes within a job.
 -- Break infinite loop in cons_tres dealing with incorrect tasks per tres
    request resulting in slurmctld hang.
 -- priority/multifactor - gracefully handle NULL list of associations or array
    of siblings when calculating FairTree fairshare.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 19.05.6
==========================
 -- Fix OverMemoryKill.
 -- Fix memory leak in scontrol show config.
 -- Remove PART_NODES reservation flag after ignoring it at creation.
 -- Fix deprecation of MemLimitEnforce parameter.
 -- X11 forwarding - alter Xauthority regex to work when "FamilyWild" cookies
    are present in the "xauth list" output.
 -- Fix memory leak when utilizing core reservations.
 -- Fix issue where adding WCKeys and then using them right away didn't always
    work.
 -- Add cosmetic batch step to correct component in a hetjob.
 -- Fix to make scontrol write config create a usable config without editing.
 -- Fix memory leak when pinging backup controller.
 -- Fix issue with 'scontrol update' not enforcing all QoS / Association limits.
 -- Fix to properly schedule certain jobs with cons_tres plugin.
 -- Fix FIRST_CORES for reservations when using cons_tres.
 -- Fix sbcast -C argument parsing.
 -- Replace/deprecate max_job_bf with bf_max_job_test and print error message.
 -- sched/backfill - fix options parsing when bf_hetjob_prio enabled.
 -- Fix for --gpu-bind when no gpus requested.
 -- Fix sshare -l crash with large values.
 -- Fix printing NULL job and step pointers.
 -- Break infinite loop in cons_tres dealing with incorrect tasks per tres
    request resulting in slurmctld hang.
 -- Improve handling of --gpus-per-task to make sure appropriate number of GPUs
    is assigned to job.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 19.05.5
==========================
 -- Fix both socket-[un]constrained GRES issues that would lead to incorrect
    GRES allocations and GRES underflow errors at deallocation time.
 -- Reject unrunnable jobs submitted to reservations.
 -- Fix misleading error returned for immediate allocation requests when defer
    in SchedulerParameters by decoupling defer from too fragmented logic.
 -- Fix printf format string error on FreeBSD.
 -- Fix parsing of delay_boot in controller when additional arguments follow it.
 -- Fix --ntasks-per-node in cons_tres.
 -- Fix array tasks getting same reject reason.
 -- Ignore DOWN/DRAIN partitions in reduce_completing_frag logic.
 -- Fix alloc_node validation when updating a job.
 -- Fix for requesting specific nodes when using cons_tres topology.
 -- Ensure x11 is setup before launching a job step.
 -- Fix incorrect SLURM_CLUSTER_NAME env var in batch step.
 -- Perl API - Fix undefined symbol for slurmdbd_pack_fini_msg.
 -- Install slurmdbd.conf.example with 0600 permissions to encourage secure
    use. CVE-2019-19727.
 -- srun - do not continue with job launch if --uid fails. CVE-2019-19728.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 19.05.4
==========================
 -- Don't allow empty string as a reservation name; generate a name if empty
    string is provided.
 -- Fix salloc segfault when using --no-shell option.
 -- Fix divide by zero when normalizing partition priorities.
 -- Restore ability to set JobPriorityFactor to 0 on a partition.
 -- Fix multi-partition non-normalized job priorities.
 -- Adjust precedence between --mem-per-cpu and --mem-per-node to enforce
    them as mutually exclusive. Specifying either on the command line will
    now explicitly override any value inherited through the environment.
 -- Always print node's version, if it exists, in scontrol show nodes.
 -- sbatch - ensure SLURM_NTASKS_PER_NODE is exported when --ntasks-per-node
    is set.
 -- slurmctld - fix memory leak when using DebugFlags=Reservation.
 -- Reset --mem and --mem-per-cpu options correctly when using --mem-per-gpu.
 -- Use correct function signature for step_set_env() in gres plugin interface.
 -- Restore pre-19.05 hostname handling behavior for AllocNodes by always
    truncating to just the host portion and dropping any domain name portion
    returned by gethostbyaddr().
 -- Fix abort initializing a configuration without acct_gather.conf.
 -- Fix GRES binding and CLOUD nodes GRES setup regressions.
 -- Make sview work with glib2 v2.62.
 -- Fix slurmctld abort when in developer mode and submitting to multiple
    partitions with a bad QOS and not enforcing QOS.
 -- Enforce PART_NODES if only PartitionName is specified.
 -- Fix slurmd -G functionality.
 -- Fix build on 32-bit systems.
 -- Remove duplicate log entry on update job.
 -- sched/backfill - fix the estimated sched_nodes for multi-part jobs.
 -- slurm.spec - fix pmix_version global context macro.
 -- Fix cons_tres topology logic incorrectly evaluating insufficient resoruces.
 -- Fix job "--switches=count@time" option handling in cons_tres topology.
 -- scontrol - allow changes to the WorkDir for pending jobs.
 -- Enable coordinators to delete users if they only belong to accounts that
    the coordinator is over.
 -- Fix regression on update from older versions with DefMemPerCPU.
 -- Fix issues with --gpu-bind while using cgroups.
 -- Suspend nodes after being down for SuspendTime.
 -- Fix rebooting nodes from skipping nextstate states on boot.
 -- Fix regression in reservation creation logic from 19.05.3 which would
    incorrectly deny certain valid reservations from being created.
 -- slurmdbd - process sacct/sacctmgr job queries from older clients correctly.
* Changes in Slurm 19.05.3-2
============================
 -- Fix missing include for Cray Aries systems.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 19.05.3
==========================
 -- Fix missing check from conversion of cray -> cray_aries.
 -- Improve job state reason string when required nodes are not available by
    not including those that don't belong to the job partition.
 -- Set a more appropriate ESLURM_RESERVATION_MAINT job state reason for jobs
    requesting feature(s) and required nodes are in a maintenance reservation.
 -- Fix logic to better handle maintenance reservations.
 -- Add spank options to cache in remote callback.
 -- Enforce the use of spank_option_getopt().
 -- Fix select plugins' will run test under-allocating nodes usage for
    completing jobs.
 -- Nodes in COMPLETING state treated as being currently available for job
    will-run test.
 -- Cray - fix contribs slurm.conf.j2 with updated cray_aries plugin names.
 -- job_submit/lua - fix problem where nil was expected for min_mem_per_cpu.
 -- Fix extra, unaccounted TRESRunMins usage created by heterogeneous jobs when
    running with the priority/multifactor plugin.
 -- Detach threads once they are done to avoid having to join them
    in track scripts code.
 -- Handle situation where a slurmctld tries to communicate with slurmdbd more
    than once at the same time.
 -- Fix XOR/XAND features like cpu&fastio&[knl|westmere] to be resolved
    correctly.
 -- Don't update [min|max]_exit_code on job array task requeue.
 -- Don't assume the first node of a job is the batch host when testing if the
    job's allocated nodes are booted/ready.
 -- Make --batch=<feature> requests wait for all nodes to be booted so that it
    can choose the batch host after the nodes have been booted -- possibly with
    different features.
 -- Fix talking to batch host on it's protocol version when using --batch.
 -- gres/mic plugin - add missing fini() function to clean up plugin state.
 -- Move _validate_node_choice() before prolog/epilog check.
 -- Look forward one week while create new reservation.
 -- Set mising resv_desc.flags before call _select_nodes().
Dominik Bartkiewicz's avatar
Dominik Bartkiewicz committed
 -- Use correct start_time for TIME_FLOAT reservation in _job_overlap().
 -- Properly enforce a job's mem-per-cpu option when allocate the node
    exclusively to that job.
 -- sched/backfill - clear estimated sched_nodes as done for start_time.
 -- Have safe_[read|write] handle EAGAIN and EINTR.
 -- Fix checking for flag with logical AND.
 -- Correct "extern" definition of variable if compiling with __APPLE__.
Danny Auble's avatar
Danny Auble committed
 -- Deprecate FastSchedule. FastSchedule will be removed in 20.02.
    The FastSchedule=2 functionality (used for testing and development) has
    been retained as the new SlurmdParameters=config_overrides option.
 -- Fix preemption issue when picking nodes for a feature job request.
 -- Fix race condition preventing held array job from getting a db_index.
 -- Fix select/cons_tres gres code infinite loop leaving slurmctld unresponsive.
 -- Remove redefinition of global variable in gres.c
 -- Fix issue where GPU devices are denied access when MPS is enabled.
 -- Fix uninitialized errors when compiling with CFLAGS="--coverage".
 -- Fix scancel --full for proctrack/cgroups.
 -- Fix sdiag backfill last and mean queue length stats.
 -- Do not remove batch host when resizing/shrinking a batch job.
 -- nss_slurm - fix file descriptor leaks.
 -- Fix preemption for jobs using complex feature requests
    (e.g. -C "[rack1*2&rack2*4]").
 -- Fix memory leaks in preemption when jobs request multiple features.
 -- Allow Operator users to show/fix runaways.
 -- Disallow coordinators to show/fix runaways.
 -- mpi/pmi2 - increase array len to avoid buffer size exceeded error.
 -- Preserve rebooting node's nextstate when updating state with scontrol.
 -- Fully merge slurm.conf and gres.conf before node_config_load().
 -- Remove FastSchedule dependence from gres.conf's AutoDetect=nvml.
 -- Forbid mix of typed and untyped GRES of same name in slurm.conf.
 -- cons_tres: Prevent creating a job without CPUs.
 -- Prevent underflow when filtering cores with gres.
 -- proctrack/cray_aries: use current pid instead of thread if we're in a fork.
 -- Fix missing check for prolog launch credential creation failure that can
    lead to segfaults.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 19.05.2
==========================
 -- Wrap END_TIMER{,2,3} macro definition in "do {} while (0)" block.
 -- Allow account coordinators to add users who don't already have an
    association with any account.
 -- If only allowing particular alloc nodes in a partition, deny any request
    coming from an alloc node of NULL.
 -- Prevent partial-load of plugins which can leave certain interfaces in
    an inconsistent state.
 -- Remove stray __USE_GNU macro definitions from source.
 -- Fix loading fed state by backup on subsequent takeovers.
 -- Add missing job read lock when loading fed job state.
 -- Add missing fed_job_info jobs if fed state is lost.
 -- Do not build cgroup plugins on FreeBSD or NetBSD, and use proctrack/pgid
    by default instead.
 -- Do not build switch/cray_aries plugin on FreeBSD, NetBSD, or macOS.
 -- Fix build on FreeBSD.
 -- Fix race condition in route/topology plugin.
 -- In munge decode set the alloc_node field to the text representation of an
    IP address if the reverse lookup fails.
 -- Fix infinite loop in slurmstepd handling for nss_slurm REQUEST_GETGR RPC.
 -- Fix slurmstepd early assertion fail which prevented batch job launch or
    tasks launch on non-Linux systems.
 -- Fix regression with SLURM_STEP_GPUS env var being renamed SLURM_STEP_GRES.
 -- Fix pmix v3 linking if no rpath is allowed on build.
 -- Fix sacctmgr error handling when removing associations and users.
 -- Allow sacctmgr to add users to WCKeys without having TrackWCKey set in the
    slurm.conf.
 -- Allow sacctmgr to delete WCKeys from users.
 -- Change GRES type set by gpu/gpu_nvml plugin to be more specific - based
    on device name instead of brand name.
 -- cli_filter - fix logic error with option lookup functions.
 -- Fix bad testing of NodeFeatures debug flag in contribs/cray.
 -- Cleanup track_script code to avoid race conditions and invalid memory
    access.
 -- Fix jobs being killed after being requeued by preemption.
 -- Make register nodes verify correctly when using cons_tres.
 -- Fix srun --mem-per-cpu being ignored.
 -- Fix segfault in _update_job() under certain conditions.
 -- job_submit/lua - restore slurm.FAILURE as a synonym for slurm.ERROR.
* Changes in Slurm 19.05.1-2
============================
 -- Fix mistake in QOS time limit calculations for UsageFactor != 0 with any
    combination of flags set.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 19.05.1
==========================
 -- accounting_storage/mysql - fix incorrect function names in error messages.
 -- accounting_storage/slurmdbd - trigger an fsync() on the dbd.messages state
    file to ensure it is committed to disk properly.
 -- Avoid JobHeldUser state reason from being updated at allocation time.
 -- Fix dump/load of rejected heterogeneous jobs.
 -- For heterogeneous jobs, do not count the each component against the QOS or
    association job limit multiple times.
 -- Comment out documentation for the incomplete and currently unusable
    burst_buffer/generic plugin.
 -- Add new error ESLURM_INVALID_TIME_MIN_LIMIT to make note when a time_min
    limit is invalid based on timelimit.
 -- Correct slurmdb cluster record pack with NULL pointer input.
 -- Clearer error message for ESLURM_INVALID_TIME_MIN_LIMIT.
 -- Fix SchedulerParameter bf_min_prio_reserve error when not the last parameter
 -- When fixing runaway jobs, change to reroll from earliest submit time, and
    never reroll from Unix epoch.
 -- Display submit time when running sacctmgr show runawayjobs and add format
    option to display eligible time.
 -- jobcomp/elasticsearch - fix minor race related to JobCompLoc setup.
 -- For HetJobs, ensure SLURM_PACK_JOB_ID is set regardless of whether
    PrologFlags=Alloc is enabled.
 -- Fix PriorityFlags regression with the mutation of FAIR_TREE to NO_FAIR_TREE.
 -- select/cons_res - fix debug flag SelectType handling in select_p_job_test.
 -- Fix sacctmgr archive dump commit confirmation.
 -- Prevent extra resources from being allocated when combining certain flags.
 -- Cray - fix template generator with update cray_aries plugin names.
 -- accounting_storage/slurmdbd - provide additional detail in several error
    messages.
 -- Backfill - If a job has a time_limit guess the end time of a job better
    if OverTimeLimit is Unlimited.
 -- Remove premature call to get system gpus before querying fake gpus that
    should override the real.
 -- Fix segfault in epilog_set_env() when gres_devices is NULL.
 -- Fix (un)supported states in sacct.
 -- Adjust build system to no longer use the AC_FUNC_MALLOC autoconf macro.
 -- srun - restore the --cpu_bind option to srun.
 -- Add UsageFactorSafe QOS flag to control applying UsageFactor at
    submission/scheduling time.
Nate Rini's avatar
Nate Rini committed
 -- Create missing reservations on DBD_MODIFY_RESV.
 -- Add error message when attempting to update association manager and object
    doesn't exist.
 -- Fix security issue in accounting_storage/mysql plugin on archive file loads
    by always escaping strings within the slurmdbd. CVE-2019-12838.
* Changes in Slurm 19.05.0
==========================
 -- Fix deprecated group by clause to use order by.
 -- NVML - Git rid of unneeded * when passing nvmlDevice_t to functions.
 -- NVML - Fix clang warning about unneeded variable initialization.
Danny Auble's avatar
Danny Auble committed
 -- NVML - remove unneeded {}.
 -- Add timers to new site_factor plugin APIs to warn of slow-running plugins,
    which can lead to issues with throughput and responsiveness.
 -- X11 forwarding - ignore screen value for local DISPLAY.
 -- Add missing locks protecting slurmctld_config.server_thread_count access.
 -- Fix jobs stuck from FedJobLock when requeueing in a federation
 -- Fix requeueing job in a federation of clusters with differing associations
 -- sacctmgr - free memory before exiting in 'sacctmgr show runaway'.
 -- Fix seff showing memory overflow when steps tres mem usage is 0.
 -- Fix memory leaks in 'sacctmgr show runawayjobs'.
 -- Fix potential deadlock in nss_slurm.
 -- Fix memory leaks due to incomplete slurmdb_cluster_cond_t destructor.
 -- Alter reservation flags column in slurmdbd to use uint64_t instead of
    uint16_t to ensure all current flags are saved correctly. Older releases
    unfortunately could not store details for newer flags (using bits 17-32)
    due to this field being silently truncated.
 -- Modify task layout with --overcommit option plus a heterogeneous job
    allocation so that a cyclic task distribution can start happening before
    all CPUs on all nodes are fully allocated. The number of tasks per node
    will be unchanged from the previous algorithm, but tasks will be distributed
    in a cyclic fashion first and then extra tasks placed on nodes with more
    CPUs. Previously all CPUs would be fully allocated in a cyclic fashion,
    then excess tasks distributed evenly across all allocated nodes.
 -- In select/cons_tres: Only allocate 1 CPU per node with the --overcommit
    option.
 -- In select/cons_res: Only allocate 1 CPU per node with the --overcommit and
    --nodelist options.
 -- Fix DefMemPer[CPU|Node] assignment on multi-partition job requests.
 -- Fix wrongly setting start_time to 0 for multi-part jobs.
 -- Upon archive file name collision, create new archive file instead of
    overwriting the old one to prevent lost records.
 -- Limit archive files to 50000 records per file so that archiving large
    databases will succeed.
 -- Remove stray newlines in SPANK plugin error messages.
 -- Fix archive loading events.
 -- In select/cons_res: Only allocate 1 CPU per node with the --overcommit and
    --nodelist options.
 -- Fix main scheduler from potentially not running through whole queue.
 -- Fix variable initiation to avoid slurmctld abort.
 -- In partition preemption, sort preemptor jobs only if they overlap a
    preemtable partition.
 -- cons_tres/dist_tasks - fix variable usage in cyclic distribution.
 -- cons_res/job_test - prevent a job from overallocating a node memory.
 -- cons_res/job_test - fix to consider a node's current allocated memory when
    testing a job's memory request.
 -- Fix issue where multi-node job steps on cloud nodes wouldn't finish cleaning
    up until the end of the job (rather than the end of the step).
 -- Fix packing pack_jobid in an sbcast.
 -- Fix GCC 9 compiler warnings.
 -- Add new job bit_flags of JOB_DEPENDENT.
 -- Make it so dependent jobs reset the AccrueTime and do not count against any
    AccrueTime limits.
 -- Fix sacctmgr --parsable2 output for reservations and tres.
 -- In multi-node systems make sure GRES are found on node when not bound to
    specific sockets.
 -- Fix gres-per-task logic for gres not bound to sockets.
 -- Fix issue when --gpus plus --cpus-per-gres was forcing socket binding
    unnecessarily.
 -- Change event table's state column to handle 32bits.
 -- Prevent slurmctld from potential segfault after job_start_data() called
    for completing job.
 -- Fix jobs getting on nodes with "scontrol reboot asap".
 -- Record node reboot events to database.
 -- Fix node reboot failure message getting to event table.
 -- Don't write "(null)" to event table when no event reason exists.
 -- Fix invalid memory read in cons_tres.
 -- Fix minor memory leak when clearing runaway jobs.
 -- Avoid flooding slurmctld and logging when prolog complete RPC errors occur.
 -- Fix slurmctld node_scheduler's feature_bitmap memory leak.
 -- Fatal when reading config if Alloc flag configured on FrontEnd mode.
 -- Modifications needed to run Federations with clusters running
    different select/switch plugins.
 -- Fix Clang errors for zero initializing struct with nested arrays.
Morris Jette's avatar
Morris Jette committed
 -- Fix minor memory leak in pmi2.
 -- MySQL - Fix minor memory leak when quering suspended jobs fails.
 -- Fix seff human readable memory string for values below a megabyte.
 -- Avoid slurmctld abort if GRES defined in gres.conf, but not in the node
    configuration of slurm.conf.
 -- Calculate task count for job with --gpus-per-task option, but no explicit
    task count.
* Changes in Slurm 19.05.0rc1
=============================
 -- Set CUDA_VISIBLE_DEVICES environment variable in Prolog and Epilog for jobs
    requesting gres/gpu.
 -- Remove '-U' argument - which was deprecated when '-A' was made the single
    character option before the Slurm 2.1 release - as an alternative to
    '--account' for salloc/sbatch/srun.
 -- Remove direct BLCR support and srun_cr.
 -- Make slurm_print_node_table only print a node's slurmd version if it is
    different to the one reported by slurm_load_ctl_conf.
 -- Call gres plugin environment setup even if gres not requested in job.
 -- Do not set CUDA_VISIBLE_DEVICES=NoDevFiles when no gres requested.
 -- If GRES configuration data is unavailable from gres.conf, then use the
    node's "Gres=" information slurm.conf. This will eliminate or minimize the
    gres.conf file in many situations.
 -- Fix checking IPMI XCC raw command response length.
 -- jobacct_gather/common - improve lightweight process identification.
 -- Cloud/PowerSave Improvements:
    - Better repsonsiveness to resuming and suspending.
    - Powering down nodes not eligible to be allocated until after
      SuspendTimeout.
    - Powering down nodes put in "Powering Down / %" state until after
      SuspendTimeout.
 -- Add idle_on_node_suspend SlurmctldParameter to make nodes idle regardless
    of state when suspended.
 -- Add PowerSave DebugFlag for Suspend/Resume debugging.
 -- Changed "scontrol reboot" to not default to ALL nodes.
 -- Changed "scontrol completing" to include two new fields - EndTime and
    CompletingTime.
 -- select/cons_tres - prevent job from overallocating a node memory.
 -- Refactor CLI option parsing for salloc/sbatch/srun into a central set of
    functions in src/common/slurm_opt.c. Note that this new option parsing can
    be stricter in a few specific situations - places that used to ignore
    invalid options and still submit/launch a job or job step may return an
    error() and refuse to proceed instead.
 -- Add preempt_send_user_signal SlurmctldParameter option to send user
    signal (e.g. --signal=<SIG_NUM>) at preemption if it hasn't already been
    sent.
 -- Add PreemptExemptTime parameter to slurm.conf and QOS to guarantee a
    minimum runtime before preemption.
 -- Set job's preempt time for non-grace time preemptions.
 -- Add sinfo format option to show used gres.
 -- Add reboot_from_controller SlurmctldParameter to allow RebootProgram to be
    run from the controller instead of the slurmds.
 -- Fix increasing of job size when extern steps exist.
 -- Reset GPU-related arguments to salloc/sbatch/srun for each separate
    heterogeneous job component.
 -- Do not set "(null)" for SLURM_JOB_CONSTRAINTS when no constraints are set
    in PrologSlurmctld/EpilogSlurmctld.
 -- Add SRUN_EXPORT_ENV as an input environment variable to srun.
 -- Return an error for invalid #SBATCH directives, and do not submit the job.
 -- Add S_JOB_ARRAY_ID and S_JOB_ARRAY_TASK_ID to spank_get_item().
 -- Change container_{g,p}_add_pid() to container_{g,p}_join() and remove the
    'pid_t pid' argument.
Tim Wickberg's avatar
Tim Wickberg committed
 -- Add new site_factor plugin type to permit sites to build plugins to set
    and modify the site priority factor value both initially on job submission,
    and periodically every PriorityCalcPeriod.
 -- Rename Cray plugins cray_aries in preperation for Cray/Shasta.
 -- Allow Het Jobs to work on a Cray.
 -- Add new cli_filter plugin type to permit sites to build plugins to log,
    modify, or reject CLI options within the salloc/sbatch/srun commands
    themselves.
 -- Allocate nodes that are booting. Previously, nodes that were being booted
    were off limits for allocation. This caused more nodes to be booted than
    needed in a cloud environment.
 -- pam_slurm_adopt - inject SLURM_JOB_ID environment variable into adopted
    processes.
 -- PMIx - use the Tree-based collective for empty fence operations.
 -- PMIx - replace use of the non-standard PMIX_VAL_SET macro with the
    standardized PMIX_VALUE_LOAD macro.
 -- slurm.spec - change --without cray option to set configure option of
    --enable-really-no-cray.
 -- slurm.spec - add new --with slurmsmwd option.
 -- pmi2: add mutex locking to all API calls to ensure thread-safety.
 -- Fix QOS usage factor to apply to TRES time limits and usage.
 -- Fix multi-cluster srun's with Select/Cray and other_cons_res.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 19.05.0pre3
==============================
 -- Fix RPM packaging for accounting_storage/mysql.

* Changes in Slurm 19.05.0pre2
==============================
 -- Removed select/serial plugin.
 -- Remove 512-character line length limit in slurm_print_topo_record().
    (Used by "scontrol show topology".)
 -- Removed crypto/openssl plugin.
 -- Tweak the sdiag gettimeofday() line format for greater clarity.
 -- Add support for SALLOC/SBATCH/SLURM_NO_KILL environment variables.
    Add salloc/sbatch/srun support for optional "--no-kill=off" option to
    disable the environment variables.
 -- Fix salloc and missing SLURM_NTASKS.
 -- Alter the backfill scheduler behavior to prevent it from scheduling lower
    priority jobs on resources that become available during the backfill
    scheduling cycle when bf_continue is enabled. This behavior was available
    as the bf_ignore_newly_avail_nodes option in 18.08.4+, but is now enabled
    by default. (The SchedulerParameters option of bf_ignore_newly_avail_nodes
    is also now removed, although harmless if still set.)
 -- Make LaunchParameters=send_gids the default introducing the reverse option
    "disable_send_gids to go back to the original behavior.
 -- Limit pam_slurm_adopt to run only in the sshd context by default, for
    security reasons. A new module option 'service=<name>' can be used to
    allow a different PAM applications to work. The option 'service=*' can be
    used to restore the old behavior of always performing the adopt logic
    regardless of the PAM application context.
 -- pam_slurm_adopt: Use uid to determine whether root is logging.
 -- Remove sbatch --x11 option. Slurm's internal X11 forwarding is now only
    supported from salloc, or an allocating srun command.
 -- Suppressed printing of job id in sbatch when quiet flag is set.
Felip Moll's avatar
Felip Moll committed
 -- Changed sreport 'SizesByAccount' and 'SizesByAccountAndWckey' default
    behavior and added new 'AcctAsParent' option.
 -- Add ave watts to api and sview.
 -- Added printf attribute to setenvf() and corrected related warnings.
 -- Kill running/pending job is allocated GRES and that GRES has a "File"
    configuration, and the GRES count changes.
Felip Moll's avatar
Felip Moll committed
 -- Add new DebugFlag=Accrue for accrue accounting debugging purposes.
 -- Change CryptoType option to CredType, and rename crypto/munge plugin to
    cred/munge.
 -- Add slurmd -G option to print GRES configuration and exit. This is useful
    for testing and debugging.
 -- Support GRES types that include numbers (e.g. "--gres=gpu:123g:2").
 -- Remove MemLimitEnforce parameter and move functionality into
    JobAcctGatherParam=OverMemoryKill.
 -- sview - disable admin mode option (which would not work anyways) if the
    user is not an admin in SlurmDBD.
 -- Remove joules reporting from sview and scontrol.
 -- Change the default fair share algorithm to "fair tree". The new
    PriorityFlags option of NO_FAIR_TREE can be used to revert to "classic"
    fair share scheduling instead.
 -- libslurmdb has been merged into libslurm.
Jason Booth's avatar
Jason Booth committed
 -- Added -b as a short option for --begin and removed the -b option which
    was a left over artifact from the Moab compatibility work.
 -- Add ArrayTaskThrottle to "scontrol show job" output.
Jason Booth's avatar
Jason Booth committed
 -- Added SPRIO_FORMAT env variable to the sprio command.
 -- Add batch step at the beginning of a batch job so that squeue, sstat, and
    sacct will show the batch step.
Tim Wickberg's avatar
Tim Wickberg committed
 -- Deprecated 32-bit builds.
 -- Make -l and -o mutually exclusive in saccct, squeue, sinfo, and sprio
 -- Disable running job expansion by default. A new SchedulerParameter of
    permit_job_expansion has been added for sites that wish to re-enable it.
 -- Permit changing a job array's ArrayTaskThrottle value even if the job is
    terminated (for job requeue).
 -- Add scontrol requeue option of "Incomplete" which will requeue jobs only if
    they failed to complete with an exit code of zero.
 -- Modify GrpNodes limit to apply to unique nodes allocated (avoid double
    counting nodes allocated to multiple jobs in the same QOS or association).
 -- If a job submit does NOT include --cpus-per-task option, then report the
    value as "N/A" rather than always mapping the value to 1.
 -- X11 forwarding - use the raw value from gethostname() with xauth to avoid
    authentication issues when Slurm has internally stripped off the domain
    portion.
 -- Change how slurmd fills in the registration message version string from
    PACKAGE_VERSION to SLURM_VERSION_STRING, affecting how the version is
    displayed with sview, sinfo, scontrol and through the API.
 -- Remove autogen.sh script. Please use the autoreconf command instead.
 -- Disable a configuration of SelectTypeParameters=CR_ONE_TASK_PER_CORE with
    SelectType=select/cons_tres. This will be addressed later.
 -- job_submit/lua - expose more fields off the partition record.
 -- task/cgroup - prevent setting a memory.soft_limit_in_bytes higher than the
    memory.limit_in_bytes since the hard limit will take precedence anyway.
 -- If a GrpNodes limit is configurated in an association, partition QOS or
    job QOS then favor use of nodes already allocated to that entity. This
    will result in the configured node "Weight" being incremented by one for
    nodes which are not prefered. Consider adjusting configured node "Weight"
    values to achieve the desired node preferences.
 -- Add full node state debug2 output to slurmdbd node up/down update
 -- Set CUDA_VISIBLE_DEVICES and CUDA_MPS_ACTIVE_THREAD_PERCENTAGE environment
    variables in Prolog and Epilog for jobs requesting gres/mps.
 -- Added thresholds for backfill parameters.
 -- Fix for backfill sleep overflow when large values are set.
 -- Execute Epilog on nodes reliquished from job (i.e. job resized).
 -- Rename burst_buffer/cray plugin to burst_buffer/datawarp.
 -- X11 Forwarding - reimplement using new internal network forwarding RPCs.
 -- Remove slurm_jobcomp_get_errno and slurm_jobcomp_strerror from jobcomp
    plugin API.
 -- Optimize backfill for checking max jobs per assoc, partition, user, etc.
* Changes in Slurm 19.05.0pre1
==============================
 -- Run epilog and clean up allocation when a job is resized to zero and its
    resources transferred to another job (--depend=expand).
 -- If GRES are associated with specific sockets, identify those sockets in the
    output of "scontrol show node". For example if all 4 GPUs on a node are
    all associated with socket zero, then "Gres=gpu:4(S:0)". If associated
    with sockets 0 and 1 then "Gres=gpu:4(S:0-1)". The information of which
    specific GPUs are associated with specific GPUs is not reported, but only
    available by parsing the gres.conf file.
 -- Add configuration parameter "GpuFreqDef" to control a job's default GPU
    frequency.
 -- Add job flags to the database.  Currently used to determine which scheduler
    scheduled the job.
 -- Add constraints/features to the database.
 -- Add last reason job didn't run before resources/priority to the database.
Danny Auble's avatar
Danny Auble committed
 -- Make it so we set the alloc_node in a resource allocation based on the auth
    plugin instead of the rpc call.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 18.08.10
===========================

Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 18.08.9
==========================
 -- Wrap END_TIMER{,2,3} macro definition in "do {} while (0)" block.
 -- Make sview work with glib2 v2.62.
 -- Make Slurm compile on linux after sys/sysctl.h was deprecated.
 -- Install slurmdbd.conf.example with 0600 permissions to encourage secure
    use. CVE-2019-19727.
 -- srun - do not continue with job launch if --uid fails. CVE-2019-19728.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 18.08.8
==========================
 -- Update "xauth list" to use the same 10000ms timeout as the other xauth
    commands.
 -- Fix issue in gres code to handle a gres cnt of 0.
 -- Don't purge jobs if backfill is running.
 -- Verify job is pending add/removing accrual time.
 -- Don't abort when the job doesn't have an association that was removed
    before the job was able to make it to the database.
 -- Set state_reason if select_nodes() fails job for QOS or Account.
 -- Avoid seg_fault on referencing association without a valid_qos bitmap.
 -- If Association/QOS is removed on a pending job set that job as ineligible.
 -- When changing a jobs account/qos always make sure you remove the old limits.
 -- Don't reset a FAIL_QOS or FAIL_ACCOUNT job reason until the qos or
    account changed.
 -- Restore "sreport -T ALL" functionality.
 -- Correctly typecast signals being sent through the api.
 -- Properly initialize structures throughout Slurm.
 -- Sync "numtask" squeue format option for jobs and steps to "numtasks".
 -- Fix sacct -PD to avoid CA before start jobs.
 -- Fix potential deadlock with backup slurmctld.
 -- Fixed issue with jobs not appearing in sacct after dependency satisfied.
 -- Fix showing non-eligible jobs when asking with -j and not -s.
 -- Fix issue with backfill scheduler scheduling tasks of an array
    when not the head job.
 -- accounting_storage/mysql - fix SIGABRT in the archive load logic.
 -- accounting_storage/mysql - fix memory leak in the archive load logic.
 -- Limit records per single SQL statement when loading archived data.
 -- Fix unnecessary reloading of job submit plugins.
 -- Allow job submit plugins to be turned on/off with a reconfigure.
 -- Fix segfault when loading/unloading Lua job submit plugin multiple times.
 -- Fix printing duplicate error messages of jobs rejected by job submit plugin.
 -- Fix printing of job submit plugin messages of het jobs without pack id.
 -- Fix memory leak in group_cache.c
 -- Fix jobs stuck from FedJobLock when requeueing in a federation
 -- Fix requeueing job in a federation of clusters with differing associations
 -- sacctmgr - free memory before exiting in 'sacctmgr show runaway'.
 -- Fix seff showing memory overflow when steps tres mem usage is 0.
 -- Upon archive file name collision, create new archive file instead of
    overwriting the old one to prevent lost records.
 -- Limit archive files to 50000 records per file so that archiving large
    databases will succeed.
 -- Remove stray newlines in SPANK plugin error messages.
 -- Fix archive loading events.
 -- In select/cons_res: Only allocate 1 CPU per node with the --overcommit and
 -- Fix main scheduler from potentially not running through whole queue.
 -- cons_res/job_test - prevent a job from overallocating a node memory.
 -- cons_res/job_test - fix to consider a node's current allocated memory when
    testing a job's memory request.
 -- Fix issue where multi-node job steps on cloud nodes wouldn't finish cleaning
    up until the end of the job (rather than the end of the step).
 -- Fix issue with a 17.11 sbcast call to a 18.08 daemon.
 -- Add new job bit_flags of JOB_DEPENDENT.
 -- Make it so dependent jobs reset the AccrueTime and do not count against any
    AccrueTime limits.
 -- Fix sacctmgr --parsable2 output for reservations and tres.
 -- Prevent slurmctld from potential segfault after job_start_data() called
    for completing job.
 -- Fix jobs getting on nodes with "scontrol reboot asap".
 -- Record node reboot events to database.
 -- Fix node reboot failure message getting to event table.
 -- Don't write "(null)" to event table when no event reason exists.
 -- Fix minor memory leak when clearing runaway jobs.
 -- Avoid flooding slurmctld and logging when prolog complete RPC errors occur.
 -- Fix GCC 9 compiler warnings.
 -- Fix seff human readable memory string for values below a megabyte.
Albert Gil's avatar
Albert Gil committed
 -- Fix dump/load of rejected heterogeneous jobs.
 -- For heterogeneous jobs, do not count the each component against the QOS or
    association job limit multiple times.
 -- slurmdbd - avoid reservation flag column corruption with the use of newer
    flags, instead preserve the older flag fields that we can still fit in the
    smallint field, and discard the rest.
 -- Fix security issue in accounting_storage/mysql plugin on archive file loads
    by always escaping strings within the slurmdbd. CVE-2019-12838.
* Changes in Slurm 18.08.7
==========================
 -- Set debug statement to debug2 to avoid benign error messages.
 -- Add SchedulerParameters option of bf_hetjob_immediate to attempt to start
    a heterogeneous job as soon as all of its components are determined able to
    do so.
 -- Fix underflow causing decay thread to exit.
 -- Fix main scheduler not considering hetjobs when building the job queue.
 -- Fix regression for sacct to display old jobs without a start time.
 -- Fix setting correct number of gres topology bits.
 -- Update hetjobs pending state reason when appropriate.
 -- Fix accounting_storage/filetxt's understanding of TRES.
 -- Set Accrue time when not enforcing limits.
 -- Fix srun segfault when requesting a hetjob with test_exec or bcast options.
 -- Hide multipart priorities log message behind Priority debug flag.
 -- sched/backfill - Make hetjobs sensitive to bf_max_job_start.
 -- Fix slurmctld segfault due to job's partition pointer NULL dereference.
 -- Fix issue with OR'ed job dependencies.
 -- Add new job's bit_flags of INVALID_DEPEND to prevent rebuilding a job's
    dependency string when it has at least one invalid and purged dependency.
 -- Promote federation unsynced siblings log message from debug to info.
 -- burst_buffer/cray - fix slurmctld SIGABRT due to illegal read/writes.
 -- burst_buffer/cray - fix memory leak due to unfreed job script content.
 -- node_features/knl_cray - fix script_argv use-after-free.
 -- burst_buffer/cray - fix script_argv use-after-free.
 -- Fix invalid reads of size 1 due to non null-terminated string reads.
 -- Add extra debug2 logs to identify why BadConstraints reason is set.
* Changes in Slurm 18.08.6-2
============================
 -- Remove deadlock situation when logging and --enable-debug is used.
 -- Fix RPM packaging for accounting_storage/mysql.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 18.08.6
==========================
 -- Added parsing of -H flag with scancel.
 -- Fix slurmsmwd build on 32-bit systems.
 -- acct_gather_filesystem/lustre - add support for Lustre 2.12 client.
 -- Fix per-partition TRES factors/priority
 -- Fix per-partition NICE priority
 -- Fix partition access check validation for multi-partition job submissions.
 -- Prevent segfault on empty response in 'scontrol show dwstat'.
 -- node_features/knl_cray plugin - Preserve node's active features if it has
    already booted when slurmctld daemon is reconfigured.
 -- Detect missing burst buffer script and reject job.
 -- GRES: Properly reset the topo_gres_cnt_alloc counter on slurmctld restart
    to prevent underflow.
 -- Avoid errors from packing accounting_storage_mysql.so when RPM is built
    with out mysql support.
 -- Remove deprecated -t option from slurmctld --help.
 -- acct_gather_filesystem/lustre - fix stats gathering.
 -- Enforce documented default usage start and end times when querying jobs from
    the database.
 -- Fix issues when querying running jobs from the database.
 -- Deny sacct request where start time is later than the end time requested.
 -- Fix sacct verbose about time and states queried.
 -- burst_buffer/cray - allow 'scancel --hurry <jobid>' to tear down a burst
    buffer that is currently staging data out.
 -- X11 forwarding - allow setup if the DISPLAY environment variable lacks
    a screen number. (Permit both "localhost:10.0" and "localhost:10".)
 -- docs - change HTML title to include the page title or man page name.
 -- X11 forwarding - fix an unnecessary error message when using the
    local_xauthority X11Parameters option.
 -- Add use_raw_hostname to X11Parameters.
 -- Fix smail so it passes job arrays to seff correctly.
 -- Don't check InactiveLimit for salloc --no-shell jobs.
 -- Add SALLOC_GRES and SBATCH_GRES as input to salloc/sbatch.
 -- Remove drain state when node doesn't reboot by ResumeTimeout.
 -- Fix considering "resuming" nodes in scheduling.
 -- Do not kill suspended jobs due to exceeding time limit.
 -- Add NoAddrCache CommunicationParameter.
 -- Don't ping powering up cloud nodes.
 -- Add cloud_dns SlurmctldParameter.
 -- Consider --sbindir configure option as the default path to find slurmstepd.
 -- Fix node state printing of DRAINED$
 -- Fix spamming dbd of down/drained nodes in maintenance reservation.
 -- Avoid buffer overflow in time_str2secs.
 -- Calculate suspended time for suspended steps.
 -- Add null check for step_ptr->step_node_bitmap in _pick_step_nodes.
 -- Fix multi-cluster srun issue after 'scontrol reconfigure' was called.
 -- Fix accessing response_cluster_rec outside of write locks.
 -- Fix Lua user messages not showing up on rejected submissions.
 -- Fix printing multi-line error messages on rejected submissions.
* Changes in Slurm 18.08.5-2
============================
 -- Fix Perl build for 32-bit systems.

Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 18.08.5
==========================
 -- Backfill - If a job has a time_limit guess the end time of a job better
    if OverTimeLimit is Unlimited.
 -- Fix "sacctmgr show events event=cluster"
 -- Fix sacctmgr show runawayjobs from sibling cluster
 -- Avoid bit offset of -1 in call to bit_nclear().
 -- Insure that "hbm" is a configured GresType on knl systems.
 -- Fix NodeFeaturesPlugins=node_features/knl_generic to allow other gres
    other than knl.
 -- cons_res: Prevent overflow on multiply.
 -- Better debug for bad values in gres.conf.
 -- Fix double accounting of energy at end of job.
 -- Read gres.conf for cloud nodes on slurmctld.
 -- Don't assume the first node of a job is the batch host when purging jobs
    from a node.
 -- Better debugging when a job doesn't have a job_resrcs ptr.
 -- Store ave watts in energy plugins.
 -- Add XCC plugin for reading Lenovo Power.
 -- Fix minor memory leak when scheduling rebootable nodes.
 -- Fix debug2 prefix for sched log.
 -- Fix printing correct SLURM_JOB_ACCOUNT_PACK_GROUP_* in env for a Het Job.
 -- sbatch - search current working directory first for job script.
 -- Make it so held jobs reset the AccrueTime and do not count against any
    AccrueTime limits.
 -- Add SchedulerParameters option of bf_hetjob_prio=[min|avg|max] to alter the
    job sorting algorithm for scheduling heterogeneous jobs.
 -- Fix initialization of assoc_mgr_locks and slurmctld_locks lock structures.
 -- Fix segfault with job arrays using X11 forwarding.
 -- Revert regression caused by e0ee1c7054 which caused negative values and
    values starting with a decimal to be invalid for PriorityWeightTRES and
    TRESBillingWeight.