Skip to content
Snippets Groups Projects
NEWS 552 KiB
Newer Older
David Bigagli's avatar
David Bigagli committed
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and administrators.

* Changes in Slurm 19.05.0pre2
==============================
 -- Removed select/serial plugin.
 -- Remove 512-character line length limit in slurm_print_topo_record().
    (Used by "scontrol show topology".)
 -- Removed crypto/openssl plugin.
 -- Tweak the sdiag gettimeofday() line format for greater clarity.
 -- Add support for SALLOC/SBATCH/SLURM_NO_KILL environment variables.
    Add salloc/sbatch/srun support for optional "--no-kill=off" option to
    disable the environment variables.
 -- Fix salloc and missing SLURM_NTASKS.
 -- Alter the backfill scheduler behavior to prevent it from scheduling lower
    priority jobs on resources that become available during the backfill
    scheduling cycle when bf_continue is enabled. This behavior was available
    as the bf_ignore_newly_avail_nodes option in 18.08.4+, but is now enabled
    by default. (The SchedulerParameters option of bf_ignore_newly_avail_nodes
    is also now removed, although harmless if still set.)
 -- Make LaunchParameters=send_gids the default introducing the reverse option
    "disable_send_gids to go back to the original behavior.
 -- Limit pam_slurm_adopt to run only in the sshd context by default, for
    security reasons. A new module option 'service=<name>' can be used to
    allow a different PAM applications to work. The option 'service=*' can be
    used to restore the old behavior of always performing the adopt logic
    regardless of the PAM application context.
 -- pam_slurm_adopt: Use uid to determine whether root is logging.
 -- Remove sbatch --x11 option. Slurm's internal X11 forwarding is now only
    supported from salloc, or an allocating srun command.
 -- Suppressed printing of job id in sbatch when quiet flag is set.
Felip Moll's avatar
Felip Moll committed
 -- Changed sreport 'SizesByAccount' and 'SizesByAccountAndWckey' default
    behavior and added new 'AcctAsParent' option.
 -- Add ave watts to api and sview.
 -- Added printf attribute to setenvf() and corrected related warnings.
 -- Kill running/pending job is allocated GRES and that GRES has a "File"
    configuration, and the GRES count changes.
Felip Moll's avatar
Felip Moll committed
 -- Add new DebugFlag=Accrue for accrue accounting debugging purposes.
 -- Change CryptoType option to CredType, and rename crypto/munge plugin to
    cred/munge.
 -- Add slurmd -G option to print GRES configuration and exit. This is useful
    for testing and debugging.
 -- Support GRES types that include numbers (e.g. "--gres=gpu:123g:2").
 -- Remove MemLimitEnforce parameter and move functionality into
    JobAcctGatherParam=OverMemoryKill.
 -- sview - disable admin mode option (which would not work anyways) if the
    user is not an admin in SlurmDBD.
 -- Remove joules reporting from sview and scontrol.
 -- Change the default fair share algorithm to "fair tree". The new
    PriorityFlags option of NO_FAIR_TREE can be used to revert to "classic"
    fair share scheduling instead.
 -- libslurmdb has been merged into libslurm.
Jason Booth's avatar
Jason Booth committed
 -- Added -b as a short option for --begin and removed the -b option which
    was a left over artifact from the Moab compatibility work.
 -- Add ArrayTaskThrottle to "scontrol show job" output.
* Changes in Slurm 19.05.0pre1
==============================
 -- Run epilog and clean up allocation when a job is resized to zero and its
    resources transferred to another job (--depend=expand).
 -- If GRES are associated with specific sockets, identify those sockets in the
    output of "scontrol show node". For example if all 4 GPUs on a node are
    all associated with socket zero, then "Gres=gpu:4(S:0)". If associated
    with sockets 0 and 1 then "Gres=gpu:4(S:0-1)". The information of which
    specific GPUs are associated with specific GPUs is not reported, but only
    available by parsing the gres.conf file.
 -- Add configuration parameter "GpuFreqDef" to control a job's default GPU
    frequency.
 -- Add job flags to the database.  Currently used to determine which scheduler
    scheduled the job.
 -- Add constraints/features to the database.
 -- Add last reason job didn't run before resources/priority to the database.
Danny Auble's avatar
Danny Auble committed
 -- Make it so we set the alloc_node in a resource allocation based on the auth
    plugin instead of the rpc call.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 18.08.5
==========================
 -- Backfill - If a job has a time_limit guess the end time of a job better
    if OverTimeLimit is Unlimited.
 -- Fix "sacctmgr show events event=cluster"
 -- Fix sacctmgr show runawayjobs from sibling cluster
 -- Avoid bit offset of -1 in call to bit_nclear().
 -- Insure that "hbm" is a configured GresType on knl systems.
 -- Fix NodeFeaturesPlugins=node_features/knl_generic to allow other gres
    other than knl.
 -- cons_res: Prevent overflow on multiply.
 -- Better debug for bad values in gres.conf.
 -- Fix double accounting of energy at end of job.
 -- Read gres.conf for cloud nodes on slurmctld.
 -- Don't assume the first node of a job is the batch host when purging jobs
    from a node.
 -- Better debugging when a job doesn't have a job_resrcs ptr.
 -- Store ave watts in energy plugins.
 -- Add XCC plugin for reading Lenovo Power.
 -- Fix minor memory leak when scheduling rebootable nodes.
 -- Fix debug2 prefix for sched log.
 -- Fix printing correct SLURM_JOB_ACCOUNT_PACK_GROUP_* in env for a Het Job.
 -- sbatch - search current working directory first for job script.
 -- Make it so held jobs reset the AccrueTime and do not count against any
    AccrueTime limits.
 -- Add SchedulerParameters option of bf_hetjob_prio=[min|avg|max] to alter the
    job sorting algorithm for scheduling heterogeneous jobs.
 -- Fix initialization of assoc_mgr_locks and slurmctld_locks lock structures.
 -- Fix segfault with job arrays using X11 forwarding.
 -- Revert regression caused by e0ee1c7054 which caused negative values and
    values starting with a decimal to be invalid for PriorityWeightTRES and
    TRESBillingWeight.
 -- Fix possibility to update a job's reservation to none.
 -- Suppress connection errors to primary slurmdbd when backup dbd is active.
 -- Suppress connection errors to primary db when backup db kicks in
 -- Add missing fields for sacct --completion when using jobcomp/filetxt.
 -- Fix incorrect values set for UserCPU, SystemCPU, and TotalCPU sacct fields
    when JobAcctGatherType=jobacct_gather/cgroup.
 -- Fixed srun from double printing invalid option msg twice.
 -- Remove unused -b flag from getopt call in sbatch.
 -- Disable reporting of node TRES in sreport.
 -- Re-enabling features combined by OR within parenthesis for non-knl setups.
 -- Prevent sending duplicate requests to reboot a node before ResumeTimeout.
 -- Down nodes that don't reboot by ResumeTimeout.
 -- Update seff to reflect API change from rss_max to tres_usage_in_max.
 -- Add missing TRES constants from perl API.
 -- Fix issue where sacct would return incorrect array tasks when querying
    specific tasks.
 -- Add missing variables to slurmdb_stats_t in the perlapi.
 -- Fix nodes not getting reboot RPC when job requires reboot of nodes.
 -- Fix failing update the partition list of a job.
 -- Use slurm.conf gres ids instead of gres.conf names to get a gres type name.
 -- Add mitigation for a potential heap overflow on 32-bit systems in xmalloc.
    CVE-2019-6438.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 18.08.4
==========================
 -- burst_buffer/cray - avoid launching a job that would be immediately
    cancelled due to a DataWarp failure.
 -- Fix message sent to user to display preempted instead of time limit when
    a job is preempted.
 -- Fix memory leak when a failure happens processing a nodes gres config.
 -- Improve error message when failures happen processing a nodes gres config.
 -- When building rpms ignore redundant standard rpaths and insecure relative
    rpaths, for RHEL based distros which use "check-rpaths" tool.
 -- Don't skip jobs in scontrol hold.
 -- Avoid locking the job_list when unneeded.
 -- Allow --cpu-bind=verbose to be used with SLURM_HINT environment variable.
 -- Make it so fixing runaway jobs will not alter the same job requeued
    when not runaway.
 -- Avoid checking state when searching for runaway jobs.
 -- Remove redundant check for end time of job when searching for runaway jobs.
 -- Make sure that we properly check for runawayjobs where another job might
    have the same id (for example, if a job was requeued) by also checking the
    submit time.
 -- Add scontrol update job ResetAccrueTime to clear a job's time
    previously accrued for priority.
 -- cons_res: Delay exiting cr_job_test until after cores/cpus are calculated
    and distributed.
 -- Fix bug where binary in cwd would trump binary in PATH with test_exec.
 -- Fix check to test printf("%s\n", NULL); to not require
    -Wno-format-truncation CFLAG.
 -- Fix JobAcctGatherParams=UsePss to report the correct usage.
 -- Fix minor memory leak in pmix plugin.
 -- Fix minor memory leak in slurmctld when reading configuration.
 -- Handle return codes correctly from pthread_* functions.
 -- Fix minor memory leak when a slurmd is unable to contact a slurmctld
    when trying to register.
 -- Fix sreport sizesbyaccount report when using Flatview and accounts.
 -- Fix incorrect shift when dealing with node weights and scheduling.
 -- libslurm/perl - Fix segfault caused by incorrect hv_to_slurm_ctl_conf.
 -- Add qos and assoc options to confirmation dialogs.
 -- Handle updating identical license or partition information correctly.
 -- Makes sure accounts and QOS' are all lower case to match documentation
    when read in from the slurm.conf file.
 -- Don't consider partitions without enough nodes in reservation,
    main scheduler.
 -- Set SLURM_NTASKS correctly if having to determine from other options.
 -- Removed GCP scripts from contribs. Now located at:
    https://github.com/SchedMD/slurm-gcp.
 -- Don't check existence of srun --prolog or --epilog executables when set to
    "none" and SLURM_TEST_EXEC is used.
 -- Add "P" suffix support to job and step tres specifications.
 -- When doing a reconfigure handle QOS' GrpJobsAccrue correctly.
 -- Remove unneeded extra parentheses from sh5util.
 -- Fix jobacct_gather/cgroup to work correctly when more than one task is
    started on a node.
 -- If requesting --ntasks-per-node with no tasks set tasks correctly.
 -- Accept modifiers for TRES originally added in 6f0342e0358.
 -- Don't remove reservation on slurmctld restart if nodes are removed from
    configuration.
 -- Fix bad xfree in task/cgroup.
 -- Fix removing counters if a job array isn't subject to limits and is
    canceled while pending.
 -- Make sure SLURM_NTASKS_PER_NODE is set correctly when env is overwritten
    by the command line.
 -- Clean up step on a failed node correctly.
 -- mpi/pmix: Fixed the logging of collective state.
 -- mpi/pmix: Make multi-slurmd work correctly when using ring communication.
 -- mpi/pmix: Fix double invocation of the PMIx lib fence callback.
 -- mpi/pmix: Remove unneeded libpmix callback drop in tree-based coll.
 -- Fix race condition in route/topology when the slurmctld is reconfigured.
 -- In route/topology validate the slurmctld doesn't try to initialize the
    node system.
 -- Fix issue when requesting invalid gres.
 -- Validate job_ptr in backfill before restoring preempt state.
 -- Fix issue when job's environment is minimal and only contains variables
    Slurm is going to replace internally.
 -- When handling runaway jobs remove all usage before rollup to remove any
    time that wasn't existent instead of just updating lines that have time
    with a lesser time.
 -- salloc - set SLURM_NTASKS_PER_CORE and SLURM_NTASKS_PER_SOCKET in the
    environment if the corresponding command line options are used.
 -- slurmd - fix handling of the -f flag to specify alternate config file
    locations.
 -- Fix scheduling logic to avoid using nodes that require a reboot for KNL
    node change when possible.
 -- Fix scheduling logic bug. There should have been a test for _not_
    NODE_SET_REBOOT to continue.
 -- Fix a scheuling logic bug with respect to XOR operation support when there
    are down nodes.
 -- If there is a constraint construct of the form "[...&...]"
    then an error is generated if more than one of those specifications
    contains KNL NUMA or MCDRAM modes.
 -- Fix stepd segfault race if slurmctld hasn't registered with the launching
    slurmd yet delivering it's TRES list.
 -- Add SchedulerParameters option of bf_ignore_newly_avail_nodes to avoid
    scheduling lower priority jobs on resources that become available during
    the backfill scheduling cycle when bf_continue is enabled.
 -- Decrement message_connections in stepd code on error path correctly.
 -- Decrease an error message to be debug.
 -- Fix missing suffixes in squeue.
 -- pam_slurm_adopt - send an error message to the user if no Slurm jobs
    can be located on the node.
 -- Run SlurmctldPrimaryOffProg when the primary slurmctld process shuts down.
 -- job_submit/lua: Add several slurmctld return codes.
 -- job_submit/lua: Add user/group info to jobs.
 -- Fix formatting issues when printing uint64_t.
 -- Bump RLIMIT_NOFILE for daemons in systemd services.
 -- Expand %x in job name in 'scontrol show job'.
 -- salloc/sbatch/srun - print warning if mutually exclusive options of --mem
    and --mem-per-cpu are both set.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 18.08.3
==========================
 -- Fix regression in 18.08.1 that caused dbd messages to not be queued up
 -- Fix regression in 18.08.1 that can cause a slurmctld crash when splitting
    job array elements.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 18.08.2
==========================
 -- Correctly initialize variable in env_array_user_default().
 -- Remove race condition when signaling starting step.
 -- Fix issue where 17.11 job's using GRES in didn't initialize new 18.08
    structures after unpack.
 -- Stop removing nodes once the minimum CPU or node count for the job is
    reached in the cons_res plugin.
 -- Process any changes to MinJobAge and SlurmdTimeout in the slurmctld when
    it is reconfigured to determine changes in its background timers.
 -- Use previous SlurmdTimeout in the slurmctld after a reconfigure to
    determine the time a node has been down.
 -- Fix multi-cluster srun between clusters with different SelectType plugins.
 -- Fix removing job licenses on reconfig/restart when configured license
    counts are 0.
 -- If a job requested multiple licenses and one license was removed then on
    a reconfigure/restart all of the licenses -- including the valid ones
    would be removed.
 -- Fix issue where job's license string wasn't updated after a restart when
    licenses were removed or added.
 -- Add allow_zero_lic to SchedulerParameters.
 -- Avoid scheduling tasks in excess of ArrayTaskThrottle when canceling tasks
    of an array.
 -- Fix jobs that request memory per node and task count that can't be
    scheduled right away.
Loading
Loading full blame...