Skip to content
Snippets Groups Projects
NEWS 565 KiB
Newer Older
David Bigagli's avatar
David Bigagli committed
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and administrators.

Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 19.05.0pre4
==============================
 -- Set CUDA_VISIBLE_DEVICES environment variable in Prolog and Epilog for jobs
    requesting gres/gpu.
 -- Remove '-U' argument - which was deprecated when '-A' was made the single
    character option before the Slurm 2.1 release - as an alternative to
    '--account' for salloc/sbatch/srun.
 -- Remove direct BLCR support and srun_cr.
 -- Make slurm_print_node_table only print a node's slurmd version if it is
    different to the one reported by slurm_load_ctl_conf.
 -- Call gres plugin environment setup even if gres not requested in job.
 -- Do not set CUDA_VISIBLE_DEVICES=NoDevFiles when no gres requested.
 -- If GRES configuration data is unavailable from gres.conf, then use the
    node's "Gres=" information slurm.conf. This will eliminate or minimize the
    gres.conf file in many situations.
 -- Fix checking IPMI XCC raw command response length.
 -- jobacct_gather/common - improve lightweight process identification.
 -- Cloud/PowerSave Improvements:
    - Better repsonsiveness to resuming and suspending.
    - Powering down nodes not eligible to be allocated until after
      SuspendTimeout.
    - Powering down nodes put in "Powering Down / %" state until after
      SuspendTimeout.
 -- Add idle_on_node_suspend SlurmctldParameter to make nodes idle regardless
    of state when suspended.
 -- Add PowerSave DebugFlag for Suspend/Resume debugging.
 -- Changed "scontrol reboot" to not default to ALL nodes.
 -- Changed "scontrol completing" to include two new fields - EndTime and
    CompletingTime.
 -- select/cons_tres - prevent job from overallocating a node memory.
 -- Refactor CLI option parsing for salloc/sbatch/srun into a central set of
    functions in src/common/slurm_opt.c. Note that this new option parsing can
    be stricter in a few specific situations - places that used to ignore
    invalid options and still submit/launch a job or job step may return an
    error() and refuse to proceed instead.
 -- Add preempt_send_user_signal SlurmctldParameter option to send user
    signal (e.g. --signal=<SIG_NUM>) at preemption if it hasn't already been
    sent.
 -- Add PreemptExemptTime parameter to slurm.conf and QOS to guarantee a
    minimum runtime before preemption.
 -- Set job's preempt time for non-grace time preemptions.
 -- Add sinfo format option to show used gres.
 -- Add reboot_from_controller SlurmctldParameter to allow RebootProgram to be
    run from the controller instead of the slurmds.
 -- Fix increasing of job size when extern steps exist.
 -- Reset GPU-related arguments to salloc/sbatch/srun for each separate
    heterogeneous job component.
 -- Do not set "(null)" for SLURM_JOB_CONSTRAINTS when no constraints are set
    in PrologSlurmctld/EpilogSlurmctld.
 -- Add SRUN_EXPORT_ENV as an input environment variable to srun.
 -- Return an error for invalid #SBATCH directives, and do not submit the job.
 -- Add S_JOB_ARRAY_ID and S_JOB_ARRAY_TASK_ID to spank_get_item().
 -- Change container_{g,p}_add_pid() to container_{g,p}_join() and remove the
    'pid_t pid' argument.
Tim Wickberg's avatar
Tim Wickberg committed
 -- Add new site_factor plugin type to permit sites to build plugins to set
    and modify the site priority factor value both initially on job submission,
    and periodically every PriorityCalcPeriod.
 -- Rename Cray plugins cray_aries in preperation for Cray/Shasta.
 -- Allow Het Jobs to work on a Cray.
 -- Add new cli_filter plugin type to permit sites to build plugins to log,
    modify, or reject CLI options within the salloc/sbatch/srun commands
    themselves.
 -- Allocate nodes that are booting. Previously, nodes that were being booted
    were off limits for allocation. This caused more nodes to be booted than
    needed in a cloud environment.
 -- pam_slurm_adopt - inject SLURM_JOB_ID environment variable into adopted
    processes.
 -- PMIx - use the Tree-based collective for empty fence operations.
 -- PMIx - replace use of the non-standard PMIX_VAL_SET macro with the
    standardized PMIX_VALUE_LOAD macro.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 19.05.0pre3
==============================
 -- Fix RPM packaging for accounting_storage/mysql.

* Changes in Slurm 19.05.0pre2
==============================
 -- Removed select/serial plugin.
 -- Remove 512-character line length limit in slurm_print_topo_record().
    (Used by "scontrol show topology".)
 -- Removed crypto/openssl plugin.
 -- Tweak the sdiag gettimeofday() line format for greater clarity.
 -- Add support for SALLOC/SBATCH/SLURM_NO_KILL environment variables.
    Add salloc/sbatch/srun support for optional "--no-kill=off" option to
    disable the environment variables.
 -- Fix salloc and missing SLURM_NTASKS.
 -- Alter the backfill scheduler behavior to prevent it from scheduling lower
    priority jobs on resources that become available during the backfill
    scheduling cycle when bf_continue is enabled. This behavior was available
    as the bf_ignore_newly_avail_nodes option in 18.08.4+, but is now enabled
    by default. (The SchedulerParameters option of bf_ignore_newly_avail_nodes
    is also now removed, although harmless if still set.)
 -- Make LaunchParameters=send_gids the default introducing the reverse option
    "disable_send_gids to go back to the original behavior.
 -- Limit pam_slurm_adopt to run only in the sshd context by default, for
    security reasons. A new module option 'service=<name>' can be used to
    allow a different PAM applications to work. The option 'service=*' can be
    used to restore the old behavior of always performing the adopt logic
    regardless of the PAM application context.
 -- pam_slurm_adopt: Use uid to determine whether root is logging.
 -- Remove sbatch --x11 option. Slurm's internal X11 forwarding is now only
    supported from salloc, or an allocating srun command.
 -- Suppressed printing of job id in sbatch when quiet flag is set.
Felip Moll's avatar
Felip Moll committed
 -- Changed sreport 'SizesByAccount' and 'SizesByAccountAndWckey' default
    behavior and added new 'AcctAsParent' option.
 -- Add ave watts to api and sview.
 -- Added printf attribute to setenvf() and corrected related warnings.
 -- Kill running/pending job is allocated GRES and that GRES has a "File"
    configuration, and the GRES count changes.
Felip Moll's avatar
Felip Moll committed
 -- Add new DebugFlag=Accrue for accrue accounting debugging purposes.
 -- Change CryptoType option to CredType, and rename crypto/munge plugin to
    cred/munge.
 -- Add slurmd -G option to print GRES configuration and exit. This is useful
    for testing and debugging.
 -- Support GRES types that include numbers (e.g. "--gres=gpu:123g:2").
 -- Remove MemLimitEnforce parameter and move functionality into
    JobAcctGatherParam=OverMemoryKill.
 -- sview - disable admin mode option (which would not work anyways) if the
    user is not an admin in SlurmDBD.
 -- Remove joules reporting from sview and scontrol.
 -- Change the default fair share algorithm to "fair tree". The new
    PriorityFlags option of NO_FAIR_TREE can be used to revert to "classic"
    fair share scheduling instead.
 -- libslurmdb has been merged into libslurm.
Jason Booth's avatar
Jason Booth committed
 -- Added -b as a short option for --begin and removed the -b option which
    was a left over artifact from the Moab compatibility work.
 -- Add ArrayTaskThrottle to "scontrol show job" output.
Jason Booth's avatar
Jason Booth committed
 -- Added SPRIO_FORMAT env variable to the sprio command.
 -- Add batch step at the beginning of a batch job so that squeue, sstat, and
    sacct will show the batch step.
Tim Wickberg's avatar
Tim Wickberg committed
 -- Deprecated 32-bit builds.
 -- Make -l and -o mutually exclusive in saccct, squeue, sinfo, and sprio
 -- Disable running job expansion by default. A new SchedulerParameter of
    permit_job_expansion has been added for sites that wish to re-enable it.
 -- Permit changing a job array's ArrayTaskThrottle value even if the job is
    terminated (for job requeue).
 -- Add scontrol requeue option of "Incomplete" which will requeue jobs only if
    they failed to complete with an exit code of zero.
 -- Modify GrpNodes limit to apply to unique nodes allocated (avoid double
    counting nodes allocated to multiple jobs in the same QOS or association).
 -- If a job submit does NOT include --cpus-per-task option, then report the
    value as "N/A" rather than always mapping the value to 1.
 -- X11 forwarding - use the raw value from gethostname() with xauth to avoid
    authentication issues when Slurm has internally stripped off the domain
    portion.
 -- Change how slurmd fills in the registration message version string from
    PACKAGE_VERSION to SLURM_VERSION_STRING, affecting how the version is
    displayed with sview, sinfo, scontrol and through the API.
 -- Remove autogen.sh script. Please use the autoreconf command instead.
 -- Disable a configuration of SelectTypeParameters=CR_ONE_TASK_PER_CORE with
    SelectType=select/cons_tres. This will be addressed later.
 -- job_submit/lua - expose more fields off the partition record.
 -- task/cgroup - prevent setting a memory.soft_limit_in_bytes higher than the
    memory.limit_in_bytes since the hard limit will take precedence anyway.
 -- If a GrpNodes limit is configurated in an association, partition QOS or
    job QOS then favor use of nodes already allocated to that entity. This
    will result in the configured node "Weight" being incremented by one for
    nodes which are not prefered. Consider adjusting configured node "Weight"
    values to achieve the desired node preferences.
 -- Add full node state debug2 output to slurmdbd node up/down update
 -- Set CUDA_VISIBLE_DEVICES and CUDA_MPS_ACTIVE_THREAD_PERCENTAGE environment
    variables in Prolog and Epilog for jobs requesting gres/mps.
 -- Added thresholds for backfill parameters.
 -- Fix for backfill sleep overflow when large values are set.
 -- Execute Epilog on nodes reliquished from job (i.e. job resized).
 -- Rename burst_buffer/cray plugin to burst_buffer/datawarp.
 -- X11 Forwarding - reimplement using new internal network forwarding RPCs.
 -- Remove slurm_jobcomp_get_errno and slurm_jobcomp_strerror from jobcomp
    plugin API.
 -- Optimize backfill for checking max jobs per assoc, partition, user, etc.
* Changes in Slurm 19.05.0pre1
==============================
 -- Run epilog and clean up allocation when a job is resized to zero and its
    resources transferred to another job (--depend=expand).
 -- If GRES are associated with specific sockets, identify those sockets in the
    output of "scontrol show node". For example if all 4 GPUs on a node are
    all associated with socket zero, then "Gres=gpu:4(S:0)". If associated
    with sockets 0 and 1 then "Gres=gpu:4(S:0-1)". The information of which
    specific GPUs are associated with specific GPUs is not reported, but only
    available by parsing the gres.conf file.
 -- Add configuration parameter "GpuFreqDef" to control a job's default GPU
    frequency.
 -- Add job flags to the database.  Currently used to determine which scheduler
    scheduled the job.
 -- Add constraints/features to the database.
 -- Add last reason job didn't run before resources/priority to the database.
Danny Auble's avatar
Danny Auble committed
 -- Make it so we set the alloc_node in a resource allocation based on the auth
    plugin instead of the rpc call.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 18.08.8
==========================
 -- Update "xauth list" to use the same 10000ms timeout as the other xauth
    commands.
 -- Fix issue in gres code to handle a gres cnt of 0.
 -- Don't purge jobs if backfill is running.
 -- Verify job is pending add/removing accrual time.
Loading
Loading full blame...