Skip to content
Snippets Groups Projects
NEWS 349 KiB
Newer Older
David Bigagli's avatar
David Bigagli committed
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and administrators.

* Changes in Slurm 15.08.5
==========================
 -- Prevent "scontrol update job" from updating jobs that have already finished.
 -- Show requested TRES in "squeue -O tres" when job is pending.
 -- Backfill scheduler: Test association and QOS node limits before reserving
    resources for pending job.
 -- burst_buffer/cray: If teardown operations fails, sleep and retry.
 -- Clean up the external pids when using the PrologFlags=Contain feature
    and the job finishes.
 -- burst_buffer/cray: Support file staging when job lacks job-specific buffer
    (i.e. only persistent burst buffers).
 -- Added srun option of --bcast to copy executable file to compute nodes.
 -- Fix for advanced reservation of burst buffer space.
 -- BurstBuffer/cray: Add logic to terminate dw_wlm_cli child processes at
    shutdown.
 -- If job can't be launch or requeued, then terminate it.
 -- BurstBuffer/cray: Enable clearing of burst buffer string on completed job
    as a means of recovering from a failure mode.
 -- Fix wrong memory free when parsing SrunPortRange=0-0 configuration.
 -- BurstBuffer/cray: Fix job record purging if cancelled from pending state.
 -- BGQ - Handle database throw correctly when syncing users on blocks.
 -- MySQL - Make sure we don't have a NULL string returned when not
    requesting any specific association.
 -- sched/backfill: If max_rpc_cnt is configured and the backlog of RPCs has
    not cleared after yielding locks, then continue to sleep.
 -- Preserve the job dependency description displayed in 'scontrol show job'
    even if the dependee jobs was terminated and cleaned causing the
    dependent to never run because of DependencyNeverSatisfied.
David Bigagli's avatar
David Bigagli committed
 -- Correct job task count calculation if only node count and ntasks-per-node
    options supplied.
 -- Make sure the association manager converts any string to be lower case
    as all the associations from the database will be lower case.
 -- Sanity check for xcgroup_delete() to verify incoming parameter is valid.
 -- Fix formatting for sacct with variables that switched from uint32_t to
    uint64_t.
David Bigagli's avatar
David Bigagli committed
 -- Fix a typo in sacct man page.
 -- Set up extern step to track any childern of an ssh if it leaves anything
    else behind.
* Changes in Slurm 15.08.4
==========================
 -- Fix typo for the "devices" cgroup subsystem in pam_slurm_adopt.c
 -- Fix TRES_MAX flag to work correctly.
David Bigagli's avatar
David Bigagli committed
 -- Improve the systemd startup files.
 -- Added burst_buffer.conf flag parameter of "TeardownFailure" which will
    teardown and remove a burst buffer after failed stage-in or stage-out.
    By default, the buffer will be preserved for analysis and manual teardown.
 -- Prevent a core dump in srun if the signal handler runs during the job
    allocation causing the step context to be NULL.
 -- Don't fail job if multiple prolog operations in progress at slurmctld
    restart time.
 -- Burst_buffer/cray: Fix to purge terminated jobs with burst buffer errors.
 -- Burst_buffer/cray: Don't stall scheduling of other jobs while a stage-in
    is in progress.
 -- Make it possible to query 'extern' step with sstat.
 -- Make 'extern' step show up in the database.
 -- MYSQL - Quote assoc table name in mysql query.
 -- Make SLURM_ARRAY_TASK_MIN, SLURM_ARRAY_TASK_MAX, and SLURM_ARRAY_TASK_STEP
    environment variables available to PrologSlurmctld and EpilogSlurmctld.
David Bigagli's avatar
David Bigagli committed
 -- Fix slurmctld bug in which a pending job array could be canceled
    by a user different from the owner or the administrator.
 -- Support taking node out of FUTURE state with "scontrol reconfig" command.
 -- Sched/backfill: Fix to properly enforce SchedulerParameters of
    bf_max_job_array_resv.
 -- Enable operator to reset sdiag data.
 -- jobcomp/elasticsearch plugin: Add array_job_id and array_task_id fields.
 -- Remove duplicate #define IS_NODE_POWER_UP.
 -- Added SchedulerParameters option of max_script_size.
 -- Add REQUEST_ADD_EXTERN_PID option to add pid to the slurmstepd's extern
    step.
 -- Add unique identifiers to anchor tags in HTML generated from the man pages.
 -- Add with_freeipmi option to spec file.
 -- Minor elasticsearch code improvements
Morris Jette's avatar
Morris Jette committed
* Changes in Slurm 15.08.3
==========================
 -- Correct Slurm's RPM build if Munge is not installed.
 -- Job array termination status email ExitCode based upon highest exit code
    from any task in the job array rather than the last task. Also change the
    state from "Ended" or "Failed" to "Mixed" where appropriate.
Morris Jette's avatar
Morris Jette committed
 -- Squeue recombines pending job array records only if their name and partition
    are identical.
 -- Fix some minor leaks in the job info and step info API.
 -- Export missing QOS id when filling in association with the association
    manager.
 -- Fix invalid reference if a lua job_submit plugin references a default qos
    when a user doesn't exist in the database.
 -- Use association enforcement in the lua plugin.
 -- Fix a few spots missing defines of accounting_enforce or acct_db_conn
    in the plugins.
 -- Show requested TRES in scontrol show jobs when job is pending.
 -- Improve sched/backfill support for job features, especially XOR construct.
 -- Correct scheduling logic for job features option with XOR construct that
    could delay a job's initiation.
 -- Remove unneeded frees when creating a tres string.
 -- Send a tres_alloc_str for the batch step
 -- Fix incorrect check for slurmdb_find_tres_count_in_string in various places,
    it needed to check for INFINITE64 instead of zero.
 -- Don't allow scontrol to create partitions with the name "DEFAULT".
 -- burst_buffer/cray: Change error from "invalid request" to "permssion denied"
    if a non-authorized user tries to create/destroy a persistent buffer.
Morris Jette's avatar
Morris Jette committed
 -- PrologFlags work: Setting a flag of "Contain" implicitly sets the "Alloc"
    flag. Fix code path which could prevent execution of the Prolog when the
    "Alloc" or "Contain" flag were set.
 -- Fix for acct_gather_energy/cray|ibmaem to work with missed enum.
 -- MYSQL - When inserting a job and begin_time is 0 do not set it to
    submit_time.  0 means the job isn't eligible yet so we need to treat it so.
 -- MYSQL - Don't display ineligible jobs when querying for a window of time.
 -- Fix creation of advanced reservation of cores on nodes which are DOWN.
 -- Return permission denied if regular user tries to release job held by an
    administrator.
 -- MYSQL - Fix rollups for multiple jobs running by the same association
    in an hour counting multiple times.
 -- Burstbuffer/Cray plugin - Fix for persistent burst buffer use.
    Don't call paths if no #DW options.
 -- Modifications to pam_slurm_adopt to work correctly for the "extern" step.
 -- Alphabetize debugflags when printing them out.
 -- Fix systemd's slurmd service from killing slurmstepds on shutdown.
 -- Fixed counter of not indexed jobs, error_cnt post-increment changed to
    pre-increment.
Morris Jette's avatar
Morris Jette committed
* Changes in Slurm 15.08.2
==========================
 -- Fix for tracking node state when jobs that have been allocated exclusive
    access to nodes (i.e. entire nodes) and later relinquish some nodes. Nodes
    would previously appear partly allocated and prevent use by other jobs.
 -- Correct some cgroup paths ("step_batch" vs. "step_4294967294", "step_exter"
    vs. "step_extern", and "step_extern" vs. "step_4294967295").
 -- Fix advanced reservation core selection logic with network topology.
 -- MYSQL - Remove restriction to have to be at least an operator to query TRES
    values.
 -- For pending jobs have sacct print 0 for nnodes instead of the bogus 2.
 -- Fix for tracking node state when jobs that have been allocated exclusive
    access to nodes (i.e. entire nodes) and later relinquish some nodes. Nodes
    would previously appear partly allocated and prevent use by other jobs.
 -- Fix updating job in db after extending job's timelimit past partition's
    timelimit.
 -- Fix srun -I<timeout> from flooding the controller with step create requests.
 -- Requeue/hold batch job launch request if job already running (possible if
    node went to DOWN state, but jobs remained active).
 -- If a job's CPUs/task ratio is increased due to configured MaxMemPerCPU,
    then increase it's allocated CPU count in order to enforce CPU limits.
 -- Don't mark powered down node as not responding. This could be triggered by
    race condition of the node suspend and ping logic, preventing use of the
    node.
 -- Don't requeue RPC going out from slurmctld to DOWN nodes (can generate
    repeating communication errors).
 -- Propagate sbatch "--dist=plane=#" option to srun.
 -- Add acct_gather_energy/ibmaem plugin for systems with IBM Systems Director
    Active Energy Manager.
 -- Fix spec file to look for mariadb or mysql devel packages for build
    requirements.
 -- MySQL - Improve the code with asking for jobs in a suspended state.
David Bigagli's avatar
David Bigagli committed
 -- Fix slurcmtld allowing root to see job steps using squeues -s.
 -- Do not send burst buffer stage out email unless the job uses burst buffers.
 -- Fix sacct to not return all jobs if the -j option is given with a trailing
    ','.
 -- Permit job_submit plugin to set a job's priority.
 -- Fix occasional srun segfault.
 -- Fix issue with sacct, printing 0_0 for array's that had finished in the
    database but the start record hadn't made it yet.
 -- sacctmgr - Don't allow default account associations to be removed
    from a user.
 -- Fix sacct -j, (nothing but a comma) to not return all jobs.
 -- Fixed slurmctld not sending cold-start messages correctly to the database
    when a cold-start (-c) happens to the slurmctld.
 -- Fix case where if the backup slurmdbd has existing connections when it gives
    up control that the it would be killed.
 -- Fix task/cgroup affinity to work correctly with multi-socket
    single-threaded cores.  A regression caused only 1 socket to be used on
    this kind of node instead of all that were available.
 -- MYSQL - Fix minor issue after an index was added to the database it would
    previously take 2 restarts of the slurmdbd to make it stick correctly.
David Bigagli's avatar
David Bigagli committed
 -- Add hv_to_qos_cond() and qos_rec_to_hv() functions to the Perl interface.
 -- Add new burst_buffer.conf parameters: ValidateTimeout and OtherTimeout.
    See man page for details.
 -- Fix burst_buffer/cray support for interactive allocations >4GB.
 -- Correct backfill scheduling logic for job with INFINITE time limit.
 -- Fix issue on a scontrol reconfig all available GRES/TRES would be zeroed
    out.
 -- Set SLURM_HINT environment variable when --hint is used with sbatch or
    salloc.
 -- Add scancel -f/--full option to signal all steps including batch script and
    all of its child processes.
 -- Fix salloc -I to accept an argument.
 -- Avoid reporting more allocated CPUs than exist on a node. This can be
    triggered by resuming a previosly suspended job, resulting in
    oversubscription of CPUs.
 -- Fix the pty window manager in slurmstepd not to retry IO operation with
    srun if it read EOF from the connection with it.
 -- sbatch --ntasks option to take precedence over --ntasks-per-node plus node
    count, as documented. Set SLURM_NTASKS/SLURM_NPROCS environment variables
    accordingly.
 -- MYSQL - Make sure suspended time is only subtracted from the CPU TRES
    as it is the only TRES that can be given to another job while suspended.
 -- Clarify how TRESBillingWeights operates on memory and burst buffers.
Loading
Loading full blame...