Skip to content
Snippets Groups Projects
NEWS 502 KiB
Newer Older
David Bigagli's avatar
David Bigagli committed
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and administrators.

Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.4
==========================
 -- Add fatal_abort() function to be able to get core dumps if we hit an
    "impossible" edge case.
 -- Link slurmd against all libraries that slurmstepd links to.
 -- Fix limits enforce order when they're set at partition and other levels.
 -- Add slurm_load_single_node() function to the Perl API.
 -- slurm.spec - change dependency for --with lua to use pkgconfig.
 -- Fix small memory leaks in node_features plugins on reconfigure.
 -- slurmdbd - only permit requests to update resources from operators or
    administrators.
 -- Fix handling of partial writes in io_init_msg_write_to_fd() which can
    lead to job step launch failure under higher cluster loads.
 -- MYSQL - Fix to handle quotes in a given work_dir of a job.
 -- sbcast - fix a race condition that leads to "Unspecified error".
 -- Log that support for the ChosLoc configuration parameter will end in Slurm
    version 18.08.
 -- Fix backfill performance issue where bf_min_prio_reserve was not respected.
 -- Fix MaxQueryTimeRange checks.
 -- Print MaxQueryTimeRange in "sacctmgr show config".
 -- Correctly check return codes when creating a step to check if needing to
    wait to retry or not.
 -- Fix issue where a job could be denied by Reason=MaxMemPerLimit when not
    requesting any tasks.
Felip Moll's avatar
Felip Moll committed
 -- In perl tools, fix for regexp that caused extra incorrectly shown results.
 -- Add some extra locks in fed_mgr to be extra safe.
 -- Minor memory leak fixes in the fed_mgr on slurmctld shutdown.
 -- Make sreport job reports also report duplicate jobs correctly.
 -- Fix issues restoring certain Partition configuration elements, especially
    when ReconfigFlags=KeepPartInfo is enabled.
 -- Don't add TRES whose value is NO_VAL64 when building string line.
 -- Fix removing array jobs from hash in slurmctld.
 -- Print out missing user messages from jobsubmit plugin when srun/salloc are
    waiting for an allocation.
 -- Handle --clusters=all as case insensitive.
 -- Only check requested clusters in federation when using --test-only
    submission option.
 -- In the federation, make it so you can cancel stranded sibling jobs.
 -- Silence an error from PSS memory stat collection process.
 -- Requeue jobs allocated to nodes requested to DRAIN or FAIL if nodes are
    POWER_SAVE or POWER_UP, preventing jobs to start on NHC-failed nodes.
 -- Make MAINT and OVERLAP resvervation flags order agnostic on overlap test.
 -- Preserve node features when slurmctld daemons reconfigured including active
    and available KNL features.
* Changes in Slurm 17.11.3-2
Tim Wickberg's avatar
Tim Wickberg committed
==========================
 -- Revert node_features changes in 17.11.3 that lead to various segfaults on
    slurmctld startup.
Danny Auble's avatar
Danny Auble committed
* Changes in Slurm 17.11.3
==========================
 -- Send SIG_UME correctly to a step.
 -- Sort sreport's reservation report by cluster, time_start, resv_name instead
    of cluster, resv_name, time_start.
 -- Avoid setting node in COMPLETING state indefinitely if the job initiating
    the node reboot is cancelled while the reboot in in progress.
 -- Scheduling fix for changing node features without any NodeFeatures plugins.
 -- Improve logic when summarizing job arrays mail notifications.
 -- Add scontrol -F/--future option to display nodes in FUTURE state.
 -- Fix REASONABLE_BUF_SIZE to actually be 3/4 of MAX_BUF_SIZE.
 -- When a job array is preempting make it so tasks in the array don't wait
    to preempt other possible jobs.
 -- Change free_buffer to FREE_NULL_BUFFER to prevent possible double free
    in slurmstepd.
 -- node_feature/knl_cray - Fix memory leaks that occur when slurmctld
    reconfigured.
 -- node_feature/knl_cray - Fix memory leak that can occur during normal
    operation.
 -- Fix srun environment variables for --prolog script.
 -- Fix job array dependency with "aftercorr" option and some task arrays in
    the first job fail. This fix lets all task array elements that can run
    proceed rather than stopping all subsequent task array elements.
 -- Fix potential deadlock in the slurmctld when using list_for_each.
 -- Fix for possible memory corruption in srun when running heterogeneous job
    steps.
 -- Fix job array dependency with "aftercorr" option and some task arrays in
    the first job fail. This fix lets all task array elements that can run
    proceed rather than stopping all subsequent task array elements.
 -- Fix output file containing "%t" (task ID) for heterogeneous job step to
    be based upon global task ID rather than task ID for that component of the
    heterogeneous job step.
 -- MYSQL - Fix potential abort when attempting to make an account a parent of
    itself.
 -- Fix potentially uninitialized variable in slurmctld.
 -- MYSQL - Fix issue for multi-dimensional machines when using sacct to
    find jobs that ran on specific nodes.
 -- Reject --acctg-freq at submit if invalid.
 -- Added info string on sh5util when deleting an empty file.
 -- Correct dragonfly topology support when job allocation specifies desired
    switch count.
 -- Fix minor memory leak on an sbcast error path.
 -- Fix issues when starting the backup slurmdbd.
 -- Revert uid check when requesting a jobid from a pid.
 -- task/cgroup - add support to detect OOM_KILL cgroup events.
 -- Fix whole node allocation cpu counts when --hint=nomultihtread.
 -- Allow execution of task prolog/epilog when uid has access
    rights by a secondary group id.
 -- Validate command existence on the srun *[pro|epi]log options
    if LaunchParameter test_exec is set.
 -- Fix potential memory leak if clean starting and the TRES didn't change
    from when last started.
 -- Fix for association MaxWall enforcement when none is given at submission.
 -- Add a job's allocated licenses to the [Pro|Epi]logSlurmctld.
 -- burst_buffer/cray: Attempts by job to create persistent burst buffer when
    one already exists owned by a different user will be logged and the job
    held.
 -- CRAY - Remove race in the core_spec where we add the slurmstepd to the
    job container where if the step was canceled would also cancel the stepd
    erroneously.
 -- Make sure the slurmstepd blocks signals like SIGTERM correctly.
 -- SPANK - When slurm_spank_init_post_opt() fails return error correctly.
 -- When revoking a sibling job in the federation we want to send a start
    message before purging the job record to get the uid of the revoked job.
 -- Make JobAcctGatherParams options case-insensitive. Previously, UsePss
    was the only correct capitialization; UsePSS or usepss were silently
    ignored.
 -- Prevent pthread_atfork handlers from being added unnecessarily after
    'scontrol reconfigure', which can eventually lead to a crash if too
    many handlers have been registered.
 -- Better debug messages when MaxSubmitJobs is hit.
 -- Docs - update squeue man page to describe all possible job states.
 -- Prevent orphaned step_extern steps when a job is cancelled while the
    prolog is still running.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.2
==========================
 -- jobcomp/elasticsearch - append Content-Type to the HTTP header.
 -- MYSQL - Fix potential abort of slurmdbd when job has no TRES.
 -- Add advanced reservation flag of "REPLACE_DOWN" to replace DOWN or DRAINED
    nodes.
 -- slurm.spec-legacy - add missing libslurmfull.so to slurm.files.
 -- Fix squeue job ID filtering for pending job array records.
 -- Fix potential deadlock in _run_prog() in power save code.
 -- MYSQL - Add dynamic_offset in the database to force range for auto
    increment ids for the tres_table.
 -- MYSQL - Fix fallout from MySQL auto increment bug, see RELEASE_NOTES,
    only affects current 17.11 users tracking licenses or GRES in the database.
 -- Refactor logging logic to avoid possible memory corruption on non-x86
    architectures.
 -- Fix memory leak when getting jobs from the slurmdbd.
 -- Fix incorrect logic behind MemorySwappiness, and only set the value when
    specified in the configuration.
* Changes in Slurm 17.11.1-2
============================
 -- MYSQL - Make index for pack_job_id
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.1
==========================
 -- Fix --with-shared-libslurm option to work correctly.
 -- Make it so only daemons log errors on configuration option duplicates.
 -- Fix for ConstrainDevices=yes to work correctly.
 -- Fix to purge old jobs using burst buffer if slurmctld daemon restarted
    after the job's burst buffer work was already completed.
 -- Make logging prefix for slurmstepd to happen as soon as possible.
 -- mpi/pmix: Fix the job registration for the PMIx v2.1.
 -- Fix uid check for signaling a step with anything but SIGKILL.
 -- Return ESLURM_TRANSITION_STATE_NO_UPDATE instead of EAGAIN when trying to
    signal a step that is still running a prolog.
 -- Update Cray slurm_playbook.yaml with latest recommended version.
 -- Only say a prolog is done running after the extern step is launched.
Danny Auble's avatar
Danny Auble committed
 -- Wait to start a batch step until the prolog and extern step are
    fully ran/launched.  Only matters if running with
    PrologFlags=[contain|alloc].
 -- Truncate a range for SlurmctldPort to FD_SETSIZE elements and throw an
    error, otherwise network traffic may be lost due to poll() not detecting
    traffic.
 -- Fix for srun --pack-group option that can reuse/corrupt memory.
 -- Fix handling ultra long hostlists in a hostfile.
 -- X11: fix xauth regex to handle '-' in hostnames again.
 -- Fix potential node reboot timeout problem for "scontrol reboot" command.
 -- Add ability for squeue to sort jobs by submit time.
 -- CRAY - Switch to standard pid files on Cray systems.
 -- Update jobcomp records on duplicate inserts.
 -- If unrecognized configuration file option found then print an appropriate
    fatal error message rather than relying upon random errno value.
 -- Initialize job_desc_msg_t's instead of just memset'ing them.
 -- Fix divide by zero when job requests no tasks and more memory than
    MaxMemPer{CPU|NODE}.
 -- Avoid changing Slurm internal errno on syslog() failures.
 -- BB - Only launch dependent jobs after the burst buffer is staged-out
    completely instead of right after the parent job finishes.
 -- node_features/knl_generic - If plugin can not fully load then do not spawn
    a background pthread (which will fail with invalid memory reference).
 -- Don't set the next jobid to give out to the highest jobid in the system on
    controller startup. Just use the checkpointed next use jobid.
 -- Docs - add Slurm/PMIx and OpenMPI build notes to the mpi_guide page.
 -- Add lustre_no_flush option to LaunchParameters for Native Cray systems.
 -- Fix rpmbuild issue with rpm 4.13+ / Fedora 25+.
 -- sacct - fix the display for the NNodes field when using the --units option.
 -- Prevent possible double-xfree on a buffer in stepd_completion.
 -- Fix for record job state on successful allocation but failed reply message.
 -- Fill in the user_name field for batch jobs if not sent by the slurmctld.
    (Which is the default behavior if PrologFlags=send_gids is not enabled.)
    This prevents job launch problems for sites using UsePAM=1.
 -- Handle syncing federated jobs that ran on non-origin clusters and were
Loading
Loading full blame...