Skip to content
Snippets Groups Projects
NEWS 498 KiB
Newer Older
David Bigagli's avatar
David Bigagli committed
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and administrators.

Danny Auble's avatar
Danny Auble committed
* Changes in Slurm 17.11.3
==========================
 -- Send SIG_UME correctly to a step.
 -- Sort sreport's reservation report by cluster, time_start, resv_name instead
    of cluster, resv_name, time_start.
 -- Avoid setting node in COMPLETING state indefinitely if the job initiating
    the node reboot is cancelled while the reboot in in progress.
 -- Scheduling fix for changing node features without any NodeFeatures plugins.
 -- Improve logic when summarizing job arrays mail notifications.
 -- Add scontrol -F/--future option to display nodes in FUTURE state.
 -- Fix REASONABLE_BUF_SIZE to actually be 3/4 of MAX_BUF_SIZE.
 -- When a job array is preempting make it so tasks in the array don't wait
    to preempt other possible jobs.
 -- Change free_buffer to FREE_NULL_BUFFER to prevent possible double free
    in slurmstepd.
 -- node_feature/knl_cray - Fix memory leaks that occur when slurmctld
    reconfigured.
 -- node_feature/knl_cray - Fix memory leak that can occur during normal
    operation.
 -- Fix srun environment variables for --prolog script.
 -- Fix job array dependency with "aftercorr" option and some task arrays in
    the first job fail. This fix lets all task array elements that can run
    proceed rather than stopping all subsequent task array elements.
 -- Fix potential deadlock in the slurmctld when using list_for_each.
 -- Fix for possible memory corruption in srun when running heterogeneous job
    steps.
 -- Fix job array dependency with "aftercorr" option and some task arrays in
    the first job fail. This fix lets all task array elements that can run
    proceed rather than stopping all subsequent task array elements.
 -- Fix output file containing "%t" (task ID) for heterogeneous job step to
    be based upon global task ID rather than task ID for that component of the
    heterogeneous job step.
 -- MYSQL - Fix potential abort when attempting to make an account a parent of
    itself.
 -- Fix potentially uninitialized variable in slurmctld.
 -- MYSQL - Fix issue for multi-dimensional machines when using sacct to
    find jobs that ran on specific nodes.
 -- Reject --acctg-freq at submit if invalid.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.2
==========================
 -- jobcomp/elasticsearch - append Content-Type to the HTTP header.
 -- MYSQL - Fix potential abort of slurmdbd when job has no TRES.
 -- Add advanced reservation flag of "REPLACE_DOWN" to replace DOWN or DRAINED
    nodes.
 -- slurm.spec-legacy - add missing libslurmfull.so to slurm.files.
 -- Fix squeue job ID filtering for pending job array records.
 -- Fix potential deadlock in _run_prog() in power save code.
 -- MYSQL - Add dynamic_offset in the database to force range for auto
    increment ids for the tres_table.
 -- MYSQL - Fix fallout from MySQL auto increment bug, see RELEASE_NOTES,
    only affects current 17.11 users tracking licenses or GRES in the database.
 -- Refactor logging logic to avoid possible memory corruption on non-x86
    architectures.
 -- Fix memory leak when getting jobs from the slurmdbd.
 -- Fix incorrect logic behind MemorySwappiness, and only set the value when
    specified in the configuration.
* Changes in Slurm 17.11.1-2
============================
 -- MYSQL - Make index for pack_job_id
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.1
==========================
 -- Fix --with-shared-libslurm option to work correctly.
 -- Make it so only daemons log errors on configuration option duplicates.
 -- Fix for ConstrainDevices=yes to work correctly.
 -- Fix to purge old jobs using burst buffer if slurmctld daemon restarted
    after the job's burst buffer work was already completed.
 -- Make logging prefix for slurmstepd to happen as soon as possible.
 -- mpi/pmix: Fix the job registration for the PMIx v2.1.
 -- Fix uid check for signaling a step with anything but SIGKILL.
 -- Fix uid check when requesting a jobid from a pid.
 -- Return ESLURM_TRANSITION_STATE_NO_UPDATE instead of EAGAIN when trying to
    signal a step that is still running a prolog.
 -- Update Cray slurm_playbook.yaml with latest recommended version.
 -- Only say a prolog is done running after the extern step is launched.
Danny Auble's avatar
Danny Auble committed
 -- Wait to start a batch step until the prolog and extern step are
    fully ran/launched.  Only matters if running with
    PrologFlags=[contain|alloc].
 -- Truncate a range for SlurmctldPort to FD_SETSIZE elements and throw an
    error, otherwise network traffic may be lost due to poll() not detecting
    traffic.
 -- Fix for srun --pack-group option that can reuse/corrupt memory.
 -- Fix handling ultra long hostlists in a hostfile.
 -- X11: fix xauth regex to handle '-' in hostnames again.
 -- Fix potential node reboot timeout problem for "scontrol reboot" command.
 -- Add ability for squeue to sort jobs by submit time.
 -- CRAY - Switch to standard pid files on Cray systems.
 -- Update jobcomp records on duplicate inserts.
 -- If unrecognized configuration file option found then print an appropriate
    fatal error message rather than relying upon random errno value.
 -- Initialize job_desc_msg_t's instead of just memset'ing them.
 -- Fix divide by zero when job requests no tasks and more memory than
    MaxMemPer{CPU|NODE}.
 -- Avoid changing Slurm internal errno on syslog() failures.
 -- BB - Only launch dependent jobs after the burst buffer is staged-out
    completely instead of right after the parent job finishes.
 -- node_features/knl_generic - If plugin can not fully load then do not spawn
    a background pthread (which will fail with invalid memory reference).
 -- Don't set the next jobid to give out to the highest jobid in the system on
    controller startup. Just use the checkpointed next use jobid.
 -- Docs - add Slurm/PMIx and OpenMPI build notes to the mpi_guide page.
 -- Add lustre_no_flush option to LaunchParameters for Native Cray systems.
 -- Fix rpmbuild issue with rpm 4.13+ / Fedora 25+.
 -- sacct - fix the display for the NNodes field when using the --units option.
 -- Prevent possible double-xfree on a buffer in stepd_completion.
 -- Fix for record job state on successful allocation but failed reply message.
 -- Fill in the user_name field for batch jobs if not sent by the slurmctld.
    (Which is the default behavior if PrologFlags=send_gids is not enabled.)
    This prevents job launch problems for sites using UsePAM=1.
 -- Handle syncing federated jobs that ran on non-origin clusters and were
    cancelled while the origin cluster was down.
 -- Fix accessing variable outside of lock.
 -- slurm.spec: move libpmi to a separate package to solve a conflict with the
    version provided by PMIx. This will require a separate change to PMIx as
    well.
 -- X11 forwarding: change xauth handling to use hostname/unix:display format,
    rather than localhost:display.
 -- mpi/pmix - Fix warning if not compiling with debug.
* Changes in Slurm 17.11.0
==========================
 -- Fix documentation for MaxQueryTimeRange option in slurmdbd.conf.
 -- Avoid srun abort trying to run on heterogeneous job component that has
    ended.
 -- Add SLURM_PACK_JOB_ID,SLURM_PACK_JOB_OFFSET to PrologSlurmctld and
    EpilogSlurmctld environment.
 -- Treat ":" in #SBATCH arguments as fatal error. The "#SBATCH packjob" syntax
    must be used instead.
 -- job_submit/lua plugin: expose pack_job fields to get.
 -- Prevent scheduling deadlock with multiple components of heterogeneous job
    in different partitions (i.e. one heterogeneous job component is higher
    priority in one partition and another component is lower priority in a
    different partition).
 -- Fix for heterogeneous job starvation bug.
 -- Fix some slurmctld memory leaks.
 -- Add SLURM_PACK_JOB_NODELIST to PrologSlurmctld and EpilogSlurmctld
    environment.
 -- If PrologSlurmctld fails for pack job leader then requeue or kill all
    components of the job.
 -- Fix for mulitple --pack-group srun arguments given out of order.
 -- Update slurm.conf(5) man page with updated example logrotate script.
 -- Add SchedulerParameters=whole_pack configuration parameter. If set, then
    hold, release and cancel operations on any component of a heterogeneous job
    will be applied to all components
 -- Handle FQDNs in xauth cookies for x11 display forwarding properly.
 -- For heterogeneous job steps, the srun --open-mode option default value will
    be set to "append".
 -- Pack job scheduling list not being cleared between runs of the backfill
    scheduler resulted in various anomalies.
 -- Fix that backward compat for pmix version < 1.1.5.
 -- Fix use-after-free that can lead to slurmstepd segfaulting when setting
    ulimit values.
 -- Add heterogeneous job start data to sdiag output.
 -- X11 forwarding - handle systems with X11UseLocalhost=no set in sshd_config.
 -- Fix potential missing issue with missin symbols in gres plugins.
 -- Ignore querying clusters in federation that are down from status commands.
 -- Base federated jobs off of origin job and not the local cluster in API.
 -- Remove erroneous double '-' on rpath for libslurmfull.
 -- Remove version from libslurmfull and move it to $LIBDIR/slurm since the ABI
    could change from one version to the other.
 -- Fix unused wall time for reservations.
 -- Convert old reservation records to insert unused wall into the rows.
 -- slurm.spec: further restructing and improvements.
 -- Allow nodes state to be updated between FAIL and DRAIN.
 -- x11 forwarding: handle build with alternate location for libssh2.
* Changes in Slurm 17.11.0rc3
==============================
 -- Fix extern step to wait until launched before allowing job to start.
 -- Add missing locks around figuring out TRES when clean starting the
    slurmctld.
 -- Cray modulefile: avoid removing /usr/bin from path on module unload.
 -- Make reoccurring reservations show up in the database.
 -- Adjust related resources (cpus, tasks, gres, mem, etc.) when updating
    NumNodes with scontrol.
 -- Don't initialize MPI plugins for batch or extern steps.`
 -- slurm.spec - do not install a slurm.conf file under /etc/ld.so.conf.d.
 -- X11 forwarding - fix keepalive message generation code.
Morris Jette's avatar
Morris Jette committed
 -- If heterogeneous job step is unable to acquire MPI reserved ports then
    avoid referencing NULL pointer. Retry assigning ports ONLY for
    non-heterogeneous job steps.
 -- If any acct_gather_*_init fails fatal instead of error and keep going.
 -- launch/slurm plugin - Avoid using global variable for heterogeneous job
    steps, which could corrupt memory.
* Changes in Slurm 17.11.0rc2
==============================
Morris Jette's avatar
Morris Jette committed
 -- Prevent slurmctld abort with NodeFeatures=knl_cray and non-KNL nodes lacking
    any configured features.
 -- The --cpu_bind and --mem_bind options have been renamed to --cpu-bind
    and --mem-bind for consistency with the rest of Slurm's options. Both
    old and new syntaxes are supported for now.
 -- Add slurmdb_connection_commit to the slurmdb api to commit when needed.
 -- Add the federation api's to the slurmdb.h file.
 -- Add job functions to the db_api.
Loading
Loading full blame...