Skip to content
Snippets Groups Projects
NEWS 446 KiB
Newer Older
David Bigagli's avatar
David Bigagli committed
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and administrators.

* Changes in Slurm 17.11.0pre1
==============================
 -- Interpet all format options in output/error file to log prolog errors. Prior
    logic only supported "%j" (job ID) option.
 -- Add the configure option --with-shared-libslurm which will link to
    libslurm.so instead of libslurm.o thus reducing the footprint of all the
    binaries.
 -- In switch plugin, added plugin_id symbol to plugins and wrapped
    switch_jobinfo_t with dynamic_plugin_data_t in interface calls in
    order to pass switch information between clusters with different switch
    types.
 -- Switch naming of acct_gather_infiniband to acct_gather_interconnect
Morris Jette's avatar
Morris Jette committed
 -- Make it so you can "stack" the interconnect plugins.
 -- Add a last_sched_eval timestamp to record when a job was last evaluated
    by the main scheduler or backfill.
 -- Add scancel "--hurry" option to avoid staging out any burst buffer data.
 -- Simplify the sched plugin interface.
 -- Add new advanced reservation flags of "weekday" (repeat on each weekday;
    Monday through Friday) and "weekend" (repeat on each weekend day; Saturday
    and Sunday).
 -- Add new advanced reservation flag of "flex", which permits jobs requesting
    the reservation to begin prior to the reservation's start time and use
    resources inside or outside of the reservation. A typical use case is to
Morris Jette's avatar
Morris Jette committed
    prevent jobs not explicitly requesting the reservation from using those
    reserved resources rather than forcing jobs requesting the reservation to
    use those resources in the time frame reserved.
 -- Add NoDecay flag to QOS.
Morris Jette's avatar
Morris Jette committed
 -- Node "OS" field expanded from "sysname" to "sysname release version" (e.g.
    change from "Linux" to
    "Linux 4.8.0-28-generic #28-Ubuntu SMP Sat Feb 8 09:15:00 UTC 2017").
 -- jobcomp/elasticsearch - Add "job_name" and "wc_key" fields to stored
    information.
Morris Jette's avatar
Morris Jette committed
 -- jobcomp/filetxt - Add ArrayJobId, ArrayTaskId, ReservationName, Gres,
    Account, QOS, WcKey, Cluster, SubmitTime, EligibleTime, DerivedExitCode and
    ExitCode.
 -- scontrol modified to report core IDs for reservation containing individual
    cores.
* Changes in Slurm 17.02.2
==========================
 -- Update hyperlink to LBNL Node Health Check program.
 -- burst_buffer/cray - Add support for line continuation.
 -- If a job is cancelled by the user while it's allocated nodes are being
    reconfigured (i.e. the capmc_resume program is rebooting nodes for the job)
    and the node reconfiguration fails (i.e. the reboot fails), then don't
    requeue the job but leave it in a cancelled state.
 -- capmc_resume (Cray resume node script) - Do not disable changing a node's
    active features if SyscfgPath is configured in the knl.conf file.
 -- Improve the srun documentation for the --resv-ports option.
 -- burst_buffer/cray - Fix parsing for discontinuous allocated nodes. A job
    allocation of "20,22" must be expressed as "20\n22".
 -- Fix rare segfault when shutting down slurmctld and still sending data to
    the database.
 -- Fix gres output of a job if it is updated while pending to be displayed
    correctly with Slurm tools.
Tim Wickberg's avatar
Tim Wickberg committed
 -- Fix pam_slurm_adopt.
 -- Fix missing unlock when job_list doesn't exist when starting priority/
    multifactor.
 -- Fix segfault if slurmctld is shutting down and the slurmdbd plugin was
    in the middle of setting db_indexes.
 -- Add ESLURM_JOB_SETTING_DB_INX to errno to note when a job can't be updated
    because the dbd is setting a db_index.
 -- Fix possible double insertion into database when a job is updated at the
    moment the dbd is assigning a db_index.
 -- Fix memory error when updating a job's licenses.
 -- Fix seff to work correctly with non-standard perl installs.
 -- Export missing slurmdbd_defs_[init|fini] needed for libslurmdb.so to work.
 -- Fix sacct from returning way more than requested when querying against a job
    array task id.
 -- Fix double read lock of tres when updating gres or licenses on a job.
 -- Make sure locks are always in place when calling
    assoc_mgr_make_tres_str_from_array.
 -- Prevent slurmctld SEGV when creating reservation with duplicated name.
 -- Consider QOS flags Partition[Min|Max]Nodes when doing backfill.
 -- Fix slurmdbd_defs.c to not have half symbols go to libslurm.so and the
    other half go to libslurmdb.so.
 -- Fix 'scontrol show jobs' to remove an errant newline when 'Switches' is
    printed.
* Changes in Slurm 17.02.1-2
============================
 -- Replace clock_gettime with time(NULL) for very old systems without the call.
Danny Auble's avatar
Danny Auble committed
* Changes in Slurm 17.02.1
==========================
 -- Modify pam module to work when configured NodeName and NodeHostname differ.
 -- Update to sbatch/srun man pages to explain the "filename pattern" clearer
 -- Add %x to sbatch/srun filename pattern to represent the job name.
 -- job_submit/lua - Add job "bitflags" field.
 -- Update slurm.spec file to note obsolete RPMs.
 -- Fix deadlock scenario when dumping configuration in the slurmctld.
 -- Remove unneeded job lock when running assoc_mgr cache.  This lock could
    cause potential deadlock when/if TRES changed in the database and the
    slurmctld wasn't made aware of the change.  This would be very rare.
 -- Fix missing locks in gres logic to avoid potential memory race.
 -- If gres is NULL on a job don't try to process it when returning detailed
    information about a job to scontrol.
 -- Fix print of consumed energy in sstat when no energy is being collected.
 -- Print formatted tres string when creating/updating a reservation.
 -- Fix issues with QOS flags Partition[Min|Max]Nodes to work correctly.
 -- Prevent manipulation of the cpu frequency and governor for batch or
    extern steps. This addresses an issue where the batch step would
    inadvertently set the cpu frequency maximum to the minimum value
    supported on the node.
 -- Convert a slurmctd power management data structure from array to list in
    order to eliminate the possibility of zombie child suspend/resume
    processes.
 -- Burst_buffer/cray - Prevent slurmctld daemon abort if "paths" operation
    fails. Now job will be held. Update job update time when held.
 -- Fix issues with QOS flags Partition[Min|Max]Nodes to work correctly.
 -- Refactor slurmctld agent logic to eliminate some pthreads.
 -- Added "SyscfgTimeout" parameter to knl.conf configuration file.
 -- Fix for CPU binding for job steps run under a batch job.
Danny Auble's avatar
Danny Auble committed

* Changes in Slurm 17.02.0
==========================
 -- job_submit/lua - Make "immediate" parameter available.
 -- Fix srun I/O race condtion to eliminate a error message that might be
    generated if the application exits with outstanding stdin.
 -- Fix regression when purging/archiving jobs/events.
 -- Add new job state JOB_OOM indicating Out Of Memory condition as detected
    by task/cgroup plugin.
 -- If QOS has been added to the system go refigure out Deny/AllowQOS on
    partitions.
 -- Deny job with duplicate GRES requested.
 -- Fix loading super old assoc_mgr usage without segfaulting.
 -- CRAY systems: Restore TaskPlugins order of task/cray before task/cgroup.
 -- Task/cray: Treat missing "mems" cgroup with "debug" messages rather than
    "error" messages. The file may be missing at step termination due to a
    change in how cgroups are released at job/step end.
 -- Fix for job constraint specification with counts, --ntasks-per-node value,
    and no node count.
 -- Fix ordering of step task allocation to fill in a socket before going into
    another one.
 -- Fix configure to not require C++
 -- job_submit/lua - Remove access to slurmctld internal reservation fields of
    job_pend_cnt and job_run_cnt.
 -- Prevent job_time_limit enforcement from blocking other internal operations
    if a large number of jobs need to be cancelled.
 -- Add 'preempt_youngest_order' option to preempt/partition_prio plugin.
 -- Fix controller being able to talk to a pre-released DBD.
 -- Added ability to override the invoking uid for "scontrol update job"
    by specifying "--uid=<uid>|-u <uid>".
 -- Changed file broadcast "offset" from 32 to 64 bits in order to support files
    over 2 GB.
 -- slurm.spec - do not install init scripts alongside systemd service files.
* Changes in Slurm 17.02.0rc1
==============================
 -- Add port info to 'sinfo' and 'scontrol show node'.
 -- Fix errant definition of USE_64BIT_BITSTR which can lead to core dumps.
 -- Move BatchScript to end of each job's information when using
    "scontrol -dd show job" to make it more readable.
 -- Add SchedulerParameters configuration parameter of "default_gbytes", which
    treats numeric only (no suffix) value for memory and tmp disk space as being
    in units of Gigabytes. Mostly for compatability with LSF.
 -- Fix race condtion in srun/sattach logic which would prevent srun from
    terminating.
 -- Bitstring operations are now 64bit instead of 32bit.
 -- Replace hweight() function in bitstring with faster version.
 -- scancel would treat a non-numeric argument as the name of jobs to be
    cancelled (a non-documented feature). Cancelling jobs by name now require
    the "--jobname=" command line argument.
 -- scancel modified to note that no jobs satisfy the filter options when the
    --verbose option is used along with one or more job filters (e.g. "--qos=").
 -- Change _pack_cred to use pack_bit_str_hex instead of pack_bit_fmt for
    better scalability and performance.
 -- Add BootTime configuration parameter to knl.conf file to optimize resource
    allocations with respect to required node reboots.
 -- Add node_features_p_boot_time() to node_features plugin to optimize
    scheduling with respect to node reboots.
 -- Avoid allocating resources to a job in the event that its run time plus boot
    time (if needed) extent into an advanced reservation.
 -- Burst_buffer/cray - Avoid stage-out operation if job never started.
 -- node_features/knl_cray - Add capability to detected Uncorrectable Memory
    Errors (UME) and if detected then log the event in all job and step stderr
    with a message of the form:
    error: *** STEP 1.2 ON tux1 UNCORRECTABLE MEMORY ERROR AT 2016-12-14T09:09:37 ***
    Similar logic added to node_features/knl_generic in version 17.02.0pre4.
 -- If job is allocated nodes which are powered down, then reset job start time
    when the nodes are ready and do not charge the job for power up time.
 -- Add the ability to purge transactions from the database.
Brian Christiansen's avatar
Brian Christiansen committed
 -- Add support for requeue'ing of federated jobs (BETA).
 -- Add support for interactive federated jobs (BETA).
 -- Add the ability to purge rolled up usage from the database.
 -- Properly set SLURM_JOB_GPUS environment variable for Prolog.
* Changes in Slurm 17.02.0pre4
==============================
 -- Add support for per-partitiion OverTimeLimit configuration.
 -- Add --mem_bind option of "sort" to run zonesort on KNL nodes at step start.
 -- Add LaunchParameters=mem_sort option to configure running of zonesort
    by default at step startup.
 -- Add "FreeSpace" information for each pool to the "scontrol show burstbuffer"
    output. Required changes to the burst_buffer_info_t data structure.
 -- Add new node state flag of NODE_STATE_REBOOT for node reboots triggered by
    "scontrol reboot" commands. Previous logic re-used NODE_STATE_MAINT flag,
Loading
Loading full blame...