Newer
Older
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and administrators.
* Changes in Slurm 17.11.0pre1
==============================
-- Interpet all format options in output/error file to log prolog errors. Prior
logic only supported "%j" (job ID) option.

Danny Auble
committed
-- Add the configure option --with-shared-libslurm which will link to
libslurm.so instead of libslurm.o thus reducing the footprint of all the
binaries.
-- In switch plugin, added plugin_id symbol to plugins and wrapped
switch_jobinfo_t with dynamic_plugin_data_t in interface calls in
order to pass switch information between clusters with different switch
types.
-- Switch naming of acct_gather_infiniband to acct_gather_interconnect
-- Add a last_sched_eval timestamp to record when a job was last evaluated
by the main scheduler or backfill.
-- Add scancel "--hurry" option to avoid staging out any burst buffer data.
-- Simplify the sched plugin interface.
-- Add new advanced reservation flags of "weekday" (repeat on each weekday;
Monday through Friday) and "weekend" (repeat on each weekend day; Saturday
and Sunday).
-- Add new advanced reservation flag of "flex", which permits jobs requesting
the reservation to begin prior to the reservation's start time and use
resources inside or outside of the reservation. A typical use case is to
prevent jobs not explicitly requesting the reservation from using those
reserved resources rather than forcing jobs requesting the reservation to
use those resources in the time frame reserved.
-- Node "OS" field expanded from "sysname" to "sysname release version" (e.g.
change from "Linux" to
"Linux 4.8.0-28-generic #28-Ubuntu SMP Sat Feb 8 09:15:00 UTC 2017").
-- jobcomp/elasticsearch - Add "job_name" and "wc_key" fields to stored
information.
-- jobcomp/filetxt - Add ArrayJobId, ArrayTaskId, ReservationName, Gres,
Account, QOS, WcKey, Cluster, SubmitTime, EligibleTime, DerivedExitCode and
ExitCode.
-- scontrol modified to report core IDs for reservation containing individual
cores.
-- MYSQL - Get rid of table join during rollup which speeds up the process
dramatically on large job/step tables.
-- Add ability to define features on clusters for directing federated jobs to
different clusters.
-- Add new RPC to process multiple federation RPCs in a single communication.
-- Modify slurm_load_jobs() function to load job information from all clusters
in a federation.
-- Add squeue --local and --sibling options to modify filtering of jobs on
federated clusters.
-- Add SchedulerParameters option of bf_max_job_user_part to specifiy the
maximum number of jobs per user for any single partition. This differs from
bf_max_job_user in that a separate counter is applied to each partition
rather than having a single counter per user applied to all partitions.
-- Modify backfill logic so that bf_max_job_user, bf_max_job_part and
bf_max_job_user_part options can all be used independently of each other.
-- Add sprio -p/--partition option to filter jobs by partition name.
-- Add partition name to job priority factor response message.
-- Add sprio --local and --sibling options for use in federation of clusters.
-- Add sprio "%c" format to print cluster name in federation mode.
-- Modify sinfo logic to provided unified view of all nodes and partitions
in a federation, add --local option to only report local state information
even in a cluster, print cluster name with "%V" format option, and
optionally sort by cluster name.
-- If a task in a parallel job fails and it was launched with the
--kill-on-bad-exit option then terminate the remaining tasks using the
SIGCONT, SIGTERM and SIGKILL signals rather than just sending SIGKILL.
-- Include submit_time when doing the sort for job scheduling.
* Changes in Slurm 17.02.3
==========================
-- Increase --cpu_bind and --mem_bind field length limits.
* Changes in Slurm 17.02.2
==========================
-- Update hyperlink to LBNL Node Health Check program.
-- burst_buffer/cray - Add support for line continuation.
-- If a job is cancelled by the user while it's allocated nodes are being
reconfigured (i.e. the capmc_resume program is rebooting nodes for the job)
and the node reconfiguration fails (i.e. the reboot fails), then don't
requeue the job but leave it in a cancelled state.
-- capmc_resume (Cray resume node script) - Do not disable changing a node's
active features if SyscfgPath is configured in the knl.conf file.
-- Improve the srun documentation for the --resv-ports option.
-- burst_buffer/cray - Fix parsing for discontinuous allocated nodes. A job
allocation of "20,22" must be expressed as "20\n22".
-- Fix rare segfault when shutting down slurmctld and still sending data to
the database.
-- Fix gres output of a job if it is updated while pending to be displayed
correctly with Slurm tools.
-- Fix missing unlock when job_list doesn't exist when starting priority/
multifactor.
-- Fix segfault if slurmctld is shutting down and the slurmdbd plugin was
in the middle of setting db_indexes.
-- Add ESLURM_JOB_SETTING_DB_INX to errno to note when a job can't be updated
because the dbd is setting a db_index.
-- Fix possible double insertion into database when a job is updated at the
moment the dbd is assigning a db_index.
-- Fix memory error when updating a job's licenses.

Danny Auble
committed
-- Fix seff to work correctly with non-standard perl installs.
-- Export missing slurmdbd_defs_[init|fini] needed for libslurmdb.so to work.
-- Fix sacct from returning way more than requested when querying against a job
array task id.
-- Fix double read lock of tres when updating gres or licenses on a job.
-- Make sure locks are always in place when calling
assoc_mgr_make_tres_str_from_array.
-- Prevent slurmctld SEGV when creating reservation with duplicated name.
-- Consider QOS flags Partition[Min|Max]Nodes when doing backfill.
-- Fix slurmdbd_defs.c to not have half symbols go to libslurm.so and the
other half go to libslurmdb.so.
-- Fix 'scontrol show jobs' to remove an errant newline when 'Switches' is
printed.
-- Better code for handling memory required by a task on a heterogeneous
system.
-- Fix regression in 17.02.0 with respects to GrpTresMins on a QOS or
Association.
-- Schedule interactive jobs quicker.
-- Perl API - correct value of MEM_PER_CPU constant to correctly handle
memory values.
-- Fix 'flags' variable to be 32 bit from the old 16 bit value in the perl api.
-- Export sched_nodes for a job in the perl api.
-- Improve error output when updating a reservation that has already started.
-- Fix --ntasks-per-node issue with srun so DenyOnLimit would work correctly.
-- node_features/knl_cray plugin - Fix memory leak.

Dominik Bartkiewicz
committed
-- Fix wrong cpu_per_task count issue on heterogeneous system when dealing with
steps.
-- Fix double free issue when removing usage from an association with sacctmgr.
-- Fix issue with SPANK plugins attempting to set null values as environment
variables, which leads to the command segfaulting on newer glibc versions.
-- Fix race condition on slurmctld startup when plugins have not gone through
init() ahead of the rpc_manager processing incoming messages.
-- job_submit/lua - expose admin_comment field.
-- Allow AdminComment field to be set by the job_submit plugin.
-- Allow AdminComment field to be changed by any Administrator.

Alejandro Sanchez
committed
-- MYSQL - Streamline job flush sql when doing a clean start on the slurmctld.
-- Fix potential infinite loop when talking to the DBD when shutting down
the slurmctld.
-- Fix MCS filter.
-- Make it so pmix can be included in the plugin rpm without having to
specify --with-pmix.
-- MYSQL - Fix initial load when not using he DBD.
-- Fix scontrol top to not make jobs priority 0 (held).
-- Downgrade info message about exceeding partition time limit to a debug2.
* Changes in Slurm 17.02.1-2
============================
-- Replace clock_gettime with time(NULL) for very old systems without the call.
* Changes in Slurm 17.02.1
==========================
-- Modify pam module to work when configured NodeName and NodeHostname differ.
-- Update to sbatch/srun man pages to explain the "filename pattern" clearer
-- Add %x to sbatch/srun filename pattern to represent the job name.
-- job_submit/lua - Add job "bitflags" field.
-- Update slurm.spec file to note obsolete RPMs.
-- Fix deadlock scenario when dumping configuration in the slurmctld.
-- Remove unneeded job lock when running assoc_mgr cache. This lock could
cause potential deadlock when/if TRES changed in the database and the
slurmctld wasn't made aware of the change. This would be very rare.
-- Fix missing locks in gres logic to avoid potential memory race.

Dominik Bartkiewicz
committed
-- If gres is NULL on a job don't try to process it when returning detailed
information about a job to scontrol.
-- Fix print of consumed energy in sstat when no energy is being collected.
-- Print formatted tres string when creating/updating a reservation.
-- Fix issues with QOS flags Partition[Min|Max]Nodes to work correctly.
-- Prevent manipulation of the cpu frequency and governor for batch or
extern steps. This addresses an issue where the batch step would
inadvertently set the cpu frequency maximum to the minimum value
supported on the node.
-- Convert a slurmctd power management data structure from array to list in
order to eliminate the possibility of zombie child suspend/resume
processes.
-- Burst_buffer/cray - Prevent slurmctld daemon abort if "paths" operation
fails. Now job will be held. Update job update time when held.
-- Fix issues with QOS flags Partition[Min|Max]Nodes to work correctly.
-- Refactor slurmctld agent logic to eliminate some pthreads.
-- Added "SyscfgTimeout" parameter to knl.conf configuration file.
-- Fix for CPU binding for job steps run under a batch job.
* Changes in Slurm 17.02.0
==========================
-- job_submit/lua - Make "immediate" parameter available.
-- Fix srun I/O race condtion to eliminate a error message that might be
generated if the application exits with outstanding stdin.
-- Fix regression when purging/archiving jobs/events.
-- Add new job state JOB_OOM indicating Out Of Memory condition as detected
by task/cgroup plugin.
-- If QOS has been added to the system go refigure out Deny/AllowQOS on
partitions.
-- Deny job with duplicate GRES requested.
-- Fix loading super old assoc_mgr usage without segfaulting.
-- CRAY systems: Restore TaskPlugins order of task/cray before task/cgroup.
-- Task/cray: Treat missing "mems" cgroup with "debug" messages rather than
"error" messages. The file may be missing at step termination due to a
change in how cgroups are released at job/step end.
-- Fix for job constraint specification with counts, --ntasks-per-node value,
and no node count.
-- Fix ordering of step task allocation to fill in a socket before going into
Loading
Loading full blame...