This file describes changes in recent versions of SLURM. It primarily
documents those changes that are of interest to users and admins.
* Changes in Slurm 14.03.0pre6
-- Modify slurmstepd to log messages according to the LogTimeFormat
parameter in slurm.conf.
-- Insure that overlapping reservations do not oversubscribe available
-- Added core specialization logic to select/cons_res plugin.
-- Added whole_node field to job_resources structure and enable gang scheduling
for jobs with core specialization.
-- When using FastSchedule = 1 the nodes with less than configured resources
are not longer set DOWN, they are set to DRAIN instead.
-- Modified 'sacctmgr show associations' command to show GrpCPURunMins
by default.
-- Replace the hostlist_push() function with a more efficient
-- Modify the reading of lustre file system statistics to print more
information when debug and when io error occur.
-- Add specialized core count field to job credential data.
NOTE: This changes the communications protocol from other pre-releases of
version 14.03. All programs must be cancelled and daemons upgraded from
previous pre-releases of version 14.03. Upgrades from version 2.6 or earlier
can take place without loss of jobs
-- Add version number to node and front-end configuration information visible
using the scontrol tool.
-- Add idea of a RESERVED flag for node state so idle resources are marked
not "idle" when in a reservation.
-- Added core specialization plugin infrastructure.
-- Added new job_submit/trottle plugin to control the rate at which a user
can submit jobs.
-- CRAY - added network performance counters option.
-- Allow scontrol suspend/resume to accept jobid in the format jobid_taskid
to suspend/resume array elements.
-- In the slurmctld job record, split "shared" variable into "share_res" (share
resource) and "whole_node" fields.
-- Fix the format of SLURM_STEP_RESV_PORTS. It was generated incorrectly
when using the hostlist_push_host function and input surrounded by [].
-- Modify the srun --slurmd-debug option to accept debug string tags
(quiet, fatal, error, info verbose) beside the numerical values.
-- Fix the bug where --cpu_bind=map_cpu is interpreted as mask_cpu.
-- Added squeue format option of "%X" (core specialization count).
-- Added core specialization web page (just a start for now).
-- Fix bug in job step allocation failing due to memory limit.
-- Modify the pbsnodes script to reflect its output on a TORQUE system.
-- Add ability to clear a node's DRAIN flag using scontrol or sview by setting
it's state to "UNDRAIN". The node's base state (e.g. "DOWN" or "IDLE") will
not be changed.
-- Modify the output of 'scontrol show partition' by displaying
DefMemPerCPU=UNLIMITED and MaxMemPerCPU=UNLIMITED when these limits are
configured as 0.
-- mpirun-mic - Major re-write of the command wrapper for Xeon Phi use.
-- Add new configuration parameter of AuthInfo to specify port used by
authentication plugin.
-- Corrected slurmstepd ident name when logging to syslog.
-- Fixed sh5util loop when there are no node-step files.
-- Add SLURM_CLUSTER_NAME to environment variables passed to PrologSlurmctld,
Prolog, EpilogSlurmctld, and Epilog
-- Add the idea of running a prolog right when an allocation happens
instead of when running on the node for the first time.
-- If user runs 'scontrol reconfig' but hostnames or the host count changes
the slurmctld throws a fatal error.
-- gres.conf - Add "NodeName" specification so that a single gres.conf file
can be used for a heterogeneous cluster.
-- Add flag to accounting RPC to indicate if job data is packed or not.
-- After all srun tasks have terminated on a node close the stdout/stderr
channel with the slurmstepd on that node.
-- In case of i/o error with slurmstepd log an error message and abort the
-- Add --test-only option to sbatch command to validate the script and options.
The response includes expected start time and resources to be allocated.
-- Remove the ThreadID documentation from slurm.conf. This functionality has
been obsoleted by the LogTimeFormat.
-- Sched plugins - rename global and plugin functions names for consistency
with other plugin types.
-- BGQ - Added RebootQOSList option to bluegene.conf to allow an implicate
reboot of a block if only jobs in the list are running on it when cnodes
go into a failure state.
-- Correct task count of pending job steps.
-- Improve limit enforcement for jobs, set RLIMIT_RSS, RLIMIT_AS and/or
RLIMIT_DATA to enforce memory limit.
-- Pending job steps will have step_id of INFINITE rather than NO_VAL and
will be reported as "TBD" by scontrol and squeue commands.
-- Add logic so PMI_Abort or PMI2_Abort can propagate an exit code.
-- Added SlurmdPlugstack configuration parameter.
-- Added PriorityFlag DEPTH_OBLIVIOUS to have the depth of an association
not effect it's priorty.
-- Multi-thread the sinfo command (one thread per partition).
-- Added sgather tool to gather files from a job's compute nodes into a
central location.
-- Added configuration parameter FairShareDampeningFactor to offer a greater
priority range based upon utilization.
-- Change MaxArraySize and job's array_task_id from 16-bit to 32-bit field.
Additional Slurm enhancements are be required to support larger job arrays.
-- Added -S/--core-spec option to salloc, sbatch and srun commands to reserve
specialized cores for system use. Modify scontrol and sview to get/set
the new field. No enforcement exists yet for these new options.
struct job_info / slurm_job_info_t: Added core_spec
struct job_descriptorjob_desc_msg_t: Added core_spec
-- Do not set SLURM_NODEID environment variable on front-end systems.
-- Convert bitmap functions to use int32_t instead of int in data structures
and function arguments. This is to reliably enable use of bitmaps containing
up to 4 billion elements. Several data structures containing index values
were also changed from data type int to int32_t:
- Struct job_info / slurm_job_info_t: Changed exc_node_inx, node_inx, and
req_node_inx from type int to type int32_t
- job_step_info_t: Changed node_inx from type int to type int32_t
- Struct partition_info / partition_info_t: Changed node_inx from type int
to type int32_t
- block_job_info_t: Changed cnode_inx from type int to type int32_t
- block_info_t: Changed ionode_inx and mp_inx from type int to type int32_t
- Struct reserve_info / reserve_info_t: Changed node_inx from type int to
type int32_t
-- Modify qsub wrapper output to match torque command output, just print the
job ID rather than "Submitted batch job #"
-- Change Slurm error string for ESLURM_MISSING_TIME_LIMIT from
"Missing time limit" to
"Time limit specification required, but not provided"
-- Change salloc job_allocate error message header from
"Failed to allocate resources" to
"Job submit/allocate failed"
-- Modify slurmctld message retry logic to support Cray cold-standby SDB.
-- Added "JobAcctGatherParams" configuration parameter. Value of "NoShare"
disables accounting for shared memory.
-- Added fields to "scontrol show job" output: boards_per_node,
sockets_per_board, ntasks_per_node, ntasks_per_board, ntasks_per_socket,
ntasks_per_core, and nice.
-- Add squeue output format options for job command and working directory
(%o and %Z respectively).
-- Add stdin/out/err to sview job output.
-- Add new job_state of JOB_BOOT_FAIL for job terminations due to failure to
boot it's allocated nodes or BlueGene block.
-- CRAY - Add SelectTypeParameters NHC_NO_STEPS and NHC_NO which will disable
the node health check script for steps and allocations respectfully.
-- Reservation with CoreCnt: Avoid possible invalid memory reference.
-- Add new error code for attempt to create a reservation with duplicate name.
-- Validate that a hostlist file contains text (i.e. not a binary).
-- switch/generic - propagate switch information from srun down to slurmd and
-- CRAY - Do not package Slurm's libpmi or libpmi2 libraries. The Cray version
of those libraries must be used.
-- Added a new option to the scontrol command to view licenses that are
configured in use and avalable. 'scontrol show licenses'.
-- MySQL - Made Slurm compatible with 5.6
-- Add task pointer to the task_post_term() function in task plugins. The
terminating task's PID is available in task->pid.
-- Defer sending SIGKILL signal to processes while core dump in progress.
-- Added JobContainerPlugin configuration parameter and plugin infrastructure.
-- Added partition configuration parameters AllowAccounts, AllowQOS,
DenyAccounts and DenyQOS.
-- The rpmbuild option for a cray system with ALPS has changed from
%_with_cray to %_with_cray_alps.
-- The log file timestamp format can now be selected at runtime via the
LogTimeFormat configuration option. See the slurm.conf and slurmdbd.conf
man pages for details.
-- Added switch/generic plugin to a job's convey network topology.
-- BLUEGENE - If block is in 'D' state or has more cnodes in error than
MaxBlockInError set the job wait reason appropriately.
-- API use: Generate an error return rather than fatal error and exit if the
configuraiton file is absent or invalid. This will permit Slurm APIs to be
more reliably used by other programs.
-- Add support for load-based scheduling, allocate jobs to nodes with the
largest number of available CPUs. Added SchedulingParameters paramter of
"CR_LLN" and partition parameter of "LLN=yes|no".
-- Added job_info() and step_info() functions to the gres plugins to extract
plugin specific fields from the job's or step's GRES data structure.
-- Added sbatch --signal option of "B:" to signal the batch shell rather than
only the spawned job steps.
-- Added sinfo and squeue format option of "%all" to print all fields available
for the data type with a vertical bar separating each field.
-- Add mechanism for job_submit plugin to generate error message for srun,
salloc or sbatch to stderr. New argument added to job_submit function in
the plugin.
-- Add StdIn, StdOut, and StdErr paths to job information dumped with
"scontrol show job".
-- Permit Slurm administrator to submit a batch job as any user.
-- Set a job's RLIMIT_AS limit based upon it's memory limit and VsizeFactor
configuration value.
-- Remove Postgres plugins
-- Make jobacct_gather/cgroup work correctly and also make all jobacct_gather
plugins more maintainable.
-- Proctrack/pgid - Add support for proctrack_p_plugin_get_pids() function.
-- Sched/backfill - Change default max_job_bf parameter from 50 to 100.
-- Added -I|--item-extract option to sh5util to extract data item from series.
* Changes in Slurm 2.6.6
-- sched/backfill - Fix bug that could result in failing to reserve resources
for high priority jobs.
-- Correct job RunTime if requeued from suspended state.
-- Reset job priority from zero (held) on manual resume from suspend state.
-- If FastSchedule=0 then do not DOWN a node with low memory or disk size.
-- Update sshare.1 man page making it consistent with sacctmgr.1.
-- Do not reset a job's priority when the slurmctld restarts if previously
set to some specific value.
-- sview - Fix regression where the Node tab wasn't able to add/remove columns.
-- Fix slurmstepd lock when job terminates inside the infiniband
network traffic accounting plugin.
-- Correct the documentation to read filesystem instead of Lustre. Update
the srun help.
-- Fix the acct_gather_filesystem_lustre.c to compute the Lustre accounting
data correctly accumulating differences between sampling intervals.
Fix the data structure mismatch between acct_gather_filesystem_lustre.c
and slurm_jobacct_gather.h which caused the hdf5 plugin to log incorrect
-- Don't allow PMI_TIME to be zero which will cause floating exception.
-- Fix purging of old reservation errors in database.
-- MYSQL - If starting the plugin and the database isn't up attempt to
connect in a loop instead of producing a fatal.
-- BLUEGENE - If IONodesPerMP changes in bluegene.conf recalculate bitmaps
based on ionode count correctly on slurmctld restart.
-- Fix step allocation when some CPUs are not available due to memory limits.
This happens when one step is active and using memory that blocks the
scheduling of another step on a portion of the CPUs needed. The new step
is now delayed rather than aborting with "Requested node configuration is
not available".
-- Make sure node limits get assessed if no node count was given in request.
-- Removed obsolete slurm_terminate_job() API.
-- Update documentation about QOS limits
-- Retry task exit message from slurmstepd to srun on message timeout.
-- Correction to logic reserving all nodes in a specified partition.
-- Added support for selecting AMD GPU by setting GPU_DEVICE_ORDINAL env var.
* Changes in Slurm 2.6.5
-- Correction to hostlist parsing bug introduced in v2.6.4 for hostlists with
more than one numeric range in brackets (e.g. rack[0-3]_blade[0-63]").
-- Add notification if using proctrack/cgroup and task/cgroup when oom hits.
-- Corrections to advanced reservation logic with overlapping jobs.
-- job_submit/lua - add cpus_per_task field to those available.
-- Add cpu_load to the node information available using the Perl API.
-- Correct a job's GRES allocation data in accounting records for non-Cray
-- Substantial performance improvement for systems with Shared=YES or FORCE
and large numbers of running jobs (replace bubble sort with quick sort).
-- proctrack/cgroup - Add locking to prevent race condition where one job step
is ending for a user or job at the same time another job stepsis starting
and the user or job container is deleted from under the starting job step.
-- Fixed sh5util loop when there are no node-step files.
-- Fix race condition on batch job termination that could result in a job exit
code of 0xfffffffe if the slurmd on node zero registers its active jobs at
the same time that slurmstepd is recording the job's exit code.
-- Correct logic returning remaining job dependencies in job information
reported by scontrol and squeue. Eliminates vestigial descriptors with
no job ID values (e.g. "afterany").
-- Improve performance of REQUEST_JOB_INFO_SINGLE RPC by removing unnecessary
locks and use hash function to find the desired job.
-- jobcomp/filetxt - Reopen the file when slurmctld daemon is reconfigured
or gets SIGHUP.
-- Remove notice of CVE with very old/deprecated versions of Slurm in
-- Fix if hwloc_get_nbobjs_by_type() returns zero core count (set to 1).
-- Added ApbasilTimeout parameter to the cray.conf configuration file.
-- Handle in the API if parts of the node structure are NULL.
-- Fix srun hang when IO fails to start at launch.
-- Fix for GRES bitmap not matching the GRES count resulting in abort
(requires manual resetting of GRES count, changes to gres.conf file,
and slurmd restarts).
-- Modify sview to better support job arrays.
-- Modify squeue to support longer job ID values (for many job array tasks).
-- Fix race condition in authentication credential creation that could corrupt
memory. (NOTE: This race condition has existed since 2003 and would be
exceedingly rare.)
-- Slurmstepd variable initialization - Without this patch, free() is called
on a random memory location (i.e. whatever is on the stack), which can
result in slurmstepd dying and a completed job not being purged in a
timely fashion.
-- Fix slurmstepd race condition when separate threads are reading and
modifying the job's environment, which can result in the slurmstepd failing
with an invalid memory reference.
-- Fix erroneous error messages when running gang scheduling.
-- Fix minor memory leak.
-- scontrol modified to suspend, resume, hold, uhold, or release multiple
jobs in a space separated list.
-- Minor debug error when a connection goes away at the end of a job.
-- Validate return code from calls to slurm_get_peer_addr
-- BGQ - Fix issues with making sure all cnodes are accounted for when mulitple
steps cause multiple cnodes in one allocation to go into error at the
same time.
-- scontrol show job - Correct NumNodes value calculated based upon job
-- BGQ - Fix issue if user runs multiple sub-block jobs inside a multiple
midplane block that starts on a higher coordinate than it ends (i.e if a
block has midplanes [0010,0013] 0013 is the start even though it is
listed second in the hostlist).
-- BGQ - Add midplane to the total_cnodes used in the runjob_mux plugin
for better debug.
-- Update AllocNodes paragraph in slurm.conf.5.
* Changes in Slurm 2.6.4
-- Honor ntasks-per-node option with exclusive node allocations.
-- sched/backfill - Prevent invalid memory reference if bf_continue option is
configured and slurm is reconfigured during one of the sleep cycles or if
there are any changes to the partition configuration or if the normal
scheduler runs and starts a job that the backfill scheduler is actively
working on.
-- Update man pages information about acct-freq and JobAcctGatherFrequency
to reflect only the latest supported format.
-- Minor document update to include note about PrivateData=Usage for the
slurm.conf when using the DBD.
-- Expand information reported with DebugFlags=backfill.
-- Initiate jobs pending to run in a reservation as soon as the reservation
becomes active.
-- Purged expired reservation even if it has pending jobs.
-- Corrections to calculation of a pending job's expected start time.
-- Remove some vestigial logic treating job priority of 1 as a special case.
-- Memory freeing up to avoid minor memory leaks at close of daemons
-- Updated documentation to give correct units being displayed.
-- Report AccountingStorageBackupHost with "scontrol show config".
-- init scripts ignore quotes around Pid file name specifications.
-- Fixed typo about command case in quickstart.html.
-- task/cgroup - handle new cpuset files, similar to commit c4223940.
-- Replace the tempname() function call with mkstemp().
-- Fix for --cpu_bind=map_cpu/mask_cpu/map_ldom/mask_ldom plus
--mem_bind=map_mem/mask_mem options, broken in 2.6.2.
-- Restore default behavior of allocating cores to jobs on a cyclic basis
across the sockets unless SelectTypeParameters=CR_CORE_DEFAULT_DIST_BLOCK
or user specifies other distribution options.
-- Enforce JobRequeue configuration parameter on node failure. Previously
always requeued the job.
-- acct_gather_energy/ipmi - Add delay before retry on read error.
-- select/cons_res with GRES and multiple threads per core, fix possible
infinite loop.
-- proctrack/cgroup - Add cgroup create retry logic in case one step is
starting at the same time as another step is ending and the logic to create
and delete cgroups overlaps.
-- Improve setting of job wait "Reason" field.
-- Correct sbatch documentation and job_submit/pbs plugin "%j" is job ID,
not "%J" (which is job_id.step_id).
-- Improvements to sinfo performance, especially for large numbers of
-- SlurmdDebug - Permit changes to slurmd debug level with "scontrol reconfig"
-- smap - Avoid invalid memory reference with hidden nodes.
-- Fix sacctmgr modify qos set preempt+/-=.
-- BLUEGENE - fix issue where node count wasn't set up correctly when srun
preforms the allocation, regression in 2.6.3.
-- Add support for dependencies of job array elements (e.g.
"sbatch --depend=afterok:123_4 ...") or all elements of a job array (e.g.
"sbatch --depend=afterok:123 ...").
-- Add support for new options in sbatch qsub wrapper:
-W block=true (wait for job completion)
Clear PBS_NODEFILE environment variable
-- Fixed the MaxSubmitJobsPerUser limit in QOS which limited submissions
a job too early.
-- sched/wiki, sched/wiki2 - Fix to work with change logic introduced in
version 2.6.3 preventing Maui/Moab from starting jobs.
-- Updated the QOS limits documentation and man page.
* Changes in Slurm 2.6.3
-- Add support for some new #PBS options in sbatch scripts and qsub wrapper:
-l accelerator=true|false (GPU use)
-l mpiprocs=# (processors per node)
-l naccelerators=# (GPU count)
-l select=# (node count)
-l ncpus=# (task count)
-v key=value (environment variable)
-W depend=opts (job dependencies, including "on" and "before" options)
-W umask=# (set job's umask)
-- Added qalter and qrerun commands to torque package.
-- Corrections to qstat logic: job CPU count and partition time format.
-- Add job_submit/pbs plugin to translate PBS job dependency options to the
extend possible (no support for PBS "before" options) and set some PBS
environment variables.
-- Add spank/pbs plugin to set a bunch of PBS environment variables.
-- Backported sh5util from master to 2.6 as there are some important
bugfixes and the new item extraction feature.
-- select/cons_res - Correct MacCPUsPerNode partition constraint for CR_Socket.
-- scontrol - for setdebugflags command, avoid parsing "-flagname" as an
scontrol command line option.
-- Fix issue with step accounting if a job is requeued.
-- Close file descriptors on exec of prolog, epilog, etc.
-- Fix issue when a user has held a job and then sets the begin time
into the future.
-- Scontrol - Enable changing a job's stdout file.
-- Fix issues where memory or node count of a srun job is altered while the
srun is pending. The step creation would use the old values and possibly
hang srun since the step wouldn't be able to be created in the modified
-- Add support for new SchedulerParameters value of "bf_max_job_part", the
maximum depth the backfill scheduler should go in any single partition.
-- acct_gather/infiniband plugin - Correct packets_in/out values.
-- BLUEGENE - Don't ignore a conn-type request from the user.
-- BGQ - Force a request on a Q for a MESH to be a TORUS in a dimension that
can only be a TORUS (1).
-- Change max message length from 100MB to 1GB before generating "Insane
message length" error.
-- sched/backfill - Prevent possible memory corruption due to use of
bf_continue option and long running scheduling cycle (pending jobs could
have been cancelled and purged).
-- CRAY - fix AcceleratorAllocation depth correctly for basil 1.3
-- Created the environment variable SLURM_JOB_NUM_NODES for srun jobs and
updated the srun man page.
-- BLUEGENE/CRAY - Don't set env variables that pertain to a node when Slurm
isn't doing the launching.
-- gres/gpu and gres/mic - Do not treat the existence of an empty gres.conf
file as a fatal error.
-- Fixed for if hours are specified as 0 the time days-0:min specification
is not parsed correctly.
-- Subtract the PMII_COMMANDLEN_SIZE in contribs/pmi2/pmi2_api.c to prevent
certain implementation of snprintf() to segfault.
* Changes in Slurm 2.6.2
-- Fix issue with reconfig and GrpCPURunMins
-- Fix of wrong node/job state problem after reconfig
-- Allow users who are coordinators update their own limits in the accounts
they are coordinators over.
-- BackupController - Make sure we have a connection to the DBD first thing
to avoid it thinking we don't have a cluster name.
-- Correct value of min_nodes returned by loading job information to consider
the job's task count and maximum CPUs per node.
-- If running jobacct_gather/none fix issue on unpacking step completion.
-- Reservation with CoreCnt: Avoid possible invalid memory reference.
-- sjstat - Add man page when generating rpms.
-- Make sure GrpCPURunMins is added when creating a user, account or QOS with
-- Fix for invalid memory reference due to multiple free calls caused by
job arrays submitted to multiple partitions.
-- Enforce --ntasks-per-socket=1 job option when allocating by socket.
-- Validate permissions of key directories at slurmctld startup. Report
anything that is world writable.
-- Improve GRES support for CPU topology. Previous logic would pick CPUs then
reject jobs that can not match GRES to the allocated CPUs. New logic first
filters out CPUs that can not use the GRES, next picks CPUs for the job,
and finally picks the GRES that best match those CPUs.
-- Switch/nrt - Prevent invalid memory reference when allocating single adapter
per node of specific adapter type
-- CRAY - Make Slurm work with CLE 5.1.1
-- Fix segfault if submitting to multiple partitions and holding the job.
-- Use MAXPATHLEN instead of the hardcoded value 1024 for maximum file path
-- If OverTimeLimit is defined do not declare failed those jobs that ended
in the OverTimeLimit interval.
* Changes in Slurm 2.6.1
-- slurmdbd - Allow job derived ec and comments to be modified by non-root
-- Fix issue with job name being truncated to 24 chars when sending a mail
-- Fix minor issues with spec file, missing files and including files
erroneously on a bluegene system.
-- sacct - fix --name and --partition options when using
-- squeue - Remove extra whitespace of default printout.
-- BGQ - added head ppcfloor as an include dir when building.
-- BGQ - Better debug messages in runjob_mux plugin.
-- PMI2 Updated the to build a versioned library.
-- CRAY - Fix srun --mem_bind=local option with launch/aprun.
-- PMI2 Corrected buffer size computation in the pmi2_api.c module.
-- GRES accounting data wrong in database: gres_alloc, gres_req, and gres_used
fields were empty if the job was not started immediately.
-- Fix sbatch and srun task count logic when --ntasks-per-node specified,
but no explicit task count.
-- Corrected the hdf5 profile user guide and the acct_gather.conf
-- IPMI - Fix Math bug getting new wattage.
-- Corrected the AcctGatherProfileType documentation in slurm.conf
-- Corrected the sh5util program to print the header in the csv file
only once, set the debug messages at debug() level, make the argument
check case insensitive and avoid printing duplicate \n.
-- If cannot collect energy values send message to the controller
to drain the node and log error slurmd log file.
-- Handle complete removal of CPURunMins time at the end of the job instead
of at multifactor poll.
-- sview - Add missing debug_flag options.
Loading full blame...