Newer
Older
This file describes changes in recent versions of SLURM. It primarily
documents those changes that are of interest to users and admins.
-- Added squeue format option of "%X" (core specialization count).
-- Added core specialization web page (just a start for now).
-- Added the SLURM_ARRAY_JOB_ID and SLURM_ARRAY_TASK_ID
in epilogi slurmctld environment.
-- Fix bug in job step allocation failing due to memory limit.
-- Modify the pbsnodes script to reflect its output on a TORQUE system.
-- Add ability to clear a node's DRAIN flag using scontrol or sview by setting
it's state to "UNDRAIN". The node's base state (e.g. "DOWN" or "IDLE") will
not be changed.
-- Modify the output of 'scontrol show partition' by displaying
DefMemPerCPU=UNLIMITED and MaxMemPerCPU=UNLIMITED when these limits are
configured as 0.
-- mpirun-mic - Major re-write of the command wrapper for Xeon Phi use.
-- Add new configuration parameter of AuthInfo to specify port used by
authentication plugin.
-- Corrected slurmstepd ident name when loggind to syslog.
-- Fixed sh5util loop when there are no node-step files.
-- Add SLURM_CLUSTER_NAME to environment variables passed to PrologSlurmctld,
Prolog, EpilogSlurmctld, and Epilog
-- Add the idea of running a prolog right when an allocation happens
instead of when running on the node for the first time.
-- If user runs 'scontrol reconfig' but hostnames or the host count changes
the slurmctld throws a fatal error.
-- gres.conf - Add "NodeName" specification so that a single gres.conf file
can be used for a heterogeneous cluster.
-- Add flag to accounting RPC to indicate if job data is packed or not.
-- After all srun tasks have terminated on a node close the stdout/stderr
channel with the slurmstepd on that node.
-- In case of i/o error with slurmstepd log an error message and abort the
job.
-- Add --test-only option to sbatch command to validate the script and options.
The response includes expected start time and resources to be allocated.
==============================
-- Remove the ThreadID documentation from slurm.conf. This functionality has
been obsoleted by the LogTimeFormat.
-- Sched plugins - rename global and plugin functions names for consistency
with other plugin types.
-- BGQ - Added RebootQOSList option to bluegene.conf to allow an implicate
reboot of a block if only jobs in the list are running on it when cnodes
go into a failure state.
-- Correct task count of pending job steps.
-- Improve limit enforcement for jobs, set RLIMIT_RSS, RLIMIT_AS and/or
RLIMIT_DATA to enforce memory limit.
-- Pending job steps will have step_id of INFINITE rather than NO_VAL and
will be reported as "TBD" by scontrol and squeue commands.
-- Add logic so PMI_Abort or PMI2_Abort can propagate an exit code.
-- Added SlurmdPlugstack configuration parameter.
-- Added PriorityFlag DEPTH_OBLIVIOUS to have the depth of an association
not effect it's priorty.
-- Multi-thread the sinfo command (one thread per partition).
-- Added sgather tool to gather files from a job's compute nodes into a
central location.
-- Added configuration parameter FairShareDampeningFactor to offer a greater
priority range based upon utilization.
-- Change MaxArraySize and job's array_task_id from 16-bit to 32-bit field.
Additional Slurm enhancements are be required to support larger job arrays.
-- Added -S/--core-spec option to salloc, sbatch and srun commands to reserve
specialized cores for system use. Modify scontrol and sview to get/set
the new field. No enforcement exists yet for these new options.
struct job_info / slurm_job_info_t: Added core_spec
struct job_descriptorjob_desc_msg_t: Added core_spec
-- Do not set SLURM_NODEID environment variable on front-end systems.
-- Convert bitmap functions to use int32_t instead of int in data structures
and function arguments. This is to reliably enable use of bitmaps containing
up to 4 billion elements. Several data structures containing index values
were also changed from data type int to int32_t:
- Struct job_info / slurm_job_info_t: Changed exc_node_inx, node_inx, and
req_node_inx from type int to type int32_t
- job_step_info_t: Changed node_inx from type int to type int32_t
- Struct partition_info / partition_info_t: Changed node_inx from type int
to type int32_t
- block_job_info_t: Changed cnode_inx from type int to type int32_t
- block_info_t: Changed ionode_inx and mp_inx from type int to type int32_t
- Struct reserve_info / reserve_info_t: Changed node_inx from type int to
type int32_t
-- Modify qsub wrapper output to match torque command output, just print the
job ID rather than "Submitted batch job #"
-- Change Slurm error string for ESLURM_MISSING_TIME_LIMIT from
"Missing time limit" to
"Time limit specification required, but not provided"
-- Change salloc job_allocate error message header from
"Failed to allocate resources" to
"Job submit/allocate failed"
-- Modify slurmctld message retry logic to support Cray cold-standby SDB.
-- Added "JobAcctGatherParams" configuration parameter. Value of "NoShare"
disables accounting for shared memory.
-- Added fields to "scontrol show job" output: boards_per_node,
sockets_per_board, ntasks_per_node, ntasks_per_board, ntasks_per_socket,
ntasks_per_core, and nice.
-- Add squeue output format options for job command and working directory
(%o and %Z respectively).
-- Add stdin/out/err to sview job output.
-- Add new job_state of JOB_BOOT_FAIL for job terminations due to failure to
boot it's allocated nodes or BlueGene block.
-- CRAY - Add SelectTypeParameters NHC_NO_STEPS and NHC_NO which will disable
the node health check script for steps and allocations respectfully.
-- Reservation with CoreCnt: Avoid possible invalid memory reference.
-- Add new error code for attempt to create a reservation with duplicate name.
-- Validate that a hostlist file contains text (i.e. not a binary).
-- switch/generic - propagate switch information from srun down to slurmd and
slurmstepd.
-- CRAY - Do not package Slurm's libpmi or libpmi2 libraries. The Cray version
of those libraries must be used.
-- Added a new option to the scontrol command to view licenses that are
configured in use and avalable. 'scontrol show licenses'.
-- MySQL - Made Slurm compatible with 5.6
==============================
-- Add task pointer to the task_post_term() function in task plugins. The
terminating task's PID is available in task->pid.
-- Defer sending SIGKILL signal to processes while core dump in progress.
-- Added JobContainerPlugin configuration parameter and plugin infrastructure.
-- Added partition configuration parameters AllowAccounts, AllowQOS,
DenyAccounts and DenyQOS.
-- The rpmbuild option for a cray system with ALPS has changed from
%_with_cray to %_with_cray_alps.
-- The log file timestamp format can now be selected at runtime via the
LogTimeFormat configuration option. See the slurm.conf and slurmdbd.conf
man pages for details.
-- Added switch/generic plugin to a job's convey network topology.
-- BLUEGENE - If block is in 'D' state or has more cnodes in error than
MaxBlockInError set the job wait reason appropriately.
-- API use: Generate an error return rather than fatal error and exit if the
configuraiton file is absent or invalid. This will permit Slurm APIs to be
more reliably used by other programs.
-- Add support for load-based scheduling, allocate jobs to nodes with the
largest number of available CPUs. Added SchedulingParameters paramter of
"CR_LLN" and partition parameter of "LLN=yes|no".
-- Added job_info() and step_info() functions to the gres plugins to extract
plugin specific fields from the job's or step's GRES data structure.
-- Added sbatch --signal option of "B:" to signal the batch shell rather than
only the spawned job steps.
-- Added sinfo and squeue format option of "%all" to print all fields available
for the data type with a vertical bar separating each field.
-- Add mechanism for job_submit plugin to generate error message for srun,
salloc or sbatch to stderr. New argument added to job_submit function in
the plugin.
-- Add StdIn, StdOut, and StdErr paths to job information dumped with
"scontrol show job".
-- Permit Slurm administrator to submit a batch job as any user.
-- Set a job's RLIMIT_AS limit based upon it's memory limit and VsizeFactor
configuration value.
-- Remove Postgres plugins
-- Make jobacct_gather/cgroup work correctly and also make all jobacct_gather
plugins more maintainable.
-- Proctrack/pgid - Add support for proctrack_p_plugin_get_pids() function.
-- Sched/backfill - Change default max_job_bf parameter from 50 to 100.
-- Added -I|--item-extract option to sh5util to extract data item from series.
* Changes in Slurm 2.6.6
========================
* Changes in Slurm 2.6.5
========================
-- Correction to hostlist parsing bug introduced in v2.6.4 for hostlists with
more than one numeric range in brackets (e.g. rack[0-3]_blade[0-63]").
-- Add notification if using proctrack/cgroup and task/cgroup when oom hits.
-- Corrections to advanced reservation logic with overlapping jobs.
-- job_submit/lua - add cpus_per_task field to those available.
-- Add cpu_load to the node information available using the Perl API.
-- Correct a job's GRES allocation data in accounting records for non-Cray
systems.
-- Substantial performance improvement for systems with Shared=YES or FORCE
and large numbers of running jobs (replace bubble sort with quick sort).
-- proctrack/cgroup - Add locking to prevent race condition where one job step
is ending for a user or job at the same time another job stepsis starting
and the user or job container is deleted from under the starting job step.
-- Fixed sh5util loop when there are no node-step files.
-- Fix race condition on batch job termination that could result in a job exit
code of 0xfffffffe if the slurmd on node zero registers its active jobs at
the same time that slurmstepd is recording the job's exit code.
-- Correct logic returning remaining job dependencies in job information
reported by scontrol and squeue. Eliminates vestigial descriptors with
no job ID values (e.g. "afterany").
-- Improve performance of REQUEST_JOB_INFO_SINGLE RPC by removing unnecessary
locks and use hash function to find the desired job.
-- jobcomp/filetxt - Reopen the file when slurmctld daemon is reconfigured
or gets SIGHUP.
-- Remove notice of CVE with very old/deprecated versions of Slurm in
news.html.
-- Fix if hwloc_get_nbobjs_by_type() returns zero core count (set to 1).
-- Added ApbasilTimeout parameter to the cray.conf configuration file.
-- Handle in the API if parts of the node structure are NULL.
Loading
Loading full blame...