Newer
Older
This file describes changes in recent versions of SLURM. It primarily
documents those changes that are of interest to users and admins.
* Changes in SLURM 2.3.0.pre7
=============================
-- select/cray: Add support for Accelerator information including model and
memory options.
-- Cray systems: Add support to suspend/resume salloc command to insure that
aprun does not get initiated when the job is suspended.
-- Cray systems: Modify smap and sview to display all nodes even if multiple
nodes exist at each coordinate.
-- Improve efficiency of select/linear plugin with topology/tree plugin
configured, Patch by Andriy Grytsenko (Massive Solutions Limited).
* Changes in SLURM 2.3.0.pre6
=============================
-- NOTE: THERE HAS BEEN A NEW FIELD ADDED TO THE CONFIGURATION RESPONSE RPC
AS SHOWN BY "SCONTROL SHOW CONFIG". THIS FUNCTION WILL ONLY WORK WHEN THE
SERVER AND CLIENT ARE BOTH RUNNING SLURM VERSION 2.3.0.pre6
-- Modify job expansion logic to support licenses, generic resources, and
currently running job steps.
-- Added an rpath if using the --with-munge option of configure.
-- Add support for multiple sets of DEFAULT node, partition, and frontend
specifications in slurm.conf so that default values can be changed mulitple
times as the configuration file is read.

Danny Auble
committed
-- BLUEGENE - Improved logic to place small blocks in free space before freeing
larger blocks.
-- Add optional argument to srun's --kill-on-bad-exit so that user can set
its value to zero and override a SLURM configuration parameter of
KillOnBadExit.
-- Fix bug in GraceTime support for preempted jobs that prevented proper
operation when more than one job was being preempted. Based on patch from
Bill Brophy, Bull.
-- Fix for running sview from a non-bluegene cluster to a bluegene cluster.
Regression from pre5.
-- If job's TMPDIR environment is not set or is not usable, reset to "/tmp".
Patch from Andriy Grytsenko (Massive Solutions Limited).
-- Remove logic for defunct RPC: DBD_GET_JOBS.
-- Propagate DebugFlag changes by scontrol to the plugins.
-- Improve accuracy of REQUEST_JOB_WILL_RUN start time with respect to higher
priority pending jobs.
-- Add -R/--reservation option to squeue command as a job filter.
-- Add scancel support for --clusters option.
-- Note that scontrol and sprio can only support a single cluster at one time.
-- Add support to salloc for a new environment variable SALLOC_KILL_CMD.
-- Add scontrol ability to increment or decrement a job or step time limit.
-- Add support for SLURM_TIME_FORMAT environment variable to control time
stamp output format. Work by Gerrit Renker, CSCS.
-- Fix error handling in mvapich plugin that could cause srun to enter an
infinite loop under rare circumstances.
-- Add support for multiple task plugins. Patch from Andriy Grytsenko (Massive
Solutions Limited).
-- Addition of per-user node/cpu limits for QOS's. Patch from Aaron Knister,
UMBC.
-- Fix logic for multiple job resize operations.
-- BLUEGENE - many fixes to make things work correctly on a L/P system.
-- Fix bug in layout of job step with --nodelist option plus node count. Old
code could allocate too few nodes.
* Changes in SLURM 2.3.0.pre5
=============================
-- NOTE: THERE HAS BEEN A NEW FIELD ADDED TO THE JOB STATE FILE. UPGRADES FROM
VERSION 2.3.0-PRE4 WILL RESULT IN LOST JOBS UNLESS THE "orig_dependency"
FIELD IS REMOVED FROM JOB STATE SAVE/RESTORE LOGIC. ON CRAY SYSTEMS A NEW
"confirm_cookie" FIELD WAS ADDED AND HAS THE SAME EFFECT OF DISABLING JOB
STATE RESTORE.
-- BLUEGENE - Improve speed of start up when removing blocks at the beginning.
-- Correct init.d/slurm status to have non-zero exit code if ANY Slurm
damon that should be running on the node is not running. Patch from Rod
Schulz, Bull.
-- Improve accuracy of response to "srun --test-only jobid=#".
-- Fix bug in front-end configurations which reports job_cnt_comp underflow
errors after slurmctld restarts.
-- Eliminate "error from _trigger_slurmctld_event in backup.c" due to lack of
event triggers.
-- Fix logic in BackupController to properly recover front-end node state and
avoid purging active jobs.
-- Added man pages to html pages and the new cpu_management.html page.
Submitted by Martin Perry / Rod Schultz, Bull.
-- Job dependency information will only show the currently active dependencies
rather than the original dependencies. From Dan Rusak, Bull.
-- Add RPCs to get the SPANK environment variables from the slurmctld daemon.
Patch from Andrej N. Gritsenko.
-- Updated plugins/task/cgroup/task_cgroup_cpuset.c to support newer
HWLOC_API_VERSION.
-- Do not build select/bluegene plugin if C++ compiler is not installed.
-- Add new configure option --with-srun2aprun to build an srun command
which is a wrapper over Cray's aprun command and supports many srun
options. Without this option, the srun command will advise the user
to use the aprun command.
-- Change container ID supported by proctrack plugin from 32-bit to 64-bit.
-- Added contribs/cray/libalps_test_programs.tar.gz with tools to validate
SLURM's logic used to support Cray systems.
-- Create RPM for srun command that is a wrapper for the Cray/ALPS aprun
command. Dependent upon .rpmmacros parameter of "%_with_srun2aprun".
-- Add configuration parameter MaxStepCount to limit effect of bad batch
scripts.
-- Fix for handling a 2.3 system talking to a 2.2 slurmctld.
-- Add contribs/lua/job_submit.license.lua script. Update job_submit and Lua
related documentation.
-- Test if _make_batch_script() is called with a NULL script.
-- Increase hostlist support from 24k to 64k nodes.
-- Renamed the Accounting Storage database's "DerivedExitString" job field to
"Comment". Provided backward compatible support for "DerivedExitString" in
the sacctmgr tool.
-- Added the ability to save the job's comment field to the Accounting
Storage db (to the formerly named, "DerivedExitString" job field). This
behavior is enabled by a new slurm.conf parameter:
AccountingStoreJobComment.
-- Test if _make_batch_script() is called with a NULL script.
-- Increase hostlist support from 24k to 64k nodes.
-- Fix srun to handle signals correctly when waiting for a step creation.
-- Preserve the last job ID across slurmctld daemon restarts even if the job
state file can not be fully recovered.

Danny Auble
committed
-- Made the hostlist functions be able to arbitrarily handle any size
dimension no matter what the size of the cluster is in dimensions.
* Changes in SLURM 2.3.0.pre4
=============================
-- Add GraceTime to Partition and QOS data structures. Preempted jobs will be
given this time interval before termination. Work by Bill Brophy, Bull.
-- Add the ability for scontrol and sview to modify slurmctld DebugFlags
values.
-- Various Cray-specific patches:
- Fix a bug in distinguishing XT from XE.
- Avoids problems with empty nodenames on Cray.
- Check whether ALPS is hanging on to nodes, which happens if ALPS has not
yet cleaned up the node partition.
- Stops select/cray from clobbering node_ptr->reason.
- Perform 'safe' release of ALPS reservations using inventory and apkill.
- Compile-time sanity check for the apbasil and apkill files.
- Changes error handling in do_basil_release() (called by
select_g_job_fini()).
- Warn that salloc --no-shell option is not supported on Cray systems.
-- Add a reservation flag of "License_Only". If set, then jobs using the
reservation may use the licenses associated with it plus any compute nodes.
Otherwise the job is limited to the compute nodes associated with the
reservation.
-- Change slurm.conf node configuration parameter from "Procs" to "CPUs".
Both parameters will be supported for now.

Danny Auble
committed
-- BLUEGENE - fix for when user requests only midplane names with no count at
job submission time to process the node count correctly.
-- Fix job step resource allocation problem when both node and tasks counts
are specified. New logic selects nodes with larger CPU counts as needed.

Danny Auble
committed
-- BGQ - make it so srun wraps runjob (still under construction, but works
for most cases)
-- Permit a job's QOS and Comment field to both change in a single RPC. This
was previously disabled since Moab stored the QOS within the Comment field.
-- Add support for jobs to expand in size. Submit additional batch job with
the option "--dependency=expand:<jobid>". See web page "faq.html#job_size"
for details. Restrictions to be removed in the future.

Danny Auble
committed
-- Added --with-alps-emulation to configure, and also an optional cray.conf
to setup alps location and database information.
-- Modify PMI data types from 16-bits to 32-bits in order to support MPICH2
jobs with more than 65,536 tasks. Patch from Hongjia Cao, NUDT.
-- Set slurmd's soft process CPU limit equal to it's hard limit and notify the
user if the limit is not infinite.
-- Added proctrack/cgroup and task/cgroup plugins from Matthieu Hautreux, CEA.
-- Fix slurmctld restart logic that could leave nodes in UNKNOWN state for a
longer time than necessary after restart.
* Changes in SLURM 2.3.0.pre3
=============================
-- BGQ - Appears to work correctly in emulation mode, no sub blocks just yet.
-- Minor typos fixed
-- Various bug fixes for Cray systems.
-- Fix bug that when setting a compute node to idle state, it was failing to
set the systems up_node_bitmap.
-- BLUEGENE - code reorder
-- BLUEGENE - Now only one select plugin for all Bluegene systems.
-- Modify srun to set the SLURM_JOB_NAME environment variable when srun is
used to create a new job allocation. Not set when srun is used to create a
job step within an existing job allocation.
-- Modify init.d/slurm script to start multiple slurmd daemons per compute
node if so configured. Patch from Matthieu Hautreux, CEA.
-- Change license data structure counters from uint16_t to uint32_t to support
larger license counts.
* Changes in SLURM 2.3.0.pre2
=============================
-- Log a job's requeue or cancellation due to preemption to that job's stderr:
"*** JOB 65547 CANCELLED AT 2011-01-21T12:59:33 DUE TO PREEMPTION ***".
-- Added new job termination state of JOB_PREEMPTED, "PR" or "PREEMPTED" to
indicate job termination was due to preemption.
-- Optimize advanced reservations resource selection for computer topology.
The logic has been added to select/linear and select/cons_res, but will
not be enabled until the other select plugins are modified.
-- Disable deletion of partitions that have unfinished jobs (pending,
running or suspended states). Patch from Martin Perry, BULL.
-- In sview, disable the sorting of node records by name at startup for
clusters over 1000 nodes. Users can enable this by selecting the "Name"
tab. This change dramatically improves scalability of sview.
-- Report error when trying to change a node's state from scontrol for Cray
-- Do not attempt to read the batch script for non-batch jobs. This patch
eliminates some inappropriate error messages.
-- Preserve NodeHostName when reordering nodes due to system topology.
-- On Cray/ALPS systems do node inventory before scheduling jobs.
Loading
Loading full blame...