Newer
Older
This file describes changes in recent versions of SLURM. It primarily
documents those changes that are of interest to users and admins.
* Changes in SLURM 2.3.0.pre5
=============================
-- BLUEGENE - Improve speed of start up when removing blocks at the beginning.
-- Correct init.d/slurm status to have non-zero exit code if ANY Slurm
damon that should be running on the node is not running. Patch from Rod
Schulz, Bull.
-- Improve accuracy of response to "srun --test-only jobid=#".
-- Correct logic to properly support --ntasks-per-node option in the
select/cons_res plugin. Patch from Rod Schulz, Bull.
-- Fix bug in select/cons_res with respect to generic resource (gres)
scheduling which prevented some jobs from starting as soon as possible.
-- Fix memory leak in select/cons_res when backfill scheduling generic
resources (gres).

Danny Auble
committed
-- Fix for when configuring a node with more resources than in real life
and using task/affinity.

Danny Auble
committed
-- Fix so slurmctld will pack correctly 2.1 step information. (Only needed if
a 2.1 client is talking to a 2.2 slurmctld.)
-- Fix bug in front-end configurations which reports job_cnt_comp underflow
errors after slurmctld restarts.
-- Eliminate "error from _trigger_slurmctld_event in backup.c" due to lack of
event triggers.
-- Fix logic in BackupController to properly recover front-end node state and
avoid purging active jobs.
-- Added man pages to html pages and the new cpu_management.html page.
Submitted by Martin Perry / Rod Schultz, Bull.
* Changes in SLURM 2.3.0.pre4
=============================
-- Add GraceTime to Partition and QOS data structures. Preempted jobs will be
given this time interval before termination. Work by Bill Brophy, Bull.
-- Add the ability for scontrol and sview to modify slurmctld DebugFlags
values.
-- Various Cray-specific patches:
- Fix a bug in distinguishing XT from XE.
- Avoids problems with empty nodenames on Cray.
- Check whether ALPS is hanging on to nodes, which happens if ALPS has not
yet cleaned up the node partition.
- Stops select/cray from clobbering node_ptr->reason.
- Perform 'safe' release of ALPS reservations using inventory and apkill.
- Compile-time sanity check for the apbasil and apkill files.
- Changes error handling in do_basil_release() (called by
select_g_job_fini()).
- Warn that salloc --no-shell option is not supported on Cray systems.
-- Add a reservation flag of "License_Only". If set, then jobs using the
reservation may use the licenses associated with it plus any compute nodes.
Otherwise the job is limited to the compute nodes associated with the
reservation.
-- Change slurm.conf node configuration parameter from "Procs" to "CPUs".
Both parameters will be supported for now.

Danny Auble
committed
-- BLUEGENE - fix for when user requests only midplane names with no count at
job submission time to process the node count correctly.
-- Fix job step resource allocation problem when both node and tasks counts
are specified. New logic selects nodes with larger CPU counts as needed.

Danny Auble
committed
-- BGQ - make it so srun wraps runjob (still under construction, but works
for most cases)
-- Permit a job's QOS and Comment field to both change in a single RPC. This
was previously disabled since Moab stored the QOS within the Comment field.
-- Add support for jobs to expand in size. Submit additional batch job with
the option "--dependency=expand:<jobid>". See web page "faq.html#job_size"
for details. Restrictions to be removed in the future.

Danny Auble
committed
-- Added --with-alps-emulation to configure, and also an optional cray.conf
to setup alps location and database information.
-- Modify PMI data types from 16-bits to 32-bits in order to support MPICH2
jobs with more than 65,536 tasks. Patch from Hongjia Cao, NUDT.
-- Set slurmd's soft process CPU limit equal to it's hard limit and notify the
user if the limit is not infinite.
-- Added proctrack/cgroup and task/cgroup plugins from Matthieu Hautreux, CEA.
-- Fix slurmctld restart logic that could leave nodes in UNKNOWN state for a
longer time than necessary after restart.
* Changes in SLURM 2.3.0.pre3
=============================
-- BGQ - Appears to work correctly in emulation mode, no sub blocks just yet.
-- Minor typos fixed
-- Various bug fixes for Cray systems.
-- Fix bug that when setting a compute node to idle state, it was failing to
set the systems up_node_bitmap.
-- BLUEGENE - code reorder
-- BLUEGENE - Now only one select plugin for all Bluegene systems.
-- Modify srun to set the SLURM_JOB_NAME environment variable when srun is
used to create a new job allocation. Not set when srun is used to create a
job step within an existing job allocation.
-- Modify init.d/slurm script to start multiple slurmd daemons per compute
node if so configured. Patch from Matthieu Hautreux, CEA.
-- Change license data structure counters from uint16_t to uint32_t to support
larger license counts.
* Changes in SLURM 2.3.0.pre2
=============================
-- Log a job's requeue or cancellation due to preemption to that job's stderr:
"*** JOB 65547 CANCELLED AT 2011-01-21T12:59:33 DUE TO PREEMPTION ***".
-- Added new job termination state of JOB_PREEMPTED, "PR" or "PREEMPTED" to
indicate job termination was due to preemption.
-- Optimize advanced reservations resource selection for computer topology.
The logic has been added to select/linear and select/cons_res, but will
not be enabled until the other select plugins are modified.
-- Disable deletion of partitions that have unfinished jobs (pending,
running or suspended states). Patch from Martin Perry, BULL.
-- In sview, disable the sorting of node records by name at startup for
clusters over 1000 nodes. Users can enable this by selecting the "Name"
tab. This change dramatically improves scalability of sview.
-- Report error when trying to change a node's state from scontrol for Cray
-- Do not attempt to read the batch script for non-batch jobs. This patch
eliminates some inappropriate error messages.
-- Preserve NodeHostName when reordering nodes due to system topology.
-- On Cray/ALPS systems do node inventory before scheduling jobs.
-- Disable some salloc options on Cray systems.
-- Disable scontrol's wait_job command on Cray systems.
-- Disable srun command on native Cray/ALPS systems.

Danny Auble
committed
-- Updated configure option "--enable-cray-emulation" (still under
development) to emulate a cray XT/XE system, and auto-detect a real
Cray XT/XE systems (removed no longer needed --enable-cray configure
option). Building on native Cray systems requires the
cray-MySQL-devel-enterprise rpm and expat XML parser library/headers.
* Changes in SLURM 2.3.0.pre1
=============================
-- Added that when a slurmctld closes the connection to the database it's
registered host and port are removed.
-- Added flag to slurmdbd.conf TrackSlurmctldDown where if set will mark idle
resources as down on a cluster when a slurmctld disconnects or is no
longer reachable.
-- Added support for more than one front-end node to run slurmd on
architectures where the slurmd does not execute on the compute nodes
(e.g. BlueGene). New configuration parameters FrontendNode and FrontendAddr
added. See "man slurm.conf" for more information.
-- With the scontrol show job command when using the --details option, show
a batch job's script.
-- Add ability to create reservations or partitions and submit batch jobs
using sview. Also add the ability to delete reservations and partitions.
-- Added new configuration parameter MaxJobId. Once reached, restart job ID
values at FirstJobId.
-- When restarting slurmctld with priority/basic, increment all job priorities
so the highest job priority becomes TOP_PRIORITY.
-- For batch jobs for which the Prolog fails, substitute the job ID for any
"%j" in the job's output or error file specification.
-- Add licenses field to the sview reservation information.
-- BLUEGENE - Fix for handling extremely overloaded system on Dynamic system
dealing with starting jobs on overlapping blocks. Previous fallout
was job would be requeued. (happens very rarely)
-- In accounting_storage/filetxt plugin, substitute spaces within job names,
step names, and account names with an underscore to insure proper parsing.
-- When building contribs/perlapi ignore both INSTALL_BASE and PERL_MM_OPT.
Use PREFIX instead to avoid build errors from multiple installation
specifications.
-- Add job_submit/cnode plugin to support resource reservations of less than
a full midplane on BlueGene computers. Treat cnodes as liceses which can
be reserved and are consumed by jobs. This reservation mechanism for less
than an entire midplane is still under development.
-- Clear a job's "reason" field when a held job is released.
-- When releasing a held job, calculate a new priority for it rather than
just setting the priority to 1.

Danny Auble
committed
-- Fix for sview started on a non-bluegene system to pick colors correctly
when talking to a real bluegene system.
-- Improve sched/backfill's expected start time calculation.
-- Prevent abort of sacctmgr for dump command with invalid (or no) filename.

Danny Auble
committed
-- Improve handling of job updates when using limits in accounting, and
updating jobs as a non-admin user.
-- Fix for "squeue --states=all" option. Bug would show no jobs.
-- Schedule jobs with reservations before those without reservations.
-- Fix squeue/scancel to query correctly against accounts of different case.
-- Abort an srun command when it's associated job gets aborted due to a
dependency that can not be satisfied.
-- In jobcomp plugins, report start time of zeroif pending job is cancelled.
Previously may report expected start time.
-- Fixed sacctmgr man to state correct variables.
-- Select nodes based upon their Weight when job allocation requests include
a constraint field with a count (e.g. "srun --constraint=gpu*2 -N4 a.out").
-- Add support for user names that are entirely numeric and do not treat them
as UID values. Patch from Dennis Leepow.
-- Patch to un/pack double values properly if negative value. Patch from

Danny Auble
committed
Dennis Leepow
-- Do not reset a job's priority when requeued or suspended.
-- Fix problemm that could let new jobs start on a node in DRAINED state.

Danny Auble
committed
-- Fix cosmetic sacctmgr issue where if the user you are trying to add
doesn't exist in the /etc/passwd file and the account you are trying
to add them to doesn't exist it would print (null) instead of the bad
account name.
-- Fix associations/qos for when adding back a previously deleted object
the object will be cleared of all old limits.

Danny Auble
committed
-- BLUEGENE - Added back a lock when creating dynamic blocks to be more thread
safe on larger systems with heavy load.
* Changes in SLURM 2.2.3
========================
-- Update srun, salloc, and sbatch man page description of --distribution
option. Patches from Rod Schulz, Bull.
-- Applied patch from Martin Perry to fix "Incorrect results for task/affinity
block second distribution and cpus-per-task > 1" bug.
-- Avoid setting a job's eligible time while held (priority == 0).
-- Substantial performance improvement to backfill scheduling. Patch from
Bjorn-Helge Mevik, University of Oslo.
Loading
Loading full blame...