Newer
Older
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and admins.
-- Fix issue where not enforcing QOS but a partition either allows or denies
them.
-- CRAY - Make switch/cray default when running on a Cray natively.
-- CRAY - Make job_container/cncu default when running on a Cray natively.
-- Disable job time limit change if it's preemption is in progress.
-- Correct logic to properly enforce job preemption GraceTime.
-- Fix sinfo -R to print each down/drained node once, rather than once per
partition.
* Changes in Slurm 14.03.3-2
============================
* Changes in Slurm 14.03.3
==========================
-- Correction to default batch output file name. In version 14.03.2 was using
"slurm_<jobid>_4294967294.out" due to error in job array logic.
-- In slurm.spec file, replace "Requires cray-MySQL-devel-enterprise" with
"Requires mysql-devel".
-- Fix race condition if PrologFlags=Alloc,NoHold is used.
-- Cray - Make NPC only limit running other NPC jobs on shared blades instead
of limited non NPC jobs.
-- Fix for sbatch #PBS -m (mail) option parsing.
-- Fix job dependency bug. Jobs dependent upon multiple other jobs may start
prematurely.
-- Set "Reason" field for all elements of a job array on short-circuited
scheduling for job arrays.
-- Allow -D option of salloc/srun/sbatch to specify relative path.
-- Added SchedulerParameter of batch_sched_delay to permit many batch jobs
to be submitted between each scheduling attempt to reduce overhead of
scheduling logic.
-- Added job reason of "SchedTimeout" if the scheduler was not able to reach
the job to attempt scheduling it.
-- Add job's exit state and exit code to email message.
-- scontrol hold/release accepts job name option (in addition to job ID).
-- Handle when trying to cancel a step that hasn't started yet better.
-- Add --priority option to salloc, sbatch and srun commands.
-- Honor partition priorities over job priorities.
-- Fix sacct -c when using jobcomp/filetxt to read newer variables
-- Fix segfault of sacct -c if spaces are in the variables.
-- Release held job only with "scontrol release <jobid>" and not by resetting
the job's priority. This is needed to support job arrays better.
-- Correct squeue command not to merge jobs with state pending and completing
together.
-- Fix issue where user is requesting --acctg-freq=0 and no memory limits.
-- Fix issue with GrpCPURunMins if a job's timelimit is altered while the job
is running.
-- Temporary fix for handling our typemap for the perl api with newer perl.
-- Fix allowgroup on bad group seg fault with the controller.
-- Handle node ranges better when dealing with accounting max node limits.
-- Update configure to set correct version without having to run autogen.sh
* Changes in Slurm 14.03.1
==========================
-- Add support for job std_in, std_out and std_err fields in Perl API.
-- Add "Scheduling Configuration Guide" web page.
-- BGQ - fix check for jobinfo when it is NULL
-- Do not check cleaning on "pending" steps.
-- task/cgroup plugin - Fix for building on older hwloc (v1.0.2).
-- In the PMI implementation by default don't check for duplicate keys.
Set the SLURM_PMI_KVS_DUP_KEYS if you want the code to check for
duplicate keys.
-- Permit user root to propagate resource limits higher than the hard limit
slurmd has on that compute node has (i.e. raise both current and maximum
limits).
-- Fix issue with license used count when doing an scontrol reconfig.
-- Fix the PMI iterator to not report duplicated keys.
-- Fix issue with sinfo when -o is used without the %P option.
-- Rather than immediately invoking an execution of the scheduling logic on
every event type that can enable the execution of a new job, queue its
execution. This permits faster execution of some operations, such as
modifying large counts of jobs, by executing the scheduling logic less
frequently, but still in a timely fashion.
-- If the environment variable is greater than MAX_ENV_STRLEN don't
set it in the job env otherwise the exec() fails.
-- Optimize scontrol hold/release logic for job arrays.
-- Modify srun to report an exit code of zero rather than nine if some tasks
exit with a return code of zero and others are killed with SIGKILL. Only an
exit code of zero did this.
-- Avoid slurmctld crash getting job info if detail_ptr is NULL.
-- Fix sacctmgr add user where both defaultaccount and accounts are specified.
-- Added SchedulerParameters option of max_sched_time to limit how long the
main scheduling loop can execute for.
-- Added SchedulerParameters option of sched_interval to control how frequently
the main scheduling loop will execute.
-- Move start time of main scheduling loop timeout after locks are aquired.
-- Add squeue job format option of "%y" to print a job's nice value.
-- Update scontrol update jobID logic to operate on entire job arrays.
-- Fix PrologFlags=Alloc to run the prolog on each of the nodes in the
allocation instead of just the first.
-- Fix race condition if a step is starting while the slurmd is being
restarted.
-- Make sure a job's prolog has ran before starting a step.
-- BGQ - Fix invalid memory read when using DefaultConnType in the
bluegene.conf
-- Make sure we send node state to the DBD on clean start of controller.
-- Fix some sinfo and squeue sorting anomalies due to differences in data
types.
-- Only send message back to slurmctld when PrologFlags=Alloc is used on a
Cray/ALPS system, otherwise use the slurmd to wait on the prolog to gate
the start of the step.
-- Remove need to check PrologFlags=Alloc in slurmd since we can tell if prolog
has ran yet or not.
-- Fix squeue to use a correct macro to check job state.
-- BGQ - Fix incorrect logic issues if MaxBlockInError=0 in the bluegene.conf.
-- priority/basic - Insure job priorities continue to decrease when jobs are
submitted with the --nice option.
-- Make the PrologFlag=Alloc work on batch scripts
-- Make PrologFlag=NoHold (automatically sets PrologFlag=Alloc) not hold in
salloc/srun, instead wait in the slurmd when a step hits a node and the
prolog is still running.
-- Added --cpu-freq=highm1 (high minus one) option.
-- Expand StdIn/Out/Err string length output by "scontrol show job" from 128
to 1024 bytes.
-- squeue %F format will now print the job ID for non-array jobs.
-- Use quicksort for all priority based job sorting, which improves performance
significantly with large job counts.
-- If a job has already been released from a held state ignore successive
release requests.
-- Fix srun/salloc/sbatch man pages for the --no-kill option.
-- Add squeue -L/--licenses option to filter jobs by license names.
-- Handle abort job on node on front end systems without core dumping.
-- Fix dependency support for job arrays.
-- When updating jobs verify the update request is not identical to
the current settings.
-- When sorting jobs and priorities are equal sort by job_id.
-- Do not overwrite existing reason for node being down or drained.
-- Requeue batch job if Munge is down and credential can not be created.
-- Make _slurm_init_msg_engine() tolerate bug in bind() returning a busy
ephemeral port.
-- Don't block scheduling of entire job array if it could run in multiple
partitions.
-- Introduce a new debug flag Protocol to print protocol requests received
together with the remote IP address and port.
-- CRAY - Set up the network even when only using 1 node.
-- CRAY - Greatly reduce the number of error messages produced from the task
plugin and provide more information in the message.
* Changes in Slurm 14.03.0
==========================
-- job_submit/lua: Fix invalid memory reference if script returns error message
for user.
-- Add logic to sleep and retry if slurm.conf can't be read.
-- Reset a node's CpuLoad value at least once each SlurmdTimeout seconds.
-- Scheduler enhancements for reservations: When a job needs to run in
reservation, but can not due to busy resources, then do not block all jobs
in that partition from being scheduled, but only the jobs in that
reservation.
-- Export "SLURM*" environment variables from sbatch even if --export=NONE.
-- When recovering node state if the Slurm version is 2.6 or 2.5 set the
protocol version to be SLURM_2_5_PROTOCOL_VERSION which is the minimum
supported version.
-- Update the scancel man page documenting the -s option.
-- Update sacctmgr man page documenting how to modify account's QOS.
-- Fix for sjstat which currently does not print >1TB memory values correctly.
-- Change xmalloc()/xfree() to malloc()/free() in hostlist.c for better
performance.
-- Update squeue.1 man page describing the SPECIAL_EXIT state.
-- Added scontrol option of errnumstr to return error message given a slurm
error number.
-- If srun invoked with the --multi-prog option, but no task count, then use
the task count provided in the MPMD configuration file.
-- Prevent sview abort on some systems when adding or removing columns to the
display for nodes, jobs, partitions, etc.
-- Add job array hash table for improved performance.
-- Make AccountingStorageEnforce=all not include nojobs or nosteps.
-- Added sacctmgr mod qos set RawUsage=0.
-- Modify hostlist functions to accept more than two numeric ranges (e.g.
"row[1-3]rack[0-8]slot[0-63]")
-- Run job scheduling logic immediately when nodes enter service.
-- Added sbatch '--parsable' option to output only the job id number and the
cluster name separated by a semicolon. Errors will still be displayed.
-- Added failure management "slurmctld/nonstop" plugin.
-- Prevent jobs being killed when a checkpoint plugin is enabled or disabled.
-- Update the documentation about SLURM_PMI_KVS_NO_DUP_KEYS environment
variable.
-- select/cons_res bug fix for range of node counts with --cpus-per-task
option (e.g. "srun -N2-3 -c2 hostname" would allocate 2 CPUs on the first
node and 0 CPUs on the second node).
-- Change reservation flags field from 16 to 32-bits.
-- Add reservation flag value of "FIRST_CORES".
-- Added the idea of Resources to the database. Framework for handling
Loading
Loading full blame...