Newer
Older
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and administrators.
* Changes in Slurm 16.05.0pre1
===============================
-- Add sbatch "--wait" option that waits for job completion before exiting.
Exit code will match that of spawned job.
-- Modify advanced reservation save/restore logic for core reservations to
support configuration changes (changes in configured nodes or cores counts).

Brian Christiansen
committed
-- Allow ControlMachine, BackupController, DbdHost and DbdBackupHost to be
either short or long hostname.
-- Job output and error files can now contain "%" character by specifying
a file name with two consecutive "%" characters. For example,
"sbatch -o "slurm.%%.%j" for job ID 123 will generate an output file named
"slurm.%.123".
-- Pass user name in Prolog RPC from controller to slurmd when using
PrologFlags=Alloc. Allows SLURM_JOB_USER env variable to be set when using
Native Slurm on a Cray.
-- Add "NumTasks" to job information visible to Slurm commands.
-- Add mail wrapper script "smail" that will include job statistics in email
notification messages.
-- Add a PIMX plugin for fast wire up of MPI jobs.
-- Remove vestigial "SICP" job option (inter-cluster job option). Completely
different logic will be forthcoming.

Brian Christiansen
committed
-- Fix case where the primary and backup dbds would both be performing rollup.
-- Add an ack reply from slurmd to slurmstepd when job setup is done and the
job is ready to be executed.
-- Removed support for authd. authd has not been developed and supported since
-- Introduce a new parameter requeue_setup_env_fail in SchedulerParameters.
A job that fails to setup the environment will be requeued and the node
drained.
-- Add ValidateTimeout and OtherTimeout to "scontrol show burst" output.
-- Increase default sbcast buffer size from 512KB to 8MB.
-- Enable the hdf5 profiling of the batch step.
-- Eliminate redundant environment and script files for job arrays.
-- Implemented the checking configuration functionality using the new -C
options of slurmctld. To check for configuration errors in slurm.conf
run: 'slurmctld -C'.
-- Stop searching sbatch scripts for #PBS directives after 100 lines of
non-comments. Stop parsing #PBS or #SLURM directives after 1024 characters
into a line. Required for decent perforamnce with huge scripts.
-- Add debug flag for timing Cray portions of the code.
-- Add Multi-Category Security (MCS) infrastructure to permit nodes to be bound
to specific users or groups.
-- Install the pmi2 unix sockets in slurmd spool directory instead of /tmp.
-- Implement the getaddrinfo and getnameinfo instead of gethostbyaddr and
gethostbyname.
* Changes in Slurm 15.08.6
==========================
-- In slurmctld log file, log duplicate job ID found by slurmd. Previously was
being logged as prolog/epilog failure.
-- If a job is requeued while in the process of being launch, remove it's
job ID from slurmd's record of active jobs in order to avoid generating a
duplicate job ID error when launched for the second time (which would
drain the node).

Tim Wickberg
committed
-- Cleanup messages when handling job script and environment variables in
older directory structure formats.
-- Prevent triggering gang scheduling within a partition if configured with
PreemptType=partition_prio and PreemptMode=suspend,gang.
-- Decrease parallelism in job cancel request to prevent denial of service
when cancelling huge numbers of jobs.
-- If all ephemeral ports are in use, try using other port numbers.
* Changes in Slurm 15.08.5
==========================

Brian Christiansen
committed
-- Prevent "scontrol update job" from updating jobs that have already finished.
-- Show requested TRES in "squeue -O tres" when job is pending.
-- Backfill scheduler: Test association and QOS node limits before reserving
resources for pending job.
-- burst_buffer/cray: If teardown operations fails, sleep and retry.
-- Clean up the external pids when using the PrologFlags=Contain feature
and the job finishes.
-- burst_buffer/cray: Support file staging when job lacks job-specific buffer
(i.e. only persistent burst buffers).
-- Added srun option of --bcast to copy executable file to compute nodes.
-- Fix for advanced reservation of burst buffer space.
-- BurstBuffer/cray: Add logic to terminate dw_wlm_cli child processes at
shutdown.
-- If job can't be launch or requeued, then terminate it.
-- BurstBuffer/cray: Enable clearing of burst buffer string on completed job
as a means of recovering from a failure mode.
-- Fix wrong memory free when parsing SrunPortRange=0-0 configuration.
-- BurstBuffer/cray: Fix job record purging if cancelled from pending state.
-- BGQ - Handle database throw correctly when syncing users on blocks.
-- MySQL - Make sure we don't have a NULL string returned when not
requesting any specific association.
-- sched/backfill: If max_rpc_cnt is configured and the backlog of RPCs has
not cleared after yielding locks, then continue to sleep.
-- Preserve the job dependency description displayed in 'scontrol show job'
even if the dependee jobs was terminated and cleaned causing the
dependent to never run because of DependencyNeverSatisfied.
-- Correct job task count calculation if only node count and ntasks-per-node
-- Make sure the association manager converts any string to be lower case
as all the associations from the database will be lower case.
-- Sanity check for xcgroup_delete() to verify incoming parameter is valid.
-- Fix formatting for sacct with variables that switched from uint32_t to
uint64_t.
-- Set up extern step to track any childern of an ssh if it leaves anything
else behind.
-- Prevent slurmdbd divide by zero if no associations defined at rollup time.
-- Multifactor - Add sanity check to make sure pending jobs are handled
correctly when PriorityFlags=CALCULATE_RUNNING is set.
-- Add slurmdb_find_tres_count_in_string() to slurm db perl api.
-- Make lua dlopen() conditional on version found at build.
-- sched/backfill - Delay backfill scheduler for completing jobs only if
CompleteWait configuration parameter is set (make code match documentation).
-- Release a job's allocated licenses only after epilog runs on all nodes
rather than at start of termination process.
-- Cray job NHC delayed until after burst buffer released and epilog completes
on all allocated nodes.
-- Fix abort of srun if using PrologFlags=NoHold
-- Let devices step_extern cgroup inherit attributes of job cgroup.

Brian Christiansen
committed
-- Add new hook to Task plugin to be able to put adopted processes in the
step_extern cgroups.
-- Fix AllowUsers documentation in burst_buffer.conf man page. Usernames are
comma separated, not colon delimited.
-- Fix issue with time limit not being set correctly from a QOS when a job
requests no time limit.
-- In both sched/basic and backfill: If a job can not be started due to some
account/qos limit, then don't start other jobs which could delay jobs. The
old logic would skip the job and start other jobs, which could delay the
higher priority job.
-- select/cray: Prevent NHC from running more than once per job or step.
-- Fix fields not properly printed when adding an account through sacctmgr.
-- Update LBNL Node Health Check (NHC) link on FAQ.
-- Fix multifactor plugin to prevent slurmctld from getting segmentation fault
should the tres_alloc_cnt be NULL.
-- sbatch/salloc - Move nodelist logic before the time min_nodes is used
so we can set it correctly before tasks are set.
* Changes in Slurm 15.08.4
==========================
-- Fix typo for the "devices" cgroup subsystem in pam_slurm_adopt.c
-- Fix TRES_MAX flag to work correctly.
-- Added burst_buffer.conf flag parameter of "TeardownFailure" which will
teardown and remove a burst buffer after failed stage-in or stage-out.
By default, the buffer will be preserved for analysis and manual teardown.
-- Prevent a core dump in srun if the signal handler runs during the job
allocation causing the step context to be NULL.
-- Don't fail job if multiple prolog operations in progress at slurmctld
restart time.
-- Burst_buffer/cray: Fix to purge terminated jobs with burst buffer errors.
-- Burst_buffer/cray: Don't stall scheduling of other jobs while a stage-in
is in progress.
-- Make it possible to query 'extern' step with sstat.
-- Make 'extern' step show up in the database.
-- MYSQL - Quote assoc table name in mysql query.
-- Make SLURM_ARRAY_TASK_MIN, SLURM_ARRAY_TASK_MAX, and SLURM_ARRAY_TASK_STEP
environment variables available to PrologSlurmctld and EpilogSlurmctld.
-- Fix slurmctld bug in which a pending job array could be canceled
by a user different from the owner or the administrator.
-- Support taking node out of FUTURE state with "scontrol reconfig" command.
-- Sched/backfill: Fix to properly enforce SchedulerParameters of
bf_max_job_array_resv.
-- Enable operator to reset sdiag data.
-- jobcomp/elasticsearch plugin: Add array_job_id and array_task_id fields.
-- Remove duplicate #define IS_NODE_POWER_UP.
-- Added SchedulerParameters option of max_script_size.
-- Add REQUEST_ADD_EXTERN_PID option to add pid to the slurmstepd's extern
step.

Brian Christiansen
committed
-- Add unique identifiers to anchor tags in HTML generated from the man pages.
-- Add with_freeipmi option to spec file.
-- Minor elasticsearch code improvements
* Changes in Slurm 15.08.3
==========================
-- Correct Slurm's RPM build if Munge is not installed.
-- Job array termination status email ExitCode based upon highest exit code
from any task in the job array rather than the last task. Also change the
state from "Ended" or "Failed" to "Mixed" where appropriate.
-- Squeue recombines pending job array records only if their name and partition
are identical.
-- Fix some minor leaks in the job info and step info API.
-- Export missing QOS id when filling in association with the association
manager.
-- Fix invalid reference if a lua job_submit plugin references a default qos
when a user doesn't exist in the database.
-- Use association enforcement in the lua plugin.
-- Fix a few spots missing defines of accounting_enforce or acct_db_conn
in the plugins.
-- Show requested TRES in scontrol show jobs when job is pending.
-- Improve sched/backfill support for job features, especially XOR construct.
-- Correct scheduling logic for job features option with XOR construct that
could delay a job's initiation.
-- Remove unneeded frees when creating a tres string.
-- Send a tres_alloc_str for the batch step
-- Fix incorrect check for slurmdb_find_tres_count_in_string in various places,
it needed to check for INFINITE64 instead of zero.
-- Don't allow scontrol to create partitions with the name "DEFAULT".
-- burst_buffer/cray: Change error from "invalid request" to "permssion denied"
if a non-authorized user tries to create/destroy a persistent buffer.
-- PrologFlags work: Setting a flag of "Contain" implicitly sets the "Alloc"
flag. Fix code path which could prevent execution of the Prolog when the
Loading
Loading full blame...