Skip to content
Snippets Groups Projects
NEWS 239 KiB
Newer Older
Danny Auble's avatar
Danny Auble committed
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and admins.
* Changes in Slurm 14.11.0pre1
==============================
 -- Modify etc/cgroup.release_common.example to set specify full path to the
    scontrol command. Also find cgroup mount point by reading cgroup.conf file.
 -- Improve qsub wrapper support for passing environment variables.
 -- Modify sdiag to report Slurm RPC traffic by user, type, count and time
    consumed.
 -- In select plugins, stop triggering extra logging based upon the debug flag
    CPU_Bind and use SelectType instead.
 -- Added SchedulerParameters options of bf_yield_interval and bf_yield_sleep
    to control how frequently and for how long the backfill scheduler will
    relinquish its locks.
 -- To support larger numbers of jobs when the StateSaveDirectory is on a
    file system that supports a limited number of files in a directory, add a
    subdirectory called "hash.#" based upon the last digit of the job ID.
 -- More gracefully handle missing batch script file. Just kill the job and do
    not drain the compute node.
 -- Add support for allocation of GRES by model type for heterogenous systems
    (e.g. request a Kepler GPU, a Tesla GPU, or a GPU of any type).
 -- Record and enable display of nodes anticipated to be used for pending jobs.
Morris Jette's avatar
Morris Jette committed
 -- Modify squeue --start option to print the nodes expected to be used for
    pending job (in addition to expected start time, etc.).
 -- Add association hash to the assoc_mgr.
 -- Better logic to handle resized jobs when the DBD is down.
David Bigagli's avatar
David Bigagli committed
 -- Introduce MemLimitEnforce yes|no in slurm.conf. If set no Slurm will
    not terminate jobs if they exceed requested memory.
 -- Add support for non-consumable generic resources for resources that are
    limited, but can be shared between jobs.
 -- Introduce 5 new Slurm errors in slurm_errno.h related to job to better
    report error conditions. 
 -- Modify scontrol to print error message for each array task when updating
    the entire array.
 -- Added gres_drain and gres_used fields to node_info_t.
 -- Added PriorityParameters configuration parameter in slurm.conf.
 -- Introduce automatic job requeue policy based on exit value. See RequeueExit
    and RequeueExitHold descriptions in slurm.conf man page.
 -- Modify slurmd to cache launched job IDs for more responsive job suspend and
    gang scheduling.
 -- Permit jobs steps full control over cpu_bind options if specialized cores
    are included in the job allocation.
 -- Added ChosLoc configuration parameter to specifiy the pathname of the
    Chroot OS tool.
Morris Jette's avatar
Morris Jette committed
 -- Sent SIGCONT/SIGTERM when a job is selected for preemption with GraceTime
    configured rather than waiting for GraceTime to be reached before notifying
    the job.
 -- Do not resume a job with specialized cores on a node running another job
    with specialized cores (only one can run at a time).
 -- Add specialized core count to job suspend/resume calls.
 -- task/cgroup - Correct specialized core task binding with user supplied
    invalid CPU mask or map.
Morris Jette's avatar
Morris Jette committed
* Changes in Slurm 14.03.4
Morris Jette's avatar
Morris Jette committed
==========================
 -- Fix issue where not enforcing QOS but a partition either allows or denies
    them.
Morris Jette's avatar
Morris Jette committed
 -- CRAY - Make switch/cray default when running on a Cray natively.
 -- CRAY - Make job_container/cncu default when running on a Cray natively.
 -- Disable job time limit change if it's preemption is in progress.
 -- Correct logic to properly enforce job preemption GraceTime.
 -- Fix sinfo -R to print each down/drained node once, rather than once per
    partition.
 -- If a job has non-responding node, retry job step create rather than
    returning with DOWN node error.
 -- Support SLURM_CONF path which does not have "slurm.conf" as the file name.
 -- CRAY - make job_container/cncu default when running on a Cray natively
 -- Fix issue where batch cpuset wasn't looked at correctly in
    jobacct_gather/cgroup.
 -- Correct squeue's job node and CPU counts for requeued jobs.
 -- Correct SelectTypeParameters=CR_LLN with job selecition of specific nodes.
 -- Only if ALL of their partitions are hidden will a job be hidden by default.
 -- Run EpilogSlurmctld for a job is killed during slurmctld reconfiguration.
Morris Jette's avatar
Morris Jette committed

* Changes in Slurm 14.03.3-2
============================
Danny Auble's avatar
Danny Auble committed
 -- BGQ - Fix issue with uninitialized variable.
* Changes in Slurm 14.03.3
==========================
 -- Correction to default batch output file name. In version 14.03.2 was using
    "slurm_<jobid>_4294967294.out" due to error in job array logic.
 -- In slurm.spec file, replace "Requires cray-MySQL-devel-enterprise" with
    "Requires mysql-devel".
Morris Jette's avatar
Morris Jette committed
* Changes in Slurm 14.03.2
==========================
 -- Fix race condition if PrologFlags=Alloc,NoHold is used.
 -- Cray - Make NPC only limit running other NPC jobs on shared blades instead
    of limited non NPC jobs.
 -- Fix for sbatch #PBS -m (mail) option parsing.
Morris Jette's avatar
Morris Jette committed
 -- Fix job dependency bug. Jobs dependent upon multiple other jobs may start
    prematurely.
Morris Jette's avatar
Morris Jette committed
 -- Set "Reason" field for all elements of a job array on short-circuited
    scheduling for job arrays.
 -- Allow -D option of salloc/srun/sbatch to specify relative path.
 -- Added SchedulerParameter of batch_sched_delay to permit many batch jobs
    to be submitted between each scheduling attempt to reduce overhead of
    scheduling logic.
 -- Added job reason of "SchedTimeout" if the scheduler was not able to reach
    the job to attempt scheduling it.
 -- Add job's exit state and exit code to email message.
 -- scontrol hold/release accepts job name option (in addition to job ID).
 -- Handle when trying to cancel a step that hasn't started yet better.
 -- Handle Max/GrpCPU limits better
 -- Add --priority option to salloc, sbatch and srun commands.
 -- Honor partition priorities over job priorities.
 -- Fix sacct -c when using jobcomp/filetxt to read newer variables
 -- Fix segfault of sacct -c if spaces are in the variables.
 -- Release held job only with "scontrol release <jobid>" and not by resetting
    the job's priority. This is needed to support job arrays better.
 -- Correct squeue command not to merge jobs with state pending and completing
    together.
 -- Fix issue where user is requesting --acctg-freq=0 and no memory limits.
 -- Fix issue with GrpCPURunMins if a job's timelimit is altered while the job
    is running.
 -- Temporary fix for handling our typemap for the perl api with newer perl.
 -- Fix allowgroup on bad group seg fault with the controller.
 -- Handle node ranges better when dealing with accounting max node limits.

* Changes in Slurm 14.03.1-2
Morris Jette's avatar
Morris Jette committed
==========================
 -- Update configure to set correct version without having to run autogen.sh
* Changes in Slurm 14.03.1
==========================
 -- Add support for job std_in, std_out and std_err fields in Perl API.
 -- Add "Scheduling Configuration Guide" web page.
 -- BGQ - fix check for jobinfo when it is NULL
 -- Do not check cleaning on "pending" steps.
Unknown's avatar
Unknown committed
 -- task/cgroup plugin - Fix for building on older hwloc (v1.0.2).
 -- In the PMI implementation by default don't check for duplicate keys.
    Set the SLURM_PMI_KVS_DUP_KEYS if you want the code to check for
    duplicate keys.
 -- Add job submission time to squeue.
 -- Permit user root to propagate resource limits higher than the hard limit
    slurmd has on that compute node has (i.e. raise both current and maximum
    limits).
 -- Fix issue with license used count when doing an scontrol reconfig.
 -- Fix the PMI iterator to not report duplicated keys.
 -- Fix issue with sinfo when -o is used without the %P option.
 -- Rather than immediately invoking an execution of the scheduling logic on
    every event type that can enable the execution of a new job, queue its
    execution. This permits faster execution of some operations, such as
    modifying large counts of jobs, by executing the scheduling logic less
    frequently, but still in a timely fashion.
 -- If the environment variable is greater than MAX_ENV_STRLEN don't
    set it in the job env otherwise the exec() fails.
 -- Optimize scontrol hold/release logic for job arrays.
 -- Modify srun to report an exit code of zero rather than nine if some tasks
    exit with a return code of zero and others are killed with SIGKILL. Only an
    exit code of zero did this.
David Bigagli's avatar
David Bigagli committed
 -- Fix a typo in scontrol man page.
 -- Avoid slurmctld crash getting job info if detail_ptr is NULL.
 -- Fix sacctmgr add user where both defaultaccount and accounts are specified.
 -- Added SchedulerParameters option of max_sched_time to limit how long the
    main scheduling loop can execute for.
 -- Added SchedulerParameters option of sched_interval to control how frequently
    the main scheduling loop will execute.
 -- Move start time of main scheduling loop timeout after locks are aquired.
 -- Add squeue job format option of "%y" to print a job's nice value.
 -- Update scontrol update jobID logic to operate on entire job arrays.
 -- Fix PrologFlags=Alloc to run the prolog on each of the nodes in the
    allocation instead of just the first.
 -- Fix race condition if a step is starting while the slurmd is being
    restarted.
 -- Make sure a job's prolog has ran before starting a step.
 -- BGQ - Fix invalid memory read when using DefaultConnType in the
    bluegene.conf
 -- Make sure we send node state to the DBD on clean start of controller.
 -- Fix some sinfo and squeue sorting anomalies due to differences in data
    types.
 -- Only send message back to slurmctld when PrologFlags=Alloc is used on a
    Cray/ALPS system, otherwise use the slurmd to wait on the prolog to gate
    the start of the step.
 -- Remove need to check PrologFlags=Alloc in slurmd since we can tell if prolog
    has ran yet or not.
 -- Fix squeue to use a correct macro to check job state.
 -- BGQ - Fix incorrect logic issues if MaxBlockInError=0 in the bluegene.conf.
 -- priority/basic - Insure job priorities continue to decrease when jobs are
    submitted with the --nice option.
 -- Make the PrologFlag=Alloc work on batch scripts
 -- Make PrologFlag=NoHold (automatically sets PrologFlag=Alloc) not hold in
    salloc/srun, instead wait in the slurmd when a step hits a node and the
    prolog is still running.
 -- Added --cpu-freq=highm1 (high minus one) option.
 -- Expand StdIn/Out/Err string length output by "scontrol show job" from 128
    to 1024 bytes.
 -- squeue %F format will now print the job ID for non-array jobs.
 -- Use quicksort for all priority based job sorting, which improves performance
    significantly with large job counts.
 -- If a job has already been released from a held state ignore successive
    release requests.
Danny Auble's avatar
Danny Auble committed
 -- Fix srun/salloc/sbatch man pages for the --no-kill option.
 -- Add squeue -L/--licenses option to filter jobs by license names.
 -- Handle abort job on node on front end systems without core dumping.
 -- Fix dependency support for job arrays.
 -- When updating jobs verify the update request is not identical to
    the current settings.
Loading
Loading full blame...