Skip to content
Snippets Groups Projects
NEWS 203 KiB
Newer Older
Christopher J. Morrone's avatar
Christopher J. Morrone committed
This file describes changes in recent versions of SLURM. It primarily
documents those changes that are of interest to users and admins.
* Changes in SLURM 2.0.0-rc2
============================
 -- Change fanout logic to start on calling node instead of first node in 
    message nodelist.
 -- Fix bug so that smap builds properly on Sun Constellation system.
 -- Filter white-space out from node feature specification.
Danny Auble's avatar
Danny Auble committed
 -- Fixed issue with duration not being honored when updating start time in 
    reservations.
 -- Fix bug in sched/wiki and sched/wiki2 plugins for reporting job resource
    allocation properly when node names are configured out of sort order 
    with more than one numeric suffix (e.g. "tux10-1" is configured after 
    "tux5-1").
 -- Avoid re-use of job_id (if specified at submit time) when the existing
    job is in completing state (possible race condition with Moab).
 -- Added SLURM_DISTIRBUTION to env for salloc.
 -- Add support for "scontrol takeover" command for backup controller to 
    assume control immediately. Patch from Matthieu Hautreux, CEA.
 -- If srun is unable to communicate with the slurmd tasks are now marked as 
    failed with the controller.
 -- Fixed issues with requeued jobs not being accounted for correctly in 
    the accounting.
 -- Clear node's POWER_SAVE flag if configuration changes to one lacking a
    ResumeProgram.
 -- Extend a job's time limit as appropriate due to delays powering up nodes.
 -- If sbatch is used to launch a job step within an existing allocation (as
    used by LSF) and the required node is powered down, print the message
    "Job step creation temporarily disabled, retrying", sleep, and retry.
 -- Configuration parameter ResumeDelay added to control how much time must 
    after a node has been suspended before resume it (e.g. powering it back 
    up).
 -- Fix CPU binding for batch program. Patch from Matthieu Hautreux, CEA.
Danny Auble's avatar
Danny Auble committed

* Changes in SLURM 2.0.0-rc1
============================
 -- Fix bug in preservation of advanced reservations when slurmctld restarts.
 -- Updated perlapi to match correctly with slurm.h structures
 -- Do not install the srun command on BlueGene systems (mpirun must be used to
 -- Corrections to scheduling logic for topology/tree in configurations where 
    nodes are configured in multiple leaf switches.
 -- Patch from Matthieu Hautreux for backup mysql deamon support.
 -- Changed DbdBackup to DbdBackupHost for slurmdbd.conf file
 -- Add support for spank_strerror() function and improve error handling in
    general for SPANK plugins.
 -- Added configuration parameter SrunIOTimeout to optionally ping srun's tasks
    for better fault tolerance (e.g. killed and restarteed SLURM daemons on 
    compute node).
 -- Add slurmctld and slurmd binding to appropriate communications address
    based upon NodeAddr, ControllerAddr and BackupAddr configuration 
    parameters. Based upon patch from Matthieu Hautreux, CEA.
    NOTE: Fails when SlurmDBD is configured with some configurations.
    NOTE: You must define BIND_SPECIFIC_ADDR to enable this option.
 -- Avoid using powered down nodes when scheduling work if possible. 
    Fix possible invalid memory reference in power save logic.
* Changes in SLURM 1.4.0-pre13
==============================
 -- Added new partition option AllocNodes which controls the hosts from 
    which jobs can be submitted to this partition. From Matthieu Hautreux, CEA.
 -- Better support the --contiguous option for job allocations.
 -- Add new scontrol option: show topology (reports contents of topology.conf 
    file via RPC if topology/tree plugin is configured).
 -- Add advanced reservation display to smap command.
 -- Replaced remaining references to SLURM_JOBID with SLURM_JOB_ID - except
    when needed for backwards compatibility.
 -- Fix logic to properly excise a DOWN node from the allocation of a job
    with the --no-kill option.
 -- The MySQL and PgSQL plugins for accounting storage and job completion are
    now only built if the underlying database libraries exists (previously
    the plugins were built to produce a fatal error when used).
 -- BLUEGENE - scontrol show config will now display bluegene.conf information.
* Changes in SLURM 1.4.0-pre12
==============================
 -- Added support for hard time limit by associations with added configuration 
    option PriorityUsageResetPeriod. This specifies the interval at which to 
    clear the record of time used. This is currently only available with the 
    priority/multifactor plugin.
 -- Added SLURM_SUBMIT_DIR to sbatch's output environment variables.
 -- Backup slurmdbd support implemented.
 -- Update to checkpoint/xlch logic from Hongjia Cao, NUDT.
 -- Added configuration parameter AccountingStorageBackupHost.
Danny Auble's avatar
Danny Auble committed
* Changes in SLURM 1.4.0-pre11
==============================
Moe Jette's avatar
Moe Jette committed
 -- Fix slurm.spec file for RPM build.
Danny Auble's avatar
Danny Auble committed

* Changes in SLURM 1.4.0-pre10
==============================
 -- Critical bug fix in task/affinity when the CoresPerSocket is greater
    than the ThreadsPerCore (invalid memory reference).
 -- Add DebugFlag parameter of "Wiki" to log sched/wiki and wiki2 
    communications in greater detail.
 -- Add "-d <slurmstepd_path>" as an option to the slurmd daemon to
    specifying a non-stardard slurmstepd file, used  for testing purposes.
 -- Minor cleanup to crypto/munge plugin.
    - Restrict uid allowed to decode job credentials in crypto/munge
    - Get slurm user id early in crypto/munge
    - Remove buggy error code handling in crypto/munge
 -- Added sprio command - works only with the priority/multifactor plugin
 -- Add real topology plugin infrastructure (it was initially added 
    directly into slurmctld code). To specify topology information,
    set TopologyType=topology/tree and add configuration information
    to a new file called topology.conf. See "man topology.conf" or
    topology.html web page for details.
 -- Set "/proc/self/oom_adj" for slurmd and slurmstepd daemons based upon
    the values of SLURMD_OOM_ADJ and SLURMSTEPD_OOM_ADJ environment 
    variables. This can be used to prevent daemons being killed when
    a node's memory is exhausted. Based upon patch by Hongjia  Cao, NUDT.
 -- Fix several bugs in task/affinity: cpuset logic was broken and 
    --cpus-per-task option not properly handled.
 -- Ensure slurmctld adopts SlurmUser GID as well as UID on startup.
* Changes in SLURM 1.4.0-pre9
=============================
 -- OpenMPI users only: Add srun logic to automatically recreate and 
    re-launch a job step if the step fails with a reserved port conflict.
 -- Added TopologyPlugin configuration parameter.
 -- Added switch topology data structure to slurmctld (for use by select 
    plugin) add load it based upon new slurm.conf parameters: SwitchName, 
    Nodes, Switches and LinkSpeed.
 -- Modify select/linear and select/cons_res plugins to optimize resource
    allocation with respect to network topology.
 -- Added  support for new configuration parameter EpilogSlurmctld (executed 
    by slurmctld daemon).
 -- Added checkpoint/blcr plugin, SLURM now support job checkpoint/restart 
    using BLCR. Patch from Hongjia Cao, NUDT, China.
 -- Made a variety of new environment variables available to PrologSlurmctld
    and EpilogSlurmctld. See the "Prolog and Epilog Scripts" section of the 
    slurm.conf man page for details.
 -- NOTE: Cold-start (without preserving state) required for upgrade from 
    version 1.4.0-pre8.
* Changes in SLURM 1.4.0-pre8
=============================
 -- In order to create a new partition using the scontrol command, use
    the "create" option rather than "update" (which will only operate
    upon partitions that already exist).
 -- Added environment variable SLURM_RESTART_COUNT to batch jobs to
    indicated the count of job restarts made.
 -- Added sacctmgr command "show config".
 -- Added the scancel option --nodelist to cancel any jobs running on a
    given list of nodes.
 -- Add partition-specific DefaultTime (default time limit for jobs, 
    if not specified use MaxTime for the partition. Patch from Par
    Andersson, National Supercomputer Centre, Sweden.
 -- Add support for the scontrol command to be able change the Weight
    associated with nodes. Patch from Krishnakumar Ravi[KK] (HP).
 -- Add DebugFlag configuration option of "CPU_Bind" for detailed CPU
    binding information to be logged.
 -- Fix some significant bugs in task binding logic (possible infinite loops
    and memory corruption).
 -- Add new node state flag of NODE_STATE_MAINT indicating the node is in
    a reservation of type MAINT.
 -- Modified task/affinity plugin to automatically bind tasks to sockets,
    cores, or threads as appropriated based upon resource allocation and
    task count. User can override with srun's --cpu_bind option. 
 -- Fix bug in backfill logic for select/cons_res plugin, resulted in 
    error "cons_res:_rm_job_from_res: node_state mis-count".
 -- Add logic go bind a batch job to the resources allocated to that job.
 -- Add configuration parameter MpiParams for (future) OpenMPI port 
    management. Add resv_port_cnt and resv_ports fields to the job step 
    data structures. Add environment variable SLURM_STEP_RESV_PORTS to
    show what ports are reserved for a job step.
 -- Add support for SchedulerParameters=interval=<sec> to control the time
    interval between executions of the backfill scheduler logic.
 -- Preserve record of last job ID in use even when doing a cold-start unless
    there is no job state file or there is a change in its format (which only 
    happens when there is a change in SLURM's major or minor version number: 
    v1.3 -> v1.4).
 -- Added new configuration parameter KillOnBadExit to kill a job step as soon
    as any task of a job step exits with a non-zero exit code. Patch based
    on work from Eric Lin, Bull.
Moe Jette's avatar
Moe Jette committed
 -- Add spank plugin calls for use by salloc and sbatch command, see 
    "man spank" for details.
 -- NOTE: Cold-start (without preserving state) required for upgrade from 
Danny Auble's avatar
Danny Auble committed
    version 1.4.0-pre7.
Moe Jette's avatar
Moe Jette committed
* Changes in SLURM 1.4.0-pre7
=============================
 -- Bug fix for preemption with select/cons_res when there are no idle nodes.
Moe Jette's avatar
Moe Jette committed
 -- Bug fix for use of srun options --exclusive and --cpus-per-task together
    for job step resource allocation (tracking of cpus in use was bad).
 -- Added the srun option --preserve-env to pass the current values of 
    environment variables SLURM_NNODES and SLURM_NPROCS through to the 
    executable, rather than computing them from commandline parameters.
 -- For select/cons_res or sched/gang only: Validate a job's resource 
    allocation socket and core count on each allocated node. If the node's
    configuration has been changed, then abort the job.
 -- For select/cons_res or sched/gang only: Disable updating a node's 
    processor count if FastSchedule=0. Administrators must set a valid
    processor count although the memory and disk space configuration can
    be loaded from the compute node when it starts.
 -- Add configure option "--disable-iso8601" to disable SLURM use of ISO 8601
    time format at the time of SLURM build. Default output for all commands
    is now ISO 8601 (yyyy-mm-ddThh:mm:ss).
 -- Add support for scontrol to explicity power a node up or down using the
Loading
Loading full blame...