Skip to content
Snippets Groups Projects
NEWS 202 KiB
Newer Older
Christopher J. Morrone's avatar
Christopher J. Morrone committed
This file describes changes in recent versions of SLURM. It primarily
documents those changes that are of interest to users and admins.
* Changes in Slurm 13.12.0pre4
 -- Remove the ThreadID documentation from slurm.conf. This functionality has
    been obsoleted by the LogTimeFormat.
 -- Sched plugins - rename global and plugin functions names for consistency
    with other plugin types.
 -- BGQ - Added RebootQOSList option to bluegene.conf to allow an implicate
    reboot of a block if only jobs in the list are running on it when cnodes
    go into a failure state.
* Changes in Slurm 13.12.0pre3
 -- Do not set SLURM_NODEID environment variable on front-end systems.
 -- Convert bitmap functions to use int32_t instead of int in data structures
    and function arguments. This is to reliably enable use of bitmaps containing
    up to 4 billion elements. Several data structures containing index values
    were also changed from data type int to int32_t:
    - Struct job_info / slurm_job_info_t: Changed exc_node_inx, node_inx, and
      req_node_inx from type int to type int32_t
    - job_step_info_t: Changed node_inx from type int to type int32_t
    - Struct partition_info / partition_info_t: Changed node_inx from type int
      to type int32_t
    - block_job_info_t: Changed cnode_inx from type int to type int32_t
    - block_info_t: Changed ionode_inx and mp_inx from type int to type int32_t
    - Struct reserve_info / reserve_info_t: Changed node_inx from type int to
      type int32_t
 -- Modify qsub wrapper output to match torque command output, just print the
    job ID rather than "Submitted batch job #"
 -- Change Slurm error string for ESLURM_MISSING_TIME_LIMIT from
    "Missing time limit" to
    "Time limit specification required, but not provided"
 -- Change salloc job_allocate error message header from
    "Failed to allocate resources" to
    "Job submit/allocate failed"
 -- Modify slurmctld message retry logic to support Cray cold-standby SDB.
* Changes in Slurm 13.12.0pre2
 -- Added "JobAcctGatherParams" configuration parameter. Value of "NoShare"
    disables accounting for shared memory.
 -- Added fields to "scontrol show job" output: boards_per_node,
    sockets_per_board, ntasks_per_node, ntasks_per_board, ntasks_per_socket,
    ntasks_per_core, and nice.
 -- Add squeue output format options for job command and working directory
    (%o and %Z respectively).
 -- Add stdin/out/err to sview job output.
 -- Add new job_state of JOB_BOOT_FAIL for job terminations due to failure to
    boot it's allocated nodes or BlueGene block.
 -- CRAY - Add SelectTypeParameters NHC_NO_STEPS and NHC_NO which will disable
    the node health check script for steps and allocations respectfully.
 -- Reservation with CoreCnt: Avoid possible invalid memory reference.
 -- Add new error code for attempt to create a reservation with duplicate name.
 -- Validate that a hostlist file contains text (i.e. not a binary).
 -- switch/generic - propagate switch information from srun down to slurmd and
 -- CRAY - Do not package Slurm's libpmi or libpmi2 libraries. The Cray version
    of those libraries must be used.
 -- Added a new option to the scontrol command to view licenses that are
    configured in use and avalable. 'scontrol show licenses'.
 -- MySQL - Made Slurm compatible with 5.6
* Changes in Slurm 13.12.0pre1
 -- sview - improve scalability
 -- Add task pointer to the task_post_term() function in task plugins. The
    terminating task's PID is available in task->pid.
 -- Move select/cray to select/alps
 -- Defer sending SIGKILL signal to processes while core dump in progress.
 -- Added JobContainerPlugin configuration parameter and plugin infrastructure.
 -- Added partition configuration parameters AllowAccounts, AllowQOS,
 -- The rpmbuild option for a cray system with ALPS has changed from
    %_with_cray to %_with_cray_alps.
 -- The log file timestamp format can now be selected at runtime via the
    LogTimeFormat configuration option. See the slurm.conf and slurmdbd.conf
 -- Added switch/generic plugin to a job's convey network topology.
 -- BLUEGENE - If block is in 'D' state or has more cnodes in error than
    MaxBlockInError set the job wait reason appropriately.
 -- API use: Generate an error return rather than fatal error and exit if the
    configuraiton file is absent or invalid. This will permit Slurm APIs to be
    more reliably used by other programs.
 -- Add support for load-based scheduling, allocate jobs to nodes with the
    largest number of available CPUs. Added SchedulingParameters paramter of
    "CR_LLN" and partition parameter of "LLN=yes|no".
 -- Added job_info() and step_info() functions to the gres plugins to extract
    plugin specific fields from the job's or step's GRES data structure.
 -- Added sbatch --signal option of "B:" to signal the batch shell rather than
    only the spawned job steps.
 -- Added sinfo and squeue format option of "%all" to print all fields available
    for the data type with a vertical bar separating each field.
 -- Add mechanism for job_submit plugin to generate error message for srun,
Morris Jette's avatar
Morris Jette committed
    salloc or sbatch to stderr. New argument added to job_submit function in
 -- Add StdIn, StdOut, and StdErr paths to job information dumped with
    "scontrol show job".
 -- Permit Slurm administrator to submit a batch job as any user.
 -- Set a job's RLIMIT_AS limit based upon it's memory limit and VsizeFactor
    configuration value.
 -- Make jobacct_gather/cgroup work correctly and also make all jobacct_gather
    plugins more maintainable.
 -- Proctrack/pgid - Add support for proctrack_p_plugin_get_pids() function.
 -- Sched/backfill - Change default max_job_bf parameter from 50 to 100.
 -- Added -I|--item-extract option to sh5util to extract data item from series.
Morris Jette's avatar
Morris Jette committed
* Changes in Slurm 2.6.4
 -- Fixed sh5util to print its usage.
 -- Corrected commit f9a3c7e4e8ec.
 -- Honor ntasks-per-node option with exclusive node allocations.
 -- sched/backfill - Prevent invalid memory reference if bf_continue option is
    configured and slurm is reconfigured during one of the sleep cycles or if
    there are any changes to the partition configuration.
 -- Update man pages information about acct-freq and JobAcctGatherFrequency
    to reflect only the latest supported format.
 -- Minor document update to include note about PrivateData=Usage for the
    slurm.conf when using the DBD.
 -- Expand information reported with DebugFlags=backfill.
 -- Initiate jobs pending to run in a reservation as soon as the reservation
    becomes active.
Morris Jette's avatar
Morris Jette committed
* Changes in Slurm 2.6.3
 -- Add support for some new #PBS options in sbatch scripts and qsub wrapper:
    -l accelerator=true|false	(GPU use)
    -l mpiprocs=#	(processors per node)
    -l naccelerators=#	(GPU count)
    -l select=#		(node count)
    -l ncpus=#		(task count)
    -v key=value	(environment variable)
    -W depend=opts	(job dependencies, including "on" and "before" options)
    -W umask=#		(set job's umask)
Eric Winter's avatar
Eric Winter committed
 -- Added qalter and qrerun commands to torque package.
 -- Corrections to qstat logic: job CPU count and partition time format.
 -- Add job_submit/pbs plugin to translate PBS job dependency options to the
    extend possible (no support for PBS "before" options) and set some PBS
    environment variables.
 -- Add spank/pbs plugin to set a bunch of PBS environment variables.
 -- Backported sh5util from master to 2.6 as there are some important
    bugfixes and the new item extraction feature.
Morris Jette's avatar
Morris Jette committed
 -- select/cons_res - Correct MacCPUsPerNode partition constraint for CR_Socket.
Morris Jette's avatar
Morris Jette committed
 -- scontrol - for setdebugflags command, avoid parsing "-flagname" as an
    scontrol command line option.
 -- Fix issue with step accounting if a job is requeued.
 -- Close file descriptors on exec of prolog, epilog, etc.
 -- Fix issue when a user has held a job and then sets the begin time
    into the future.
 -- Scontrol - Enable changing a job's stdout file.
 -- Fix issues where memory or node count of a srun job is altered while the
    srun is pending.  The step creation would use the old values and possibly
    hang srun since the step wouldn't be able to be created in the modified
 -- Add support for new SchedulerParameters value of "bf_max_job_part", the
    maximum depth the backfill scheduler should go in any single partition.
 -- acct_gather/infiniband plugin - Correct packets_in/out values.
 -- BLUEGENE - Don't ignore a conn-type request from the user.
 -- BGQ - Force a request on a Q for a MESH to be a TORUS in a dimension that
    can only be a TORUS (1).
 -- Change max message length from 100MB to 1GB before generating "Insane
    message length" error.
 -- sched/backfill - Prevent possible memory corruption due to use of
    bf_continue option and long running scheduling cycle (pending jobs could
    have been cancelled and purged).
 -- CRAY - fix AcceleratorAllocation depth correctly for basil 1.3
 -- Created the environment variable SLURM_JOB_NUM_NODES for srun jobs and
    updated the srun man page.
 -- BLUEGENE/CRAY - Don't set env variables that pertain to a node when Slurm
    isn't doing the launching.
 -- gres/gpu and gres/mic - Do not treat the existence of an empty gres.conf
    file as a fatal error.
 -- Fixed for if hours are specified as 0 the time days-0:min specification
    is not parsed correctly.
 -- switch/nrt - Fix for memory leak.
Morris Jette's avatar
Morris Jette committed
 -- Subtract the PMII_COMMANDLEN_SIZE in contribs/pmi2/pmi2_api.c to prevent
    certain implementation of snprintf() to segfault.
Morris Jette's avatar
Morris Jette committed
* Changes in Slurm 2.6.2
 -- Fix issue with reconfig and GrpCPURunMins
 -- Fix of wrong node/job state problem after reconfig
Danny Auble's avatar
Danny Auble committed
 -- Allow users who are coordinators update their own limits in the accounts
    they are coordinators over.
 -- BackupController - Make sure we have a connection to the DBD first thing
    to avoid it thinking we don't have a cluster name.
 -- Correct value of min_nodes returned by loading job information to consider
    the job's task count and maximum CPUs per node.
 -- If running jobacct_gather/none fix issue on unpacking step completion.
 -- Reservation with CoreCnt: Avoid possible invalid memory reference.
 -- sjstat - Add man page when generating rpms.
 -- Make sure GrpCPURunMins is added when creating a user, account or QOS with
 -- Fix for invalid memory reference due to multiple free calls caused by
    job arrays submitted to multiple partitions.
 -- Enforce --ntasks-per-socket=1 job option when allocating by socket.
 -- Validate permissions of key directories at slurmctld startup. Report
    anything that is world writable.
Loading full blame...