Skip to content
Snippets Groups Projects
NEWS 199 KiB
Newer Older
Christopher J. Morrone's avatar
Christopher J. Morrone committed
This file describes changes in recent versions of SLURM. It primarily
documents those changes that are of interest to users and admins.
Morris Jette's avatar
Morris Jette committed
* Changes in Slurm 2.6.4
========================
 -- Fixed sh5util to print its usage.
 -- Corrected commit f9a3c7e4e8ec.
 -- Honor ntasks-per-node option with exclusive node allocations.
 -- sched/backfill - Prevent invalid memory reference if bf_continue option is
    configured and slurm is reconfigured during one of the sleep cycles or if
    there are any changes to the partition configuration or if the normal
    scheduler runs and starts a job that the backfill scheduler is actively
    working on.
 -- Update man pages information about acct-freq and JobAcctGatherFrequency
    to reflect only the latest supported format.
 -- Minor document update to include note about PrivateData=Usage for the
    slurm.conf when using the DBD.
 -- Expand information reported with DebugFlags=backfill.
 -- Initiate jobs pending to run in a reservation as soon as the reservation
    becomes active.
 -- Purged expired reservation even if it has pending jobs.
 -- Corrections to calculation of a pending job's expected start time.
 -- Remove some vestigial logic treating job priority of 1 as a special case.
 -- Memory freeing up to avoid minor memory leaks at close of daemons
 -- Updated documentation to give correct units being displayed.
 -- Report AccountingStorageBackupHost with "scontrol show config".
 -- init scripts ignore quotes around Pid file name specifications.
 -- Fixed typo about command case in quickstart.html.
 -- task/cgroup - handle new cpuset files, similar to commit c4223940.
 -- Replace the tempname() function call with mkstemp().
Morris Jette's avatar
Morris Jette committed
 -- Fix for --cpu_bind=map_cpu/mask_cpu/map_ldom/mask_ldom plus
    --mem_bind=map_mem/mask_mem options, broken in 2.6.2.
 -- Restore default behavior of allocating cores to jobs on a cyclic basis
    across the sockets unless SelectTypeParameters=CR_CORE_DEFAULT_DIST_BLOCK
    or user specifies other distribution options.
 -- Enforce JobRequeue configuration parameter on node failure. Previously
    always requeued the job.
 -- acct_gather_energy/ipmi - Add delay before retry on read error.
 -- select/cons_res with GRES and multiple threads per core, fix possible
    infinite loop.
 -- proctrack/cgroup - Add cgroup create retry logic in case one step is
    starting at the same time as another step is ending and the logic to create
    and delete cgroups overlaps.
 -- Improve setting of job wait "Reason" field.
 -- Correct sbatch documentation and job_submit/pbs plugin "%j" is job ID,
    not "%J" (which is job_id.step_id).
 -- Improvements to sinfo performance, especially for large numbers of
    partitions.
 -- SlurmdDebug - Permit changes to slurmd debug level with "scontrol reconfig"
 -- smap - Avoid invalid memory reference with hidden nodes.
 -- Fix sacctmgr modify qos set preempt+/-=.
 -- BLUEGENE - fix issue where node count wasn't set up correctly when srun
    preforms the allocation, regression in 2.6.3.
 -- Add support for dependencies of job array elements (e.g. 
    "sbatch --depend=afterok:123_4 ...") or all elements of a job array (e.g.
    "sbatch --depend=afterok:123 ...").
Morris Jette's avatar
Morris Jette committed
 -- Add support for new options in sbatch qsub wrapper:
    -W block=true	(wait for job completion)
    Clear PBS_NODEFILE environment variable
David Bigagli's avatar
David Bigagli committed
 -- Fixed the MaxSubmitJobsPerUser limit in QOS which limited submissions
    a job too early.
 -- sched/wiki, sched/wiki2 - Fix to work with change logic introduced in
    version 2.6.3 preventing Maui/Moab from starting jobs.
 -- Updated the QOS limits documentation and man page.
Morris Jette's avatar
Morris Jette committed

Morris Jette's avatar
Morris Jette committed
* Changes in Slurm 2.6.3
========================
 -- Add support for some new #PBS options in sbatch scripts and qsub wrapper:
    -l accelerator=true|false	(GPU use)
    -l mpiprocs=#	(processors per node)
    -l naccelerators=#	(GPU count)
    -l select=#		(node count)
    -l ncpus=#		(task count)
    -v key=value	(environment variable)
    -W depend=opts	(job dependencies, including "on" and "before" options)
    -W umask=#		(set job's umask)
Eric Winter's avatar
Eric Winter committed
 -- Added qalter and qrerun commands to torque package.
 -- Corrections to qstat logic: job CPU count and partition time format.
 -- Add job_submit/pbs plugin to translate PBS job dependency options to the
    extend possible (no support for PBS "before" options) and set some PBS
    environment variables.
 -- Add spank/pbs plugin to set a bunch of PBS environment variables.
 -- Backported sh5util from master to 2.6 as there are some important
    bugfixes and the new item extraction feature.
Morris Jette's avatar
Morris Jette committed
 -- select/cons_res - Correct MacCPUsPerNode partition constraint for CR_Socket.
Morris Jette's avatar
Morris Jette committed
 -- scontrol - for setdebugflags command, avoid parsing "-flagname" as an
    scontrol command line option.
 -- Fix issue with step accounting if a job is requeued.
 -- Close file descriptors on exec of prolog, epilog, etc.
 -- Fix issue when a user has held a job and then sets the begin time
    into the future.
 -- Scontrol - Enable changing a job's stdout file.
 -- Fix issues where memory or node count of a srun job is altered while the
    srun is pending.  The step creation would use the old values and possibly
    hang srun since the step wouldn't be able to be created in the modified
    allocation.
 -- Add support for new SchedulerParameters value of "bf_max_job_part", the
    maximum depth the backfill scheduler should go in any single partition.
 -- acct_gather/infiniband plugin - Correct packets_in/out values.
 -- BLUEGENE - Don't ignore a conn-type request from the user.
 -- BGQ - Force a request on a Q for a MESH to be a TORUS in a dimension that
    can only be a TORUS (1).
 -- Change max message length from 100MB to 1GB before generating "Insane
    message length" error.
 -- sched/backfill - Prevent possible memory corruption due to use of
    bf_continue option and long running scheduling cycle (pending jobs could
    have been cancelled and purged).
 -- CRAY - fix AcceleratorAllocation depth correctly for basil 1.3
 -- Created the environment variable SLURM_JOB_NUM_NODES for srun jobs and
    updated the srun man page.
 -- BLUEGENE/CRAY - Don't set env variables that pertain to a node when Slurm
    isn't doing the launching.
 -- gres/gpu and gres/mic - Do not treat the existence of an empty gres.conf
    file as a fatal error.
 -- Fixed for if hours are specified as 0 the time days-0:min specification
    is not parsed correctly.
 -- switch/nrt - Fix for memory leak.
Morris Jette's avatar
Morris Jette committed
 -- Subtract the PMII_COMMANDLEN_SIZE in contribs/pmi2/pmi2_api.c to prevent
    certain implementation of snprintf() to segfault.
Morris Jette's avatar
Morris Jette committed
* Changes in Slurm 2.6.2
========================
 -- Fix issue with reconfig and GrpCPURunMins
 -- Fix of wrong node/job state problem after reconfig
Danny Auble's avatar
Danny Auble committed
 -- Allow users who are coordinators update their own limits in the accounts
    they are coordinators over.
 -- BackupController - Make sure we have a connection to the DBD first thing
    to avoid it thinking we don't have a cluster name.
 -- Correct value of min_nodes returned by loading job information to consider
    the job's task count and maximum CPUs per node.
 -- If running jobacct_gather/none fix issue on unpacking step completion.
 -- Reservation with CoreCnt: Avoid possible invalid memory reference.
 -- sjstat - Add man page when generating rpms.
 -- Make sure GrpCPURunMins is added when creating a user, account or QOS with
    sacctmgr.
 -- Fix for invalid memory reference due to multiple free calls caused by
    job arrays submitted to multiple partitions.
 -- Enforce --ntasks-per-socket=1 job option when allocating by socket.
 -- Validate permissions of key directories at slurmctld startup. Report
    anything that is world writable.
 -- Improve GRES support for CPU topology. Previous logic would pick CPUs then
    reject jobs that can not match GRES to the allocated CPUs. New logic first
    filters out CPUs that can not use the GRES, next picks CPUs for the job,
    and finally picks the GRES that best match those CPUs.
 -- Switch/nrt - Prevent invalid memory reference when allocating single adapter
    per node of specific adapter type
 -- CRAY - Make Slurm work with CLE 5.1.1
 -- Fix segfault if submitting to multiple partitions and holding the job.
 -- Use MAXPATHLEN instead of the hardcoded value 1024 for maximum file path
    lengths.
 -- If OverTimeLimit is defined do not declare failed those jobs that ended
    in the OverTimeLimit interval.
Morris Jette's avatar
Morris Jette committed
* Changes in Slurm 2.6.1
========================
 -- slurmdbd - Allow job derived ec and comments to be modified by non-root
    users.
 -- Fix issue with job name being truncated to 24 chars when sending a mail
    message.
 -- Fix minor issues with spec file, missing files and including files
    erroneously on a bluegene system.
 -- sacct - fix --name and --partition options when using
    accounting_storage/filetxt.
 -- squeue - Remove extra whitespace of default printout.
 -- BGQ - added head ppcfloor as an include dir when building.
 -- BGQ - Better debug messages in runjob_mux plugin.
David Bigagli's avatar
David Bigagli committed
 -- PMI2 Updated the Makefile.am to build a versioned library.
 -- CRAY - Fix srun --mem_bind=local option with launch/aprun.
 -- PMI2 Corrected buffer size computation in the pmi2_api.c module.
 -- GRES accounting data wrong in database: gres_alloc, gres_req, and gres_used
    fields were empty if the job was not started immediately.
 -- Fix sbatch and srun task count logic when --ntasks-per-node specified,
    but no explicit task count.
 -- Corrected the hdf5 profile user guide and the acct_gather.conf
    documentation.
 -- IPMI - Fix Math bug getting new wattage.
David Bigagli's avatar
David Bigagli committed
 -- Corrected the AcctGatherProfileType documentation in slurm.conf
 -- Corrected the sh5util program to print the header in the csv file
    only once, set the debug messages at debug() level, make the argument
David Bigagli's avatar
David Bigagli committed
    check case insensitive and avoid printing duplicate \n.
 -- If cannot collect energy values send message to the controller
    to drain the node and log error slurmd log file.
 -- Handle complete removal of CPURunMins time at the end of the job instead
    of at multifactor poll.
 -- sview - Add missing debug_flag options.
 -- PGSQL - Notes about Postgres functionality being removed in the next
    version of Slurm.
 -- MYSQL - fix issue when rolling up usage and events happened when a cluster
    was down (slurmctld not running) during that time period.
 -- sched/wiki2 - Insure that Moab gets current CPU load information.
 -- Prevent infinite loop in parsing configuration if including file containing
    one blank line.
 -- Fix pack and unpack between 2.6 and 2.5.
 -- Fix job state recovery logic in which a job's accounting frequency was
    not set. This would result in a value of 65534 seconds being used (the
    equivalent of NO_VAL in uint16_t), which could result in the job being
    requeued or aborted.
 -- Validate a job's accounting frequency at submission time rather than
    waiting for it's initiation to possibly fail.
 -- Fix CPURunMins if a job is requeued from a failed launch.
Loading
Loading full blame...