Newer
Older
This file describes changes in recent versions of SLURM. It primarily
documents those changes that are of interest to users and admins.
* Changes in SLURM 1.2.0-pre6
=============================
-- Maintain actually job step run time with suspend/resume use.
* Changes in SLURM 1.2.0-pre5
=============================
-- Patch from HP patch.1.2.0.pre4.061017.crcore_hints, supports cores as
consumable resource.
-- Added node_inx to job_step_info_t to get the node indecies for mapping out

Danny Auble
committed
steps in a job by nodes.
-- sview grid added
-- BLUEGENE node_inx added to blocks for reference.
-- Automatic CPU_MASK generation for task launch, new srun option -B.
-- Automatic logical to physical processor identification and mapping.
-- Added new srun options to --cpu_bind: sockets, cores, and threads
-- Updated select/cons_res to operate as socket granularity.
-- New srun task distribution options to -m: plane
-- Multi-core support in sinfo, squeue, and scontrol.
-- Memory can be treated as a consumable resource.
-- New srun options --ntasks-per-[node|socket|core].
* Changes in SLURM 1.2.0-pre3
=============================
-- Remove configuration parameter ShedulerAuth (defunct).
-- Add NextJobId to "scontrol show config" output.
-- Add new slurm.conf parameter MailProg.
-- New forwarding logic. New recieve_msg functions depending on what you
are expecting to get back. No srun_node_id anymore passed around in
a slurm_msg_t
-- Remove sched/wiki plugin (use sched/wiki2 for now)
-- Disable pthread_create() for PMI_send when TotalView is running for
better performance.
-- Fixed certain tests in test suite to not run with bluegene or front-end

Danny Auble
committed
systems
-- Removed addresses from slurm_step_layout_t
-- Added new job field, "comment". Set by srun, salloc and sbatch. See
with "scontrol show job". Used in sched/wiki2.
-- Report a job's exit status in "scontrol show job".
-- In sched/wiki2: add support for JOBREQUEUE command.
* Changes in SLURM 1.2.0-pre2
=============================
-- Added function slurm_init_slurm_msg to be used to init any slurm_msg_t
you no longer need do any other type of initialization to the type.
* Changes in SLURM 1.2.0-pre2
=============================
-- Fixed task dist to work with hostfile and warn about asking for more tasks
than you have nodes for in arbitray mode.
-- Added "account" field to job and step accounting information and sacct output.

Danny Auble
committed
-- Moved task layout to slurmctld instead of srun. Job step create returns
step_layout structure with hostnames and addresses that corrisponds
to those nodes.
-- Changed api slurm_lookup_allocation params,
resource_allocation_response_msg_t changed to job_alloc_info_response_msg_t
this structure is being renamed so contents are the same.
-- alter resource_allocation_response_msg_t see slurm.h.in
-- remove old_job_alloc_msg_t and function slurm_confirm_alloc
-- Slurm configuration files now support an "Include" directive to
include other files inline.

Danny Auble
committed
-- BLUEGENE New --enable-bluegene-emulation configure parameter to allow
running system in bluegene emulation mode. Only
really useful for developers.
-- New added new tool sview GUI for displaying slurm info.
-- fixed bug in step layout to lay out tasks correctly
* Changes in SLURM 1.2.0-pre1
=============================
-- Fix bug that could run a job's prolog more than once
-- Permit batch jobs to be requeued, scontrol requeue <jobid>
-- Send overcommit flag from srun in RPCs and have slurmd set SLURM_OVERCOMMIT
flag at batch job launch time.
-- Added new configuration parameter MessageTimeout (replaces #define in
the code)
* Changes in SLURM 1.1.16
=========================
- BLUEGENE - fix to make prolog run 5 minutes longer to make sure we have
enough time to free the overlapping blocks when starting a new job on a
block.
- BLUEGENE - edit to the libsched_if.so to read env and look at
MPIRUN_PARTITION to see if we are in slurm or running mpirun natively.
- Plugins are now dlopened RTLD_LAZY instead of RTLD_NOW.
* Changes in SLURM 1.1.15
=========================
- BLUEGENE - fix to be able to create static partitions
- Fixed fanout timeout logic.
- Fix for slurmctld timeout on outgoing message (Hongjia Cao, NUDT.edu.cn).
* Changes in SLURM 1.1.14
=========================
- In sched/wiki2: report job/node id and state only if no changes since
time specified in request.
- In sched/wiki2: include a job's exit code in job state information.
- In sched/wiki2: add event notification logic on job submit and completion.
- In sched/wiki2: add support for JOBWILLRUN command type.
- In sched/wiki2: for job info, include required HOSTLIST if applicable.
- In sched/wiki2: for job info, replace PARTITIONMASK with RCLASS (report
partition name associated with a job, but no task count)
- In sched/wiki2: for job and node info, report all data if TS==0,
volitile data if TS<=update_time, state only if TS>update_time
- In sched/wiki2: add support for CMD=JOBSIGNAL ARG=jobid SIGNAL=name or #
- In sched/wiki2: add support for CMD=JOBMODIFY ARG=jobid [BANK=name]
[TIMELIMIT=minutes] [PARTITION=name]
- In sched/wiki2: add support for CMD=INITIALIZE ARG=[USEHOSTEXP=T|F]
[EPORT=#]; RESPONSE=EPORT=# USEHOSTEXP=T
- In sched/wiki2: fix memory leak.
- Fix sinfo node state filtering when asking for idle nodes that are also
draining.
- Add Fortran extension to slurm_get_rem_time() API.
- Fix bug when changing the time limit of a running job that has previously
been suspended (formerly failed to account for suspend time in setting
termination time).
- fix for step allocation to be able to specify only a few nodes in a
step and ask for more that specified.
- patch from Hongjia Cao for forwarding logic
- BLUEGENE - able to allocate specific nodes without locking up.
- BLUEGENE - better tracking of blocks that are created dynamically,
less hitting the db2.
* Changes in SLURM 1.1.13
=========================
- Fix hang in sched/wiki2 if Moab stops responding responding when
response is outgoing.
- BLUEGENE - fix to make sure the block is good to go when picking it
- BLUEGENE - add libsched_if.so so mpirun doesn't try to create a block
by itself.
- Enable specification of srun --jobid=# option with --batch (for user root).
- Verify that job actually starts when requested by sched/wiki2.
- Add new wiki.conf parameters: EPort and JobAggregationTime for event
notification logic (see wiki.conf man page for details)
* Changes in SLURM 1.1.12
=========================
- Sched/wiki2 to report a job's account as COMMENT response to GETJOBS
request.
- Add srun option "--comment" (maps to job account until slurm v1.2,
needed for Moab scheduler functionality).
- fixed some timeout issues in the controller hopefully stopping all the
issues with excessive timeouts.
- unit conversion (i.e. 1024 => 1k) only happens on bgl systems for node
count.
- Sched/wiki2 to report a job's COMPETETIME and SUSPENDTIME in GETJOBS
response.
- Added support for Mellanox's version of mvapich-0.9.7.
* Changes in SLURM 1.1.11
=========================
- Update file headers adding permission to link with OpenSSL.
- Enable sched/wiki2 message authentication.
- Fix libpmi compilation issue.
- Remove "gcc-c++ python" from slurm.spec BuildRequires. It breaks
the AIX build, so we'll have to find another way to deal with that.
* Changes in SLURM 1.1.10
=========================
-- task distribution fix for steps that are smaller than job allocation.
-- BLUEGENE - fix to only send a success when block was created when trying
to allocate the block.
-- fix so if slurm_send_recv_node_msg fails on the send the auth_cred returned
by the resp is NULL.
-- Fix switch/federation plugin so backup controller can assume control
repeatedly without leaking or corrupting memory.
-- Add new error code (for Maui/Moab scheduler): ESLURM_JOB_HELD
-- Tweak slurmctld's node ping logic to better handle failed nodes with
hierarchical communications fail-over logic.
-- Add support for sched/wiki specific configuration file "wiki.conf".
-- Added sched/wiki2 plugin (new experimental wiki plugin).
* Changes in SLURM 1.1.9
========================

Christopher J. Morrone
committed
-- BLUEGENE - fix to handle a NO_VAL sent in as num procs in the job
description.
-- Fix bug in slurmstepd code for parsing --multi-prog command script.
Parser was failing for commands with no arguments.
-- Fix bug to check unsigned ints correctly in bitstring.c
-- Alter node count covert to kilo to only convert number divisible by
1024 or 512
* Changes in SLURM 1.1.8
========================
-- Added bug fixes (fault-tolerance and memory leaks) from Hongjia Cao
<hjcao@nudt.edu.cn>
-- Gixed some potential BLUEGENE issues with the bridge log file not having
a mutex around the fclose and fopen.
-- BLUEGENE - srun -n procs now regristers correctly
-- Fixed problem with reattach double allocating step_layout->tids
-- BLUEGENE - fix race condition where job is finished before it starts.
* Changes in SLURM 1.1.7
========================
-- BLUEGENE - fixed issue with doing an allocation for nodes since asking
for 32,128, or 512 all mean 1 to the controller.

Christopher J. Morrone
committed
-- Add "Include" directive to slurm.conf files. If "Include" is found
at the beginning of a line followed by whitespace and then
the full path to a file, that file is included inline with the current
slurm.conf file.
* Changes in SLURM 1.1.6
========================
-- Improved task layout for relative positions
-- Fixed heterogeous cpu overcommit issue
-- Fix bug where srun would hang if it ran on one node and that
node's slurmd died
-- Fix bug where srun task layout would be bad when min-max node range is
specified (e.g. "srun -N1-4 ...")
-- Made slurmctld_conf.node_prefix only be set on Bluegene systems.
-- Fixed a race condition in the controller to make it so a plugin thread
wouldn't be able to access the slurmctld_conf structure before it was
filled.
* Changes in SLURM 1.1.5
========================
-- Ignore partition's MaxNodes for SlurmUser and root.
-- Fix possible memory corruption with use of PMI_KVS_Create call.
-- Fix race condition when multiple PMI_KVS_Barrier calls.
-- Fix logic in which slurmctld outgoing RPC requests could get delayed.
-- Fix logic for laying out steps without a hostlist.
* Changes in SLURM 1.1.4
========================
-- Improve error handling in hierarchical communications logic.
* Changes in SLURM 1.1.3
========================
-- Fix big-endian bug in the bitstring code which plagued AIX.
-- Fix bug in handling srun's --multi-prog option, could go off end of buffer.
-- Added support for job step completion (and switch window release) on
subset of allocated nodes.
-- BLUEGENE - removed configure option --with-bg-link bridge is linked with
dlopen now no longer needing fake database so files on frontend node.
-- BLUEGENE - implemented use of rm_get_partition_info instead of
...partitions_info which has made a much better design improving stability.
-- Streamline PMI communications and increase timeouts for highly parallel
jobs. Improves scalability of PMI.
* Changes in SLURM 1.1.2
========================
-- Fix bug in jobcomp/filetxt plugin to report proper NodeCnt when a job
fails due to a node failure.
-- Fix Bluegene configure to work with the new 64bit libs.
-- Fix bug in controller that causes it to segfault when hit with a malformed
message.
-- For "srun --attach=X" to other users job, report an error and exit (it
previously just hung).
-- BLUEGENE - fix for doing correct small block logic on user error.
-- BLUEGENE - Added support in slurmd to create a fake libdb2.so if it
doesn't exist so smap won't seg fault
-- BLUEGENE - "scontrol show job" reports "MaxProcs=None" and "Start=None"
if values are not specified at job submit time
-- Add retry logic for PMI communications, may be needed for highly parallel
jobs.
-- Fix bug in slurmd where variable is used in logging message after freed
(slurmstepd rank info).
-- Fix bug in scontrol show daemons if NodeName=localhost will work now to
display slurmd as place where it is running.
-- Patch from HP for init nodes before init_bitmaps
-- ctrl-c killed sruns will result in job state as cancelled instead of
completed.
-- BLUEGENE - added configure option --with-bg-link to choose dynamic linking
or static linking with the bridgeapi.
* Changes in SLURM 1.1.1
========================
-- Fix bug in packing job suspend/resume RPC.
-- If a user breaks out of srun before the allocation takes place, mark the
job as CANCELLED rather than COMPLETED and change its start and end time
to that time.
-- Fix bug in PMI support that prevented use of second PMI_Barrier call.
This fix is needed for MVAPICH2 use.
-- Add "-V" options to slurmctld and slurmd to print version number and exit.
-- Fix scalability bug in sbcast.
-- Fix bug in cons_res allocation strategy.
-- Fix bug in forwarding with mpi
-- Fix bug sacct forwarding with stat option
-- Added nodeid to sacct stat information
-- cleaned up way slurm_send_recv_node_msg works no more clearing errno
-- Fix error handling bug in the networking code that causes the slurmd to
xassert if the server is not running when the slurmd tries to register.
* Changes in SLURM 1.1.0
========================
-- Fix bug that could temporarily make nodes DOWN when they are really
Loading
Loading full blame...