Skip to content
Snippets Groups Projects
NEWS 70.6 KiB
Newer Older
Christopher J. Morrone's avatar
Christopher J. Morrone committed
This file describes changes in recent versions of SLURM. It primarily
documents those changes that are of interest to users and admins. 
Moe Jette's avatar
Moe Jette committed
* Changes in SLURM 1.1.0-pre3
=============================
 -- Added framework for XCPU job launch support.
 -- New general configuration file parser and slurm.conf handling code.
    Allows long lines to be continued on the next line by ending with a "\".
    Whitespace is allowed between the key and "=", and between the "=" and
    value.
    WARNING: A NodeName may now occur only once in a slurm.conf file.
             If you want to temporarily make nodes DOWN in the slurm.conf,
             use the new DownNodes keyword (see "man slurm.conf").
 -- Gracefully handle request to submit batch job from within an existing batch job.
 -- Warn user attempting to create a job allocation from within an existing job
    allocation.
 -- Add web page description for proctrack plugin.
 -- Add new function slurm_job_warn() to notify when a job's time limit approches
    (not yet fully implemented).
 -- JobAcct plugin renamed from "log" to "linux" in preparation for support of 
    new system types. 
    WARNING: "JobAcctType=jobacct/log" is no longer supported.
 -- removed vestigal 'bg' names from bluegene plugin and smap
Moe Jette's avatar
Moe Jette committed

* Changes in SLURM 1.1.0-pre2
=============================
 -- Added basic "sbcast" support, still needs message fanout logic.
 -- Bluegene specific - Added support for overlapping partitions and 
    dynamic partitioning. 
 -- Bluegene specific - Added support for nodecard sized blocks.
 -- Added logic to accept 1k for 1024 and so on for --nodes option of srun. 
    This logic is through display tools such as smap, sinfo, scontrol, and 
    squeue.
 -- Added bluegene.conf man page.
 -- Added support for memory affinity, see srun --mem_bind option.
* Changes in SLURM 1.1.0-pre1
=============================
 -- New --enable-multiple-slurmd configure parameter to allow running
    more than one copy of slurmd on a node at the same time.  Only
    really useful for developers.
 -- New communication is now branched on all processes to slurmd's from 
    slurmctld and srun launch command.  This is done with a tree type 
    algorithm.  Spawn and batch mode work the same as before.  New slurm.conf
    variable TreeWidth=50 is default.  This is the number of threads per 
    stop on the tree.  
 -- Configuration parameter HeartBeatInterval is depracated. Now used half
    of SlurmdTimeout and SlurmctldTimeout for communications to slurmd and
    slurmctld daemons repsectively.
 -- Add hash tables for select/cons_res plugin (Susanne Balle, HP, 
    patch_02222006).
 -- Remove some use of cr_enabled flag in slurmctld job record, use 
    new flag "test_only" in select_g_job_test() instead.
* Changes in SLURM 1.0.7
========================
 -- Change in how AuthType=auth/dummy is handled for security testing.
 -- Fix for bluegene systems to allow full system partitions to stay booted 
    when other jobs are submitted to the queue.

* Changes in SLURM 1.0.6
========================
 -- Prevent slurmstepd from crashing when srun attaches to batch job.

* Changes in SLURM 1.0.5
========================
 -- Restructure logic for scheduling BlueGene small block jobs. Added
    "test_only" flag to select_p_job_test() in select plugin.
 -- Correct squeue "NODELIST" output for BlueGene small block jobs.
 -- Fix possible deadlock situations on BlueGene plugin on errors.

* Changes in SLURM 1.0.4
========================
 -- Release job allocation if step creation fails (especially for BlueGene).
 -- Fix bug select/bluegene warm start with changed bglblock layout.
 -- Fix bug for queuing full-system BlueGene jobs.

* Changes in SLURM 1.0.3
========================
 -- Fix bug that could refuse to queue batch jobs for BlueGene system.
 -- Add BlueGene plugin mutex lock for reconfig.
 -- Ignore BlueGene bgljobs in ERROR state (don't try to kill).
 -- Fix job accounting for batch jobs (Andy Riebs, HP, 
    slurm.hp.jobacct_divby0a.patch).
 -- Added proctrack/linuxproc.so to the main RPM.
 -- Added mutex around bridge api file to avoid locking up the api.
 -- BlueGene mod: Terminate slurm_prolog and slurm_epilog immediately if 
    SLURM_JOBID environment variable is invalid.
 -- Federation driver: allow selection of a sepecific switch interface
    (sni0, sni1, etc.) with -euidevice/MP_EUIDEVICE.
 -- Return an error for "scontrol reconfig" if there is already one in
    progress
* Changes in SLURM 1.0.2
========================
 -- Correctly report DRAINED node state as type OTHER for "sinfo --summarize".
 -- Fixes in sacct use of malloc (Andy Riebs, HP, sacct_malloc.patch).
 -- Smap mods: eliminate screen flicker, fix window resize, report more clear
    message if window too small (Dan Palermo, HP, patch.1.0.0.1.060126.smap).
 -- Sacct mods for inconsistent records (race condition) and replace --debug
    option with --verbose (Andy Riebs, HP, slurm.hp.sacct_exp_vvv.patch).
 -- scancel of a job step will now send a job-step-completed message
    to the controller after verifying that the step has completed on all nodes.
 -- Fix task layout bug in srun.
 -- Added times to node "Reason" field when set down for insufficient 
    resources or if not responding.
 -- Validate operation with Elan switch and heterogeneous nodes.
Moe Jette's avatar
Moe Jette committed
* Changes in SLURM 1.0.1
========================
 -- Assorted updates and clarifications in documentation.
 -- Detect which munge installation to use 32/64 bit.
* Changes in SLURM 1.0.0
Moe Jette's avatar
Moe Jette committed
========================
 -- Fix sinfo filtering bug, especially "sinfo -R" output.
Moe Jette's avatar
Moe Jette committed
 -- Fix node state change bug, resuming down or drained nodes.
 -- Fix "scontrol show config" to display JobCredentialPrivateKey instead
    of JobCredPrivateKey and JobCredentialPublicCertificate instead of
    JobCredPublicKey.  They now match the options in the slurm.conf.
 -- Fix bug in job accounting for very long node list records (Andy Riebs,
    HP, sacct_buf.patch).
Danny Auble's avatar
Danny Auble committed
 -- BLUEGENE SPECIFIC - added load function to smap to load an already 
    exsistant bluegene.conf file.
 -- Fix bug in sacct: If user requests specific job or job step ID,
    only the last one with that ID will be reported. If multiple 
    nodes fail, the job has its state recorded as "JOB_TERMINATED...nf"
    (Andy Riebs, HP, slurm.hp.sacct_dup.patch).
 -- Fix some inconsistencies in sacct's help message (Andy Riebs, HP, 
    slurm.hp.sacct_help.patch).
 -- Validate input to sacct command and allows embedded spaces in 
    arguments (Andy Riebs, HP, slurm.hp.sacct_validate.patch).
* Changes in SLURM 0.7.0-pre8
=============================
Danny Auble's avatar
Danny Auble committed
 -- BGL specific -- bug fix for smap configure function down configuration
 -- Add support for job suspend/resume.
 -- Add slurmd cache for group IDs (Takao Hatazaki, HP).
 -- Fix bug in processing of "#SLURM" batch script option parsing.
* Changes in SLURM 0.7.0-pre7
=============================
 -- Fix issue with NODE_STATE_COMPLETING, could start job on node before
    epilog completed.
 -- Added some infrastructure for job suspend/resume (scontrol, api, and 
    slurmctld stub).
 -- Set job's num_procs to the actual processor count allocated to the job.
 -- Fix bug in HAVE_FRONT_END support for cluster emulation.
* Changes in SLURM 0.7.0-pre6
=============================
 -- Added support for task affinity for binding tasks to CPUs (Daniel
    Palermo, HP).
Moe Jette's avatar
Moe Jette committed
 -- Integrate task affinity support with configuration, add validation 
    test.
* Changes in SLURM 0.7.0-pre5
=============================
 -- Enhanced performance and debugging for slurmctld reconfiguration.
 -- Add "scontrol update Jobid=# Nice=#" support.
 -- Basic slurmctld and tool functionality validated to 16k nodes.
 -- squeue and smap now display correct info for jobs in bluegene enviornment.
Moe Jette's avatar
Moe Jette committed
 -- Fix setting of SLURM_NODELIST for batch jobs.
 -- Add SubmitTime to job information available for display.
Moe Jette's avatar
Moe Jette committed
 -- API function slurm_confirm_allocation() has been marked OBSOLETE
    and will go away in some future version of SLURM.  Use
Moe Jette's avatar
Moe Jette committed
    slurm_allocation_lookup() instead.
 -- New API calls slurm_signal_job and slurm_signal_job_step to send
    signals directly to the slurmds without triggering the shutdown sequence.
 -- remove "uid" from old_job_alloc_msg_t, no longer needed.
 -- Several bug fixes in maui scheduler plugin from Dave Jackon 
    (Cluster Resources).
* Changes in SLURM 0.7.0-pre4
=============================
 -- Remove BNR libary functions and add those for PMI (KVS and basic
    MPI-1 functions only for now)
Danny Auble's avatar
Danny Auble committed
 -- Added Hostfile support for POE and srun.  MP_HOSTFILE env var to set
    location of hostfile.  Tasks will run from list order in the file.  
 -- Removes the slurmd's use of SysV shared memory.  Instead the slurmd
    communicates with the slurmstepd processes through the slurmstepd's
    new named unix domain socket.  The "stepd_api" is used to talk to the
    slurmstepd (src/slurmd/common/stepd_api.[ch]).
 -- Bluegene specific - bluegene block allocator will find most any 
    partition size now.  Added support to start at any point in smap 
    to request a partition instead of always starting at 000.
 -- Bluegene specific - Support to smap to down or bring up nodes in 
    configure mode.  Added commands include allup, alldown, 
    up [range], down [range]
 -- Time format in sinfo/squeue/smap/sacct changed from D:HH:MM:SS to 
    D-HH:MM:SS per POSIX standards document.
 -- Treat scontrol update request without any requested changes as an 
    error condition.
Danny Auble's avatar
Danny Auble committed
 -- Bluegene plugin renamed with BG instead of BGL.  partition_allocator moved 
    into bluegene plugin and renamed block_allocator.  Format for bluegene.conf
    file changed also.  Read bluegene html page.  Code is backwards compatable
    smap will generate in new form
 -- Add srun option --nice to give user some control over job priority.
Moe Jette's avatar
Moe Jette committed
* Changes in SLURM 0.7.0-pre3
Moe Jette's avatar
Moe Jette committed
=============================
 -- Restructure node states: DRAINING and DRAINED states are replaced 
    with a DRAIN flag. COMPLETING state is changed to a COMPLETING flag. 
 -- Test suite moved into testsuite/expect from separate repository.
Moe Jette's avatar
Moe Jette committed
 -- Added new document describing slurm APIs (doc/html/api.html).
 -- Permit nodes to be in multiple partitions simultaneously.
* Changes in SLURM 0.7.0-pre2
=============================
 -- New stdio protocol.  Now srun has just a single TCP stream to each node
    of a job-step.  srun and slurmd comminicate over the TCP stream using a
    simple messaging protocol.
 -- Added task plugin and use task prolog/epilog(s).
Danny Auble's avatar
Danny Auble committed
 -- New slurmd_step functionality added.  Fork exec instead of using shared
    memory.  Not completely tested.
Danny Auble's avatar
Danny Auble committed
 -- BGL small partition logic in place in plugin and smap.  Scheduler needs  
    to be rewritten to handle multiple partitions on a single node. No 
    documentation written on process yet.
 -- If running select/bluegene plugin without access to BGL DB2, then 
    full-system bglblock is of system size defined in bluegene.conf.
* Changes in SLURM 0.7.0-pre1
=============================
 -- Support defered initiation of job (e.g. srun --begin=11:30 ...).
 -- Add support for srun --cpus-per-task through task allocation in 
    slurmctld.
 -- fixed partition_allocator to work without curses
 -- made change to srun to start message thread before other threads 
    to make sure localtime doesn't interfere.   
 -- Added new RPCs for slurmctld REQUEST_TERMINATE_JOB or TASKS, 
    REQUEST_KILL_JOB/TASKS changed to REQUEST_SIGNAL_JOB/TASKS.
 -- Add support for e-mail notification on job state changes.
 -- Some infrastructure added for task launch controls (slurm.conf:
    TaskProlog, TaskEpilog, TaskPlugin; srun --task-prolog, --task-epilog).
* Changes in SLURM 0.6.11
=========================
 -- Fix bug in sinfo partition sorting order.
 -- Fix bugs in srun use of #SLURM options in batch script.
 -- Use full Elan credential space rather than re-using credentials as soon 
    as job step completes (helps with fault-tolerance).
* Changes in SLURM 0.6.10
=========================
 -- Fix for slurmd job termination logic (could hang in COMPLETING state).
 -- Sacct bug fixes: Report correct user name for job step, show "uid.gid"
    as fifth field of job step record (Andy Riebs, slurm.hp.sacct_uid.patch).
 -- Add job_id to maui scheduler plugin start job status message.
 -- Fix for srun's handling of null characters in stdout or stderr.
 -- Update job accounting for larger systems (Andy Riebs, uptodate.patch).
 -- Fixes for proctrack/linuxproc and mpich-gm support (Takao Hatazaki, HP).
 -- Fix bug in switch/elan for large task count job having irregular task 
    distribution across nodes.
* Changes in SLURM 0.6.9
========================
 -- Fix bug in mpi plugin to set the ID correctly
 -- Accounting bug causing segv fixed (Andy Riebs, 14oct.jobacct.patch)
 -- Fix for failed launch of a debugged job (e.g. bad executable name).
 -- Wiki plugin fix for tracking allocated nodes (Ernest Artiaga, BSC).
 -- Fix memory leaks in slurmctld and federation plugin.
 -- Fix sefault in federation plugin function fed_libstate_clear().
 -- Align job accounting data (Andy Riebs, slurm.hp.unal_jobacct.patch)
 -- Restore switch state in backup controller restarts
* Changes in SLURM 0.6.8
========================
 -- Invalid AllowGroup value in slurm.conf to not cause seg fault.
 -- Fix bug that would cause slurmctld to seg-fault with select/cons_res
    and batch job containing more than one step.
* Changes in SLURM 0.6.7
========================
 -- Make proctrack/linuxproc thread safe, could cause slurmd seg fault.
 -- Propagate umask from srun to spawned tasks.
 -- Fix problem in switch/elan error handling that could hang a slurmd 
    step manager process.
 -- Build on AIX with -bmaxdata:0x70000000 for memory limit more than 256MB.
Loading
Loading full blame...