Skip to content
Snippets Groups Projects
NEWS 282 KiB
Newer Older
Danny Auble's avatar
Danny Auble committed
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and administrators.

Morris Jette's avatar
Morris Jette committed
* Changes in Slurm 14.11.2
==========================
 -- Fix Centos5 compile errors.
 -- Fix issue with association hash not getting the correct index which
    could result in seg fault.
 -- Fix salloc/sbatch -B segfault.
Morris Jette's avatar
Morris Jette committed
 -- Avoid huge malloc if GRES configured with "Type" and huge "Count".
Brian Christiansen's avatar
Brian Christiansen committed
 -- Fix jobs from starting in overlapping reservations that won't finish before
    a "maint" reservation begins.
Danny Auble's avatar
Danny Auble committed
* Changes in Slurm 14.11.1
==========================
 -- Get libs correct when doing the xtree/xhash make check.
 -- Update xhash/tree make check to work correctly with current code.
David Bigagli's avatar
David Bigagli committed
 -- Remove the reference 'experimental' for the jobacct_gather/cgroup
    plugin.
 -- Add QOS manipulation examples to the qos.html documentation page.
 -- If 'squeue -w node_name' specifies an unknown host name print
    an error message and return 1.
 -- Fix race condition in job_submit plugin logic that could cause slurmctld to
    deadlock.
 -- Job wait reason of "ReqNodeNotAvail" expanded to identify unavailable nodes
    (e.g. "ReqNodeNotAvail(Unavailable:tux[3-6])").
* Changes in Slurm 14.11.0
==========================
 -- ALPS - Fix issue with core_spec warning.
 -- Allow multiple partitions to be specified in sinfo -p.
David Bigagli's avatar
David Bigagli committed
 -- Install the service files in /usr/lib/systemd/system.
 -- MYSQL - Add id_array_job and id_resv keys to $CLUSTER_job_table.  THIS
    COULD TAKE A WHILE TO CREATE THE KEYS SO BE PATIENT.
 -- CRAY - Resize bitmaps on a restart and find we have more blades
    than before.
 -- Add new eio API function for removing unused connections.
 -- ALPS - Fix issue where batch allocations weren't correctly confirmed or
    released.
 -- Define DEFAULT_MAX_TASKS_PER_NODE based on MAX_TASKS_PER_NODE from
    slurm.h as per documentation.
 -- Update the FAQ about relocating slurmctld.
 -- In the memory cgroup enable memory.use_hierarchy in the cgroup root.
Hongjia Cao's avatar
Hongjia Cao committed
 -- Export eio.c functions for use by MPI/PMI2.
 -- Add SLURM_CLUSTER_NAME to job environment.
Morris Jette's avatar
Morris Jette committed
* Changes in Slurm 14.11.0rc3
=============================
 -- Allow envs to override autotools binaries in autogen.sh
 -- Added system services files.
Morris Jette's avatar
Morris Jette committed
 -- If the jobs pends with DependencyNeverSatisfied keep it pending even after
    the job which it was depending upon was cleaned.
 -- Let operators (in addition to user root and SlurmUser) see job script for
    other user's jobs.
 -- Perl API modified to return node state of MIXED rather than ALLOCATED if
    only some CPUs allocated.
 -- Double Munge connect retry timeout from 1 to 2 seconds.
 -- sview - Remove unneeded code that was resolved globally in commit
    98e24b0dedc.
 -- Collect and report the accounting of the batch step and its children.
David Bigagli's avatar
David Bigagli committed
 -- Add configure checks for faccessat and eaccess, and make use of one of
    them if available.
 -- Make configure --enable-developer also set --enable-debug
 -- Introduce a SchedulerParameters variable kill_invalid_depend, if set
    then jobs pending with invalid dependency are going to be terminated.
 -- Move spank_user_task() call in slurmstepd after the task_g_pre_launch()
    so that the task affinity information is available to spank.
 -- Make /etc/init.d/slurm script return value 3 when the daemon is
    not running. This is required by Linux Standard Base Core
    Specification 3.1
Morris Jette's avatar
Morris Jette committed
* Changes in Slurm 14.11.0rc2
=============================
Morris Jette's avatar
Morris Jette committed
 -- Logs for jobs which are explicitly requeued will say so rather than saying
    that a node in their allocation failed.
 -- Updated the documentation about the remote licenses served by
    the Slurm database.
 -- Insure that slurm_spank_exit() is only called once from srun.
David Bigagli's avatar
David Bigagli committed
 -- Change the signature of net_set_low_water() to use 4 bytes instead of 8.
Danny Auble's avatar
Danny Auble committed
 -- Export working_cluster_rec in libslurmdb.so as well as move some function
    definitions needed for drmaa.
 -- If using cons_res or serial cause a fatal in the plugin instead of causing
    the SelectTypeParameters to magically set to CR_CPU.
 -- Enhance task/affinity auto binding to consider tasks * cpus-per-task.
 -- Fix regression the priority/multifactor which would cause memory corruption.
    Issue is only in rc1.
 -- Add PrivateData value of "cloud". If set, powered down nodes in the cloud
    will be visible.
 -- Sched/backfill - Eliminate clearing start_time of running jobs.
 -- Fix various backwards compatibility issues.
 -- If failed to launch a batch job, requeue it in hold.
* Changes in Slurm 14.11.0rc1
=============================
 -- When using cgroup name the batch step as step_batch instead of
    batch_4294967294
 -- Changed LEVEL_BASED priority to be "Fair_Tree"
David Bigagli's avatar
David Bigagli committed
 -- Port to NetBSD.
 -- BGQ - Add cnode based reservations.
 -- Alongside totalview_jobid implement totalview_stepid available
    to sattach.
 -- Add ability to include other files in slurm.conf based upon the ClusterName.
Brian Christiansen's avatar
Brian Christiansen committed
 -- Update strlcpy to latest upstream version.
David Bigagli's avatar
David Bigagli committed
 -- Add reservation information in the sacct and sreport output.
 -- Add job priority calculation check for overflow and fix memory leak.
 -- Add SchedulerParameters option of pack_serial_at_end to put serial jobs at
    the end of the available nodes rather than using a best fit algorithm.
David Bigagli's avatar
David Bigagli committed
 -- Allow regular users to view default sinfo output when
    privatedata=reservations is set.
 -- PrivateData=reservation modified to permit users to view the reservations
    which they have access to (rather then preventing them from seeing ANY
    reservation).
 -- job_submit/lua: Fix job_desc set field logic
* Changes in Slurm 14.11.0pre5
==============================
Morris Jette's avatar
Morris Jette committed
 -- Fix sbatch --export=ALL, it was treated by srun as a request to explicitly
    export only the environment variable named "ALL".
 -- Improve scheduling of jobs in reservations that overlap other reservations.
 -- Modify sgather to make global file systems easier to configure.
 -- Added sacctmgr reconfig to reread the slurmdbd.conf in the slurmdbd.
 -- Modify scontrol job operations to accept comma delimited list of job IDs.
    Applies to job update, hold, release, suspend, resume, requeue, and
    requeuehold operations.
 -- Refactor job_submit/lua interface. LUA FUNCTIONS NEED TO CHANGE! The
    lua script no longer needs to explicitly load meta-tables, but information
    is available directly using names slurm.reservations, slurm.jobs,
    slurm.log_info, etc. Also, the job_submit.lua script is reloaded when
    updated without restarting the slurmctld daemon.
 -- Allow users to specify --resv_ports to have value 0.
 -- Cray MPMD (Multiple-Program Multiple-Data) support completed.
 -- Added ability for "scontrol update" to references jobs by JobName (and
    filtered optionally by UserID).
 -- Add support for an advanced reservation start time that remains constant
    relative to the current time. This can be used to prevent the starting of
    longer running jobs on select nodes for maintenance purpose. See the
    reservation flag "TIME_FLOAT" for more information.
 -- Enlarge the jobid field to 18 characters in squeue output.
 -- Added "scontrol write config" option to save a copy of the current
    configuration in a file containing a time stamp.
David Gloe's avatar
David Gloe committed
 -- Eliminate native Cray specific port management. Native Cray systems must
    now use the MpiParams configuration parameter to specify ports to be used
    for commmunications. When upgrading Native Cray systems from version 14.03,
    all running jobs should be killed and the switch_cray_state file (in
    SaveStateLocation of the nodes where the slurmctld daemon runs) must be
    explicitly deleted.
Morris Jette's avatar
Morris Jette committed
==============================
 -- Added job array data structure and removed 64k array size restriction.
 -- Added SchedulerParameters options of bf_max_job_array_resv to control how
    many tasks of a job array should have resources reserved for them.
 -- Added more validity checking of incoming job submit requests.
Morris Jette's avatar
Morris Jette committed
 -- Added srun --export option to set/export specific environment variables.
 -- Scontrol modified to print separate error messages for job arrays with
    different exit codes on the different tasks of the job array. Applies to
    job suspend and resume operations.
 -- Fix race condition in CPU frequency set with job preemption.
 -- Always call select plugin on step termination, even if the job is also
    complete.
 -- Srun executable names beginning with "." will be resolved based upon the
    working directory and path on the compute node rather than the submit node.
 -- Add node state string suffix of "$" to identify nodes in maintenance
    reservation or scheduled for reboot. This applies to scontrol, sinfo,
    and sview commands.
 -- Enable scontrol to clear a nodes's scheduled reboot by setting its state
    to "RESUME".
 -- As per sbatch and srun documentation when the --signal option is used
    signal only the steps and unless, in the case, of a batch job B is
    specified in which case signal only the batch script.
 -- Modify AuthInfo configuration parameter to accept credential lifetime
    option.
 -- Modify crypto/munge plugin to use socket and timeout specified in AuthInfo.
 -- If we have a state for a step on completion put that in the database
    instead of guessing off the exit_code.
 -- Added squeue -P/--priority option that can be used to display pending jobs
    in the same order as used by the Slurm scheduler even if jobs are submitted
    to multiple partitions (job is reported once per usable partition).
 -- Improve the pending reason description for various QOS limits. For each
    QOS limit that causes a job to be pending print its specific reason.
    For example if job pends because of GrpCpus the squeue command will
    print QOSGrpCpuLimit as pending reason.
 -- sched/backfill - Set expected start time of job submitted to multiple
    partitions to the earliest start time on any of the partitions.
 -- Introduce a MAX_BATCH_REQUEUE define that indicates how many times a job
    can be requeued upon prolog failure. When the number is reached the job
    is put on hold with reason JobHoldMaxRequeue.
 -- Add sbatch job array option to limit the number of simultaneously running
    tasks from a job array (e.g. "--array=0-15%4").
 -- Implemented a new QOS limit MinCPUs. Users running under a QOS must
    request a minimum number of CPUs which is at least MinCPUs otherwise
    their job will pend.
 -- Introduced a new pending reason WAIT_QOS_MIN_CPUS to reflect the new QOS
    limit.
 -- Job array dependency based upon state is now dependent upon the state of
    the array as a whole (e.g. afterok requires ALL tasks to complete
    sucessfully, afternotok is true if ANY tasks does not complete successfully,
    and after requires all tasks to at least be started).
 -- The srun -u/--unbuffered options set the stdout of the task launched
    by srun to be line buffered.
Loading
Loading full blame...