Commits · 31c87fce85219fb84df309d632a9acfcba2b1c0b · tud-zih-energy / Slurm

Sep 17, 2016

Restore ability to manually power down nodes · da722a89

Morris Jette authored 8 years ago

Restore ability to manually power down nodes, broken in 15.08.12
in commit b4904661

The patch introduced in commit b4904661 (not powering down dead node) has a bad side effect. Adding the "(node_ptr->last_idle != 0)" condition prevents from powering down nodes with the following command:

scontrol update nodename=nX state=power_down

because the state update function relies on zeroing the "last_idle" variable when a power_down is requested (see src/slurmctld/node_mgr.c, line 1589).

Reverting this commit should solve the problem...but I let you decide...

Didier GAZEN

da722a89

Sep 16, 2016

Update KNL modes for out-of-band reboot · 3a465f80

Morris Jette authored 8 years ago

node_features/knl_cray: If a node is rebooted outside of Slurm's direction,
    update it's active features with current MCDRAM and NUMA mode information.
bug 3071

3a465f80

Sep 15, 2016
- node_features/knl_cray: Fix MCDRAM state source · bbaa48b2
  Morris Jette authored 8 years ago
  
  Fix race condition that could result in MCDRAM state information coming from capmc rather than cnselect (used state for next boot rather than latest boot). bug 3080
  bbaa48b2
- Docs - assorted spelling corrections. · 09d20dc3
  Nicolas Joly authored 8 years ago
  
  09d20dc3
Sep 14, 2016
- Silence srun warning message when intentionally lowering task-per-node count for a step. · daacf5af
  Alejandro Sanchez authored 8 years ago
  
  No functional change, just silencing the warning message in this instance. Bug 3079.
  daacf5af
- Docs - mention SLURM_MEM_PER_CPU/NODE environment variables and allowed suffixes. · f62d3e70
  Alejandro Sanchez authored 8 years ago
  
  Bug 3073.
  f62d3e70
Sep 09, 2016
- Modify srun task completion handling · 166e3bec
  Morris Jette authored 8 years ago
  
  Modify srun task completion handling to only build the task/node string for logging purposes if it is needed. Modified for performance purposes. bug 3044
  166e3bec
- Revert "Fix issue filtering licenses for output with squeue." · 9c4eabed
  Tim Wickberg authored 8 years ago
  
  This reverts commit 1ec2a4ae.
  9c4eabed
- Fix issue filtering licenses for output with squeue. · 1ec2a4ae
  Alejandro Sanchez authored 8 years ago
  
  Bug 3063.
  1ec2a4ae
Sep 08, 2016

Restructure srun task_exit logic · 6b6d4e1a

Morris Jette authored 8 years ago

Restructure srun command locking for task_exit processing logic for improved
  parallelism. This change decreases the amount of time consumed by serial
  logic by 2 orders of magnitude.
bug 3044

6b6d4e1a

Sep 07, 2016

Preserve node "RESERVATION" state · 5eee1d28

Morris Jette authored 8 years ago

Preserve node "RESERVATION" state when one of multiple overlapping
    reservations ends. Previous logic would clear the node's
    RESERVATION state flag when any one of the reservations on the
    node ended rather than keeping the node in RESERVATION state
    until the last reservation ended.
bug 3057

5eee1d28

Handle slurmctld restart while compute node reboot request in progress · 4517c454

Morris Jette authored 8 years ago

Handle case when slurmctld daemon restart while compute node reboot in
    progress. Return node to service rather than setting DOWN.
bug 3042

4517c454

Sep 06, 2016
- Add salloc_wait_nodes option to SchedulerParameters · 2670edc4
  Morris Jette authored 8 years ago
  
  Add salloc_wait_nodes option to the SchedulerParameters parameter in the slurm.conf file controlling when the salloc command returns in relation to when nodes are ready for use (i.e. booted). bug 3043
  2670edc4
- Add mpiexec man page to the script · c7cc6a70
  Gennaro Oliva authored 8 years ago
  
  bug 3055
  c7cc6a70
- Fix mpiexec wrapper to accept task count with more than one digit · ae1e63f2
  Gennaro Oliva authored 8 years ago
  
  bug 3054
  ae1e63f2
Sep 02, 2016
- sacctmgr - Fix displaying nodenames when printing out events or · 40a33cfb
  Danny Auble authored 8 years ago
  
  reservations.
  40a33cfb
Sep 01, 2016
- backfill scheduling partition/qos check · c75da93a
  Morris Jette authored 8 years ago
  
  sched/backfill - Check that a user's QOS is allowed to use a partition before trying to schedule resources on that partition for the job. bug 3039
  c75da93a
- burst_buffer/cray - Hold job after 3 failed pre-run operations · 85983130
  Morris Jette authored 8 years ago
  
  bug 3035 and 3009
  85983130
Aug 30, 2016
- Add additional NEWS entry missed with merge commit f063f761 . · 1341c4a2
  Tim Wickberg authored 8 years ago
  
  1341c4a2
- Move call to _set_job_running_restore after the bitstring is resized. · cdc3214e
  Tim Wickberg authored 8 years ago
  
  Otherwise blade_cnt is potentially greater than bit_size(jobinfo->blade_map) which leads to an assertion failure. Bug 3033.
  cdc3214e
Aug 27, 2016

Provide HWLOC topology in the job-data if Slurm was configured · 2e019441
Artem Polyakov authored 8 years ago
```
with hwloc.
```
2e019441

Fix problem updating job state_reason · fb46c84b

Morris Jette authored 8 years ago

This patch has two parts:
1. When a job is intially submitted, the Slurm was failing to set
   an initial reason for the job not starting.
2. After a job was submitted, it was sometimes failing to reset
   the job's reason. It was also failing to reset the "last_job_update"
   time, so something like "squeue -i1" would not get the new reason.
bug 3025

fb46c84b

Aug 26, 2016
- Fix multi-partition submit limits test · 2e4552df
  Alejandro Sanchez authored 8 years ago
  
  Fix multipart srun submission with EnforcePartLimits=NO and job violating the partition limits. bug 3025
  2e4552df
- Make partition State independent of EnforcePartLimits value · 76d62ae4
  Alejandro Sanchez authored 8 years ago
  
  bug 3011
  76d62ae4
Aug 25, 2016

Corrections to gres.conf parsing logic · dbfd87e4

Morris Jette authored 8 years ago

If all GRES were not defined on all nodes OR if a regular expression was used
for a GRES file configuration (e.g. in gres.conf
"Type=gpu Files=/dev/nvidia[0-4]"), then memory corruption was likely.
The logic has been bad since its inception several years ago.

dbfd87e4

Aug 24, 2016
- Fix FreeBSD build issue in node_features_knl_cray.c. · 04f0c074
  Joseph Mingrone authored 8 years ago
  
  POLLRDHUP does not exist on BSD, define to POLLHUP as done elsewhere.
  04f0c074
Aug 23, 2016

Cray - correctly detect service vs compute nodes in slurmconfgen_smw.py. · 02347d20

David Gloe authored 8 years ago

The attached patch switches to a more reliable method of detecting
service nodes, using xtcli status. In addition, it switches to the print
function to be better compatible with python 3.

02347d20

Aug 22, 2016

mpi/pmix: support for passing TMPDIR path through info key · 677f0efb
Boris Karasev authored 8 years ago

677f0efb

mpi/pmix: Introduce support for upcoming PMIx version 2.x · 0a43cfa1

Boris Karasev authored 8 years ago

To ease the distribution process, plugin names will be automatically
adjusted to identify the version of API that it can support,
ie: pmix_v1 and pmix_v2.
This provides the ability for distro's to create separate
non-conflicting packages for each API generation.

Bug 2986

0a43cfa1

Aug 20, 2016
- Expected pending job start time not in the past · 99ab4ddf
  Morris Jette authored 8 years ago
  
  Insure reported expected job start time is not in the past for pending jobs. bug 3002
  99ab4ddf
Aug 19, 2016

Don't hold job after burst buffer setup failure · 9d9b6832

Morris Jette authored 8 years ago

burst_buffer/cray: Requeue, but do not hold a job which fails the pre_run
    operation.
bug 3009

9d9b6832

Aug 17, 2016
- Update and expand Slurm upgrade guide · 263ce078
  Morris Jette authored 8 years ago
  
  263ce078
Aug 16, 2016
- Fix scancel to allow multiple steps from a single job to be cancelled. · 8dd56d4e
  Alejandro Sanchez authored 8 years ago
  
  Only mark job_id as zero for batch step (when all job steps would be cleared), not for individual steps which prevented successive steps from being cancelled. Bug 2984.
  8dd56d4e
- Export unit convert functions from slurm_protocol_api.c in libslurm.so. · d8a26ee1
  Danny Auble authored 8 years ago
  
  d8a26ee1
- Export functions from parse_time.c in libslurm.so. · d6d85618
  Danny Auble authored 8 years ago
  
  d6d85618
- slurmstepd to load all plugins at startup · 962f0cce
  Morris Jette authored 8 years ago
  
  slurmstepd modified to pre-load all relevant plugins at startup to avoid the possibility of modified plugins later resulting in inconsistent API or data structures and a failure of slurmstepd. bug 2334
  962f0cce
Aug 15, 2016
- Fix accounting for jobs requeued after the previous job was finished. · 8eb0f018
  Danny Auble authored 8 years ago
  
  8eb0f018
Aug 12, 2016
- update new for next tag · 0466a353
  Danny Auble authored 8 years ago
  
  0466a353
- task/affinity plugin buffer allocated too small, can corrupt memory · ac86d075
  Morris Jette authored 8 years ago
  
  ac86d075
Aug 11, 2016
- Update documented support for --ntasks-per-core/socket options · da37379b
  Morris Jette authored 8 years ago
  
  bug 2655
  da37379b