Commits · b1b592594ff02ce0413c9278ebbc5e2be16d8f6b · tud-zih-energy / Slurm

Sep 06, 2017

Remove call to _copy_job_desc_files() for job arrays. · b1b59259

Tim Wickberg authored 7 years ago

This creates an empty directory, since we no longer bother
copying the job script or environment as of 28b7f853.

b1b59259

Sep 04, 2017

Fix potential part_record NULL dereference. · 459af7f7

Alejandro Sanchez authored 7 years ago

** CID 175193: (FORWARD_NULL)

Theoretically we shouldn't have a job_desc_msg_t without an associated
part_record, but just in case let's harden the code.

Introduced in previous commit: 24365514.

459af7f7

Fix to test job mem against MaxMemPer[CPU|Node] limits at scheduling time. · 24365514

Alejandro Sanchez authored 7 years ago

Initially job mem limits were tested at submission time through
_validate_min_mem_partition() -> _valid_pn_min_mem(), but not tested
again at scheduling time, thus leading to jobs incorrectly being scheduled
against partitions where the job exceeded their MaxMemPer* limit
(which can in turn be inherited from the system wide limit too).

NOTE: New WAIT_PN_MEM_LIMIT job_state_reason enum component added to support
this new waiting reason.

Bug 2291.

24365514

Sep 01, 2017
- Check multiple partition limits when scheduling a job that were previously only · e566cf39
  Danny Auble authored 7 years ago
  
  checked on submit. This only mattered when submitting a job to multiple partitions. Bug 4066
  e566cf39
- Fix sbatch --signal to signal all MPI ranks in a step instead of just those · d8485b0d
  Danny Auble authored 7 years ago
  
  on node 0. Bug 4035
  d8485b0d
- Use _remove_job_hash() rather than inline version of the same. · f5a3b449
  Tim Wickberg authored 7 years ago
  
  Add the xassert() that the inline version has to _remove_job_hash().
  f5a3b449
Aug 22, 2017
- Signal the purge thread if ever removing something from the job_list. · d0f56c67
  Danny Auble authored 7 years ago
  
  d0f56c67
Aug 15, 2017
- Reorder logic for correctness · 2087a9d6
  Morris Jette authored 7 years ago
  
  Coverity CID 174399
  2087a9d6
- Don't launch pack job batch script until full allocation complete · ee34137c
  Morris Jette authored 7 years ago
  
  This insures that the batch script can have appropriate environment variables set for all components (i.e. the node list, cpu count, etc. for all pack groups).
  ee34137c
Aug 14, 2017

Add support for modifying all components of pack job · c086e56a

Morris Jette authored 7 years ago

When the "scontrol update jobid=#" command specifies a pack job
leader, then modify all components of the pack job. To modiify only
the pack job leader, specify "scontrol update jobid=#+0".

c086e56a

Aug 12, 2017

scontrol job update for specific pack job components · bb8b4214

Morris Jette authored 7 years ago

Modify scontrol job hold/release and update to operate with heterogeneous
      job id specification (e.g. "scontrol hold 123+4").

bb8b4214

Aug 11, 2017
- Improve handling of pack-job suspend/resume · 2f26cd10
  Morris Jette authored 7 years ago
  
  2f26cd10
- Disable cancellation of individual component while the job is pending · 5335dd9a
  Morris Jette authored 7 years ago
  
  Doing so would break the current scheduling logic.
  5335dd9a
Aug 09, 2017
- Change error() to info() · 7887962f
  Morris Jette authored 7 years ago
  
  This situation would be caused by an invalid user request, not a slurmctld error
  7887962f
- Remove redundant variable type cast · 4ba44ebe
  Morris Jette authored 7 years ago
  
  4ba44ebe
- Type case function return to (void) · d3ebf258
  Morris Jette authored 7 years ago
  
  In all of these cases, the input account name is NULL, so there should never be a failure. In every case, the returned association pointer is checked anyway. Coverity CID 44719, 44720, 44721
  d3ebf258
- Remove redudant check for NULL pointer · 7fd8ae37
  Morris Jette authored 7 years ago
  
  Coverity, CID 45150
  7fd8ae37
Aug 02, 2017

Fix srun jobs to run in high prio partition · 948de46b

Marshall Garey authored 7 years ago

srun jobs that could start immediately and requested multiple partitions
didn't run in the highest priority partition if the highest priority
partition wasn't listed first.

It's possible that the scontrol show jobs will show the partition list
in priority order now that the job's partition list gets sorted by
priority.

Bug 4015

948de46b

Jul 28, 2017

Deallocate pack job start failure · 17952cdf

Morris Jette authored 7 years ago

If a pack job is only partitially allocated resources (likely due
  due to limits), deallocate resources from those components which
  have been started and requeue them.

17952cdf

Jul 27, 2017

Refactor limits logic for pack job · db10eae9

Morris Jette authored 7 years ago

This change adds a new function and moves some logic around so that
  limits can be tested on a pack job as a whole (that logic still
  needs to be developed).

db10eae9

Jul 25, 2017
- Fix for pack job batch job check · f68fe3e6
  Morris Jette authored 7 years ago
  
  Don't requeue a batch pack job component that is not found node zero of the allocation. Only the first pack job component is expected to have a running script.
  f68fe3e6
- Fix typos on signall[ed|ing] to not have the erroneous double 'l' · 364ca76c
  Danny Auble authored 7 years ago
  
  364ca76c
- Fix for pack job state recovery on slurmctld restart · 48f623d9
  Morris Jette authored 7 years ago
  
  48f623d9
Jul 24, 2017
- Set Reason=dependency over Reason=JobArrayTaskLimit for pending jobs. · ad0b7c27
  Dominik Bartkiewicz authored 7 years ago
  
  Bug 3953
  ad0b7c27
Jul 19, 2017
- define year (in seconds) value in src/common · 9b011157
  Morris Jette authored 7 years ago
  
  This removes several define statements with different names in various functions
  9b011157
Jul 05, 2017

Delete federated origin jobs after minjobage · f1441da3
Brian Christiansen authored 7 years ago
```
It wasn't doing it for origin jobs.
```
f1441da3

Let remote fed jobs stay in queue till minjobage · f715668e

Brian Christiansen authored 7 years ago

Previously remote jobs would be removed from the job_list as quickly as
possible to prevent collisions with requeued jobs and to clear up the
jobs and the orign job would stay around till minjobage on the origin.
But the origin job didn't have the details from the job that ran on a
remote cluster.

Now just don't show revoked jobs. The origin tracking job will remain as
revoked and not shown and the remote job will hang around for display
till minjobage. scontrol show jobs will show the job from the cluster
that ran the job. The job is requeuable as long as the origin job is
still in the origin cluster's job_list.

f715668e

Don't schedule or show revoked jobs · fcd22f9b

Brian Christiansen authored 7 years ago

Just check for the revoked state instead of checking if it's a tracker
job since an origin job will be revoked if it can't run on the origin or
if it's running on a remote cluster.

fcd22f9b

Jun 27, 2017
- Convert pack_all_jobs and pack_ctld_job_step_info_response_msg to use part_is_visible. · 9930f2c8
  Tim Wickberg authored 7 years ago
  
  No longer require part write lock.
  9930f2c8
- Remove slurm_conf_lock() calls within slurmctld. · 534a6d3a
  Tim Wickberg authored 7 years ago
  
  Replace with direct references to the struct.
  534a6d3a
Jun 22, 2017

Keep federated jobs until origin is up and synced · 08d534c5
Brian Christiansen authored 7 years ago
```
This allows the origin to be able to sync up jobs after it has been
down.
```
08d534c5
Send cancel to all viable sibs when origin is down · 49c27c23
Brian Christiansen authored 7 years ago

49c27c23

Handle fed jobs when a cluster is removed from fed · 665efedc

Isaac Hartung authored 7 years ago

When a non-origin cluster is removed:
- running jobs remain - fed_details removed so it can't call home.
- origin cluster removes tracking job for running jobs
- pending jobs are removed.
- pending srun/sallocs don't get notified.
- other clusters remove removed cluster from viable and active sibs

When an origin cluster is removed:
- all pending jobs are removed from all clusters that had job.
- pending srun/sallocs are notified of termination
- running jobs remain.

665efedc

Jun 21, 2017
- Improve scheduling logic with respect to license use and node reboots · 2ae94e26
  Dominik Bartkiewicz authored 7 years ago
  
  bug 3757
  2ae94e26
Jun 20, 2017
- Handle partition QOS submit limits correctly when a job is submitted to · 4b003d37
  Danny Auble authored 7 years ago
  
  more than 1 partition or when the partition is changed with scontrol. Bug 3849
  4b003d37
Jun 19, 2017
- Delete job files on job purge · 7fa46b96
  Isaac Hartung authored 7 years ago
  
  Continuation of b9719be2
  7fa46b96
- Fix deadcode reported by coverity · 1740609d
  Brian Christiansen authored 7 years ago
  
  CID: 170772, 170773 Introduced by commit: 250378c2
  1740609d
Jun 16, 2017
- Actually use ignore_state_errors · 250378c2
  Tim Shaw authored 7 years ago
  
  Bug 3502.
  250378c2
Jun 13, 2017
- Fix setting min_cpus to correct value when no task count is given. · ba681423
  Danny Auble authored 7 years ago
  
  Bug 3888
  ba681423
Jun 08, 2017
- Handle update of blocking QOS pointers correctly. · 5e92a3f5
  Dominik Bartkiewicz authored 7 years ago
  
  Prevent segfault from pointer dereference to the QOS that is being deleted. Fix to commit 3e8aa451.
  5e92a3f5