Commits · e416a7539aabe4b89a7892a983e02b6fd47169ac · tud-zih-energy / Slurm

Sep 21, 2016

Pass SLURM_JOB_ID to ResumeProgram · e416a753

Morris Jette authored 8 years ago

When powering up a node to change it's state (e.g. KNL NUMA or MCDRAM mode)
    then pass to the ResumeProgram the job ID assigned to the nodes in the
    SLURM_JOB_ID environment variable.
bug 3100

e416a753

Remove error message on valid state · 1b5382f5

Morris Jette authored 8 years ago

Don't log error for job end_time being zero if node health check is still
    running.
bug 3053

1b5382f5

Simplify error checking on job_allocate · bdf94e98

Brian Christiansen authored 8 years ago

Previous logic duplicated checking error_codes returned from
job_allocate. job_allocate() will set job state to FAILED if there
was an actual issue.

bdf94e98

Reject job if job violates ANY or ALL part limits · 185ebc81

Brian Christiansen authored 8 years ago

Was just checking for ESLURM_REQUESTED_PART_CONFIG_UNAVAILABLE and
ENFORCE_ALL however in _slurm_rpc_allocate_resources() and
_slurm_rpc_submit_batch_job() both check for ANY and ALL.

185ebc81

Sep 20, 2016
- fed_mgr - Change it so update messages don't automatically connect · 4945d049
  Danny Auble authored 8 years ago
  
  to siblings (If not already connected). This will happen when the next message is sent to them.
  4945d049
- Add missing signal.h include needed for pthread_kill(). · 4c30a972
  Tim Wickberg authored 8 years ago
  
  4c30a972
- Fix clang error (hopefully) · 22007e03
  Danny Auble authored 8 years ago
  
  22007e03
- Fix memory leak when we aren't ready to accept connections from · 94075636
  Danny Auble authored 8 years ago
  
  sibling clusters.
  94075636
- Fix issue when the persist_conn didn't exist when trying to send · b1866f31
  Danny Auble authored 8 years ago
  
  back a message to the caller.
  b1866f31
- If the recv pointer switches more than once during a shutdown of · 012226f6
  Danny Auble authored 8 years ago
  
  a federation connection, someone adding and removing the cluster from the federation lots of times at the same time the cluster could be not found.
  012226f6
- Fix memory corruption issue · 3dac141f
  Danny Auble authored 8 years ago
  
  3dac141f
- Fix possible invalid memory read · 058598fb
  Danny Auble authored 8 years ago
  
  058598fb
- FED_MGR - add pthread_kill to kill ping thread if running · 9903d242
  Danny Auble authored 8 years ago
  
  9903d242
- Add missing lock/unlock to the fed_mgr to avoid memory corruption. · f59cb41b
  Danny Auble authored 8 years ago
  
  f59cb41b
- Add config.h include to slurm_persist_conn.c for HAVE_SYS_PRCTL_H · 0059928c
  Tim Wickberg authored 8 years ago
  
  Fixes build issue caused by 844830d4.
  0059928c
- Fix FreeBSD build. · 844830d4
  Ben Matthews authored 8 years ago
  
  844830d4
Sep 19, 2016
- FED_MGR - Fix issue with ping thread trying to send on a non-existent · 8d2f6153
  Danny Auble authored 8 years ago
  
  connection
  8d2f6153
- On a fatal, abort so we get a core file instead of just exiting. · 428347cf
  Danny Auble authored 8 years ago
  
  428347cf
- If a message is trying to be freed that never was don't print an · eb25f6f5
  Danny Auble authored 8 years ago
  
  error.
  eb25f6f5
- Minor memory free move. · 9142c0eb
  Danny Auble authored 8 years ago
  
  9142c0eb
- Only start the persistent send when we need to send something, or · fe8bb844
  Danny Auble authored 8 years ago
  
  at startup. Starting it up when you get a connection from another cluster could cause delays in processing the request.
  fe8bb844
- In the fed_mgr and we are starting up the send connection we · 4416f257
  Danny Auble authored 8 years ago
  
  want to only wait for message_timeout instead of forever. Otherwise we could hit deadlock if the other person is trying to do the same thing.
  4416f257
- Remove xmallocs from the fed_mgr ping_thread · ba0c6af8
  Danny Auble authored 8 years ago
  
  ba0c6af8
- Add update mutex to the fed_mgr to only allow one update to be · 0d347008
  Danny Auble authored 8 years ago
  
  processed at a time. Otherwise you could get issues if you are rapidly adding and removing a cluster from a federation. Probably not likely in real life, but in testing that is a different story.
  0d347008
- Always make the connection nonblocking when receiving in the · f48d1ccb
  Danny Auble authored 8 years ago
  
  slurmctld.
  f48d1ccb
- Make error a debug message instead of error since this is an expected · 36305ede
  Danny Auble authored 8 years ago
  
  scenario when first added to a federation.
  36305ede
- Add the idea of an init flag to the fed_mgr · 6de31291
  Danny Auble authored 8 years ago
  
  6de31291
- Merge branch 'slurm-16.05' · 38e8a078
  Morris Jette authored 8 years ago
  
  38e8a078
- Add FAQ describing how to colorize squeue output · 31c87fce
  Damien François authored 8 years ago
  
  31c87fce
Sep 17, 2016
- Refactor the persistent connections within the federation code to use · 42bb2fb3
  Danny Auble authored 8 years ago
  
  the same logic that was found in the slurmdbd. Now both functionalities share the same code. This was done with the merge right before this commit.
  42bb2fb3
- Merge branch 'persist_conn' · 63be8b75
  Danny Auble authored 8 years ago
  
  63be8b75
- Remove what appears to be an extra return to the database when an · c483b10a
  Danny Auble authored 8 years ago
  
  update is sent to a slurmctld.
  c483b10a
- Refactor the way fed_mgr state is loaded so we can actually use it · 7d6c3b77
  Danny Auble authored 8 years ago
  
  with real persistent connections.
  7d6c3b77
- Move fed_mgr_fini to a place after state is dropped. · 59934649
  Danny Auble authored 8 years ago
  
  59934649
- Change DbdInit to PersistInit · 3dfc5a5b
  Danny Auble authored 8 years ago
  
  3dfc5a5b
- When unpacking a cluster rec that has fed.recv|send set make sure · 6ef736e7
  Danny Auble authored 8 years ago
  
  the fd is set to -1 so we don't try to actually use it.
  6ef736e7
- Fix sacctmgr to only contact clusters that are up and running · 1d6df21d
  Danny Auble authored 8 years ago
  
  when querying if there are runaway jobs.
  1d6df21d
- Put enum numbers in the enum definition to count by. · 12de6490
  Danny Auble authored 8 years ago
  
  12de6490
- Added back missing RPC, I was unable to find the commit it was taken · 2fd8b3cc
  Danny Auble authored 8 years ago
  
  out in, but it is back now!
  2fd8b3cc
- Add msg->conn to the batch job submit · 1c6ef7a0
  Danny Auble authored 8 years ago
  
  1c6ef7a0