Commits · ecd02e294503adad099c94c77c9063324072d794 · tud-zih-energy / Slurm

Feb 21, 2013
- Better debug levels for sdiag request · ecd02e29
  Danny Auble authored 12 years ago
  
  ecd02e29
Jan 16, 2013

Fix for scheduling batch jobs in multiple partitions · 04fbf26a

Morris Jette authored 12 years ago

Without this change a high priority batch job may not start at submit
time. In addtion, a pending job with mutltiple partitions be cancelled
when the scheduler runs if any of it's partitions can not be used by
the job.

04fbf26a

Dec 12, 2012
- Slightly better logic to avoid doing strcmp's on a blank acct_name · d725434c
  Danny Auble authored 12 years ago
  
  d725434c
- Fix logic for previous checkin · 9aee0309
  Morris Jette authored 12 years ago
  
  9aee0309
- Correct WillRun authentication logic when issued for non-job owner · 5cba7d6d
  Morris Jette authored 12 years ago
  
  5cba7d6d
Nov 22, 2012
- CRAY - allow steps to be made in slurmctld · cb657188
  Danny Auble authored 12 years ago
  
  cb657188
Oct 25, 2012
- remove extra return for the acct_gather_node_msg · 2d35980c
  Danny Auble authored 12 years ago
  
  2d35980c
Oct 24, 2012
- More rename and move common vars into structure · 10ff418f
  Danny Auble authored 12 years ago
  
  (removed some unused variables but many remain)
  10ff418f
Oct 23, 2012
- Rename energy_accounting to acct_gather_energy · 7f2e20c8
  Danny Auble authored 12 years ago
  
  7f2e20c8
Sep 25, 2012

Patches to energy consumption logic · b57eabfa

Morris Jette authored 12 years ago

Fix to some un/pack logic
Fix to test12.5 for new sacct help format
Address various compiler warnings

b57eabfa

Energy use collection logic · 86f60616

Martin Perry authored 12 years ago

Attached is the energy accounting patch that Martin and Yiannis have been working. The framework is there, but the functionality it currently not working. They are both on vacation this week and then are back a week before the conference. I thought it would be better to send in order to get the framework and the structures in place for an official 2.5.0 instead of waiting. If you disagree, just let us know and we can send it again when the low level functionality working. Here is a short summary of our test results.

1. jobacct_gather/none + energy_accounting/none

Looks OK. Did not find any errors.

2. jobacct_gather/linux or cgroup + energy_accounting/none

Looks OK. Did not find any errors.

3. jobacct_gather/linux or cgroup + energy_accounting/rapl

Slurmd aborts when you run a job that uses a node that does not support RAPL. This appears to be because of the error()/pexit() at line# 150/151 in energy_accounting_rapl.c. We need to change this code to just issue a debug message and return. For now, energy_accounting must not be configured if the cluster includes any nodes that do not support RAPL.

The cpu frequency values reported by jobacct_gather are not correct.

Again, there are obviously some problems, so if it would be better to wait for full functionality just let us know. It may be three weeks before they are able to spend some time on this to fix the problems, so that is why I thought you may prefer to have something that has the correctly data structures in sooner rather than later.

86f60616

Aug 10, 2012
- Fix sbcast's credential to last till the end of a job instead of the · 29e3e030
  Matthieu Hautreux authored 12 years ago
  
  previous 20 minute time limit. The previous behavior would fail for large files 20 minutes into the transfer.
  29e3e030
Aug 09, 2012
- Fix sbcast's credential to last till the end of a job instead of the · 069eead2
  Matthieu Hautreux authored 12 years ago
  
  previous 20 minute time limit. The previous behavior would fail for large files 20 minutes into the transfer.
  069eead2
Jul 19, 2012
- Reset backfilled job counter only when explicitly cleared using scontrol. · b4202119
  Alejandro Lucero Palau authored 12 years ago
  
  b4202119
- Fixing a problem about counting in backfilling scheduling statistics · 419865c8
  alejluther authored 12 years ago
  
  Adding a reset level in reset_stats for controlling values to reset algorithm
  419865c8
Jul 16, 2012
- Note limited sbatch support for --immediate option · e063642d
  Morris Jette authored 12 years ago
  
  This addresses trouble ticket 85
  e063642d
Jul 03, 2012

Support change of DebugFlag=Switch on reconfig or manual reset · eb66713a
Morris Jette authored 12 years ago

eb66713a

Add support for advanced reservations at core resolution · f87e3a0a

Alexjandro Lucero Palau authored 12 years ago

Add support for advanced reservation for specific cores rather than whole
nodes. Current limiations: homogeneous cluster, nodes idle when reservation
created, and no more than one reservation per node. Code is still under
development. Work by Alejandro Lucero Palau, et. al, BSC.

f87e3a0a

Jun 01, 2012
- Largely cosmetic mods for select/serial logic · 1a8d0bbe
  Morris Jette authored 12 years ago
  
  1a8d0bbe
May 23, 2012
- Add LicensesUsed to "scontrol show configuration" output · 450706d3
  Morris Jette authored 12 years ago
  
  Format is "name:used/total"
  450706d3
- Minor tweaks to scheduling calls from slurmctld · c4277ac7
  Morris Jette authored 12 years ago
  
  c4277ac7
May 22, 2012
- handle various issues not previously covered with new launch type plugin · 4dea9129
  Danny Auble authored 12 years ago
  
  4dea9129
- Move job request validation outside of locks for better parallelism · 246e22f0
  Morris Jette authored 12 years ago
  
  246e22f0
May 11, 2012
- Remove unused argument to select_p_select_nodeinfo_set_all · d3e1dd63
  Morris Jette authored 12 years ago
  
  d3e1dd63
- Remove redundant xstrdup that results in memory leak · 84fed84e
  Morris Jette authored 12 years ago
  
  84fed84e
May 10, 2012
- Fix for complete job RPC return code logic · 16f41b2b
  Morris Jette authored 12 years ago
  
  16f41b2b
May 08, 2012
- Refactor batch job complete reply logic · 3afd96be
  Morris Jette authored 12 years ago
  
  3afd96be
- Adds very primative pull on job completion · feb506fb
  Morris Jette authored 12 years ago
  
  feb506fb
- Early batch launch response to batch complete RPC · e7042248
  Morris Jette authored 12 years ago
  
  e7042248
May 05, 2012
- Add call to scheduler won batch job completion · 9df475b2
  Morris Jette authored 12 years ago
  
  9df475b2
- Make the batch job complete RPC imply epilog complete on that node · 4f86a1df
  Morris Jette authored 12 years ago
  
  This eliminates one of four RPCs needed for each job.
  4f86a1df
May 04, 2012
- Remove response to EPILOG_COMPLETE RPC · 720ed91c
  Morris Jette authored 12 years ago
  
  720ed91c
- Add response to EpilogComplete RPC · de57a17e
  Morris Jette authored 12 years ago
  
  de57a17e
Mar 20, 2012
- Minor updates to PriorityFlags logic and documentation · 264c7fbc
  Morris Jette authored 13 years ago
  
  264c7fbc
Feb 22, 2012
- Size parameter of xduparray also needs to be uint32_t · 17bb1011
  Pär Andersson authored 13 years ago
  
  17bb1011
Jan 31, 2012

Problem when using srun --uid in conjunction with --jobid (patch included) · e2b39c14

Didier GAZEN authored 13 years ago

Hi,

With slurm 2.3.2 (or 2.3.3), I encounter the following error when
trying to launch as root a command attached to a running user's job
even if I use the --uid=<user> option :

sila@suse112:~> squeue
   JOBID PARTITION     NAME     USER    STATE      TIME TIMELIMIT
NODES   CPUS NODELIST(REASON)
     551     debug mysleep.     sila  RUNNING      0:02 UNLIMITED
1      1 n1

root@suse112:~ # srun --jobid=551 hostname
srun: error: Unable to create job step: Access/permission denied
<--normal behaviour

root@suse112:~ # srun --jobid=551 --uid=sila hostname
srun: error: Unable to create job step: Invalid user id <--problem

By increasing slurmctld verbosity, the log files displays the follwing
error :

slurmctld: debug2: Processing RPC: REQUEST_JOB_ALLOCATION_INFO_LITE from
uid=0
slurmctld: debug:  _slurm_rpc_job_alloc_info_lite JobId=551 NodeList=n1
usec=1442
slurmctld: debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from uid=0
slurmctld: error: Security violation, JOB_STEP_CREATE RPC from uid=0 to
run as uid 1001

which occurs in function : _slurm_rpc_job_step_create
(src/slurmctld/proc_req.c)

Here's my patch to prevent the command from failing (but I'm not sure
that there is no side effects) :

e2b39c14

Jan 27, 2012
- Change error code for permission denied in sdiag reset · 488fa13e
  Morris Jette authored 13 years ago
  
  488fa13e
Jan 19, 2012
- Rewrite patch for sprio to honor PrivateFlags · b8493e03
  Danny Auble authored 13 years ago
  
  b8493e03
- Fix PrivateFlags bug when using Priority Multifactor plugin. If using sprio · 854a2025
  Danny Auble authored 13 years ago
  
  all jobs would be returned even if the flag was set. Patch from Bill Brophy, Bull.
  854a2025
Dec 28, 2011
- keep job stats for srun jobs · 34315a02
  Morris Jette authored 13 years ago
  
  34315a02