Commits · 8815cfc9bdc86a6328ef1498bdd88042cfe4e33d · tud-zih-energy / Slurm

Jul 06, 2013
- Requeue batch job if it's node reboots (used to abort the job). · 8815cfc9
  Morris Jette authored 11 years ago
  
  8815cfc9
Jul 05, 2013
- Correction to memory limit calculation for mem per cpu with threads · 0d537d32
  John Thiltges authored 11 years ago
  
  When using ThreadsPerCore > 1, it appears that DefMemPerCPU is being scaled by slurmctld, but not by slurmd/slurmstepd. For example, we set ThreadsPerCore=2 and DefMemPerCPU=100. Running a single core job, we would expect two threads to be allocated and AllocMem on the assigned node to increase by 200MB. scontrol reports that AllocMem increased by 200MB, but the task/cgroup plugin only sees 100M of RAM. It looks like the problem may lie in common/slurm_cred.c:format_core_allocs(). The function counts the job/step cores and multiplies the mem_limit's, but it does not scale the CPU count like in slurmd/slurmd/req.c:_check_job_credential(). See bug 309
  0d537d32
- Set exit code for sacctmgr on many error conditions · d235f891
  Morris Jette authored 11 years ago
  
  d235f891
- Merge branch 'slurm-2.5' into slurm-2.6 · 27640127
  Morris Jette authored 11 years ago
  
  27640127
- switch/nrt - Don't allocate network resources unless job step has 2+ nodes · 7ea11af2
  jette authored 11 years ago
  
  7ea11af2
- Minor clarifications in man page of how cgroups handles memory limits · 2b4ab832
  jette authored 11 years ago
  
  2b4ab832
- Add SchedMD copyrights to man pages we are maintaining · d9e9b864
  jette authored 11 years ago
  
  d9e9b864
- Correct slurmctld logging of job details due to differnce in data types · c2df9c98
  jette authored 11 years ago
  
  c2df9c98
- Correct srun verbose logging when int > 32-bits long · 5e3e4516
  jette authored 11 years ago
  
  5e3e4516
Jul 03, 2013

Prevent possible buffer overflow · c2b4c0d4
Morris Jette authored 11 years ago

c2b4c0d4

Piortr Lesnicki authored 11 years ago

This is the correction of the slurm client which otherwise fails the
PMI2_Init() step. This hidden bug was introduced in the standalone
client made from BULL and David's modification made it occur : a wrong
size (by 1 too short) was passed to strncpy for keys, then David
replaced strncpy by MPICH's MPIU_Strncpy forcing a terminal '\0'.

Bug 359

259904b0

Minor update to NEWS web page · dd31776d
Morris Jette authored 11 years ago

dd31776d
sacct - remove relatively new %0 logic that would result in potentially · 91499259
Danny Auble authored 11 years ago
```
messed up output.  Use the parsable options if you want this kind of
view.
```
91499259

Jul 02, 2013
- Updates to NEWS web page · 7801fbd0
  Morris Jette authored 11 years ago
  
  7801fbd0
- Update to news.shtml for release · d16cbe05
  Danny Auble authored 11 years ago
  
  d16cbe05
Jul 01, 2013
- Update main slurm web with new top500 list info · 3c4892ad
  Morris Jette authored 11 years ago
  
  3c4892ad
- Update formatting of job profiling web page · 63fd6e4d
  Nathan Yee authored 11 years ago
  
  63fd6e4d
Jun 28, 2013

Merge branch 'slurm-2.5' into slurm-2.6 · c24ce25a
Morris Jette authored 11 years ago

c24ce25a
Select/cons_res - Correct total CPU count allocated to a job · 9a17ba1c
Morris Jette authored 11 years ago
```
Effects jobs with --exclusive and --cpus-per-task options
bug 355
```
9a17ba1c

Fix calculation of tasks per code · 8c5399ba

Stephen Trofinoff authored 11 years ago

A simple one-line fix to the "_adjust_cpus_nppcu" function that I had
added. I had added this function as part of the NPPCU functionality;
however it wasn't a problem until that squeue patch. That was because
then squeue had been updated to use this function and in this one case
the default value for the internal variable "ntasks_per_core" wound up
not being the 0xffff (65535) that I previously had coded for (as in
"select/cons_res") but instead was 0. Therefore, in that adjustment
function of mine, I simply added a second clause to the if-statment
where I check for the sentinel value that also checks whether it is 0.
This resolved the problem. Because we do not usually use
"select/serial", I did not notice this.

8c5399ba

When a job is aborted send a message for any tasks that have completed. · ee125a47
Danny Auble authored 11 years ago

ee125a47
Add explanation of how proctrack_sgi_job works to the code · 5c52515d
Morris Jette authored 11 years ago

5c52515d
Change a comment for greater clarity · c6a65d05
Morris Jette authored 11 years ago

c6a65d05
Reject job at submit time if its time limit is zero · 0d0b6c90
Phil Eckert authored 11 years ago

0d0b6c90
Add job_submit plugin requiring a user specified time limit · a4a54e4b
Daniel M. Weeks authored 11 years ago

a4a54e4b

Srun to keep running after receiving unrecognized message · 6e1e8c3e

Morris Jette authored 11 years ago

This can happen if something outside of Slurm opens the srun socket
and writes to it, since the data will not be of a form that Slurm
can decode.
Bug 354

6e1e8c3e

Update switch documentation to clarify most calls are for a job step · 2d28ef52
Morris Jette authored 11 years ago
```
Rather than a job
```
2d28ef52

Disable setting implicit value of a job's cpus_per_task value · 0e4f975a

Morris Jette authored 11 years ago

This removes logic added three years ago that would automatically set a job's cpus_per_task value in order to reset a job's mem_per_cpu value and scale the cpus_per_task by the same value. Equivalent logic did not exist in the step allocation logic. Just return an error instead. This change will be made in Slurm version 2.6, but this batch is made for version 2.5. The original patch introducing the problem is in commit: cc00cc70b9c90816afc511e0261e449857176332bug 352

0e4f975a

Fix it so bluegene and serial systems don't get warnings over new NODEDATA · 44662565
Danny Auble authored 11 years ago
```
enum.
```
44662565

Jun 27, 2013
- Confirm that sh5util command is passed a job option · 44585e77
  Rod Schultz authored 11 years ago
  
  Bug 351
  44585e77
- Clarify srun prolog/epilog options. · 99070a27
  Morris Jette authored 11 years ago
  
  99070a27
- Minor test mods for running on IBM PE systems · a4dcb2be
  Morris Jette authored 11 years ago
  
  a4dcb2be
- Defer scheduling on batch submit or job complete if SchedulerParameters=defer · 202c9ed0
  Matthieu Hautreux authored 11 years ago
  
  202c9ed0
- Serialze some more slurmctld calls for better overall performance · a3311b17
  Morris Jette authored 11 years ago
  
  This is extends the logic of commit ba58d59c to the following RPC types: job complete batch script complete and job step complete
  a3311b17
- Move documentation order for MPICH2 · 85ce75fc
  Danny Auble authored 11 years ago
  
  85ce75fc
Jun 26, 2013
- Treat fopen failure of /proc/cpuinfo as a fatal error · 7e396dd7
  Martin Perry authored 11 years ago
  
  In acct_gather_energy/rapl initialization, if the fopen of /proc/cpuinfo fails we should treat this as a fatal condition rather than continue. Patch is attached. Problem found by Coverity tool, CID 20186 bug 331
  7e396dd7
- Update news for start of v2.6.0 development · 7395805c
  Morris Jette authored 11 years ago
  
  7395805c
- Update META for v2.6.0-rc2 tag · a41e7172
  Morris Jette authored 11 years ago
  
  a41e7172
- Fix memory corruption, but leave the merge sort as not part of 2.6 · e8b341d3
  Dominik Friedrich authored 11 years ago
  
  e8b341d3
- Improve locking performance for job allocate and step create · baf4dfb3
  Morris Jette authored 11 years ago
  
  This applies the same logic as added for job signal and batch job submit as in commit ba58d59c
  baf4dfb3