- Aug 02, 2012
-
-
Morris Jette authored
This patch adds logic to work around advanced reservations better.
-
- Aug 01, 2012
-
-
Morris Jette authored
Change node_req field in struct job_resources from 8 to 32 bits so we can run more than 256 jobs per node.
-
Danny Auble authored
-
Morris Jette authored
-
Danny Auble authored
correctly as of IBM driver V1R1M1 efix 008.
-
Morris Jette authored
-
Danny Auble authored
configured correctly.
-
- Jul 31, 2012
-
-
Danny Auble authored
current or in the past.
-
Mark Nelson authored
from Mark Nelson
-
Janne Blomqvist authored
Using the syscalls directly rather than calling bin/(u)mount via system() avoids a few fork + exec calls, and provides better error handling if something goes wrong. Users of this functionality are also updated to use slurm_strerror in order to provide a more informative error message. The mount and umount syscalls are Linux-specific, but so are cgroups so no portability is lost.
-
Danny Auble authored
-
Danny Auble authored
Using the syscalls directly rather than calling bin/(u)mount via system() avoids a few fork + exec calls, and provides better error handling if something goes wrong. Users of this functionality are also updated to use slurm_strerror in order to provide a more informative error message. The mount and umount syscalls are Linux-specific, but so are cgroups so no portability is lost.
-
Danny Auble authored
Using the syscalls directly rather than calling bin/(u)mount via system() avoids a few fork + exec calls, and provides better error handling if something goes wrong. Users of this functionality are also updated to use slurm_strerror in order to provide a more informative error message. The mount and umount syscalls are Linux-specific, but so are cgroups so no portability is lost.
-
Danny Auble authored
the current plugin has been loaded when using runjob_mux_refresh_config
-
- Jul 30, 2012
-
-
Morris Jette authored
-
- Jul 27, 2012
-
-
Morris Jette authored
I would like to make two changes to this: 1) since the reservation name can easily exceed 9 characters, I would like the field to be however large it needs to be without truncating the name. I did this by looking at the names then setting the field size to that width. 2) The other headers are in capitals, so I changed ResName State StartTime EndTime Duration Nodelist to RESV_NAME STATE START_TIME END_TIME DURATION NODELIST
-
- Jul 26, 2012
-
-
Morris Jette authored
-
Morris Jette authored
Correct parsing of srun/sbatch input/output/error file names so that only the name "none" is mapped to /dev/null and not any file name starting with "none" (e.g. "none.o"). This fixes bug #98.
-
- Jul 24, 2012
-
-
Morris Jette authored
Gres: If a gres has a count of one and an associated file then when doing a reconfiguration, the node's bitmap was not cleared resulting in an underflow upon job termination or removal from scheduling matrix by the backfill scheduler.
-
- Jul 23, 2012
-
-
Morris Jette authored
Cray and BlueGene - Do not treat lack of usable front-end nodes when slurmctld deamon starts as a fatal error. Also preserve correct front-end node for jobs when there is more than one front-end node and the slurmctld daemon restarts.
-
- Jul 19, 2012
-
-
Danny Auble authored
while it is attempting to free underlying hardware is marked in error making small blocks overlapping with the freeing block. This only applies to dynamic layout mode.
-
Alejandro Lucero Palau authored
-
- Jul 16, 2012
-
-
Morris Jette authored
-
- Jul 13, 2012
-
-
Danny Auble authored
is always set when sending or receiving a message.
-
Tim Wickberg authored
-
- Jul 12, 2012
-
-
Danny Auble authored
than 1 midplane but not the entire allocation.
-
Danny Auble authored
multi midplane block allocation.
-
Danny Auble authored
-
Danny Auble authored
where other blocks on an overlapping midplane are running jobs.
-
- Jul 11, 2012
-
-
Danny Auble authored
hardware is marked bad remove the larger block and create a block over just the bad hardware making the other hardware available to run on.
-
Danny Auble authored
allocation.
-
Danny Auble authored
for a job to finish on it the number of unused cpus wasn't updated correctly.
-
- Jul 10, 2012
-
-
Morris Jette authored
When using the jobcomp/script interface, we have noticed the NODECNT environment variable is off-by-one when logging completed jobs in the NODE_FAIL state (though the NODELIST is correct). This appears to be because in many places in job_completion_logger() is called after deallocate_nodes(), which appears to decrement job->node_cnt for DOWN nodes. If job_completion_logger() only called the job completion plugin, then I would guess that it might be safe to move this call ahead of deallocate_nodes(). However, it seems like job_completion_logger() also does a bunch of accounting stuff (?), so perhaps that would need to be split out first? Also, there is the possibility that this is working as designed, though if so a well placed comment in the code might be appreciated. If the decreased nodecount is intended, though, should the DOWN nodes also be removed from the job's NODELIST? - Mark Grondona
-
- Jul 09, 2012
-
-
Martin Perry authored
See Bugzilla #73 for more complete description of the problem. Patch by Martin Perry, Bull.
-
- Jul 06, 2012
-
-
Carles Fenoy authored
If job is submitted to more than one partition, it's partition pointer can be set to an invalid value. This can result in the count of CPUs allocated on a node being bad, resulting in over- or under-allocation of its CPUs. Patch by Carles Fenoy, BSC. Hi all, After a tough day I've finally found the problem and a solution for 2.4.1 I was able to reproduce the explained behavior by submitting jobs to 2 partitions. This makes the job to be allocated in one partition but in the schedule function the partition of the job is changed to the NON allocated one. This makes that the resources can not be free at the end of the job. I've solved this by changing the IS_PENDING test some lines above in the schedule function in (job_scheduler.c) This is the code from the git HEAD (Line 801). As this file has changed a lot from 2.4.x I have not done a patch but I'm commenting the solution here. I've moved the if(!IS_JOB_PENDING) after the 2nd line (part_ptr...). This prevents the partition of the job to be changed if it is already starting in another partition. job_ptr = job_queue_rec->job_ptr; part_ptr = job_queue_rec->part_ptr; job_ptr->part_ptr = part_ptr; xfree(job_queue_rec); if (!IS_JOB_PENDING(job_ptr)) continue; /* started in other partition */ Hope this is enough information to solve it. I've just realized (while writing this mail) that my solution has a memory leak as job_queue_rec is not freed. Regards, Carles Fenoy
-
- Jul 03, 2012
-
-
Danny Auble authored
there are jobs running on that hardware.
-
Morris Jette authored
-
Alexjandro Lucero Palau authored
Add support for advanced reservation for specific cores rather than whole nodes. Current limiations: homogeneous cluster, nodes idle when reservation created, and no more than one reservation per node. Code is still under development. Work by Alejandro Lucero Palau, et. al, BSC.
-
- Jul 02, 2012
-
-
Carles Fenoy authored
correctly when transitioning. This also applies for 2.4.0 -> 2.4.1, no state will be lost. (Thanks to Carles Fenoy)
-
- Jun 29, 2012
-
-
Bill Brophy authored
Add reservation flag of Part_Nodes to allocate all nodes in a partition to a reservation and automatically change the reservation when nodes are added to or removed from the reservation. Based upon work by Bill Brophy, Bull.
-