- Dec 14, 2013
-
-
Danny Auble authored
226b49a3
-
Morris Jette authored
Test would periodically fail due to expect timing. This seems to fix the problem
-
Danny Auble authored
-
- Dec 13, 2013
-
-
Danny Auble authored
-
Morris Jette authored
-
Morris Jette authored
Fix slurmstepd race condition when separate threads are reading and modifying the job's environment, which can result in the slurmstepd failing with an invalid memory reference. Observed at shutdown when trying to run the task epilog and trying to read the env var: SLURM_STEP_KILLED_MSG_NODE_ID
-
Morris Jette authored
We do not want to look at the core file, so avoid generating it and then having to manually clear it later.
-
- Dec 12, 2013
-
-
Morris Jette authored
Without this change, sstat would try to unpack accounting data that was never packed, resulting in message unpack errors.
-
Morris Jette authored
There were some parsing issues and the test was not as general as it should have been
-
Danny Auble authored
-
Danny Auble authored
throw away initialized variable.
-
Morris Jette authored
Without this patch, free() is called on a random memory location (i.e. whatever is on the stack), which can result in slurmstepd dying and a completed job not being purged in a timely fashion.
-
- Dec 11, 2013
-
-
Morris Jette authored
-
Danny Auble authored
-
Morris Jette authored
Fix race condition in authentication credential creation that could corrupt memory. (NOTE: This race condition has existed since 2003 and would be exceedingly rare.)
-
- Dec 10, 2013
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- Dec 09, 2013
-
-
Morris Jette authored
This is needed for job arrays with discontiguous task ID values (e.g. "123_[1,3,5,...99999]")
-
Morris Jette authored
Previously job arrays were only listed with their native job ID (e.g. 123_0 listed as 123, 123_1 as 124, etc). Now lists the job ID using both format (e.g. "123_1 (124)"). The same format is used for job step IDs (e.g. "123_1.2 (124.2)").
-
- Dec 08, 2013
-
-
jette authored
-
jette authored
If the GRES is associated with specific files AND the GRES count is reset using scontrol AND the slurmd is restarted either without a gres.conf file or with a count and no specific files AND the GRES count is then increased using scontrol the GRES bitmap will not match its count This fixes the root cause of the mismatch between bitmap size and GRES count and should render the rebuilding of the bitmap unnecessary. The rebuilding was handled in the following commits commit ec4df3bf commit 1712d619
-
- Dec 07, 2013
-
-
Danny Auble authored
-
Philip D. Eckert authored
-
- Dec 06, 2013
-
-
Jason Bacon authored
Using CPU: Intel(R) Pentium(R) 4 CPU 2.40GHz (2392.04-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0xf27 Family = f Model = 2 Stepping = 7 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> It's also using an older version of hwloc (1.3.1) and I have not yet tested it with a newer one, but since 0 and -1 are legitimate returns values for hwloc_get_nbobjs_by_type(), I think they should be handled in any case. From the hwloc_get_nbobjs_by_type() man page: static inline int hwloc_get_nbobjs_by_type (hwloc_topology_ttopology, hwloc_obj_type_ttype) [static] Returns the width of level type type. If no object for that type exists, 0 is returned. If there are several levels with objects of that type, -1 is returned. I'm attaching a smarter patch that handles both 0 and -1 return values for both CORE and SOCKET. It logs a warning if it has to fudge a 0 return code and bails out with a helpful error message for -1, which I have no idea how to handle. At least people won't have to waste time tracking down the problem this way. Happy Friday, Jason
-
Trofinoff Stephen authored
This adds a mechanism to kill a hung apbasil command
-
Morris Jette authored
error introduced in commit ec4df3bf
-
Jason Bacon authored
-
Morris Jette authored
A abort has been reported if the node's gres count differs from it's bitmap. This has been induced by changing the count manually (e.g. scontrol update nodename=tux123 gres=gpu:4"). I have not been able to reproduce this problem, but this will resize the bitmap in order to avoid the assert failure.
-
- Dec 05, 2013
-
-
Danny Auble authored
-
Danny Auble authored
news.html.
-
- Dec 04, 2013
-
-
Morris Jette authored
-
Morris Jette authored
Previous logic never reopened the file, preventing proper functioning of logrotate.
-
- Dec 03, 2013
-
-
Morris Jette authored
Use hash function to locate job records for improved performance.
-
Morris Jette authored
Change partition write lock to a read lock as we use a different mechanism for hidden partitions in getting individual jobs.
-
Morris Jette authored
-
Morris Jette authored
Correct logic returning remaining job dependencies in job information reported by scontrol and squeue. Eliminates vestigial descriptors with no job ID values (e.g. "afterany"). As depdencies are removed, the job ID values were removed from the strings, but not the descriptors. This eliminates both. It also checks the full job ID to make sure we do not remove "afterany:1234" when job "123" completes.
-
- Dec 02, 2013
-
-
Morris Jette authored
Fix race condition on batch job termination that could result in a job exit code of 0xfffffffe if the slurmd on node zero registers its active jobs at the same time that slurmstepd is recording the job's exit code. but 535
-
Morris Jette authored
-
David Bigagli authored
-