- May 06, 2015
-
-
Morris Jette authored
-
Danny Auble authored
14.11 worked fine, but there were changes in 15.08 that weren't addressed in the original patch.
-
Danny Auble authored
utilization.
-
David Bigagli authored
-
- May 05, 2015
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
Dead assignments, possible zero divide, and wrong data types
-
David Bigagli authored
Minor typo fixes
-
Christopher Bottoms authored
-
Morris Jette authored
Modify all tests to use cancel_job() rather than "exec scancel ..." so that we can do better error handling all in one place. There were several places in the globals module that printed "FAILURE" and set a local "exit_code" variable. I added "global exit_code" to those functions in hopes of those failures being reflected in the test's exit code.
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
Conflicts: src/plugins/select/cons_res/job_test.c
-
Morris Jette authored
-
Morris Jette authored
Mostly replacing white-space with tabs, also some minor movement of logic
-
Morris Jette authored
Also includes some cosmetic changes Initialize variables to avoid invalid memory free
-
- May 04, 2015
-
-
Ryan Cox authored
here is information about this patch and the reasons for it: http://tech.ryancox.net/2015/04/caller-id-handling-ssh-launched-processes-in-slurm.html) As discussed previously, here is a patch against master (branched this morning at 709f6504). It works though I'm sure it has some rough edges that you'll find. I had to export a few symbols that weren't there from stepd_api.[ch]. A lot of the code I had to modify is new territory for me so it's likely I made many mistakes. There are a few minor things I might end up wanting to change (I'm not exactly in love with some of the variable or function names I chose though I can live with them). I might make a few minor tweaks to the pam module as well but it won't affect the RPC code. Currently the README is written like a manpage. I might turn it into a man page and say "read the manpage" in the README. Here is an excerpt from the README that states how decisions are made: 1) Check the local stepds for a count of jobs owned by the non-root user a) If none, deny (option action_no_jobs) b) If only one, adopt the process into that job c) If multiple, continue 2) Determine src/dst IP/port of socket 3) Issue callerid RPC to slurmd at IP address of source a) If the remote slurmd can identify the source job, adopt into that job b) If not, continue 4) Pick a random local job from the user to adopt into (option action_unknown) I tried to document to thoroughly document the code, so hopefully it makes sense. Also, I noticed that one of the stepd functions returns a uid_t which is set to -1 on error. The problem with that is that Linux's uid_t is uint32_t. One area of concern in the code is the stepd calls in pam_slurm_adopt.c code. I hope I'm doing enough error handling there, but maybe not. What happens if a step is completing or if the step data is still around even though it's actually dead? The code to actually adopt processes is currently a no-op. That will depend on having the allocation step code added. I haven't checked yet to see if all the relevant plugins (proctrack, jobacct_gather, etc.) have hooks to add a new process to the plugin. If not, it will have to be added as well. Lastly, I exceed 80 characters on lines with user-visible strings since Slurm follows the Linux kernel coding style. Chapter 2 of https://www.kernel.org/doc/Documentation/CodingStyle says "never break user-visible strings... because that breaks the ability to grep for them" (which I have wished Slurm followed, by the way, since I have hit that issue). I know in the past that you wanted even those lines to be wrapped but I figured I would ask if anything has changed :)
-
Morris Jette authored
-
Morris Jette authored
-
Alejandro Sanchez authored
-
- May 02, 2015
-
-
Morris Jette authored
-
- May 01, 2015
-
-
Morris Jette authored
-
Morris Jette authored
The "flags" option was not being forwarded to the slurmctld, but was always set to 0.
-
Morris Jette authored
Change the scancel command to always use the job_id string based API. Add retry logic on the job_id string logic. Add more checking for error codes and used appropriate exit codes. Use NO_VAL rather than "-1" for unset signal value.
-
Morris Jette authored
-
David Bigagli authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Jens Svalgaard Kohrt authored
-
Morris Jette authored
Change the temporary file names used by two tests to include the test ID number, so we can see where they came from if left around.
-
Morris Jette authored
-
Morris Jette authored
The jobcomp/elasticsearch plugin's Makefile.am file was accidentally not added to GIT...
-
Morris Jette authored
In the course of testing some scancel changes, a bunch of tests generated "FAILURE" messages due to job cancellation failures, but the tests reported "SUCCESS" at the end and an exit code of zero. This patch adds a checks for the return value of the "cancel_job" procedure.
-
- Apr 30, 2015
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
In slurmctld communication agent, make the thread timeout be the configured value of MessageTimeout (or 30 seconds, whichever is larger) rather than 30 seconds.
-
Morris Jette authored
-
Morris Jette authored
Conflicts: src/scancel/scancel.c src/scancel/scancel.h
-
Morris Jette authored
Fix scancel bug which could return an error on attempt to signal a job step. A simple "scancel 12.3" to signal a specific job step would fail. Adding another option (say "-i", "--partion=", etc.) would fix this.
-