- Mar 20, 2013
-
-
Morris Jette authored
-
Hongjia Cao authored
-
Danny Auble authored
cluster.
-
jette authored
-
jette authored
-
jette authored
-
Morris Jette authored
-
- Mar 19, 2013
-
-
Morris Jette authored
Conflicts: src/plugins/sched/backfill/backfill.c
-
Don Lipari authored
-
Morris Jette authored
-
Hongjia Cao authored
select()/FD_ISSET() does not work for file descriptor larger than 1023.
-
Morris Jette authored
-
Hongjia Cao authored
avoid add/remove node resource of job if the node is lost by resize I found another case that idle node can not be allocated. It can be reproduced as follows: 1. run a job with -k option: [root@mn0 ~]# srun -w cn[18-28] -k sleep 1000 srun: error: Node failure on cn28 srun: error: Node failure on cn28 srun: error: cn28: task 10: Killed ^Csrun: interrupt (one more within 1 sec to abort) srun: tasks 0-9: running srun: task 10: exited abnormally ^Csrun: sending Ctrl-C to job 106120.0 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 2. set a node down and then set it idle: [root@mn0 ~]# scontrol update nodename=cn28 state=down reason="hjcao test" [root@mn0 ~]# scontrol update nodename=cn28 state=idle 3. restart slurmctld [root@mn0 ~]# service slurm restart stopping slurmctld: [ OK ] slurmctld is stopped starting slurmctld: [ OK ] 4. cancel the job then, the node set down will be left unavailable: [root@mn0 ~]# sinfo -n cn[18-28] PARTITION AVAIL TIMELIMIT NODES STATE NODELIST work* up infinite 11 idle cn[18-28] [root@mn0 ~]# srun -w cn[18-28] hostname srun: job 106122 queued and waiting for resources [root@mn0 slurm]# grep cn28 slurmctld.log [2013-03-18T15:28:02+08:00] debug3: cons_res: _vns: node cn28 in exclusive use [2013-03-18T15:29:02+08:00] debug3: cons_res: _vns: node cn28 in exclusive use I made an attempt to fix this by the attached patch. Please review it.
-
Morris Jette authored
-
Morris Jette authored
I don't believe save_time_limit was redundant. At least in this case: if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){ if (orig_time_limit == NO_VAL) orig_time_limit = comp_time_limit; job_ptr->time_limit = orig_time_limit; [...] So later, when updating the db, if (save_time_limit != job_ptr->time_limit) jobacct_storage_g_job_start(acct_db_conn, job_ptr); will cause the db to be updated, while, if (orig_time_limit != job_ptr->time_limit) jobacct_storage_g_job_start(acct_db_conn, job_ptr); will not because job_ptr->time_limit now equals orig_time_limit.
-
Morris Jette authored
Conflicts: src/db_api/cluster_report_functions.c src/plugins/sched/backfill/backfill.c
-
Morris Jette authored
-
Don Lipari authored
Without this change, if the job's time limit is modified down toward --time-min by the backfill scheduler, update the job's time limit in the database.
-
- Mar 18, 2013
-
-
Morris Jette authored
-
- Mar 14, 2013
-
-
Danny Auble authored
-
Morris Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Morris Jette authored
-
Morris Jette authored
Add milliseconds to default log message header (both RFC 5424 and ISO 8601 time formats). Disable milliseconds logging using the configure parameter "--disable-log-time-msec". Default time format changes to ISO 8601 (without time zone information). Specify "--enable-rfc5424time" to restore the time zone information.
-
- Mar 13, 2013
-
-
Morris Jette authored
Add milliseconds to default log message header with the (default) RFC5424 time format. Disable milliseconds logging using the configure parameter "--enable-rfc5424time-secs". Sample time stamp format is as follows: "2013-03-13T14:28:17.767-07:00".
-
David Bigagli authored
-
Morris Jette authored
Conflicts: doc/man/man1/sbatch.1
-
Morris Jette authored
-
Morris Jette authored
If step requests more CPUs than possible in specified node count of job allocation then return ESLURM_TOO_MANY_REQUESTED_CPUS rather than ESLURM_NODES_BUSY and retrying.
-
Morris Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
- Mar 12, 2013
-
-
Morris Jette authored
Conflicts: src/plugins/select/cons_res/select_cons_res.c
-
Morris Jette authored
-
Magnus Jonsson authored
I found a bug in cons_res/select_p_select_nodeinfo_set_all. If a node is part of two (or more) partitions the code will only count the number of cores/cpus in the partition that has the most running jobs on that node. Patch attached to fix the problem. I also added an new function to bitstring to count the number of bits in an range (bit_set_count_range) and made a minor improvement of (bit_set_count) while reviewing the range version. Best regards, Magnus
-
- Mar 11, 2013
-
-
Morris Jette authored
-
Morris Jette authored
This permits default reservation names to be more easily managed
-
Andy Wettstein authored
-