Skip to content
Snippets Groups Projects
  1. Jun 21, 2013
  2. Jun 19, 2013
  3. Jun 12, 2013
  4. Jun 11, 2013
  5. Jun 10, 2013
  6. Jun 06, 2013
  7. Jun 05, 2013
  8. Jun 04, 2013
  9. Jun 03, 2013
    • jette's avatar
      Fix for job step allocation with required hostlist and exclusive option · 523b1992
      jette authored
      Previously if the required node has no available CPUs left, then other
      nodes in the job allocation would be used
      523b1992
    • Hongjia Cao's avatar
      restore max_nodes of desc to NO_VAL when checkpointing job · f82e0fb8
      Hongjia Cao authored
      We're having some trouble getting our slurm jobs to successfully
      restart after a checkpoint.  For this test, I'm using sbatch and a
      simple, single-threaded executable.  Slurm is 2.5.4, blcr is 0.8.5.
      I'm submitting the job using sbatch:
      
      $ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh
      
      I am able to create the checkpoint and vacate the node:
      
      $ scontrol checkpoint create 137
      .... time passes ....
      $ scontrol vacate 137
      
      At that point, I see the checkpoint file from blcr in the current
      directory and the checkpoint file from Slurm
      in /var/spool/slurm-llnl/checkpoint.  However, when I attempt to
      restart the job:
      
      $ scontrol checkpoint restart 137
      scontrol_checkpoint error: Node count specification invalid
      
      In slurmctld's log (at level 7) I see:
      
      [2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=*****
      [2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002
      [2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0
      [2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid
      f82e0fb8
  10. May 30, 2013
  11. May 29, 2013
  12. May 23, 2013
  13. May 22, 2013
  14. May 21, 2013
  15. May 18, 2013
  16. May 16, 2013
  17. May 14, 2013
Loading