Skip to content
Snippets Groups Projects
  1. Jun 05, 2013
  2. Jun 04, 2013
  3. Jun 03, 2013
    • jette's avatar
      Fix for job step allocation with required hostlist and exclusive option · 523b1992
      jette authored
      Previously if the required node has no available CPUs left, then other
      nodes in the job allocation would be used
      523b1992
    • Hongjia Cao's avatar
      restore max_nodes of desc to NO_VAL when checkpointing job · f82e0fb8
      Hongjia Cao authored
      We're having some trouble getting our slurm jobs to successfully
      restart after a checkpoint.  For this test, I'm using sbatch and a
      simple, single-threaded executable.  Slurm is 2.5.4, blcr is 0.8.5.
      I'm submitting the job using sbatch:
      
      $ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh
      
      I am able to create the checkpoint and vacate the node:
      
      $ scontrol checkpoint create 137
      .... time passes ....
      $ scontrol vacate 137
      
      At that point, I see the checkpoint file from blcr in the current
      directory and the checkpoint file from Slurm
      in /var/spool/slurm-llnl/checkpoint.  However, when I attempt to
      restart the job:
      
      $ scontrol checkpoint restart 137
      scontrol_checkpoint error: Node count specification invalid
      
      In slurmctld's log (at level 7) I see:
      
      [2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=*****
      [2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002
      [2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0
      [2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid
      f82e0fb8
  4. May 30, 2013
  5. May 29, 2013
  6. May 23, 2013
  7. May 22, 2013
  8. May 21, 2013
  9. May 18, 2013
  10. May 16, 2013
  11. May 14, 2013
  12. May 13, 2013
  13. May 11, 2013
  14. May 10, 2013
    • Hongjia Cao's avatar
      correctly set alloc state of node in select/linear · 0ef764b5
      Hongjia Cao authored
      fix of the following problem:
      if a node is excised from a job and a reconfiguration(e.g., update
      partition) is done when the job is still running, the node will be left
      in state idle but not available any more until the next
      reconfiguration/restart of slurmctld after the job finished.
      0ef764b5
  15. May 08, 2013
  16. May 07, 2013
  17. May 04, 2013
  18. May 03, 2013
    • jette's avatar
      Make test more robust · 2592eb5e
      jette authored
      Make test work if current working directory not in the search path
      Check for appropriate task rank on POE based systems
      Disable the entire test on POE systems
      2592eb5e
  19. May 02, 2013
    • jette's avatar
      POE - Fix logic binding tasks to CPUs. · 48e164e0
      jette authored
      Without this change pmdv12 was bound to one CPU and could not use
      all of the resources allocated to the job step for the tasks that
      it launches
      48e164e0
Loading