Skip to content
Snippets Groups Projects
  1. Jun 04, 2013
  2. Jun 03, 2013
    • Morris Jette's avatar
      Start NEWS for v2.5.8 · c795724d
      Morris Jette authored
      c795724d
    • Hongjia Cao's avatar
      restore max_nodes of desc to NO_VAL when checkpointing job · f82e0fb8
      Hongjia Cao authored
      We're having some trouble getting our slurm jobs to successfully
      restart after a checkpoint.  For this test, I'm using sbatch and a
      simple, single-threaded executable.  Slurm is 2.5.4, blcr is 0.8.5.
      I'm submitting the job using sbatch:
      
      $ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh
      
      I am able to create the checkpoint and vacate the node:
      
      $ scontrol checkpoint create 137
      .... time passes ....
      $ scontrol vacate 137
      
      At that point, I see the checkpoint file from blcr in the current
      directory and the checkpoint file from Slurm
      in /var/spool/slurm-llnl/checkpoint.  However, when I attempt to
      restart the job:
      
      $ scontrol checkpoint restart 137
      scontrol_checkpoint error: Node count specification invalid
      
      In slurmctld's log (at level 7) I see:
      
      [2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=*****
      [2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002
      [2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0
      [2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid
      f82e0fb8
  3. May 31, 2013
  4. May 30, 2013
  5. May 29, 2013
  6. May 24, 2013
  7. May 23, 2013
  8. May 22, 2013
  9. May 21, 2013
  10. May 18, 2013
  11. May 16, 2013
  12. May 14, 2013
  13. May 13, 2013
  14. May 11, 2013
    • Morris Jette's avatar
      Added MaxCPUsPerNode partition configuration parameter. · e33c5d57
      Morris Jette authored
      This can be especially useful to schedule GPUs. For example a node can be
      associated with two Slurm partitions (e.g. "cpu" and "gpu") and the
      partition/queue "cpu" could be limited to only a subset of the node's CPUs,
      insuring that one or more CPUs would be available to jobs in the "gpu"
      partition/queue.
      e33c5d57
  15. May 10, 2013
  16. May 08, 2013
  17. May 02, 2013
  18. May 01, 2013
  19. Apr 30, 2013
    • Morris Jette's avatar
      Change maximum delay for state save from 2 secs to 5 secs. · 5a2a76ff
      Morris Jette authored
      Make timeout configurable at build time by defining SAVE_MAX_WAIT.
      5a2a76ff
    • Olli-Pekka Lehto's avatar
      added script to help manage native and symmetric MPI runs within SLURM · fdf56162
      Olli-Pekka Lehto authored
      Dear all,
      
      As quick fix, I have put together this script to help manage native and symmetric MPI runs within SLURM. It's a bit bare-bones currently but I needed to get it working quickly :)
      
      It does not provide tight integration between the scheduler and MPI daemons and requires a slot on the host, even when running fully on the MIC, so it's really far from an optimal solution but could be a stopgap.
      
      It's inspired by the TACC Stampede documentation. They seem to have a similar script in place.
      
      It's fairly simple, you provide the names of the MIC binary (with -m) and host binary (with -c). The host MPI/OpenMP parameters are given as usual and the Xeon Phi side parameters as environment variables (MIC_PPN, MIC_OMP_NUM_THREADS). Currently it supports only 1 card per host but extending it should be simple enough.
      
      Here are a couple of links to documentation:
      
      Our prototype cluster documentation:
      https://confluence.csc.fi/display/HPCproto/HPC+Prototypes#HPCPrototypes-XeonPhiDevelopment
      Presentation at the PRACE Spring School in Umeå earlier this week:
      https://www.hpc2n.umu.se/sites/default/files/1.03%20CSC%20Cluster%20Introduction.pdf
      
      Feel free to include this in the contribs -directory. It might need a bit of cleanup though and I don't know when I have the time to do this.
      
      I have also added support for TotalView debugger (provided it's installed and configured properly for Xeon Phi usage).
      
      Future ideas:
      
      For the native MIC client, I've been testing it out a bit and looking at ways to minimize the changes needed for support. The two major challenges seem to be in scheduling and affinity:
      
      I think it might be necessary to put it into a specific topology plugin, like the one for BG/Q, but it looks like a lot of work to do that.
      
      Best regards,
      Olli-Pekka
      fdf56162
    • Danny Auble's avatar
      Accounting - make average by task not cpu. · 81ccec93
      Danny Auble authored
      81ccec93
  20. Apr 29, 2013
Loading