Skip to content
Snippets Groups Projects
  • Morris Jette's avatar
    6a103f2e
    Correct -mem-per-cpu logic for multiple threads per core · 6a103f2e
    Morris Jette authored
    See bugzilla bug 132
    
    When using select/cons_res and CR_Core_Memory, hyperthreaded nodes may be
    overcommitted on memory when CPU counts are scaled. I've tested 2.4.2 and HEAD
    (2.5.0-pre3).
    
    Conditions:
    -----------
    * SelectType=select/cons_res
    * SelectTypeParameters=CR_Core_Memory
    * Using threads
      - Ex. "NodeName=linux0 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2
    RealMemory=400"
    
    Description:
    ------------
    In the cons_res plugin, _verify_node_state() in job_test.c checks if a node has
    sufficient memory for a job. However, the per-CPU memory limits appear to be
    scaled by the number of threads. This new value may exceed the available memory
    on the node. And, once a node is overcommitted on memory, future memory checks
    in _verify_node_state() will always succeed.
    
    Scenario to reproduce:
    ----------------------
    With the example node linux0, we run a single-core job with 250MB/core
        srun --mem-per-cpu=250 sleep 60
    
    cons_res checks that it will fit: ((real - alloc) >= job mem)
        ((400 - 0) >= 250) and the job starts
    
    Then, the memory requirement is doubled:
        "slurmctld: error: cons_res: node linux0 memory is overallocated (500) for
    job X"
        "slurmd: scaling CPU count by factor of 2"
    
    This job should not have started
    
    While the first job is still running, we submit a second, identical job
        srun --mem-per-cpu=250 sleep 60
    
    cons_res checks that it will fit:
        ((400 - 500) >= 250), the unsigned int wraps, the test passes, and the job
    starts
    
    This second job also should not have started
    6a103f2e
    History
    Correct -mem-per-cpu logic for multiple threads per core
    Morris Jette authored
    See bugzilla bug 132
    
    When using select/cons_res and CR_Core_Memory, hyperthreaded nodes may be
    overcommitted on memory when CPU counts are scaled. I've tested 2.4.2 and HEAD
    (2.5.0-pre3).
    
    Conditions:
    -----------
    * SelectType=select/cons_res
    * SelectTypeParameters=CR_Core_Memory
    * Using threads
      - Ex. "NodeName=linux0 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2
    RealMemory=400"
    
    Description:
    ------------
    In the cons_res plugin, _verify_node_state() in job_test.c checks if a node has
    sufficient memory for a job. However, the per-CPU memory limits appear to be
    scaled by the number of threads. This new value may exceed the available memory
    on the node. And, once a node is overcommitted on memory, future memory checks
    in _verify_node_state() will always succeed.
    
    Scenario to reproduce:
    ----------------------
    With the example node linux0, we run a single-core job with 250MB/core
        srun --mem-per-cpu=250 sleep 60
    
    cons_res checks that it will fit: ((real - alloc) >= job mem)
        ((400 - 0) >= 250) and the job starts
    
    Then, the memory requirement is doubled:
        "slurmctld: error: cons_res: node linux0 memory is overallocated (500) for
    job X"
        "slurmd: scaling CPU count by factor of 2"
    
    This job should not have started
    
    While the first job is still running, we submit a second, identical job
        srun --mem-per-cpu=250 sleep 60
    
    cons_res checks that it will fit:
        ((400 - 500) >= 250), the unsigned int wraps, the test passes, and the job
    starts
    
    This second job also should not have started
NEWS 152.35 KiB