Skip to content
Snippets Groups Projects
  • Morris Jette's avatar
    dc8d97eb
    select/cray: increase robustness of initialisation code · dc8d97eb
    Morris Jette authored
    This improves the initial configuration code:
     a) Better handling of DownNodes lines
        The previous basil_geometry() would set the node Reason field on failure,
        irrespective of whether that node has been marked using a DownNode line.
    
     b) Check all cases of nodes being invisible to ALPS
        Up until now basil_geometry() had to be fixed each time a new source of
        discrepancy between ALPS and SDB state had been discovered (most recent
        case was NULL coordinates when taking out a blade). Depending on ALPS
        interface changes, there may be other possibilities. Instead of fixing the
        SLURM code for each new case, it is better to check whether SLURM and ALPS
        agree. The price is some tiny delay at SLURM initialisation time (since each
        node is first looked up in the ALPS inventory), but it pays well off as it
        eases system administration by pointing to the source of error.
        Any node that has suddenly disappeared from ALPS horizon will now show up in
        the logs, and also in marked down in sinfo.
    
     c) At initialisation time, give a summary as to how many ALPS nodes are online.
    
     d) Turn ALPS-node-invisibility error into warning message, since such nodes may
        already have been covered in a DownNodes statement.
    
    By merging basil_get_initial_state() into basil_geometry(), the previously separate
    knowledge about system state (database state, ALPS inventory) is combined, allowing
    to more easily identify sources of failure.
    Patch from Gerrit Renker, CSCS.
    dc8d97eb
    History
    select/cray: increase robustness of initialisation code
    Morris Jette authored
    This improves the initial configuration code:
     a) Better handling of DownNodes lines
        The previous basil_geometry() would set the node Reason field on failure,
        irrespective of whether that node has been marked using a DownNode line.
    
     b) Check all cases of nodes being invisible to ALPS
        Up until now basil_geometry() had to be fixed each time a new source of
        discrepancy between ALPS and SDB state had been discovered (most recent
        case was NULL coordinates when taking out a blade). Depending on ALPS
        interface changes, there may be other possibilities. Instead of fixing the
        SLURM code for each new case, it is better to check whether SLURM and ALPS
        agree. The price is some tiny delay at SLURM initialisation time (since each
        node is first looked up in the ALPS inventory), but it pays well off as it
        eases system administration by pointing to the source of error.
        Any node that has suddenly disappeared from ALPS horizon will now show up in
        the logs, and also in marked down in sinfo.
    
     c) At initialisation time, give a summary as to how many ALPS nodes are online.
    
     d) Turn ALPS-node-invisibility error into warning message, since such nodes may
        already have been covered in a DownNodes statement.
    
    By merging basil_get_initial_state() into basil_geometry(), the previously separate
    knowledge about system state (database state, ALPS inventory) is combined, allowing
    to more easily identify sources of failure.
    Patch from Gerrit Renker, CSCS.