Skip to content
Snippets Groups Projects
  • Moe Jette's avatar
    6c927b3f
    select/cray: fix error in 'is_gemini' logic · 6c927b3f
    Moe Jette authored
    The is_gemini logic is too simple: as just observed on a SeaStar system, it can
    be fooled into the wrong result if more than 1 row has NULL coordinates. 
    
    This case happens if a blade has been powered down completely, so that the SeaStar
    network chip is also powered off. The routing system recognizes this case and 
    routes around the powered-down node in the torus. It is plausible that in such a
    case the torus coordinates are NULL, since the node(s) are no longer part of the
    torus. 
    
    (It is also possible to set all nodes on a blade down, but leave power switched
     on. The SeaStar chip, which is independent of the motherboard, will continue to
     provide routing connectivity, i.e. the torus coordinates would all be non-NULL,
     but no computing can be done by the node, the ALPS state is "ROUTING".)
    
    Here is the example which revealed this behaviour: one blade, nodes 804-807,
    had been powered down after system failure.
    
    mysql> select COUNT(*), COUNT(DISTINCT x_coord,y_coord,z_coord) FROM processor;
    +----------+-----------------------------------------+
    | COUNT(*) | COUNT(DISTINCT x_coord,y_coord,z_coord) |
    +----------+-----------------------------------------+
    |     1882 |                                    1878 | 
    +----------+-----------------------------------------+
    
    ==> There are 4 more node IDs than there are distinct coordinates.
    
    mysql> select processor_id,x_coord,y_coord,z_coord from processor\
           WHERE x_coord IS NULL OR y_coord IS NULL OR z_coord IS NULL;
    +--------------+---------+---------+---------+
    | processor_id | x_coord | y_coord | z_coord |
    +--------------+---------+---------+---------+
    |          804 |    NULL |    NULL |    NULL | 
    |          805 |    NULL |    NULL |    NULL | 
    |          806 |    NULL |    NULL |    NULL | 
    |          807 |    NULL |    NULL |    NULL | 
    +--------------+---------+---------+---------+
    
    ==> The corrected query now also gives the correct result (equality):
    mysql> select COUNT(*), COUNT(DISTINCT x_coord,y_coord,z_coord) FROM processor\
           WHERE x_coord IS NOT NULL AND y_coord IS NOT NULL AND z_coord IS NOT NULL;
    +----------+-----------------------------------------+
    | COUNT(*) | COUNT(DISTINCT x_coord,y_coord,z_coord) |
    +----------+-----------------------------------------+
    |     1878 |                                    1878 | 
    +----------+-----------------------------------------+
    6c927b3f
    History
    select/cray: fix error in 'is_gemini' logic
    Moe Jette authored
    The is_gemini logic is too simple: as just observed on a SeaStar system, it can
    be fooled into the wrong result if more than 1 row has NULL coordinates. 
    
    This case happens if a blade has been powered down completely, so that the SeaStar
    network chip is also powered off. The routing system recognizes this case and 
    routes around the powered-down node in the torus. It is plausible that in such a
    case the torus coordinates are NULL, since the node(s) are no longer part of the
    torus. 
    
    (It is also possible to set all nodes on a blade down, but leave power switched
     on. The SeaStar chip, which is independent of the motherboard, will continue to
     provide routing connectivity, i.e. the torus coordinates would all be non-NULL,
     but no computing can be done by the node, the ALPS state is "ROUTING".)
    
    Here is the example which revealed this behaviour: one blade, nodes 804-807,
    had been powered down after system failure.
    
    mysql> select COUNT(*), COUNT(DISTINCT x_coord,y_coord,z_coord) FROM processor;
    +----------+-----------------------------------------+
    | COUNT(*) | COUNT(DISTINCT x_coord,y_coord,z_coord) |
    +----------+-----------------------------------------+
    |     1882 |                                    1878 | 
    +----------+-----------------------------------------+
    
    ==> There are 4 more node IDs than there are distinct coordinates.
    
    mysql> select processor_id,x_coord,y_coord,z_coord from processor\
           WHERE x_coord IS NULL OR y_coord IS NULL OR z_coord IS NULL;
    +--------------+---------+---------+---------+
    | processor_id | x_coord | y_coord | z_coord |
    +--------------+---------+---------+---------+
    |          804 |    NULL |    NULL |    NULL | 
    |          805 |    NULL |    NULL |    NULL | 
    |          806 |    NULL |    NULL |    NULL | 
    |          807 |    NULL |    NULL |    NULL | 
    +--------------+---------+---------+---------+
    
    ==> The corrected query now also gives the correct result (equality):
    mysql> select COUNT(*), COUNT(DISTINCT x_coord,y_coord,z_coord) FROM processor\
           WHERE x_coord IS NOT NULL AND y_coord IS NOT NULL AND z_coord IS NOT NULL;
    +----------+-----------------------------------------+
    | COUNT(*) | COUNT(DISTINCT x_coord,y_coord,z_coord) |
    +----------+-----------------------------------------+
    |     1878 |                                    1878 | 
    +----------+-----------------------------------------+