I have written an algorithm and tested it for A100 and V100, but I am not understanding why they have different behavior. I have a square matrix and I am splitting that matrix into 20 rectangle blocks. In each step, I am working on elements below the diagonal. So with the progress of the algorithm, the number of elements that I am working with them is reducing.
By comparing the speed of computing each block with the baseline for both machines it is faster but for the whole algorithm that is applied to a square matrix V100 is faster with small dimention matrices and A100 is faster with large square matrices.

Why they are working differently?

Why the overall speedup is different for the blocks part and the original square matrix? (the square matrix is the combination of rectangle blocks)

Do we have any specific hardware architecture for A100?