Hi,
I ran the same code on RTX3090 and A100, which is a kernel for sptrsv calculation of fp64. I found that the running speed on RTX3090 is faster, so I used night computer for analysis and found that the L2 Cache hit rate of A100 is very low compared to 3090. Even if I only calculate a very small matrix and open a block, it still remains the same. What is the possible reason for this?
The FP64 performance of CUDA Core is 20 times higher than that of 3090 in A100, and the use of intensity computer analysis shows that the difference in computation time is not significant because SPTRSV is a memory bound
That’s because the L2 Cache of A100 consists of two parts, while RTX3090 only has one. In the figure,+inf refers to the amount of data generated during communication between the two L2 Cache parts of A100