The L2 cache hit rate of A100(A800) is very low compared to RTX3090

Hi,
I ran the same code on RTX3090 and A100, which is a kernel for sptrsv calculation of fp64. I found that the running speed on RTX3090 is faster, so I used night computer for analysis and found that the L2 Cache hit rate of A100 is very low compared to 3090. Even if I only calculate a very small matrix and open a block, it still remains the same. What is the possible reason for this?


THANKS!

Possibly due to the A100 using tensor cores for the calculation. FP64 tensor core is not available on the 3090.

Can you show the individual screenshots instead of the baseline calculation, please?

(+inf%) for the ammount read from the device memory is difficult to understand.

The FP64 performance of CUDA Core is 20 times higher than that of 3090 in A100, and the use of intensity computer analysis shows that the difference in computation time is not significant because SPTRSV is a memory bound


That’s because the L2 Cache of A100 consists of two parts, while RTX3090 only has one. In the figure,+inf refers to the amount of data generated during communication between the two L2 Cache parts of A100

Ah, thank you. The overall data read from device memory is nearly the same: 2.03 MB vs. 2.01 MB.

Could it be that data in the far L2 cache also contribute to shown l2 cache misses?

So everything seems to be fine.