The L2 cache hit rate of A100(A800) is very low compared to RTX3090

JiamingCheng · January 16, 2025, 9:43am

Hi,
I ran the same code on RTX3090 and A100, which is a kernel for sptrsv calculation of fp64. I found that the running speed on RTX3090 is faster, so I used night computer for analysis and found that the L2 Cache hit rate of A100 is very low compared to 3090. Even if I only calculate a very small matrix and open a block, it still remains the same. What is the possible reason for this?

THANKS!

rs277 · January 16, 2025, 7:10pm

Possibly due to the A100 using tensor cores for the calculation. FP64 tensor core is not available on the 3090.

Curefab · January 17, 2025, 10:32am

Can you show the individual screenshots instead of the baseline calculation, please?

(+inf%) for the ammount read from the device memory is difficult to understand.

JiamingCheng · January 17, 2025, 2:16pm

The FP64 performance of CUDA Core is 20 times higher than that of 3090 in A100, and the use of intensity computer analysis shows that the difference in computation time is not significant because SPTRSV is a memory bound

JiamingCheng · January 17, 2025, 2:20pm

That’s because the L2 Cache of A100 consists of two parts, while RTX3090 only has one. In the figure,+inf refers to the amount of data generated during communication between the two L2 Cache parts of A100

Curefab · January 17, 2025, 2:30pm

Ah, thank you. The overall data read from device memory is nearly the same: 2.03 MB vs. 2.01 MB.

Could it be that data in the far L2 cache also contribute to shown l2 cache misses?

So everything seems to be fine.

Topic		Replies	Views
Higher L2 cache hit rate but larger device memory tranfer size CUDA Programming and Performance nsight , profiling	1	841	August 13, 2023
Problem about L2 cache hit rate in A800 CUDA Programming and Performance	3	244	May 14, 2024
L2 cache in A100 provides 179% hit rate! Nsight Compute	1	816	January 4, 2023
L2 cache in A100 provides 179% hit rate! CUDA Programming and Performance	7	1541	December 25, 2022
No speedup on L40s wrt RTX6000 Ada CUDA Programming and Performance	2	3764	April 1, 2024
GU H100/L40S Performance CUDA Programming and Performance	4	981	November 25, 2024
How to correctly write code to test A100 L2 bandwidth？ CUDA Programming and Performance	6	2554	October 17, 2023
How to reach peak bandwidth of L2 cache on A100 CUDA Programming and Performance	3	1895	January 4, 2022
L2 Cache mechanism for streaming data? CUDA Programming and Performance	1	841	August 25, 2022
What is the write policy of L2 cache of RTX 3090 ti CUDA Programming and Performance	0	53	August 28, 2024

The L2 cache hit rate of A100(A800) is very low compared to RTX3090

Related topics