Higher L2 cache hit rate but larger device memory tranfer size

cache · August 13, 2023, 9:08am

Hello everyone,

I’ve been running the same kernel on both the 3090 and V100 GPUs and I’m working to understand the performance differences attributed to their architectures. (the V100 achieved lower performance than the 3090.)

While the kernel is latency-bound on both platforms and exhibits similar occupancy and static scheduling statistics, I’m currently checking the memory hierarchy as the most potential source for the performance variation. (not sure.) I’ve confirmed that the warps on the 3090 experience more stalls on the long scoreboard, and the long scoreboard dominates the warp states on both platforms. So I’m trying to analyze the profiling results on cache.

I’m puzzled by the fact that the V100 demonstrates a higher L2 cache hit rate (15% vs 4.7%) but also shows more requested sectors to the device (352,577,358 vs 328,928,192). This seems to be an unexpected outcome. 3090 also has more packets sent from L1 to SMs.

Both the 3090 and V100 have identical L1 (128 KB per SM) and L2 (6144 KB) cache sizes.

Additionally, I’ve noticed that the ncu generated negative values in the “L1/TEX Reduction” metric. I’m wondering if this could potentially be a bug within ncu itself.

njuffa · August 13, 2023, 12:06pm

Keep in mind that what is running on the GPU hardware it is not the identical kernel when observed at the machine language level, since V100 and RTX 3090 belong to different architecture families. During the translation from source code to SASS (machine language) there are architecture-specific optimizations and code-generation details that can impact instruction selection, instruction scheduling, register allocation, etc. In the best case scenario, the code executed by the hardware across these two GPUs is very similar.

Do these caches have the same: Set associativity? Line lengths? Sectoring? Replacement algorithm? Write policy? The behavior of caches is not solely a function of their size. In addition, behavioral differences in the L1 cache can also trigger behavioral differences in the L2 cache, by altering the stream of requests sent to L2.

Topic		Replies	Views
Same kernel and data exhibits different performance CUDA Programming and Performance	3	472	December 3, 2021
Difference between L2 read/write transactions and L2_L1 read/write transactions ? CUDA Programming and Performance	3	1286	August 28, 2019
Cache Characterization: Strange L2 Behavior CUDA Programming and Performance	4	17890	April 20, 2011
Memory transaction size CUDA Programming and Performance	1	1710	February 12, 2017
A100 & RTX3090 Memory Similarities and Differences CUDA Programming and Performance	7	1621	September 28, 2022
L2 cache in A100 provides 179% hit rate! Nsight Compute	1	722	January 4, 2023
Strange performance by varying the block size CUDA Programming and Performance	6	695	February 21, 2020
Unbalanced Memory Read & Write CUDA Programming and Performance cuda	3	296	June 29, 2023
Weird Number for L2 Cache Hitrate Nsight Compute nsight	1	1356	April 25, 2020
Cache line size of L1 and L2 CUDA Programming and Performance	3	19866	November 14, 2011

Higher L2 cache hit rate but larger device memory tranfer size

Related topics