Higher L2 cache hit rate but larger device memory tranfer size

Hello everyone,

I’ve been running the same kernel on both the 3090 and V100 GPUs and I’m working to understand the performance differences attributed to their architectures. (the V100 achieved lower performance than the 3090.)

While the kernel is latency-bound on both platforms and exhibits similar occupancy and static scheduling statistics, I’m currently checking the memory hierarchy as the most potential source for the performance variation. (not sure.) I’ve confirmed that the warps on the 3090 experience more stalls on the long scoreboard, and the long scoreboard dominates the warp states on both platforms. So I’m trying to analyze the profiling results on cache.

I’m puzzled by the fact that the V100 demonstrates a higher L2 cache hit rate (15% vs 4.7%) but also shows more requested sectors to the device (352,577,358 vs 328,928,192). This seems to be an unexpected outcome. 3090 also has more packets sent from L1 to SMs.

Both the 3090 and V100 have identical L1 (128 KB per SM) and L2 (6144 KB) cache sizes.

Additionally, I’ve noticed that the ncu generated negative values in the “L1/TEX Reduction” metric. I’m wondering if this could potentially be a bug within ncu itself.

Keep in mind that what is running on the GPU hardware it is not the identical kernel when observed at the machine language level, since V100 and RTX 3090 belong to different architecture families. During the translation from source code to SASS (machine language) there are architecture-specific optimizations and code-generation details that can impact instruction selection, instruction scheduling, register allocation, etc. In the best case scenario, the code executed by the hardware across these two GPUs is very similar.

Do these caches have the same: Set associativity? Line lengths? Sectoring? Replacement algorithm? Write policy? The behavior of caches is not solely a function of their size. In addition, behavioral differences in the L1 cache can also trigger behavioral differences in the L2 cache, by altering the stream of requests sent to L2.