I’ve been running the same kernel on both the 3090 and V100 GPUs and I’m working to understand the performance differences attributed to their architectures. (the V100 achieved lower performance than the 3090.)
While the kernel is latency-bound on both platforms and exhibits similar occupancy and static scheduling statistics, I’m currently checking the memory hierarchy as the most potential source for the performance variation. (not sure.) I’ve confirmed that the warps on the 3090 experience more stalls on the long scoreboard, and the long scoreboard dominates the warp states on both platforms. So I’m trying to analyze the profiling results on cache.
I’m puzzled by the fact that the V100 demonstrates a higher L2 cache hit rate (15% vs 4.7%) but also shows more requested sectors to the device (352,577,358 vs 328,928,192). This seems to be an unexpected outcome. 3090 also has more packets sent from L1 to SMs.
Both the 3090 and V100 have identical L1 (128 KB per SM) and L2 (6144 KB) cache sizes.
Additionally, I’ve noticed that the ncu generated negative values in the “L1/TEX Reduction” metric. I’m wondering if this could potentially be a bug within ncu itself.