I wrote a kernel to test the L2 cache bandwidth on H100, when I use ncu to profile the kernel, I got the output like this:
I notice that L1/TEX cache Sector misses to L2 (67,108,864) and L1/TEX load Sectors in L2 cache (59,359,552) do not match.
In the kernel I do not use TMA multicast, so I think the two sector number should be equal. Have I missed something?
Thanks a lot.