Hopper L2 partition data copy error?

I am running a test on depthwise convolutuon on the hopper architecture GPU. however, the data between L2 partition does not seem correct,


From this figure, one can infer that all the data read from HBM will pass through the interconnect between L2 partitions, while all the data written to HBM does not touch the interconnect.

I think the second guess is pretty good, and it should be a design to decrease the interconnect traffic. But the first guess does not make sense, if the memory channel is uniformly distributed over L2 partitions, then only half the data read from HBM should pass the interconnect, but this conclusion is contradicted to the number in the figure. Is it a ncu bug? or anyone can explain why the interconnect traffic is like this?

The chart is not meant to indicate that “all the data read from HBM will pass through the interconnect between the L2 partitions”. The chart shows that there is an interconnect (the L2 fabric) between the L2 partitions, and that data is transferred between these partitions through this fabric. By itself, it does not make any indication what causes this transfer. That being said, if there are transfers between the two partitions needed, it means that data was accessed from an SM that wasn’t local to the cache partition this data resided on.

You may also refer to Meanings of L2 --> L2 copy - #2 by felix_dt and Kernel Profiling Guide :: Nsight Compute Documentation

Is there any techniques that can be applied to decrease the L2 fabric traffic? or any pitfalls about L2 fabric traffic that programmers must pay attention for?

1 Like

You may already be aware of it, but there is a discussion around L2 partioning in the “Dissecting Ampere” GTC presentation.

Whether it is directly applicable to Hopper, I don’t know.

Your link is very interesting! But how they control which SM to access which L2 partition? Thanks!

Can programmers assume that CTAs can always get good L2 bandwidth, no matter whether the accessed L2 slice is local or not? Since, there is almost no CUDA guide regarding whether the kernel performance bottleneck is L2 facbric.