From this figure, one can infer that all the data read from HBM will pass through the interconnect between L2 partitions, while all the data written to HBM does not touch the interconnect.
I think the second guess is pretty good, and it should be a design to decrease the interconnect traffic. But the first guess does not make sense, if the memory channel is uniformly distributed over L2 partitions, then only half the data read from HBM should pass the interconnect, but this conclusion is contradicted to the number in the figure. Is it a ncu bug? or anyone can explain why the interconnect traffic is like this?
The chart is not meant to indicate that “all the data read from HBM will pass through the interconnect between the L2 partitions”. The chart shows that there is an interconnect (the L2 fabric) between the L2 partitions, and that data is transferred between these partitions through this fabric. By itself, it does not make any indication what causes this transfer. That being said, if there are transfers between the two partitions needed, it means that data was accessed from an SM that wasn’t local to the cache partition this data resided on.
Is there any techniques that can be applied to decrease the L2 fabric traffic? or any pitfalls about L2 fabric traffic that programmers must pay attention for?
Can programmers assume that CTAs can always get good L2 bandwidth, no matter whether the accessed L2 slice is local or not? Since, there is almost no CUDA guide regarding whether the kernel performance bottleneck is L2 facbric.