Hopper L2 partition data copy error?

buaastv_yzl · December 12, 2023, 1:43pm

I am running a test on depthwise convolutuon on the hopper architecture GPU. however, the data between L2 partition does not seem correct,

From this figure, one can infer that all the data read from HBM will pass through the interconnect between L2 partitions, while all the data written to HBM does not touch the interconnect.

I think the second guess is pretty good, and it should be a design to decrease the interconnect traffic. But the first guess does not make sense, if the memory channel is uniformly distributed over L2 partitions, then only half the data read from HBM should pass the interconnect, but this conclusion is contradicted to the number in the figure. Is it a ncu bug? or anyone can explain why the interconnect traffic is like this?

felix_dt · December 12, 2023, 2:22pm

The chart is not meant to indicate that “all the data read from HBM will pass through the interconnect between the L2 partitions”. The chart shows that there is an interconnect (the L2 fabric) between the L2 partitions, and that data is transferred between these partitions through this fabric. By itself, it does not make any indication what causes this transfer. That being said, if there are transfers between the two partitions needed, it means that data was accessed from an SM that wasn’t local to the cache partition this data resided on.

You may also refer to Meanings of L2 --> L2 copy - #2 by felix_dt and Kernel Profiling Guide :: Nsight Compute Documentation

buaastv_yzl · December 13, 2023, 12:57am

Is there any techniques that can be applied to decrease the L2 fabric traffic? or any pitfalls about L2 fabric traffic that programmers must pay attention for?

rs277 · December 15, 2023, 6:23pm

You may already be aware of it, but there is a discussion around L2 partioning in the “Dissecting Ampere” GTC presentation.

Whether it is directly applicable to Hopper, I don’t know.

202476410arsmart · December 25, 2023, 2:46am

Your link is very interesting! But how they control which SM to access which L2 partition? Thanks!

buaastv_yzl · June 7, 2024, 8:03am

Can programmers assume that CTAs can always get good L2 bandwidth, no matter whether the accessed L2 slice is local or not? Since, there is almost no CUDA guide regarding whether the kernel performance bottleneck is L2 facbric.

Topic		Replies	Views
How to use L2 compression? How to send L1D to shared memory? CUDA Programming and Performance	8	1255	December 31, 2023
L2 cache rate profiled in nsight compute is confused Nsight Compute	5	3109	July 3, 2024
Meanings of L2 --> L2 copy Nsight Compute	1	670	January 17, 2022
Nsight Compute on Hopper: Is TMA Traffic Reflected in Device Memory (DRAM) Metrics? Nsight Compute	0	19	June 26, 2025
Use of L2 cache CUDA Programming and Performance	13	232	March 26, 2025
Nsignt Compute - L2 Compression metric Nsight Compute	6	1573	July 30, 2024
Shared Memory Bypass not matching dram__bytes.sum Nsight Compute	4	358	June 11, 2024
How to correctly write code to test A100 L2 bandwidth？ CUDA Programming and Performance	6	2168	October 17, 2023
Weird Number for L2 Cache Hitrate Nsight Compute nsight	1	1409	April 25, 2020
Understanding Caching/Flushing Behavior/Performance in computeprof for Kepler CUDA Programming and Performance	6	3336	September 19, 2014

Hopper L2 partition data copy error?

Related topics