Shared Memory Bypass not matching dram__bytes.sum

I was profiling some code with Nsight Compute and noticed this Memory Chart in one of my kernels.

The shared memory bypass metric notes that 134 MB came from global memory space. In that case why doesn’t dram__bytes.sum include the 134 MB?

If it is not coming from DRAM then where is this data coming from?


I’d have thought it was coming from the L2 cache and should be reflected in those metrics.

It uses the L2 cache as an intermediary as far as I can tell, but still originates from global memory.

Also the L2 capacity is only 40 MB and incoming data to L2 only sums up to ~18.9 MB, so assuming that the cache was completely filled before execution, there are still ~75 MB that aren’t represented here for where they come from.

A minimal reproducible with full NCU command line has not been provided so it is hard to provide a full answer.

Reviewing the L2 hit rate it can estimate the number of bytes hit in the L2 cache as

= dram_bytes_read / l2_miss_rate
= 10.55 MB / (1.0 - .9193)
= 130.73 MB

L2 bytes hit + miss
= 130.73 MB + 10.55 MB = 141 MB

NCU memory table has this metric.

Given the L2 hit rate and the DRAM bytes read the following is likely true:

The total global memory footprint loaded into shared memory across all thread blocks is <= 10.55 MB implying that multiple thread blocks are loading the same data into shared memory resulting in a high L2 hit rate.