Mismatch in memory bandwidth statistics

I’m running a few different kernels through the visual profiler. I’m specifically looking at memory bandwidth. For most of the kernels I’ve checked there’s a massive difference in the “Memory Bandwidth and Utilization” table and the “Memory Statistics” diagram. For example my L2 read throughput shows 191GB/s in the table and 340MB/s in the diagram. Device memory write throughput is 25.5GB/s in the table, 45.7MB/s in the diagram.

Can anyone explain why I’m getting different numbers?

I’m running on a V100 DGX machine with Cuda 9.2 (not my machine can’t change version).


The “Memory Bandwidth And Utilization” table shows the memory bandwidth used by that kernel whereas in “Memory Statistics” diagram the data path between “Unified Cache” and “L2 Cache” show the total amount of memory transferred. Please see highlighted parts in attached screenshots.

In your case 191GB/s is the L2 read bandwidth used by that kernel and 340MB of memory is read from L2.