Unified memory in visual profiler

ddpruitt · July 10, 2019, 8:20pm

I’m profiling a few different applications in the visual profiler on V100s and P100s. Some kernels use a lot of read only global memory so some of the traffic is getting routed through the texture subsystem. I’m noticing that for some kernels the amount of traffic listed under the memory bandwidth option shows both texture an global memory loads. In a few cases these numbers are nearly identical. Totaling up the numbers I’m getting about 2x as much data being read as I expect. I verified this based on the number of loads executed (no texture loads but the number is consistent with global reads).

I’m also getting a few kernels that use constant and texture memory a bit more and the same thing happens. The data is counted twice and I’m getting higher total reads in the visual profiler than either reported by the application (if it reports) or by the number of instructions executed.

I’m assuming this is because global reads through the L1 cache that are routed via the texture pipeline are being counted as global reads and texture reads. But then the data shown under totals counts both of these together. This inflates the total throughput, in some cases higher than I believe the cache is even capable of (I’ve seen it report 20+GB/s total).

Is this a bug, intended behavior, or am I missing something else?