Discrepancy in profiler reported stored data size

My kernel writes to global memory 8 bytes for each thread, no matter what. The profiler however, reported a number different than expected. The kernel has one transaction per request. The reported global store size is 4 to 10 times bigger than expected, depending on different runs of the kernel (the kernel does have random numbers involved and have divergance, but it alway should write 8 bytes).

I want to understand what really the profiler data means. Even if it means something different than I expected, at least from different runs it should report the same number, as the kernel only requests constant number of data to store.


It is very hard to help you interpret profiler results when you do not post a reproducible.

You mention both random numbers and divergence. In these case it seems very likely you could achieve one transactions per store instruction but have a much higher number of store instructions than expected.

The Nsight Visual Studio Edition CUDA Profiler can collect per SASS instruction statistics including instructions executed, active mask histogram, not predicated off histogram, and memory transactions histogram. Using this data it should be easy to see why the store data is higher than expected.

If this approach shows constant data size on each run then it is possible your algorithm does not even distribute memory stores across the frame buffer partitions. On current hardware the PM system cannot observe PM counters from all L2 slices or all frame buffer partitions so the profiler collects what it can and estimates the final value. This can result in the error you are seeing.