For the gst/gld coalesced/uncoalesced counts provided by the Visual Profiler, what is the data size for them if each memory operation is a float? Are coalesced accesses all 64 bytes and uncoalesced accesses all 32 bytes?
I am trying to calculate the global memory bandwidth of a program. As the GPU I am using (8800 GTX) has only CC 1.0, it does not have the gst32/64/128 to derive global memory bandwidth from.
According to the compute visual profiler, gst coalesced —> number of coalesced global memory stores.
I am running a kernel which does just 1 store (of type float). There are 2048 * 256 = 524288 instances (threads) of this kernel on every TPC. The Visual profiler tells me that all stores are coalesced. Given that half warps are coalesced, every 16 threads of the total will encounter a single memory access of 64 bytes (sizeof(float) * 16). So I expect the number it should show me is (2048 * 256)/16 = 32768 accesses of 64 bytes each. What is shows me is 32768 * 4 = 131072. Why is this? Is it counting every byte of the float?