I am performing some simple calculations, using global memory. I am calling cudaMalloc and cuda memcpying the data. I am then passing in pointers to the device memory to my kernal call. I am seeing the kernals manipulating the data correctly as I then copy the results back to the CPU. However, when I look at the profilier, I see
gld coalesced, gld uncoalesced, gst coalesced, gst uncoalesced all equal to zero. I am using profilier 1.1 to start the session. Any ideas why the counters all show zero when I am pretty sure each kernel is doing many global reads.
Are either/both of you using a GTX 260 or 280? The newer hardware handles memory coalescing in a very different way (see the guide) which causes these counters to return zero all the time. It’s kind of annoying, but a known issue. All the other profiler counters are still correct.