The title just about says it. I have a piece of code that I spent all week optimizing, and when I “break” my access macros to not coalesce memory, my performance stays the same, even on really memory intensive operations like matrix transposition. So haw can I tell whether my accesses are coalesced?
Use the profiler. It will report uncoalesced loads and stores to global memory during kernel execution.
There’s a big difference between devices of compute capability <=1.1 and devices with compute capability >=1.2.
If you take a look at page 85 of the CUDA programming manual, it shows that devices of compute capability 1.2
are able to handle various global memory access patterns (which are not coalesced in devices of cc 1.1) efficiently.
So perhaps you’re using a GPU of compute capability 1.3?
Do you mean the visual profiler or the command line profiler? I have only ever used the visual profiler, and I don’t know where to get or how to use the command line one.
I mean the visual profiler. You can set the profile to acquire the number of coalesced and un-coalesced global memory loads and stores.
How do I do that? Is there a profile file somewhere or do I need to muck through the GUI?