Trying to figure out if/when memory coalescing is working in my kernel. Found a bunch of posts asking for more explanation of the profiler output, but the responses didn’t seem to really answer the question. :-)
I don’t think I’m getting coalesced loads in my actual kernel, so I backed off to a simple test kernel. I’ve got 4 ints per “cell”, naturally stored on the host as an array of structs. The first test kernel accesses the data as an AOS. The second kernel accesses the data as 4 separate arrays. As expected the second kernel is way faster. Here is the profiler output. I have a GTX 260, so I’m not requesting the gld_incoherent signal (it is zero as documented).
method=[ Z16TestCuda_Kernel1If6float3EvjjPKT_S3_PKjPS1 ] gputime=[ 2736.960 ] cputime=[ 2776.364 ]
occupancy=[ 1.000 ] gld_coherent=[ 69456 ] gld_32b=[ 0 ] gld_64b=[ 0 ] gld_128b=[ 69456 ]
method=[ Z16TestCuda_Kernel2If6float3EvjjPKT_S3_PKjS5_S5_S5_PS1 ] gputime=[ 398.496 ] cputime=[ 419.559 ]
occupancy=[ 1.000 ] gld_coherent=[ 34720 ] gld_32b=[ 0 ] gld_64b=[ 34720 ] gld_128b=[ 0 ]
My question is: Aside from the execution time, what, if anything, useful is the profiler telling me about my memory loads?
Notice that the first kernel, doing uncoalesced reads, is much slower as expected, but… according to the profiler it’s actually doing MORE coherent loads and it’s doing 128 byte loads, while the fast coalesced kernel is doing 64 byte loads. By that measure, shouldn’t the first kernel be faster? What’s up with that?