today, while testing cuda with a cc 1.1 GPU, i decided to do some cuda visual profiling on some SDK examples.
the thing is that we all know that packed float4 arrays should be coalesced when the k-th thread accesses the k-th float4 element. However, if you run the visual profiler on the simpleGL example
you can notice that you get all read and writes uncoalesced!!!
i can confirm that this happens at least under 1.1 compute capability.
does someone know why?
Depends on where it comes from. If allocated with cudaMalloc(), alignment should be fine. If you just have the array as a global variable, I think alignment is only guaranteed to 16 bytes.
i put the buffer on the GPU by using openGL glBufferData instructions, i’m looking into that at the moment.
also, the nbody simulation (from the sdk) is 100% coalesced and uses openGL buffers, so im trying to understand what is the difference on the aligment.
ok after looking for many things, the problem was at the very basic level, the SDK example was using blocks of size (8, 8, 1), and i think the minimum amount of threads is 16/dimension to start talking about coalesced memory.