today, while testing cuda with a cc 1.1 GPU, i decided to do some cuda visual profiling on some SDK examples.
the thing is that we all know that packed float4 arrays should be coalesced when the k-th thread accesses the k-th float4 element. However, if you run the visual profiler on the simpleGL example
you can notice that you get all read and writes uncoalesced!!!
i can confirm that this happens at least under 1.1 compute capability.
does someone know why?