No, it’s uncoalesced, because all threads in a warp (or half-warp) would be reading the same value. You need to transpose your load so that sequential threads access sequential values. The simplest way to do this is to index based on threadIdx.x.
So instead of looping over the data in every thread, first just do a parallel load of the data, and then loop to do the computation. Something like this pseudocode (following on from your original pseudocode).
Hi, I implemented Mark’s shared mem buffering and it speeds things up by 15-16x.
I used only 1kb shared mem for this, so there is enough space for other things I will need in shared mem. BTW: buffer size of 2/4/8kb didn’t improve things.
But there is still an mem bandwith overuse of #multiproc (in my case 12) as all multiprocs read the same values in their corresponding shared mem.
So I’m still curious if texture caching could improve on the bandwith usage.
G80 is a real number cruncher. It looks like CUDA on 8800GTS does things by 12 times faster than the best known CPU implementations (on a Xeon 3.2GHz). And still there isn’t very much optimized in my code yet
Now I’ve tried texture caching instead of the shared mem buffering. It’s 3x times slower (1d, 1 component float texture in linear space, perhabs array mem textures will do better).
But loading shared mem buffer from this linear texture did some slight improvement of 20%.