No, it’s uncoalesced, because all threads in a warp (or half-warp) would be reading the same value. You need to transpose your load so that sequential threads access sequential values. The simplest way to do this is to index based on threadIdx.x.
So instead of looping over the data in every thread, first just do a parallel load of the data, and then loop to do the computation. Something like this pseudocode (following on from your original pseudocode).
Hi, I implemented Mark’s shared mem buffering and it speeds things up by 15-16x.
I used only 1kb shared mem for this, so there is enough space for other things I will need in shared mem. BTW: buffer size of 2/4/8kb didn’t improve things.
But there is still an mem bandwith overuse of #multiproc (in my case 12) as all multiprocs read the same values in their corresponding shared mem.
So I’m still curious if texture caching could improve on the bandwith usage.
G80 is a real number cruncher. It looks like CUDA on 8800GTS does things by 12 times faster than the best known CPU implementations (on a Xeon 3.2GHz). And still there isn’t very much optimized in my code yet External Image
Now I’ve tried texture caching instead of the shared mem buffering. It’s 3x times slower External Image (1d, 1 component float texture in linear space, perhabs array mem textures will do better).
But loading shared mem buffer from this linear texture did some slight improvement of 20%.
The texture cache is optimized for 2D locality. Using a large 1D texture will likely result in more cache misses than a 2D texture if accesses across threads can be made to observe 2D locality.
If this is the case, if you, say, had a 1024 element 1D texture, it might be better to store it as a 32x32 2D texture and change your indexing from:
texture(index)
to
texture(index & 31, index >> 5);
(Note I used & instead of % to ensure fast performance, since modulo requires several cycles.)