global mem reads coalesced per block or warp?

I want to work with every thread (in all blocks) on the same data

(every thread has different initial parameters for some test)

so the kernel is basically:



for (i=1..n)



    bestResult=max(result, bestResult)


So is reading GlobalData[i] coalesced for all threads

in a block or a warp ?

I have several blocks to use all multiprocessors. Would it be

a big benefit to use a texture for caching ?

No, it’s uncoalesced, because all threads in a warp (or half-warp) would be reading the same value. You need to transpose your load so that sequential threads access sequential values. The simplest way to do this is to index based on threadIdx.x.

So instead of looping over the data in every thread, first just do a parallel load of the data, and then loop to do the computation. Something like this pseudocode (following on from your original pseudocode).

__shared__ float sData[n];



for (i=1..n/blockDim)

   sData[threadId + i*blockDim]=GlobalData[threadId + i*blockDim]




   bestResult=max(result, bestResult)



So I’m wasting a lot of bandwith !? Im running 1024 blocks with 256 threads.

so if every half-warp re-reads the same values there should be an

bandwith overhead of factor 16384 … wow.

Also thought this, but my data is much bigger than shared mem and if I understand

it right, you shoudn’t use more than half of it (8kb) to enhance block-switching?

So I can only read some data structs to shared , compute, read next structs…

Also I planned to use shared for something else in the future (so kind of

mini stack…)

Perhabs it will be simpler to use texture caching to access the data?

But I don’t understand this texread() stuff. E.g. I could not manage

to read float3 from texture.

But I think to start a new thread with texturing questions

Hi, I implemented Mark’s shared mem buffering and it speeds things up by 15-16x.

I used only 1kb shared mem for this, so there is enough space for other things I will need in shared mem. BTW: buffer size of 2/4/8kb didn’t improve things.

But there is still an mem bandwith overuse of #multiproc (in my case 12) as all multiprocs read the same values in their corresponding shared mem.

So I’m still curious if texture caching could improve on the bandwith usage.

G80 is a real number cruncher. It looks like CUDA on 8800GTS does things by 12 times faster than the best known CPU implementations (on a Xeon 3.2GHz). And still there isn’t very much optimized in my code yet :))

Now I’ve tried texture caching instead of the shared mem buffering. It’s 3x times slower :ermm: (1d, 1 component float texture in linear space, perhabs array mem textures will do better).
But loading shared mem buffer from this linear texture did some slight improvement of 20%.

The texture cache is optimized for 2D locality. Using a large 1D texture will likely result in more cache misses than a 2D texture if accesses across threads can be made to observe 2D locality.

If this is the case, if you, say, had a 1024 element 1D texture, it might be better to store it as a 32x32 2D texture and change your indexing from:



texture(index & 31, index >> 5);

(Note I used & instead of % to ensure fast performance, since modulo requires several cycles.)