Correct understanding coalesced memory loading?

Hello everyone,

In an attempt to optimize our kernel we tried to make sure that our global memory loads are coherent/coalesced. Below is a small snippet of (pseudo-)code:

__global__ void kernel(){

     //  index calculations.

    T[index] = C1*T[index-1]+C2*T[index+1]+C3*T[index-dimx]

                     + C4*T[index+dimx]+C5*T[index-dimx*dimy]+C6*T[index+dimx*dimy]

                     + C7*T[index];


Now our conclusion was that the associated loads can never be coherent, at least not all of them. This can be seen by the fact that if the T[index] loads would be coherent, so having a BaseAdress which is a multiple of 16, T[index-1] can not be coherent.

Is this correct reasoning or are we missing something here?


Yes, you are missing that the address of T[index] minus the thread ID must be divisible by 16. So if (for a fixed number “index”, not one that differs with each thread like in your example code) thread 2 loads T[index] and thread 1 loads T[index-1] both can be coalesced (also note the relaxed coalescing rules for GTX2xx hardware).

Or in other words: read your data into shared memory first, this will provide the same level of coalescing as GTX2xx hardware provides and in addition reduce your global memory bandwidth by half.

Using texure reads might get a similar effect with less coding effort, too.

Thanks Reimar.

Possible I’m not getting it but the problem of having a base address (so if the adres if the tid equals zero) that is not divisible by 16 (for example for T[index-1]) remains if you would load all data needed for a block first into shared memory.

We did find some examples for a 2D texture bind, although they are not clear to us yet, but didn’t find any example how to access the neighbors of an element in a 3D data set other that mapping the set to a linear array and applying offsets for the different neighbors. Since numerically, this is a very common problem, is there a faster way to do this? Could anyone help us out with a small example.



Update to the CUDA 2.0 beta and use 3D texture reads :)

I’d be careful with recommending 3D textures, the maximum dimensions are extremely limiting. tex1Dfetch with manual index calculation might be more future-proof and not that much slower.

I don’t think the dimensions are that limiting. The size will be much bigger than the largest available GPU (4 Gb on the new tesla)

But where do you fetch from with tex3D() ? Is there something like a 3D cudaArray or should the 3D data set be translated to a 2D array like it is done with linear arrays ?

yeah, i remember seeing a 3d cuda array. You can see it in the 3d texture example in the latest sdk