Correct understanding coalesced memory loading?

jordyvaneijk · July 30, 2008, 11:14am

Hello everyone,

In an attempt to optimize our kernel we tried to make sure that our global memory loads are coherent/coalesced. Below is a small snippet of (pseudo-)code:

__global__ void kernel(){

     //  index calculations.

    T[index] = C1*T[index-1]+C2*T[index+1]+C3*T[index-dimx]

                     + C4*T[index+dimx]+C5*T[index-dimx*dimy]+C6*T[index+dimx*dimy]

                     + C7*T[index];

}

Now our conclusion was that the associated loads can never be coherent, at least not all of them. This can be seen by the fact that if the T[index] loads would be coherent, so having a BaseAdress which is a multiple of 16, T[index-1] can not be coherent.

Is this correct reasoning or are we missing something here?

Jordy

Reimar · July 30, 2008, 11:29am

Yes, you are missing that the address of T[index] minus the thread ID must be divisible by 16. So if (for a fixed number “index”, not one that differs with each thread like in your example code) thread 2 loads T[index] and thread 1 loads T[index-1] both can be coalesced (also note the relaxed coalescing rules for GTX2xx hardware).

Or in other words: read your data into shared memory first, this will provide the same level of coalescing as GTX2xx hardware provides and in addition reduce your global memory bandwidth by half.

Using texure reads might get a similar effect with less coding effort, too.

jordyvaneijk · July 30, 2008, 12:09pm

Thanks Reimar.

Possible I’m not getting it but the problem of having a base address (so if the adres if the tid equals zero) that is not divisible by 16 (for example for T[index-1]) remains if you would load all data needed for a block first into shared memory.

We did find some examples for a 2D texture bind, although they are not clear to us yet, but didn’t find any example how to access the neighbors of an element in a 3D data set other that mapping the set to a linear array and applying offsets for the different neighbors. Since numerically, this is a very common problem, is there a faster way to do this? Could anyone help us out with a small example.

Thanks

Jordy

MisterAnderson42 · July 30, 2008, 1:22pm

Update to the CUDA 2.0 beta and use 3D texture reads :)

Reimar · July 30, 2008, 1:28pm

I’d be careful with recommending 3D textures, the maximum dimensions are extremely limiting. tex1Dfetch with manual index calculation might be more future-proof and not that much slower.

E.D_Riedijk · July 30, 2008, 5:42pm

I don’t think the dimensions are that limiting. The size will be much bigger than the largest available GPU (4 Gb on the new tesla)

Tijn · July 30, 2008, 7:23pm

But where do you fetch from with tex3D() ? Is there something like a 3D cudaArray or should the 3D data set be translated to a 2D array like it is done with linear arrays ?

E.D_Riedijk · July 30, 2008, 7:56pm

yeah, i remember seeing a 3d cuda array. You can see it in the 3d texture example in the latest sdk

Topic		Replies	Views
Coalesced Loads Question CUDA Programming and Performance	4	1906	July 7, 2008
gld coalesced = 0, but addresses are aligned! CUDA Programming and Performance	10	1661	March 20, 2010
Coalesced Memory access related doubt CUDA Programming and Performance	13	2079	December 9, 2010
Need help on non-coalesced access CUDA Programming and Performance	0	1130	May 9, 2009
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11770	March 20, 2009
global mem reads coalesced per block or warp? CUDA Programming and Performance	5	5508	March 6, 2007
question about texture reads and coalescading reads CUDA Programming and Performance	3	1853	December 12, 2008
Coalescing - beginner question CUDA Programming and Performance	10	1783	June 23, 2010
Newbie question regarding global load CUDA Programming and Performance	2	1661	September 2, 2008
Problems with coalescing memory accesses CUDA Programming and Performance	4	3784	August 26, 2009

Correct understanding coalesced memory loading?

Related topics