Coalescing Memory Access Patterns How the heck do you do it?

The title says it all; I have a working 2-D diffusion code, but it runs very slowly and also very sporadically (i.e. putting in more than a few hundred cells returns non-answers). This is essentially a re-post of an earlier topic I posted.

The basic idea is that your loads and stores from global memory within a half warp of threads (16 threads on all current CUDA hardware) occur in an orderly, sequential fashion using a global memory which is aligned on 32,64 or 128 byte boundaries, depending on the word size the kernel is loading or storing. This is nicely illustrated in Chapter 5 of the programming guide.

For a finite difference calculation like this using global memory variables:

d2u[i] = (u[i-1] -2*u[i] + u[i+1])/(h*h)

the loads cannot be coalesced. The usual solution is to load the values of u from global memory into local shared memory first, and then perform the calculation using the shared memory variable instead:

__shared__ shmu[];

.....

shmu[j] = u[i];

__syncthreads();

....

d2u[i] = (shmu[j-1] -2*shmu[j] + shmu[j+1])/(h*h)

There are plenty of examples of this sort of approach floating around in tutorial material on the 'net and in a couple of the SDK examples.