The title says it all; I have a working 2-D diffusion code, but it runs very slowly and also very sporadically (i.e. putting in more than a few hundred cells returns non-answers). This is essentially a re-post of an earlier topic I posted.
The basic idea is that your loads and stores from global memory within a half warp of threads (16 threads on all current CUDA hardware) occur in an orderly, sequential fashion using a global memory which is aligned on 32,64 or 128 byte boundaries, depending on the word size the kernel is loading or storing. This is nicely illustrated in Chapter 5 of the programming guide.
For a finite difference calculation like this using global memory variables:
d2u[i] = (u[i-1] -2*u[i] + u[i+1])/(h*h)
the loads cannot be coalesced. The usual solution is to load the values of u from global memory into local shared memory first, and then perform the calculation using the shared memory variable instead:
__shared__ shmu[];
.....
shmu[j] = u[i];
__syncthreads();
....
d2u[i] = (shmu[j-1] -2*shmu[j] + shmu[j+1])/(h*h)
There are plenty of examples of this sort of approach floating around in tutorial material on the 'net and in a couple of the SDK examples.