I’ve been running a very simple physics code recently (2D diffusion equation) and testing speedup of GPU code vs. CPU code. In particular, I’ve looked at kernels that do coalesced accesses but no shared memory, kernels that use shared memory, and for comparison a kernel that does non-coalesced memory accesses.

My coalesced kernel is

```
__global__ void kernel(float *ph1, float *ph2)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int n = y * SIZE + x;
int b = (x % (SIZE-1) == 0 || y % (SIZE-1) == 0 ? 0 : 1);
if(b) //bulk calculation
{
ph2[n] = ph1[n] * (1. - 4. * LAMBDA)
+ LAMBDA * (ph1[n+SIZE] + ph1[n-SIZE])
+ LAMBDA * (ph1[n+1] + ph1[n-1]);
}
else if(x == SIZE-1) //boundaries
{
ph2[n] = float(y * (SIZE - y)) / float(SIZE * SIZE);
}
else ph2[n] = 0.;
}
```

To remove memory coalescing from this, I simply switch the positions of x and y in the definition of n. My shared memory kernel can be found in the following post: Previous post

These computations are done on a Tesla C1060 for a square domain. The speedup results are shown in the attachment – Green is with shared memory, red is with coalesced global memory, and orange is for non-coalesced global memory.

I don’t understand the non-coalesced result. However inefficient the memory accesses may be in the non-coalesced case, computational cost and memory cost should scale linearly with system size once the SM occupancy limit is reached, hence the speedup should converge to a fixed value. Does anyone know what could be happening here?