Strange speedup results for non-coalesced memory code

I’ve been running a very simple physics code recently (2D diffusion equation) and testing speedup of GPU code vs. CPU code. In particular, I’ve looked at kernels that do coalesced accesses but no shared memory, kernels that use shared memory, and for comparison a kernel that does non-coalesced memory accesses.

My coalesced kernel is

__global__ void kernel(float *ph1, float *ph2)


	int x = blockIdx.x * blockDim.x + threadIdx.x;

	int y = blockIdx.y * blockDim.y + threadIdx.y;

	int n = y * SIZE + x;

	int b = (x % (SIZE-1) == 0 || y % (SIZE-1) == 0 ? 0 : 1);

	if(b)	//bulk calculation


		ph2[n] = ph1[n] * (1. - 4. * LAMBDA)

				+ LAMBDA * (ph1[n+SIZE] + ph1[n-SIZE])

				+ LAMBDA * (ph1[n+1] + ph1[n-1]);


	else if(x == SIZE-1)  //boundaries


		ph2[n] = float(y * (SIZE - y)) / float(SIZE * SIZE);


	else ph2[n] = 0.;


To remove memory coalescing from this, I simply switch the positions of x and y in the definition of n. My shared memory kernel can be found in the following post: Previous post

These computations are done on a Tesla C1060 for a square domain. The speedup results are shown in the attachment – Green is with shared memory, red is with coalesced global memory, and orange is for non-coalesced global memory.

I don’t understand the non-coalesced result. However inefficient the memory accesses may be in the non-coalesced case, computational cost and memory cost should scale linearly with system size once the SM occupancy limit is reached, hence the speedup should converge to a fixed value. Does anyone know what could be happening here?

Your app is memory-bound. And non-coalesced version is 16x slower than coalesced one in C1060.

Also I think that you try n is multiple of 256, then you have partition camping problem in non-coalesced version.

Add boundary condition and try n = odd or prime (no multiple of 256).

Thanks! I’d never heard of this partition camping phenomena before. By n I assume you mean block size and not my n - my n takes on every value up to the total thread number.

You can change shape of thread block. If you have a Fermi card, I think it is good to check if this is an effect of partition camping,

because Fermi card can alleviate partition camping.

Check SDK/transpose/doc to know partition camping.