Disappointing shared memory performance

I’ve been programming with CUDA for a few years now, and from the beginning I was careful about making sure my global memory reads were coalesced and using shared memory whenever possible. Now I’m preparing a talk on GPU programming, and I coded up a simple 2D diffusion solver on a Tesla C1060 so I could give some performance results. My shared memory implementation isn’t performing as well as I think it should, and I wonder if people could either point out what I’m doing wrong, or explain why I should expect what I’m seeing.

Here are the numbers: For 500000 timesteps, 1600x1600 domain, I get an average kernel execution time of 0.060 ms with no shared memory, and 0.047 ms with shared memory. Moreover, I’ve used several different thread arrangements for loading stuff into shared memory, and the result I quote here is the best performing version - there are worse ones.

Presumably, one of the things that is slowing me down is loading the shared memory ghost cells for each block, since many of these can’t be loaded in a coalesced fashion. The fastest method I’ve found for my block size (16x16, with 18x18 shared memory array) is load the center (8 warps, 16 reads), load the top and bottom (1 warp, 2 reads), load the sides using a mapping of threadIdx.x (1 warp, 32 reads), ignore the corners.

This is a total of 50 reads. For comparison, the global memory only version has at least 5 x 16 = 80 reads, with possibly a few more because of backwards reads across a 128 byte boundary. Since I expect memory reads to dominate the time taken, I expect using shared memory to provide a speedup of 1.6, whereas the actual speedup is 1.28.

Here is my kernel:

__global__ void kernel(float *ph1, float *ph2)

{

__shared__ float ps[BLOCK_SIZE + 2][BLOCK_SIZE + 2];

	int x = blockIdx.x * blockDim.x + threadIdx.x;

	int y = blockIdx.y * blockDim.y + threadIdx.y;

	int n = y * SIZE + x;

	int b = (x % (SIZE-1) == 0 || y % (SIZE-1) == 0 ? 0 : 1);

	//define block-specific variables

	int tx = threadIdx.x;

	int ty = threadIdx.y;

	int temp;

	//fill shared memory with appropriate variables

	ps[ty+1][tx+1] = ph1[n];

	if(ty < 2)

	{

		temp = ty * BLOCK_SIZE;

		ps[temp + ty][tx+1] = ph1[n + temp * SIZE - SIZE];	

		ps[tx + 1][temp + ty] = ph1[n + temp + (tx - ty) * (SIZE - 1) - 1];

	}

	__syncthreads();

	float phi_temp;

	if(b)	//bulk calculation

	{

		phi_temp = ps[ty+1][tx+1] * (1. - 4. * LAMBDA)

				+ LAMBDA * (ps[ty+2][tx+1] + ps[ty][tx+1])

				+ LAMBDA * (ps[ty+1][tx+2] + ps[ty+1][tx]);

	}

	else if(x == SIZE-1)  //boundaries

	{

		phi_temp = float(y * (SIZE - y)) / float(SIZE * SIZE);

	}

	else phi_temp = 0.;

	ph2[n] = phi_temp;

}

Any comments? I’ve tried many different thread patterns for loading shared memory, and this is the fastest I’ve found. In addition, I’ve tried setting my number of blocks to (1600 / (16 - 2))^2 while keeping blocks at 16x16, so that I had 16x16 shared memory arrays loaded using 16x16 threads, and only used 14x14 of the 16x16 threads for the actual calc. This allows me to load shared memory in one fell swoop to avoid non-coalesced ghost cell loading, but the increased number of blocks required to cover the whole domain results in an increase in run time over the version above.

    [*]You could reshape the blocks to e.g. 32x8 threads to reduce the expensive uncoalesced reads at the borders.

    [*]If you have enough shared memory, you could use 32x16 threads per block.

    [*]If that leaves enough blocks to fully utilize all SMs and if you can spare some shared memory, blocks could loop over x to process whole lines. This could eliminate the uncoalesced reads completely.

    [*]If you process a single timestep per kernel invocation, you could use a texture to access ph1.

Thanks tera. Those are interesting ideas, but I tested the first two and they didn’t result in better performance.

This is surprising with the first idea - the number of non-coalesced memory reads should have been halved, but the average kernel time was 0.048 ms (about the same as before). This suggests that my assumption about non-coalesced ghost cell loads being the dominant factor is incorrect.

With the second idea, each SM can only hold 1 block, and performance is twice as bad as before. Perhaps this means that a lot of latency is being hidden by my thread scheduler.

Third idea is good, but I don’t want to restrict my x-dimension to 512.

As for the fourth idea, I’ve always managed to avoid using textures until now. Maybe it’s time I learned to do so :-)

I implemented a new version of the kernel with shared memory allocated as a 1D array. At first I thought that there would be 16-way bank conflicts on the vertical ghost cells for 2D and only 2-way bank conflicts for 1D. Now, however, I’m pretty sure that its 2-way for both 1D and 2D.

The 1D shared memory version is still faster. The kernel execution time is now 0.042 ms (compared to 0.047 ms before), with speedup over the no shared memory result of 1.43 (as opposed to 1.28). Perhaps this is because 2D accesses require two memory operations - shared memory may be fast, but it’s not instantaneous.