I’ve been programming with CUDA for a few years now, and from the beginning I was careful about making sure my global memory reads were coalesced and using shared memory whenever possible. Now I’m preparing a talk on GPU programming, and I coded up a simple 2D diffusion solver on a Tesla C1060 so I could give some performance results. My shared memory implementation isn’t performing as well as I think it should, and I wonder if people could either point out what I’m doing wrong, or explain why I should expect what I’m seeing.
Here are the numbers: For 500000 timesteps, 1600x1600 domain, I get an average kernel execution time of 0.060 ms with no shared memory, and 0.047 ms with shared memory. Moreover, I’ve used several different thread arrangements for loading stuff into shared memory, and the result I quote here is the best performing version - there are worse ones.
Presumably, one of the things that is slowing me down is loading the shared memory ghost cells for each block, since many of these can’t be loaded in a coalesced fashion. The fastest method I’ve found for my block size (16x16, with 18x18 shared memory array) is load the center (8 warps, 16 reads), load the top and bottom (1 warp, 2 reads), load the sides using a mapping of threadIdx.x (1 warp, 32 reads), ignore the corners.
This is a total of 50 reads. For comparison, the global memory only version has at least 5 x 16 = 80 reads, with possibly a few more because of backwards reads across a 128 byte boundary. Since I expect memory reads to dominate the time taken, I expect using shared memory to provide a speedup of 1.6, whereas the actual speedup is 1.28.
Here is my kernel:
__global__ void kernel(float *ph1, float *ph2)
{
__shared__ float ps[BLOCK_SIZE + 2][BLOCK_SIZE + 2];
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int n = y * SIZE + x;
int b = (x % (SIZE-1) == 0 || y % (SIZE-1) == 0 ? 0 : 1);
//define block-specific variables
int tx = threadIdx.x;
int ty = threadIdx.y;
int temp;
//fill shared memory with appropriate variables
ps[ty+1][tx+1] = ph1[n];
if(ty < 2)
{
temp = ty * BLOCK_SIZE;
ps[temp + ty][tx+1] = ph1[n + temp * SIZE - SIZE];
ps[tx + 1][temp + ty] = ph1[n + temp + (tx - ty) * (SIZE - 1) - 1];
}
__syncthreads();
float phi_temp;
if(b) //bulk calculation
{
phi_temp = ps[ty+1][tx+1] * (1. - 4. * LAMBDA)
+ LAMBDA * (ps[ty+2][tx+1] + ps[ty][tx+1])
+ LAMBDA * (ps[ty+1][tx+2] + ps[ty+1][tx]);
}
else if(x == SIZE-1) //boundaries
{
phi_temp = float(y * (SIZE - y)) / float(SIZE * SIZE);
}
else phi_temp = 0.;
ph2[n] = phi_temp;
}
Any comments? I’ve tried many different thread patterns for loading shared memory, and this is the fastest I’ve found. In addition, I’ve tried setting my number of blocks to (1600 / (16 - 2))^2 while keeping blocks at 16x16, so that I had 16x16 shared memory arrays loaded using 16x16 threads, and only used 14x14 of the 16x16 threads for the actual calc. This allows me to load shared memory in one fell swoop to avoid non-coalesced ghost cell loading, but the increased number of blocks required to cover the whole domain results in an increase in run time over the version above.