Can anyone see any obvious memory coalesence and or bank conflict problems with this kernel? (block size is 16x16)
global void conv_mean_tex_sharedmem_f(float* d_data, int d_pitch, int width, int height)
{
//x y adress within the image corresponding to the thread
const int ix = IMUL(blockDim.x-2, blockIdx.x) + threadIdx.x - 1;
const int iy = IMUL(blockDim.y-2, blockIdx.y) + threadIdx.y - 1;
//texture coord into the image
const float itx = (float)ix + 0.5f;
const float ity = (float)iy + 0.5f;
//shared memory for the entire image bloack and the apron
__shared__ float shared_data[BLOCK_SIZE * BLOCK_SIZE];
//just for clarity
const int spitch = blockDim.x;
const int sx = threadIdx.x;
const int sy = threadIdx.y;
//load into shared memory
shared_data[IDX(spitch,sx,sy)] = tex2D(texData,itx,ity);
__syncthreads();
if((threadIdx.x == 0) || (threadIdx.x == blockDim.x-1) ||
(threadIdx.y == 0) || (threadIdx.y == blockDim.y-1) ||
(ix >= width) || (iy >= height))
return;
d_data[d_pitch*iy+ix] = (shared_data[IDX(spitch,sx-1,sy-1)] +
shared_data[IDX(spitch,sx ,sy-1)] +
shared_data[IDX(spitch,sx+1,sy-1)] +
shared_data[IDX(spitch,sx-1,sy )] +
shared_data[IDX(spitch,sx ,sy )] +
shared_data[IDX(spitch,sx+1,sy )] +
shared_data[IDX(spitch,sx-1,sy+1)] +
shared_data[IDX(spitch,sx ,sy+1)] +
shared_data[IDX(spitch,sx+1,sy+1)])/9.0f;
The reason i ask is because i get faster results without shared memory, just relying on the texture cache:
global void conv_mean_tex_f(float* d_data, int d_pitch, int width, int height)
{
const int ix = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;
const int iy = IMUL(blockDim.y, blockIdx.y) + threadIdx.y;
const float x = (float)ix + 0.5f;
const float y = (float)iy + 0.5f;
Maybe it is because device memory reads through texture fetching are cached as described in section 5.4. You might not need the shared memory for caching your data ?
Try using a 2D smem array. Code will be easier to read, and compiler does a good job with indexing.
Also, not that your boundary condition is more complicated in the non-texture code. If the time difference is small, see if this is the additional cost.