Shared memory problem

Can anyone see any obvious memory coalesence and or bank conflict problems with this kernel? (block size is 16x16)

global void conv_mean_tex_sharedmem_f(float* d_data, int d_pitch, int width, int height)
{
//x y adress within the image corresponding to the thread
const int ix = IMUL(blockDim.x-2, blockIdx.x) + threadIdx.x - 1;
const int iy = IMUL(blockDim.y-2, blockIdx.y) + threadIdx.y - 1;
//texture coord into the image
const float itx = (float)ix + 0.5f;
const float ity = (float)iy + 0.5f;

//shared memory for the entire image bloack and the apron
__shared__ float shared_data[BLOCK_SIZE * BLOCK_SIZE];

//just for clarity
const int spitch = blockDim.x;
const int sx = threadIdx.x;
const int sy = threadIdx.y;

//load into shared memory
shared_data[IDX(spitch,sx,sy)] = tex2D(texData,itx,ity);

__syncthreads();

if((threadIdx.x == 0) || (threadIdx.x == blockDim.x-1) || 
   (threadIdx.y == 0) || (threadIdx.y == blockDim.y-1) ||
   (ix >= width) || (iy >= height))
   return;

d_data[d_pitch*iy+ix] = (shared_data[IDX(spitch,sx-1,sy-1)] + 
	                     shared_data[IDX(spitch,sx  ,sy-1)] + 
					     shared_data[IDX(spitch,sx+1,sy-1)] +
						 shared_data[IDX(spitch,sx-1,sy  )] + 
	                     shared_data[IDX(spitch,sx  ,sy  )] + 
					     shared_data[IDX(spitch,sx+1,sy  )] +
					     shared_data[IDX(spitch,sx-1,sy+1)] +
	                     shared_data[IDX(spitch,sx  ,sy+1)] +
					     shared_data[IDX(spitch,sx+1,sy+1)])/9.0f;

}

The reason i ask is because i get faster results without shared memory, just relying on the texture cache:

global void conv_mean_tex_f(float* d_data, int d_pitch, int width, int height)
{
const int ix = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;
const int iy = IMUL(blockDim.y, blockIdx.y) + threadIdx.y;
const float x = (float)ix + 0.5f;
const float y = (float)iy + 0.5f;

if(ix >= width || iy >= height)
	return;

d_data[d_pitch*iy+ix] = (tex2D(texData,x-1,y-1) +
						 tex2D(texData,x  ,y-1) +
						 tex2D(texData,x+1,y-1) +
						 tex2D(texData,x-1,y  ) +
						 tex2D(texData,x  ,y  ) +
						 tex2D(texData,x+1,y+1) +
						 tex2D(texData,x  ,y+1) + 
						 tex2D(texData,x-1,y+1))/9.0f;

}

Maybe it is because device memory reads through texture fetching are cached as described in section 5.4. You might not need the shared memory for caching your data ?

What are the performance numbers?

Try using a 2D smem array. Code will be easier to read, and compiler does a good job with indexing.

Also, not that your boundary condition is more complicated in the non-texture code. If the time difference is small, see if this is the additional cost.

Paulius