Need help coding a convolution


I have coded a convolution kernel which use shared memory in order to have good performances.

The problem is that it’s not working as I would like…

Here is the code I’ve done:

In the main function, I have declared my grid and threads like this:

dim3 threads(16, 32, 1);

dim3 grid( ceil((float)(w/(threads.x - 2*radius)))+1, ceil((float)(h/(threads.y - 2*radius)))+1, 1);

Now here is the kernel:

__global__ void Conv(float *Image, int Radius, int width, int height)


   __shared__ float LocalMem[16][32+1]


   long xIndex = blockIdx.x * (blockDim.x - 2*Radius) + threadIdx.x - Radius;

   long yIndex = blockIdx.y * (blockDim.y - 2*Radius) + threadIdx.y - Radius;

  bool testIndex = (xIndex < width) && (yIndex < height);

  const int Sz = 2*Radius+1;

   const int SzSz = Sz*Sz;

  float res = 0;



      // The picture stored in texture memory is copied in shared memory

      LocalMem [threadIdx.x][threadIdx.y] = tex2D( TexRef, xIndex, yIndex );



  if(threadIdx.x >= Radius && threadIdx.x <= (blockDim.x-1-Radius) && threadIdx.y >= Radius && threadIdx.y <= (blockDim.y-1-Radius))


      // Calculation of the convolution

      for(int i=threadIdx.x; i<threadIdx.x+Sz; i++)


         for(int j=threadIdx.y; j<threadIdx.y+Sz; j++)


            res += LocalMem[i-Radius][j-Radius];



      // Storage of the result in the output picture

      Image[ xIndex + yIndex*width ] = res / SzSz;



The picture I use is 352*288 big and the Radius is 5.

I can’t figure out what is wrong in my code. If anyone could have the kindness to help me with this convolution, it would be great! :)

Thanks in advance!

EDIT : It seems to work properly with a square picture (512*512)… How come?

Ok, I have the answer… The code works fine, but the 8800Ultra is dead! It is having a lot of artifacts on the ATItool scan and the code works well on a FX1700 and a 8800GT.