Need help coding a convolution

Hello!

I have coded a convolution kernel which use shared memory in order to have good performances.

The problem is that it’s not working as I would like…

Here is the code I’ve done:

In the main function, I have declared my grid and threads like this:

dim3 threads(16, 32, 1);

dim3 grid( ceil((float)(w/(threads.x - 2*radius)))+1, ceil((float)(h/(threads.y - 2*radius)))+1, 1);

Now here is the kernel:

__global__ void Conv(float *Image, int Radius, int width, int height)

{

   __shared__ float LocalMem[16][32+1]

   

   long xIndex = blockIdx.x * (blockDim.x - 2*Radius) + threadIdx.x - Radius;

   long yIndex = blockIdx.y * (blockDim.y - 2*Radius) + threadIdx.y - Radius;

  bool testIndex = (xIndex < width) && (yIndex < height);

  const int Sz = 2*Radius+1;

   const int SzSz = Sz*Sz;

  float res = 0;

  if(testIndex)

   {

      // The picture stored in texture memory is copied in shared memory

      LocalMem [threadIdx.x][threadIdx.y] = tex2D( TexRef, xIndex, yIndex );

   }

   __syncthreads();

  if(threadIdx.x >= Radius && threadIdx.x <= (blockDim.x-1-Radius) && threadIdx.y >= Radius && threadIdx.y <= (blockDim.y-1-Radius))

   {

      // Calculation of the convolution

      for(int i=threadIdx.x; i<threadIdx.x+Sz; i++)

      {

         for(int j=threadIdx.y; j<threadIdx.y+Sz; j++)

         {

            res += LocalMem[i-Radius][j-Radius];

         }

      }

      // Storage of the result in the output picture

      Image[ xIndex + yIndex*width ] = res / SzSz;

   }

}

The picture I use is 352*288 big and the Radius is 5.

I can’t figure out what is wrong in my code. If anyone could have the kindness to help me with this convolution, it would be great! :)

Thanks in advance!

EDIT : It seems to work properly with a square picture (512*512)… How come?

Ok, I have the answer… The code works fine, but the 8800Ultra is dead! It is having a lot of artifacts on the ATItool scan and the code works well on a FX1700 and a 8800GT.