Hello!
I have coded a convolution kernel which use shared memory in order to have good performances.
The problem is that it’s not working as I would like…
Here is the code I’ve done:
In the main function, I have declared my grid and threads like this:
dim3 threads(16, 32, 1);
dim3 grid( ceil((float)(w/(threads.x - 2*radius)))+1, ceil((float)(h/(threads.y - 2*radius)))+1, 1);
Now here is the kernel:
__global__ void Conv(float *Image, int Radius, int width, int height)
{
__shared__ float LocalMem[16][32+1]
long xIndex = blockIdx.x * (blockDim.x - 2*Radius) + threadIdx.x - Radius;
long yIndex = blockIdx.y * (blockDim.y - 2*Radius) + threadIdx.y - Radius;
bool testIndex = (xIndex < width) && (yIndex < height);
const int Sz = 2*Radius+1;
const int SzSz = Sz*Sz;
float res = 0;
if(testIndex)
{
// The picture stored in texture memory is copied in shared memory
LocalMem [threadIdx.x][threadIdx.y] = tex2D( TexRef, xIndex, yIndex );
}
__syncthreads();
if(threadIdx.x >= Radius && threadIdx.x <= (blockDim.x-1-Radius) && threadIdx.y >= Radius && threadIdx.y <= (blockDim.y-1-Radius))
{
// Calculation of the convolution
for(int i=threadIdx.x; i<threadIdx.x+Sz; i++)
{
for(int j=threadIdx.y; j<threadIdx.y+Sz; j++)
{
res += LocalMem[i-Radius][j-Radius];
}
}
// Storage of the result in the output picture
Image[ xIndex + yIndex*width ] = res / SzSz;
}
}
The picture I use is 352*288 big and the Radius is 5.
I can’t figure out what is wrong in my code. If anyone could have the kindness to help me with this convolution, it would be great! :)
Thanks in advance!
EDIT : It seems to work properly with a square picture (512*512)… How come?