Texture Memory does not improve the performance

Hi!
I am using cuda to accelerate 2d convolution of an image (512x384) with a 5x5 tap kernel.
My kernel is shown below:


global void kernel(unsigned char * img, int * wrkx_1d_gpu)
{
int hh=threadIdx.x+blockIdx.xblockDim.x;
int gg=threadIdx.y+blockIdx.y
blockDim.y;
if((hh>=2)&&(hh<510)&&(gg>=2)&&(gg<382))
{
int N=512;
wrkx_1d_gpu[ggN+hh]=36(img[ggN+hh+1]-img[ggN+hh-1]) +
18*(img[(gg+1)N+hh+1]+img[(gg-1)N+hh+1]-img[(gg-1)N+hh-1]-img[(gg+1)N+hh-1]) +
12
(img[(gg
N+hh+2)]-img[gg
N+hh-2]) +
6
(img[(gg+1)*N+hh+2]+img[(gg-1)*N+hh+2]-img[(gg+1)*N+hh-2]-img[(gg-1)N+hh-2]) +
3
(img[(gg+2)*N+hh+1]+img[(gg-2)*N+hh+1]-img[(gg+2)*N+hh-1]-img[(gg-2)N+hh-1]) +
1
(img[(gg+2)*N+hh+2]+img[(gg-2)*N+hh+2]-img[(gg-2)*N+hh-2]-img[(gg+2)*N+hh-2]);

}
}

In my kernel every thread executes the corresponding output pixel.
The tap kernel i use is:

-1 -3 0 +3 +1
-6 -18 0 18 +6
-12 -36 0 36 12
-6 -18 0 18 +6
-1 -3 0 +3 +1

With this naive implentation the execution time of the kernel is 0.3ms.
I have seen that for 2d images Texture Memory is suggested. So when in my kernel i read the data of the input image from Texture Memory the execution time is 0.5ms. That is bigger than the execution of the kernel without texture memory and i dont know why.
Is it because the dimensions of my image are small? Or because the dimensions of the tap kernel are small?

I would appreciate your answers and your opinios about that. I would like to know if 0.3 ms is good time for a kernel too. Could i achieve a smaller execution time?

Thaink you!!!

Any ideas please?
I’ ve been confused.