2D texture fetch performance when interpolating

I have a kernel which is doing a 2D texture fetch on an unsigned int array.

delta_x and delta_y stores some calculated pixel coordinates where the tex2D function automatically interpolates the new pixel values from.

The kernel takes around 10 ms to finish on an 3008 by 2000 image and a 8800GTX.

I was just wondering why the performance is so bad.

Does the interpolation take so long?

Here’s the kernel code

texture <unsigned int, 2, cudaReadModeElementType> tex_srcImage_RGBA;

__global__ void interpolate (	unsigned int* g_odata, float* d_map_x, float* d_map_y, int width, int height)


	// x coordinate

	int x = __mul24(blockIdx.x , blockDim.x) + threadIdx.x;	

	// y coordinate

	int y = __mul24(blockIdx.y , blockDim.y) + threadIdx.y;	



	unsigned int pixel_index = __mul24(y , width) + x;


	// load corrected coordinates 

	float delta_x = d_map_x[pixel_index]; 

	float delta_y = d_map_y[pixel_index];


	// texture fetch from source image

	unsigned int rgb_pixel = tex2D(tex_srcImage_RGBA, delta_x, delta_y);


	// write to global memory

	g_odata[pixel_index] = rgb_pixel;


and the kernel call

dim3 dimGrid(iDivUp(image_width , BLOCKSIZE_X),

     iDivUp(image_height, BLOCKSIZE_Y));



// Kernel call

interpolate<<< dimGrid, dimBlock >>>((unsigned int*)d_dst_rgba_imageData,d_map_dx, d_map_dy,

width, height);

BLOCKSIZE_X and _Y are both 16