I have a kernel which is doing a 2D texture fetch on an unsigned int array.
delta_x and delta_y stores some calculated pixel coordinates where the tex2D function automatically interpolates the new pixel values from.
The kernel takes around 10 ms to finish on an 3008 by 2000 image and a 8800GTX.
I was just wondering why the performance is so bad.
Does the interpolation take so long?
Here’s the kernel code
texture <unsigned int, 2, cudaReadModeElementType> tex_srcImage_RGBA;
__global__ void interpolate ( unsigned int* g_odata, float* d_map_x, float* d_map_y, int width, int height)
{
// x coordinate
int x = __mul24(blockIdx.x , blockDim.x) + threadIdx.x;
// y coordinate
int y = __mul24(blockIdx.y , blockDim.y) + threadIdx.y;
unsigned int pixel_index = __mul24(y , width) + x;
// load corrected coordinates
float delta_x = d_map_x[pixel_index];
float delta_y = d_map_y[pixel_index];
// texture fetch from source image
unsigned int rgb_pixel = tex2D(tex_srcImage_RGBA, delta_x, delta_y);
// write to global memory
g_odata[pixel_index] = rgb_pixel;
}
and the kernel call
dim3 dimGrid(iDivUp(image_width , BLOCKSIZE_X),
iDivUp(image_height, BLOCKSIZE_Y));
dim3 dimBlock(BLOCKSIZE_X, BLOCKSIZE_Y);
// Kernel call
interpolate<<< dimGrid, dimBlock >>>((unsigned int*)d_dst_rgba_imageData,d_map_dx, d_map_dy,
width, height);
BLOCKSIZE_X and _Y are both 16