Very slow texture reads.

Below is a code snippet that contains a kernel and the code involved with setting up textures. The kernel is called multiple times ranging from 0 … d. Each time the kernel is called, it compares each pixel in the left image with the corresponding pixel in the right image offset by 0 … d.

I tried binding the data from the two sub images and using tex2D() in the hopes of getting some speed up, but instead the texture reads seem to slow things down. Am I setting up the textures incorrectly, or is this just not a good place to use textures? (I can’t get most of the memory reads to be coalesced according to the cuda visual profiler, which is why I turned to texture memory)

dataL and dataR are float pointers that point to the image data I obtained using openCV. Likewise, step, width, and height were all obtained that way. layer is just some indexing calculation I do for the integralImg, which is a giant array that stores the differences between each pixel for each disparity value from 0 to d.

[codebox]//global variables

texture<float, 2, cudaReadModeElementType> leftTex;

texture<float, 2, cudaReadModeElementType> rightTex;

cudaArray *left_array, *right_array;

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc();

global void kernel_mat(float *integralImg, float *dataL, float *dataR, int win, int disp, int step, int width, int height, int layer){

int j = __umul24(blockIdx.x,blockDim.x) + threadIdx.x;

int i = __umul24(blockIdx.y,blockDim.y) + threadIdx.y;

float left, right;

if(i < height && j < width){

int sublayer = __umul24(i,step)+j;

if(j-disp < 0){

  integralImg[layer+sublayer] = 0.0;

}

else{

  //left = dataL[sublayer]; //This is what I was doing before I tried using textures

  //right = dataR[sublayer-disp];

  left = tex2D(leftTex, j, i);

  right = tex2D(rightTex, j-disp, i);

  integralImg[layer+sublayer] = fabs(left-right);

}

}

}

//Set up cudaArrays and bind the textures

cudaMallocArray(&left_array, &channelDesc, width, height));

cudaMallocArray(&right_array, &channelDesc, width, height));

cudaMemcpy2DToArray(left_array, 0, 0, dataL, step, step, height, cudaMemcpyHostToDevice);

cudaMemcpy2DToArray(right_array, 0, 0, dataR, step, step, height, cudaMemcpyHostToDevice);

cudaBindTextureToArray(leftTex, left_array);

cudaBindTextureToArray(rightTex, right_array); [/codebox]

Thanks.