why texture makes it slower?

Hi!

I’m working in a code where I have to read from diferent pixels of a image.

The pixels that contiguous threads are not contiguous, but they’re usually near.

I don’t know why, my code runs faster when I put the image into global memory than when I put it into a texture.

Can anyone help me?

Here’s my kernel code (case global mem)

global void my_kernel(int3 voxelsDim, float voxel_size, int blocks_x, int blocks_y, unsigned char *p_im_data , float *vox_n_cams_dev)
{

//determine the real index of the thread in x and y
__shared__ int block_x_pos;
__shared__ int  block_y_pos;
if(threadIdx.x == 0 && threadIdx.y == 0){
	block_x_pos = blockIdx.x % blocks_x;
	block_y_pos = blockIdx.x / blocks_x;
}
__syncthreads();


int2 thread = make_int2(blockDim.x * block_x_pos + threadIdx.x, blockDim.y * block_y_pos + threadIdx.y);


//tests if thread is inside the working zone
if(thread.x < voxelsDim.x && thread.y < voxelsDim.y && blockIdx.y < voxelsDim.z){
	
	//determines the index for the 1D array.
	int index = thread.x  +  thread.y * voxelsDim.x  +  blockIdx.y * voxelsDim.x * voxelsDim.y;
	
	
        //some functions with registers that returns "unadjacent _index"
	
	
	//unadacent read, and adjacent write
	if(p_im_data[ unadjacent_index ] > 128 ){    //when texture mem, here I use:   if(tex2D(tex_image, unadj_index_x, unadj_index_y) > 128 )
                vox_n_cams_dev[index]++;
            }
		    
}

}

any explanation about what’s going wrong?

with global memory, it takes: 5.979680 ms
with texture memory, it takes: 6.449344 ms

thanks in advanced!

Enrique oriol