Hi!
I’m working in a code where I have to read from diferent pixels of a image.
The pixels that contiguous threads are not contiguous, but they’re usually near.
I don’t know why, my code runs faster when I put the image into global memory than when I put it into a texture.
Can anyone help me?
Here’s my kernel code (case global mem)
global void my_kernel(int3 voxelsDim, float voxel_size, int blocks_x, int blocks_y, unsigned char *p_im_data , float *vox_n_cams_dev)
{
//determine the real index of the thread in x and y
__shared__ int block_x_pos;
__shared__ int block_y_pos;
if(threadIdx.x == 0 && threadIdx.y == 0){
block_x_pos = blockIdx.x % blocks_x;
block_y_pos = blockIdx.x / blocks_x;
}
__syncthreads();
int2 thread = make_int2(blockDim.x * block_x_pos + threadIdx.x, blockDim.y * block_y_pos + threadIdx.y);
//tests if thread is inside the working zone
if(thread.x < voxelsDim.x && thread.y < voxelsDim.y && blockIdx.y < voxelsDim.z){
//determines the index for the 1D array.
int index = thread.x + thread.y * voxelsDim.x + blockIdx.y * voxelsDim.x * voxelsDim.y;
//some functions with registers that returns "unadjacent _index"
//unadacent read, and adjacent write
if(p_im_data[ unadjacent_index ] > 128 ){ //when texture mem, here I use: if(tex2D(tex_image, unadj_index_x, unadj_index_y) > 128 )
vox_n_cams_dev[index]++;
}
}
}
any explanation about what’s going wrong?
with global memory, it takes: 5.979680 ms
with texture memory, it takes: 6.449344 ms
thanks in advanced!
Enrique oriol