Maybe someone can help

Ok, currently im trying to implement an area-search-Blockmatcher using cuda.

Im trying to find blocks from an reference image in a defined search areas in another image.

the core idea is to iterate over the different blocks to find using the gridindex, and iterate over the different possible block (in the search area) using the blockindex. then let every thread compare to blocks and write back the result to shared mem.

cause of bigger search areas i decided to build 4 different kernels for each direction(-x/y, x/y, -x/-y, x/-y). So i can use the threadindex with different signs to displace in the search area.

My problem is now that these 4 kernel have noticeable different execution times. Maybe im doing something completly wrong or just miss something. so here is my code:

The blockmatch kernel upper-right, for other directions i just changed the signs befor threadIdx

__global__ void blockmatchur (

  int *blkOrigins,

  int blockSize,

  int dspl,

  int4 *results



	int tx =  threadIdx.x;

	int ty =  threadIdx.y;

	int bx =  blockIdx.x;

	int by =  blockIdx.y;

	int curPosInGrid = by*gridDim.x + bx;

	int curPosInBlock = ty*blockDim.x + ty;

	int posX = blkOrigins[curPosInGrid];

	int posY = blkOrigins[curPosInGrid+1];

   __shared__ int4 values[256];

	values[curPosInBlock] = make_int4(matchblock3(blockSize, posX, posY, tx, ty), posX+tx, posY+ty, 0);



the comparator

__device__ int matchblock3(

  int blockSize,

  int posX,

  int posY,

  int dsplX,

  int dsplY


	int sum = 0;

	int tx = threadIdx.x;

	int psX = posX + tx;

	int pX = posX + dsplX;

	int pY = posY + dsplY;

	// iterate over block

	for (int y = 0; y < blockSize; y++){

  for (int x = 0; x < blockSize; x++){

  	sum +=  (tex2D(img, pY + y, pX + x )^2 - 

           tex2D(imgRef, posY + y , psX + x)^2);



	return sum;


And i call it like this

dim3 threads( dspl, dspl );

dim3 grid(20, 15);

blockmatchur<<< grid, threads >>>(org_Data, blockW, dspl, res_Data);

the code is currently not tuned as to good memory access.

Your blockmatchur() kernel produces no valuable output, so I guess compiler just removes everything. Try writing computed values to global memory at the end of your kernel.

And please, show us actual values for thread block dimensions.

The kernel is currently not finished, At the end i have to find the block who approximates best (entry in values with the lowest sum). So i think i will sort the array and then only write back the int4 struct with the lowest sum. So i get 4 int4 struct for each block, and then i have to find the one with the lowest sum from these 4.

Actual the blocks are 32x32 big and the displacement is 16. if i am right i could go up to 22 with the displacement, to lie under the max thread number of 512.