Block Matching algorithm

Hi !

I’d like to develop a fast block amtching algorithm with CUDA. I already designed a program but it wasn’t efficient enough (it takes 35 ms to process the MV for 4*4 blocks of 720 * 522 frames).

So I foud and read articles on the topic and they were all speaking about a same program computing SAD that I finally foud at this address : http://code.google.com/p/gpuocelot/source/…vn324&r=223.

But there are some portions of code thaht I don’t understand.

For example :

[codebox]/* Allocate SAD data on the device */

cudaMalloc((void **)&d_sads, 41 * MAX_POS_PADDED * image_size_macroblocks *

       sizeof(unsigned short));[/codebox]

Why do they use the number 41 ? And :

[codebox]mb_sad_calc<<<dim3(CEIL(ref_image->width / 4, THREADS_W),

	       CEIL(ref_image->height / 4, THREADS_H)),

  dim3(CEIL(MAX_POS, POS_PER_THREAD) * THREADS_W * THREADS_H),

  SAD_LOC_SIZE_BYTES>>>

  (d_sads,

   (unsigned short *)d_cur_image,

   image_width_macroblocks,

   image_height_macroblocks);

CUDA_ERRCK[/codebox]

How can they compute all tje SADs for the 4*4 nlocks if the array d_sads has been allocated with MAX_POS_PADDED * image_size_macroblocks ?

I think understanding this code will be very helpful for me.

Somebody could explain it to me ? Thks in advance.

P.S : sorry for my bad english, I’m french ;)