I’d like to develop a fast block amtching algorithm with CUDA. I already designed a program but it wasn’t efficient enough (it takes 35 ms to process the MV for 4*4 blocks of 720 * 522 frames).
So I foud and read articles on the topic and they were all speaking about a same program computing SAD that I finally foud at this address : http://code.google.com/p/gpuocelot/source/…vn324&r=223.
But there are some portions of code thaht I don’t understand.
For example :
[codebox]/* Allocate SAD data on the device */
cudaMalloc((void **)&d_sads, 41 * MAX_POS_PADDED * image_size_macroblocks * sizeof(unsigned short));[/codebox]
Why do they use the number 41 ? And :
[codebox]mb_sad_calc<<<dim3(CEIL(ref_image->width / 4, THREADS_W),
CEIL(ref_image->height / 4, THREADS_H)), dim3(CEIL(MAX_POS, POS_PER_THREAD) * THREADS_W * THREADS_H), SAD_LOC_SIZE_BYTES>>> (d_sads, (unsigned short *)d_cur_image, image_width_macroblocks, image_height_macroblocks); CUDA_ERRCK[/codebox]
How can they compute all tje SADs for the 4*4 nlocks if the array d_sads has been allocated with MAX_POS_PADDED * image_size_macroblocks ?
I think understanding this code will be very helpful for me.
Somebody could explain it to me ? Thks in advance.
P.S : sorry for my bad english, I’m french ;)