Hello,
First of all, I am french and my english is not very good, I apologize for that in advance.
I am working on a block matching program written in CUDA. I focused on the code developped by the Illinois group that is available at this address : http://code.google.com/p/gpuocelot/source/…rboil/tests/sad.
In this program, they have the gpu generates the SADs for all block candidates within a search area of (216+1)² which leads to 1086 candidate positions for one 44 block.
[codebox]/* Search offsets within 16 pixels of (0,0) */
#define SEARCH_RANGE 16
/* The total search area is 33 pixels square */
#define SEARCH_DIMENSION (2*SEARCH_RANGE+1)
/* The total number of search positions is 33^2 */
#define MAX_POS 1089
/* This is padded to a multiple of 8 when allocating memory */
#define MAX_POS_PADDED 1096
[/codebox]
here is the declaration of the whole SADs vector (it contains all the sads for the 7 different kinds of blocks) :
[codebox]/* Allocate SAD data on the device */
cudaMalloc((void **)&d_sads, 41 * MAX_POS_PADDED * image_size_macroblocks * sizeof(unsigned short));
[/codebox]
I’d like to succeed in reducing all the values obtained for all the 44 blocks to get the best motion vector for each 44 block.
I also read the article dealing about H.264 motion estimation implementation in cuda (http://vc…H.264%20ME%20implementation%20on%20CUDA.pdf). To achieve the reduction, they follow the SDK reduction example with non-divergent branch strategy.
Thera are as many thread blocks than 4*4 blocks. Each thread within a block reads 4 SADs from global memory and compares them. Then the best SAD is stored in shared memory with its index, the number of threads is halved, SADs compared, the process repeats until the last two best SADs.
My questions are about the way to implement the reduction. My vector is size of (img_width / 4 * img_height */ 4 * MAX_POS_PADDED) * sizeof(unsigned short).
If I want to follow the method described in the pdf, I have to create a kernel with these parameters :
-
grid(img_width / 4 , img_height */ 4 )
-
block(MAX_POS_PADDED / 4)
But, I don’t really know how to declare shared memory. I’ll have to put in it the indexes of the SADs which are signed short. So, I have thought that I can declare a structure of 1 unsigned short and 2 signed short and fill my shared memory with it.
The numbers of threads in a block is not a power of two. I can then allocate more threads and fill the over shared memory with high SADs values and follow the same process for the reduction as in the SDK.
i would like to have your advice on this strategy of implementation. Is there any way faster to implement this program ?
Maybe some articles dealing with the subject.
Thanks in advance.