Motion Estimation Fast search

Hello, Im trying to implement on cuda the motion estimation, now I have the full search of 16x16 using shared memory but I would like to implement some of the fast searchs too using shared memory the problem its I dont now If load the search window in shared memory and then compute the fast search (like diamond) is a good idea or not… could you help me please :)

Fast search means a little bit degradation in quality/bitrate performance. Full search is simpler and slightly more parallel than fast search. Why don’t you implement both?

I have already implemented the full search but I have to implement the diamond search or some fast search to compare it. The problem is that Im not sure if using shared memory in fast search is going to be useful or not …

It’s dependent on how you organize your memory reads. Note that reading the texture cache are faster than reading the shared memory.

I know in fact Im using texture all the time because is the best way to work with images… For the moment I have 4 versions:

-motion estimation doing full search of 16x16 only with texture

-motion estimation doing diamond search with texture

-motion estimation doing full search with texture and shared memory

-motion estimation doing pyramid method with texture and shared memory

I would like to improve the diamond search because using branchs and conditionals is quite slow the algorithm, but load the search window in shared memory Im not going to improve the speedup I think… so now I dont know what can I do

Can you read the image into scratch variables before actually doing fast search? That will make use of coalescent read and ensure the pixels are already in cache. Memory reads in fast search are not coalescent, I think.

I dont know how to read the image into scratch variables… Ive never done before…

I meant reading the image pixels into temporary variables and ignoring them. The first read would place the pixel into the cache. The consequence reads (in your fast search algorithm) will take data from the cache and not need to be coalescent.

Now Im doing something like this, I know that the code is very bad…

global void diaSAD(t_SADMV* tsad, t_SADMV* osad, int width, int height, int MB, int SW){

    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;ing very stupid like this:

int xleft, yleft, xright, yright, xbottom, ybottom, xtop, ytop;

int diamond[4][2] = {{-1,0},{1,0},{0,1},{0,-1}};

int flag = 0;
if((x>(width-1))||(y>(height-1))) return;

int  inisad = (y*width + x);

unsigned int xind  = MB*x;
unsigned int yind  = MB*y;

xleft  = xind + diamond[0][0] + tempmin.mv.x;
yleft  = yind + diamond[0][1] + tempmin.mv.y;

xright = xind + diamond[1][0] + tempmin.mv.x;
yright = yind + diamond[1][1] + tempmin.mv.y;

xtop   = xind + diamond[3][0] + tempmin.mv.x;
ytop   = yind + diamond[3][1] + tempmin.mv.y;

xbottom= xind + diamond[2][0] + tempmin.mv.x;
ybottom= yind + diamond[2][1] + tempmin.mv.y;

temp1.sad    = sad(xind, yind, xleft, yleft,MB);
temp1.mv.x   = diamond[0][0];
temp1.mv.y   = diamond[0][1];   

temp2.sad    = sad(xind, yind, xright, yright,MB);
temp2.mv.x   = diamond[1][0];
temp2.mv.y   = diamond[1][1];

temp3.sad    = sad(xind, yind, xtop, ytop,MB);
temp3.mv.x   = diamond[3][0];
temp3.mv.y   = diamond[3][1];

temp4.sad    = sad(xind, yind, xbottom, ybottom,MB);
temp4.mv.x   = diamond[2][0];
temp4.mv.y   = diamond[2][1];

                                               X

because in each iteration I need X o X the values of the X in the image …
X

I think it will be difficult to make good use of shared memory, and also use a fast algorithm(usualy not regular memory access pattern) and also avoid considerable nr of branches.

How fast are your algorithms so far?

Well now with

-the full search using shared memory, search window of 16x16 and a image of 704x576 pixels and one level in the pyramid: 103.24 ms

-the full search using shared memory, search window of 16x16 and a image of 704x576 pixels and two levels in the pyramid:

                 *- first level of subsampling:           38.292198 ms

                 *- second level highest resolution: 112.57 ms

If u want the time of the algorithms let me know… I think they are not very fast I would like to improve it …

First of all. Do you have both small and large diamonds, or only small?

What would come to my mind, for a small diamond(or maybe for both):

have a block of 16x16, each thread loads on value from mem(maybe use shared mem, because the same data will be accessed many times), and do a parallel reduction to obtain SAD for whole MB.

(load 5 values for each thread, the center and each of the other 4 values required for DS. Then do a parallel reduction)

What do you mean by level of subsampling in the case of full search?

Because Im using a pyramidal method --> I downsample the original images by step of 2 then I compute the SAD of both image and I compute the motion vector predictors with these vectors in the image with the original size I put the search window in the position that the previous motion vector tell me.

  • Downsample image–> centered search window --> Motion vectors predictors

    - Motion vector predictors -->Original size  --> not centered search window  --> compute the motion vector

Ok, I was a little confusing.

So, you will have to load inside shared memory the whole area that you would need subsequently. (in this case (1+1+16)*(1+1+16) pixels for one MB. (for small diamond).

But since you have a 16x16 block, this is not very straightforward(you’ll need to do some tricks with the threads in the warp, to load the extra 2*18 pixels in each direction) which might end up slower then using textures).

Or maybe think of some other configuration of threads/block that suits better.

That the point so finally I decided not to use shared memory to do the diamond search because for sure is slower than using texture!! Anyway do you know about other fast search that I can use to my algorithm? It could be very nice have more opinions about this …

Are you familiar with H.264 motion estimation? There are good stuffs in it:

[6] M.-C. Hong, H. H. Oh, “Range Decision for Motion Estimation of VCEG-N33”, 2nd JVT Meeting, JVT-B022, Geneva, Feb 2002.
[7] Z.-B. Chen, P. Zhou, Y. He, “Fast Motion Estimation for JVT”, 7th JVT Meeting, JVT-G016, Pattaya, Mar 2003.
[8] Z.-B. Chen, P. Zhou, Y. He, “Fast Integer Pel and Fractional Pel Motion Estimation for JVT”, 6th JVT Meeting, JVT-F017, Awaji, Dec 2002.
[9] I. Choi, W. I. Choi, J. Lee, B. Jeon, “The Fast Mode Decision with Fast Motion Estimation”, 14th JVT Meeting, JVT-N013, Hong Kong, Jan 2005.
[10] B. Jeon, J. Lee, “Fast mode decision for H.264”, 10th JVT Meeting, JVT-J033, Hawaii, Dec 2003.
[11] K. P. Lim, S. Wu, D. J. Wu, S. Rahardja, X. Lin, F. Pan, Z. G. Li, “Fast INTER Mode Selection”, 9th JVT Meeting, JVT-I020, San Diego, Sep 2003.

In fact Im doing the maste thesis about CUDA and motion estimation so thanks for your help and the information maybe I can find some fast search better for my master thesis :)