Full search Motion estimation with OpenCL(How to optimize with local momory)

Hello everyone,

I have finish the Full search ME about the YUV 420 with OpenCL and i download the reference code for ME with OpenCL from


I modify the code for processing the YUV420 data.



the kernel is as follow:


#define BLOCK_SIZE 16

#define width 352

#define height 288

__kernel void motion_estimation (const __global uchar* frame_original, const __global uchar* frame_ref, __global int2 * mv_output)


	int2 block = (int2)(get_global_id(0), get_global_id(1));

    // For each block in imageA ...

    // start at block.x - window_size , block.y - window_size

    block*= BLOCK_SIZE; //we were handed the (xth,yth) block. We want the top left pixel of it.

	int2 test_block;

    int2 best_block = test_block  ;

    uint best_value = UINT_MAX;

// scan the row until block.y + window_size

    // scan the column until block.x + window_size

    for(test_block.y = block.y - SEARCH_WINDOW_SIZE; test_block.y < block.y + SEARCH_WINDOW_SIZE; test_block.y++)


	for(test_block.x = block.x - SEARCH_WINDOW_SIZE; test_block.x < block.x + SEARCH_WINDOW_SIZE; test_block.x++)


            if( test_block.y>0 && test_block.x>0 && test_block.x+SEARCH_WINDOW_SIZE<width && test_block.y+SEARCH_WINDOW_SIZE<height )


				// find the difference of this choice

				int sum = 0;

				global uchar *a = frame_original + block.y*width+block.x;

				global uchar *b = frame_ref + test_block.y*width + test_block.x;


				for(int m=0;m<BLOCK_SIZE;m++)


					for(int n=0;n<BLOCK_SIZE;n++)




					//return the begin of next line in current block



				if(sum < best_value)


					best_value = sum;

					best_block = test_block;





	//return our match 

	int2 motion_vector =  best_block - block;      

	mv_output[get_global_id(1)*get_global_size(0) + get_global_id(0)]= motion_vector ;


It works well, but comparing with C code, it is only have 3 times speed up.

I know that there is two points reducing the performance.

  1. i use the YUV420(8bit pointer ) which is diffcult to access global memory coalesced.

  2. i did not use the local memory to optimize the kernel.(precisely speaking, i don’t know how to use the local memory to compute the SAD of every block in search range)

some one have finished it about Cuda but the papers can not tell us the details about the code and kernel.

look these:


Everyone can give me some suggestions about the optimisation of ME.

Welcome to discuss with me, i am a OpenCL beginer.

Thank you!