Code optimization how to optimize

Hello All.

I’m trying to use CUDA to solve my task. I have a video stream. I have to find a reference fragment on every frame from the stream in some frame area. I tried to explain it graphically in attach.

The algorithm is quite simple: I have two buffers of the same size (reference fragment and fragment from video), i make an array of sums of a pixel-to-pixel subtraction. The sum in array which is nearest to null indicates the closest coincidence of two fragments. I have an index of this sum, so i have fragment position on current video frame.

First of all i tried to optimize an algorithm of finding a sum of a pixel-to-pixel subtraction of two buffers, then i added a “moving” of second buffer to get all the sums in some region.

First of all I put a reference and camera frames to global memory of my card. Array of sums I allocate also on the card. Block size (threads per processor)I set to fragment width. Grid size I set to fragment heightSums count(Sums count = region to search widthregion to search height). Why so? I tried to use max thread size(1024), but I saw time dicreasing… I think this is because of atomicAdd calls on shared memory in each thread.

For this moment my kernel code disadvantages are:

  1. search region and fragment size limitation. Because of shared memory per block limitation (49kB). I use shared memory to save the sum of each row pixel-to-pixel subtraction.

  2. I have to decrease kernel time. Now the results on fragment size 3030 and search region 2020(400 sums) are: debug mode ~5.5 msec, release mode 2.8 msec. I need to decrease the time in two times…

Here is my code:

__global__ void kernel_MultiplyAndSum(  unsigned char* pFirstFrame, //reference frame. global memory

	                                               unsigned char* pCurrentFrame,//video frame. global memory

						       int FrameWidth,

						       int FrameHeight,

						       int XStartPoint, //fragment position on reference frame

						       int YStartPoint,//fragment position on reference frame

						       int RgnWidth,//fragment size on reference frame

						       int RgnHeight,//fragment size on reference frame

						       int searchLeft,//search region position on video frame

						       int searchTop,//search region position on video frame

						       int searchRgnWidth,//search region size on video frame

						       int searchRgnHeight,//search region size on video frame

						       int* pSum )// here we put the results of all calculations. array of sum. global memory

{

	__shared__ int rowSum[12100];//shared memory for all row sums

	

	//calculation of pixel shifts according to thread index and block index

        int y1 = blockIdx.x;

	int x1 = threadIdx.x;

	int FramePosIndex = blockIdx.x/RgnHeight;

	int y = blockIdx.x - FramePosIndex*RgnHeight;

	int x = threadIdx.x;

	int yStart = searchTop + (y1/RgnHeight)/searchRgnWidth;

	int xStart = searchLeft + FramePosIndex - searchRgnWidth*(FramePosIndex/searchRgnWidth);

	rowSum[y1] = 0;//initialize current row sum

	__syncthreads();

	int r = pFirstFrame[ (YStartPoint+y)*FrameWidth + XStartPoint + x ] - pCurrentFrame[ (yStart+y)*FrameWidth + xStart + x ];

	//we need an absolute value of the subtraction

        if( r < 0 )

		r = r*(-1);

	atomicAdd( &rowSum[y1], r );// atomic add the value to the row sum

	

	__syncthreads();

	if( x1 == 0 )

	{

		atomicAdd( &pSum[FramePosIndex], rowSum[y1] );//add row sum to the result sum of the frame

	}

}

Video card: GTX 570. CUDA version 4.0, Visual Studio 2008.

Maybe somebody from Guru’s can help me in optimizing the code…

Thank for your attention

algorithm.bmp (665 KB)

I haven’t looked closely at the code, but I think I’ve got three suggestions:

    Reuse the data you are loading. If it fits, place the fragment in shared memory. If not, fragment the fragment to make it fit. :smile: For every pixel of the image that you load, do multiple comparisons with different fragment locations.

    Replace the atomic addition with a reduction scheme. First sum for each thread individually, then do a reduction in shared memory.

    Use the __[u]sad() intrinsic for the sum of absolute differences.

tera, thanks for your answer. After it I have more questions:

1.What do you mean by “reusing the data”. Reference frame I load once for the whole program execution. Camera frame I have to refresh in camera frequency (as quick as possible). Copy of camera frame takes a small piece of time comparing to the kernel execution time.

  1. Now one thread makes one pixel-to-pixel subtraction, and atomic sums it according row sum. Only one frame makes atomic sum to global memory of according row sum to according frame sum.

  2. Could you give me a __[u]sad() definition. Unfortunately my VStudio doesn’t show me definition of any function mentioned in *.cu module (maybe I have to tune it but I don’t know the way).

Once again thanks for any answer\advice.

I meant reusing of data after loading it from device memory to registers or shared memory. I fully expected you to copy the data onto the device only once. :smile:

Check the good old Programming Guide. It’s in appendix C.2.3 of the 4.0 version.

Thanks. I tried the __usad(), but there is no time profit.

I have made some changes to decrease atomoicAdds. I made blocksize equal to fragment height and grid size to search region dimension. Also I put to the thread row sum calculation (for(i=0…fragment width)). Now it works 5 time faster. Here is the code:

__global__ void kernel_SubtractAndSum(  unsigned char* pFirstFrame,

									    unsigned char* pCurrentFrame,

									    int FrameWidth,

										int FrameHeight,

										int XStartPoint,

										int YStartPoint,

										int RgnWidth,

										int RgnHeight,

										int searchLeft,

										int searchTop,

										int searchRgnWidth,

										int searchRgnHeight,

										int* pSum )

{

	int FramePosIndex = blockIdx.x;

	int RowIndex = threadIdx.x;

	int yStart = searchTop + (blockIdx.x)/searchRgnWidth + RowIndex;

	int xStart = searchLeft + FramePosIndex - searchRgnWidth*(FramePosIndex/searchRgnWidth);

	int RowSum = 0;

	for( int i = 0; i < RgnWidth; i++ )	

		RowSum = __usad( pFirstFrame[ (YStartPoint+RowIndex)*FrameWidth + XStartPoint + i], pCurrentFrame[ (yStart)*FrameWidth + xStart + i ], RowSum );

	atomicAdd( &pSum[FramePosIndex], RowSum );

}

That code looks a lot better! [font=“Courier New”]__usad()[/font] probably had no influence because the code is still memory bandwidth limited.

There are still quite a few opportunities for optimization though:

    Note your memory accesses are completely uncoalesced, as you use [font=“Courier New”]threadIdx.x[/font] as the y-index and loop over the x-index into the image. While that is less of an issue on compute capability 2.x devices due to their cache, it still has a noticeable impact on performance.

    Even though the cache helps a lot, doing the atomicAdd() in shared memory for a block and then having one thread do an atomicAdd() in global memory should still be faster. Even better would be to use a reduction scheme.

    In the next step, don’t invoke a separate kernel for each [font=“Courier New”]searchTop[/font] and [font=“Courier New”]searchLeft[/font] value, as kernel invocations are relatively expensive.

    If you loop over the different offsets inside a single block, you can achieve a good memory access pattern with locality in the memory accesses, maximizing benefit of the cache. You can even take advantage of some data reuse explicitly by keeping values in registers or shared memory.

    Once the kernel isn’t memory bandwidth limited anymore, throwing in a [font=“Courier New”]#pragma unroll 16[/font] (or even larger) before the innermost loop should speed up the kernel further.

thanks a lot for your advices. It realy helps me.

in order of your advices:

as far as i understood, memory coalescing can be achieved by working with global memory through the blocks of 4 bytes (for example). and all memory asks should be aligned in a proper way. I have got a problem: if i try to read a block of 4 bytes but the start address of reading location is not aligned to 4 bytes I have to read 2 neighbour 4-byte block which contain needed 4-byte block. that is why my code became more complex and bigger. I made it, but i’m not sure that it still good optimized because of a lot of math, “if” statements and loops. Any way it gave me approx 200% time benefit.

I removed all atomicAdds on global memory. it gave me approx 7-9% time benefit.

I still put here my code. Maybe there would be some more advices…

extern "C" __global__ void kernel_MultiplyAndSum(  unsigned char* pFirstFrame,

									    unsigned char* pCurrentFrame,

									    int FrameWidth,

										int FrameHeight,

										int XStartPoint,//i set it in main func in a way to align start address to 4

										int YStartPoint,//i set it in main func in a way to align start address to 4

										int RgnWidth,

										int RgnHeight,

										int searchLeft,

										int searchTop,

										int searchRgnWidth,

										int searchRgnHeight,

										int* pSum )

{

	__shared__ int BlockSum; //since one block makes caculation of one full fragment to fragment correlation, i put thread result to shared mem

	BlockSum = 0;

	__syncthreads();

	int FramePosIndex = blockIdx.x;

	int RowIndex = threadIdx.x;

	int yStart = searchTop + (FramePosIndex)/searchRgnWidth + RowIndex;

	int xStart = searchLeft + FramePosIndex - searchRgnWidth*(FramePosIndex/searchRgnWidth);

	int RowSum = 0;

	int iFirst;

	int iCurrent;

	int loopCount = RgnWidth/sizeof(int);//row loop size

	unsigned char *pFirst;

	unsigned char *pCurr;

	int mem_shift = (yStart)*FrameWidth + xStart;

	int rem_4 = mem_shift%4;

	int k;

	for( int i = 0; i < loopCount; i++ )

	{

            //here is a lot of calculation to find out memory shifts and

		pFirst = (unsigned char*)pFirstFrame + (YStartPoint+RowIndex)*FrameWidth + XStartPoint;

		iFirst = *((int*)pFirst + i);

		unsigned char* pFirstByte = (unsigned char*)&iFirst;

		if( rem_4 > 0 )

		{

			//start pos not aligned so we need to read 2 4bytes blocks

pCurr = (unsigned char*)pCurrentFrame + mem_shift + ( 4 - rem_4 );

    		        //read first 4byte block

                        iCurrent = *((int*)pCurr + i);

			

			unsigned char* pCurrByte = (unsigned char*)&iCurrent;

			k = 0;

			while( k < rem_4 )

			{

				RowSum = __usad( *(pFirstByte + (4-rem_4)), *pCurrByte, RowSum );

				pFirstByte++;

				pCurrByte++;

				k++;

			}

		

			pCurr = (unsigned char*)pCurrentFrame + mem_shift - rem_4;

    		        //read second 4byte block

                        iCurrent = *((int*)pCurr + i);

		

			pCurrByte = (unsigned char*)&iCurrent;

			while( k < 4 )

			{

				RowSum = __usad( *(pFirstByte-rem_4), *(pCurrByte+rem_4), RowSum );

				pFirstByte++;

				pCurrByte++;

				k++;

			}

		}

		else

		{

                        //start pos is aligned so we need to read only one 4byte block

			pCurr = (unsigned char*)pCurrentFrame + mem_shift;

    		        iCurrent = *((int*)pCurr + i);

			

			unsigned char* pCurrByte = (unsigned char*)&iCurrent;

			k = 0;

			while( k < 4 )

			{

				RowSum = __usad( *(pFirstByte), *pCurrByte, RowSum );

				pFirstByte++;

				pCurrByte++;

				k++;

			}

		}

	}

	atomicAdd( &BlockSum, RowSum );

	__syncthreads();

	if( RowIndex == RgnHeight-1 )

		pSum[FramePosIndex] = BlockSum;

}