Serial op in GPU

Hi,

Following is a sample CPU loop I’ve managed to convert successfuly to GPU - each thread is actually calculating a different sample (ranging from 1000-5000 samples)

/// Calculate iStart, iEnd, iOut, iIn and w1 and w2 somehow.....

  for ( int iSample = iStart; iSample < iEnd; iSample++ )

  { 

	 pOutA[ iOut + iSample ] += pInA[ iIn + iSample ] * w1 + pInA[ iIn + iSample + 1 ] * w2;

	 pOutB[ iOut + iSample ] += pInB[ iIn + iSample ] * w1 + pInB[ iIn + iSample + 1 ] * w2;

	 pOutC[ iOut + iSample ] += pInC[ iIn + iSample ] * w1 + pInC[ iIn + iSample + 1 ] * w2;

  }

The problem is that I then have to do some post processing on the result, and this process requires that I go serially over the pOut arrays, sample by sample,

and using some sliding window to find the maximum and update the pOut array in the current place. The code looks something like this (On the CPU!):

for ( int iSample = 0; iSample < NumSamples; iSample++ )

{

	for ( int iSampleWindow = max(iSample - WINDOW_SIZE, 0 ); iSampleWindow < min(NumSamples, iSample + WINDOW_SIZE ), iSampleWindow++ )

	{

	   if ( pNewOut[ iSampleWindow ] > maximum value so far )	

		   pNewOut[ iSampleWindw ] = pOutA[ iSampleWindow ] + pOutB[ iSampleWindow ] + pOutC[ iSampleWindow ]

	}

}

Though this is post processing looks to be serial job, I’d still would like to do this on the GPU mainly for 2 reasons:

  1. Maybe it will still be faster on the GPU then on the CPU.

  2. More important - the output size after this “serial” post processing is many times smaller then without this processing. Thus if I do it on the

    GPU, I need to copy back (over PCI) to the CPU a much smaller amount of data - and this will probably save me a lot of time.

Hope I was clear enough :) any suggestions are more then welcomed :)

Actually, here’s a simpler CPU code:

memset( dsumValue, 0, nSamples* sizeof(float) );

float* pTemp = pOutA;	

for ( int iTime = 1; iTime <= nSamples; iTime++ )

   dsumValue[ iTime ] = dSumValue[ iTime - 1 ] + pTemp[ iTime - 1 ];

Thanks

eyal

Check out the scan example in the SDK. I think this is what you want.

I haven’t looked at the scan sample from the SDK (so maybe that would solve your problem), but if it doesn’t, you could almost just implement a simple ‘reduction’ (like the SDK sample), only your operation would be max(a,B) instead of a+b.

Thanks guys for the input :) I’ll try those out…

eyal