N-point moving average getting inconsistent results

Dave85 · November 23, 2011, 4:36am

Hi all,

I’m only fairly new to the CUDA developement stage, but I do believe I have a good grasp of the understandings of how it works.

I have implemented an N-point moving average filter into my system, where depending on how many threads I am using I am getting different results. The code snippets are as follows.

/**

	* Performs a moving average using as many threads as possible on a single block

	* 

	* Moving average is calculated by creating a sum vector (or think of it as a CDF) of the input data.

	* The moving average result at i is then (sum[i+fSize]-sum[i-fSize-1])/(2*fSize + 1).

	*@param out - Output data or result

	*@param in - Input data

	*@param size - Input data size

	*@param fSize - windows size of the moving average is 2*fSize + 1.

	*@param pMemory - pointer to some memory to be used for in house calculations, needs to be of size 'size'. 

	*		if NULL, memory will be allocated from global memory.

	*/

	__device__ static void movingAverage(float *out, float *in, int size, int fSize, void *pMemory)

	{

		int idx = threadIdx.x;

		int i;

		printf("Moving average threadIdx-%i blockDim-%i, size-%i fSize-%i\n",idx,blockDim.x,size,fSize);

		if(fSize <= 0)

		{

			memcpy(out,in,size * sizeof(float));

			return;

		}

		__shared__ float *sum;

		if(pMemory == NULL && idx == 0)

			sum = new float;

		else

			sum = (float*)pMemory;

		__syncthreads();

		if(sum == NULL)

		{

			printf("Error allocating memory in moving average\n");

			return;

		}

		

		if(idx == 0)

		{

			// calculates the sum vector, this has to be in a single thread

			sum[0] = in[0];

			for(i = 1; i < size; i++) sum[i] = sum[i-1] + in[i];

			// performs the start of the filter

			for(i = 0; i <= fSize; i++) out[i] = sum[i+fSize] / (i + fSize + 1);

		}

		__syncthreads();

		// performs the middle part of the filter, this is the grunt part

		// calculates the number of loops required for each thread

		int loops = (size - fSize*2 - 1) / blockDim.x;

		if((loops * blockDim.x + idx + 1) >= (size - fSize*2))

			loops--;

		__syncthreads();

		for(i=0; i <= loops; i++)

		{

			int ii = i*blockDim.x + idx + fSize + 1;

			out[ii] = (sum[ii + fSize] - sum[ii - fSize - 1]) / (fSize*2 + 1);

		}

		__syncthreads();

		// performs the end of the filter

		if(idx == 0)

		{

			for(i = 0; i < fSize; i++) out = (sum - sum) / (i + fSize + 1);

		}

		__syncthreads();

		if(pMemory == NULL && idx == 0)

			delete sum;

	}

I have tested this function in its own kernel numerous times with no hassles, where it simply copies the input data to the GPU, performs the filter, then copies the data back to from the GPU. Where this data is always correct.

I then try to run this function in within a kernel that does other things, where the code snippet is…

...

	__shared__ float *pMagnitude, *pEnvelope; // working data buffers

	__shared__ void* pPool; // large global memory pool data

	// allocate memory for magnitude and smoothed envelope profiles

	if(idx == 0)

	{

		pMagnitude = (float*)pMemory;

		pMemory =  ((float*)pMemory+size);

		pEnvelope = (float*)pMemory;

		pMemory =  ((float*)pMemory+size);	

		pPool = pMemory;

	}

	__syncthreads();

	if(idx == 0)

	{

		// gets the magnitude from the input data

		for (int n = 0; n < size; n++)

		{

			pMagnitude[n] = ... calculation here ...

		}

	}

	// get half moving average filter length

	int Nlx = 2;

	__syncthreads();

	// performs the moving average

	movingAverage(pEnvelope,pMagnitude,size,Nlx,pPool);

	__syncthreads();

	// copies test data back for debugin

	if(blockIdx.x == 0 && idx == 0 && pMemoryTest != NULL)

	{

		memcpy(pMemoryTest,pMagnitude,size* sizeof(float));

		memcpy(((float*)pMemoryTest+size),pEnvelope,size* sizeof(float));

	}

	__syncthreads();

	int peakIndex;

	float peakAmp;

	if(idx == 0)

	{

		peakIndex = maxIndex(pEnvelope, size); // gets the max index

		peakAmp = pEnvelope[peakIndex];

		printf("peakIndex=%i peakAmp=%f\n",peakIndex,peakAmp);

        }

...

If I launch the kernel with only one thread per block, I always get the correct results.

If I launch the kernel with more than one thread, I always get the same result and it is incorrect. The print outs from peakIndex and peakAmp are always wrong and they don’t even correspond to possible values in pEnvelope. However if I look at the result from pMemoryTest that gets copied back to the host, this data is correct. Therefore the movingAverage is getting the right result, but when I try and read the result on the GPU I get a wrong answer.

Does anyone have any insight into this problem? I believe it is something small that I have overlooked.

EDIT: I’m using windows 7 64 bit, Visual Studio 2010, Tesla C2050, Cuda 4.0.17, Driver 270.81, Arc 2.0.

Cheers,

Dave

Dave85 · November 23, 2011, 6:07am

Update:

I did some debugging with nsight, where I have narrowed the problem down. Just after the code snippet I gave it performs another moving average reusing the data buffers, with the following code.

if(idx == 0) // << pEnvelope changes after this call

        {

                peakIndex = maxIndex(pEnvelope, size); // gets the max index

                peakAmp = pEnvelope[peakIndex];

printf("peakIndex=%i peakAmp=%f\n",peakIndex,peakAmp);

        }

	__syncthreads();

	Nlx = 7;

	movingAverage(pEnvelope,pMagnitude,pulseSize,Nlx,pPool);

If I remove the second moving average call I get the correct answer, which suggests one of the threads is racing ahead. I have double / triple checked all code before and can no see how this could be happen.

Also when debugging and stepping through the code, the values in pEnvelope change once the first thread enters the if statement (I have labeled this on the code above). The values that do change in pEnvelope are all the indices of threads other than 0 would work on it in the moving average function.

Any ideas?

Dave

Dave85 · November 25, 2011, 12:13am

I just solved my own problem and I will explain what it is incase someone else has similar issues.

The problem was that the call __syncthreads() after my if statement was not waiting, which was due to some code in the if statement having a return call. This return call could however never be called, but the compiler was obviously doing something to make sure it could not deadlock. My code snippet for when it was not working like this:

if(idx == 0) // << pEnvelope changes after this call

        {

                peakIndex = maxIndex(pEnvelope, size); // gets the max index

                peakAmp = pEnvelope[peakIndex];

printf("peakIndex=%i peakAmp=%f\n",peakIndex,peakAmp);

...

		if ( * some condition that will never be true * )

		{

			return 0; // << causing future __syncthreads() not to work.

		}

....

        }

I removed the return call and it all now works.

For anyone having similar problems make sure you have no return calls in any branch that will not be called by all threads. Even if logically this return call will never happen.

Dave