Hi all,

I’m only fairly new to the CUDA developement stage, but I do believe I have a good grasp of the understandings of how it works.

I have implemented an N-point moving average filter into my system, where depending on how many threads I am using I am getting different results. The code snippets are as follows.

```
/**
* Performs a moving average using as many threads as possible on a single block
*
* Moving average is calculated by creating a sum vector (or think of it as a CDF) of the input data.
* The moving average result at i is then (sum[i+fSize]-sum[i-fSize-1])/(2*fSize + 1).
*@param out - Output data or result
*@param in - Input data
*@param size - Input data size
*@param fSize - windows size of the moving average is 2*fSize + 1.
*@param pMemory - pointer to some memory to be used for in house calculations, needs to be of size 'size'.
* if NULL, memory will be allocated from global memory.
*/
__device__ static void movingAverage(float *out, float *in, int size, int fSize, void *pMemory)
{
int idx = threadIdx.x;
int i;
printf("Moving average threadIdx-%i blockDim-%i, size-%i fSize-%i\n",idx,blockDim.x,size,fSize);
if(fSize <= 0)
{
memcpy(out,in,size * sizeof(float));
return;
}
__shared__ float *sum;
if(pMemory == NULL && idx == 0)
sum = new float;
else
sum = (float*)pMemory;
__syncthreads();
if(sum == NULL)
{
printf("Error allocating memory in moving average\n");
return;
}
if(idx == 0)
{
// calculates the sum vector, this has to be in a single thread
sum[0] = in[0];
for(i = 1; i < size; i++) sum[i] = sum[i-1] + in[i];
// performs the start of the filter
for(i = 0; i <= fSize; i++) out[i] = sum[i+fSize] / (i + fSize + 1);
}
__syncthreads();
// performs the middle part of the filter, this is the grunt part
// calculates the number of loops required for each thread
int loops = (size - fSize*2 - 1) / blockDim.x;
if((loops * blockDim.x + idx + 1) >= (size - fSize*2))
loops--;
__syncthreads();
for(i=0; i <= loops; i++)
{
int ii = i*blockDim.x + idx + fSize + 1;
out[ii] = (sum[ii + fSize] - sum[ii - fSize - 1]) / (fSize*2 + 1);
}
__syncthreads();
// performs the end of the filter
if(idx == 0)
{
for(i = 0; i < fSize; i++) out = (sum - sum) / (i + fSize + 1);
}
__syncthreads();
if(pMemory == NULL && idx == 0)
delete sum;
}
```

I have tested this function in its own kernel numerous times with no hassles, where it simply copies the input data to the GPU, performs the filter, then copies the data back to from the GPU. Where this data is always correct.

I then try to run this function in within a kernel that does other things, where the code snippet is…

```
...
__shared__ float *pMagnitude, *pEnvelope; // working data buffers
__shared__ void* pPool; // large global memory pool data
// allocate memory for magnitude and smoothed envelope profiles
if(idx == 0)
{
pMagnitude = (float*)pMemory;
pMemory = ((float*)pMemory+size);
pEnvelope = (float*)pMemory;
pMemory = ((float*)pMemory+size);
pPool = pMemory;
}
__syncthreads();
if(idx == 0)
{
// gets the magnitude from the input data
for (int n = 0; n < size; n++)
{
pMagnitude[n] = ... calculation here ...
}
}
// get half moving average filter length
int Nlx = 2;
__syncthreads();
// performs the moving average
movingAverage(pEnvelope,pMagnitude,size,Nlx,pPool);
__syncthreads();
// copies test data back for debugin
if(blockIdx.x == 0 && idx == 0 && pMemoryTest != NULL)
{
memcpy(pMemoryTest,pMagnitude,size* sizeof(float));
memcpy(((float*)pMemoryTest+size),pEnvelope,size* sizeof(float));
}
__syncthreads();
int peakIndex;
float peakAmp;
if(idx == 0)
{
peakIndex = maxIndex(pEnvelope, size); // gets the max index
peakAmp = pEnvelope[peakIndex];
printf("peakIndex=%i peakAmp=%f\n",peakIndex,peakAmp);
}
...
```

If I launch the kernel with only one thread per block, I always get the correct results.

If I launch the kernel with more than one thread, I always get the same result and it is incorrect. The print outs from peakIndex and peakAmp are always wrong and they don’t even correspond to possible values in pEnvelope. However if I look at the result from pMemoryTest that gets copied back to the host, this data is correct. Therefore the movingAverage is getting the right result, but when I try and read the result on the GPU I get a wrong answer.

Does anyone have any insight into this problem? I believe it is something small that I have overlooked.

EDIT: I’m using windows 7 64 bit, Visual Studio 2010, Tesla C2050, Cuda 4.0.17, Driver 270.81, Arc 2.0.

Cheers,

Dave