Precision error - CUDA kernel Vs CPU Reduction kernel with fabs and differnce operations on float


I have written a reduction kernel in which, a constant value will be deducted from each element of a huge array and the sum of difference is taken. The code inside reduction kernel looks like:

s_flData1[tid] += fabs(pucInBuff1[idx]-dSum1);
s_flData2[tid] += fabs(pucInBuff2[idx]-dSum2);
where ‘pucInBuff1’ and ‘pucInBuff2’ are unsigned char buffers and ‘dSum1’ and ‘dSum2’ are float values.

The problem is that, the sum obtained is not matching with the corresponding CPU code. I think this is because of the accumulated precision error. The logic for reduction seems ok, since I am getting the expected result when avoiding the ‘fabs’ and ‘difference (-)’ operations.
Please post some suggestions to solve this issue.

Thanks in advance,

Start by showing the full routines for both the CPU and GPU, and also clarifying what you mean by ‘the sum obtained is not matching.’ You should also state what card and CUDA version you’re using.

The CPU code for my operation is just one ‘for loop’ as given below.

for ( int itr=0; itr< nSize3D; itr ++, pTemp8++, pSuch8++ )


dSqSum1 += fabs( *pTemp8 - dSum1 );

dSqSum2 += fabs( *pSuch8 - dSum2 );


By ‘sum’, I mean the dSqSum1 and dSqSum2 values from the above code.

For this I have written a reduction kernel with the above computations.

Tesla C1060 card and CUDA version 2.3 are being used for this.

So you are comparing an accumulation sum to a reduction sum. How different are the results, since they’re obviously not going to be the same?