I have written a reduction kernel in which, a constant value will be deducted from each element of a huge array and the sum of difference is taken. The code inside reduction kernel looks like:
s_flData1[tid] += fabs(pucInBuff1[idx]-dSum1);
s_flData2[tid] += fabs(pucInBuff2[idx]-dSum2);
where ‘pucInBuff1’ and ‘pucInBuff2’ are unsigned char buffers and ‘dSum1’ and ‘dSum2’ are float values.
The problem is that, the sum obtained is not matching with the corresponding CPU code. I think this is because of the accumulated precision error. The logic for reduction seems ok, since I am getting the expected result when avoiding the ‘fabs’ and ‘difference (-)’ operations.
Please post some suggestions to solve this issue.
Start by showing the full routines for both the CPU and GPU, and also clarifying what you mean by ‘the sum obtained is not matching.’ You should also state what card and CUDA version you’re using.