I have written a reduction kernel in which, a constant value will be deducted from each element of a huge array and the sum of difference is taken. The code inside reduction kernel looks like:
s_flData1[tid] += fabs(pucInBuff1[idx]-dSum1);
s_flData2[tid] += fabs(pucInBuff2[idx]-dSum2);
where â€˜pucInBuff1â€™ and â€˜pucInBuff2â€™ are unsigned char buffers and â€˜dSum1â€™ and â€˜dSum2â€™ are float values.
The problem is that, the sum obtained is not matching with the corresponding CPU code. I think this is because of the accumulated precision error. The logic for reduction seems ok, since I am getting the expected result when avoiding the â€˜fabsâ€™ and â€˜difference (-)â€™ operations.
Please post some suggestions to solve this issue.
Thanks in advance,