Accuracy problem I'd even say inaccuracy ...

I have the following expression being evaluated on both CPU and GPU (in order to be able to compare results):

(x + 0.00915003f) * (-2.15891f + 4.15652f) * ADF0(-6.04954f), where ADF0 is a function of one argument arg0 (described in C for clearness):

float ADF0(float argo)
return arg0 / (arg0 - arg0 * arg0 / arg0) - arg0 * (arg0 - arg0 * arg0 / (arg0 + arg0 - arg0));

x has the only possible value: 10.0f. ADF0 is a vacuous function that simply should return infinity for any input value of arg0 because of division by zero.

Consequently, correct result of the source expression is -1.#INF (and CPU produces it succesfully) for any x.

However, GPU returns -2.53664e+008.
It is necessary to say that on both CPU and GPU expression is parsed using reverse polish notation, not a direct C operand (say, float fResult = (x + 0.00915003f) * (-2.15891f + 4.15652f) * ADF0(-6.04954f)). So this expression is parsed in the loop with push/pops to/from stack with subsequent math operations.

On GPU FMAds are eleminated using __fadd_rz and __fmul_rz, loops are absolutely identical on both CPU and GPU. This strange behaviour is not too frequent: as I have lots of different expressions with lots of possible values I can admit that such significant difference occures not too frequent, but nevertheless occures.

This fact is very annoying as I can’t rely on GPU on 100% - a couple of times the result from GPU were not obviously incorrect (like huge or very small abnormal number), but, say, 127 - just a number that I can hardly identify as abnormal. What GPU should actually return is still infinity or NaN.

How is it possible to make sure that GPU will work correct ? I’d like to remind that arguments like “floating point is not even associative and the sequence of math operations plays it’s role” are correct but not actual for my case - sequence of operations is absolutely the same on both CPU and GPU, FMADs are restricted.

I’d really appreciate any help on this subject.

Thanks in advance,

My guess is that the non-IEEE compliant single precision division on the GPU is causing problems here. The guide says that division has 2 ulps of error, though I haven’t traced that through the ADF0 function to see how big an error that translates into. It certainly means that (arg0 - arg0 * arg0/arg0) != 0.0f in general.

Double precision division is IEEE-compliant on the new GTX 280/260 cards, although effectively 8 times slower.

I also think that division is the problem … but when I try to repeat the calculations that produced bad results on GPU I get correct results.

To illustrate this I do the following: running a simple kernel the only task of which is to return results of all intermediate steps of overall calculations (v1 = arg0/arg0; v2 = arg0*v1; v3 = arg0 - v2 e t c). Examination of all these v1…vi gives correct result of -1.#INF. Only during the actual run with lots of threads and blocks sometimes for some values the crucial error in floats appears.

My guess was FMADs - but elemination of them gives nothing, still incorrect results sometimes … Simplification of input values (make all of them integers and eleminate division at all) solves the problem - so the algorithm in general and kernel in particular looks correct.

For my task overall estimated speedup on GTX280 comparing to c2q 9650 is about 5-6 times, this means that double precision of GTX280 can hardly be utilized.

Did you take into account that you have 140GB/s bandwith on GTX280? When you need to process more data than fits in the cache of CPU, you have much lower bandwidth on CPU.

Also you might only need to convert a small part of you algorithm to double.

I’m not sure how double support works in CUDA 2.0 … Will it be possible to do all math in float but each time I meet the division to do something like this ?

double dTemp1 = (double)fVal1;

double dTemp2 = (double)fVal2;

double dTempResult = dTemp1 / dTemp2;

float fResult = (float)dTempResult;

If so, this may help … on the other hand, I have no ideas how this division wrapper will affect the performance.

Yes, this should work. (though a peek at the PTX output to confirm it would be good)

Edit: Double precision operations are handled by a separate unit in each multiprocessor, which is why you take a factor of 8 hit in performance. Everything else, including single precision operations and index calculations, will continue to be handled by the normal stream processors.

Yeah, I think that in practice it will be a bit like hiding the global memory latency. ‘Long latency’ calculations for doubles will be hidden by the calculations by the ‘normal’ SPU’s as long as you have not a lot of them if I understand the architecture correctly. That is I think the biggest benefit of having separate units and not doing them on the normal processors.