Is there a difference between GPU double precision and CPU double precision?

Hi folks,

I noticed a weird issue while I tested a cuda implementation of a numerical algorithm. It seems like the architecture precision is lesser then normal double precision. (before you ask. I HAVE a 1.3 device and I DID compile the code with -arch sm_13). The algorithm does a few numerically critical steps like finite differences discretization of a PDE and so on, so the computation has to handle calculation of quite big (approx. 1e+7) and quite small values (approx. 1e-7) and additionally has to iterate until a given precision has been reached.

I have successfully translated the sourcecode to cuda… and now the issue:

If the algorithm has to do many operations before precision-termination, the algorithm iterates much longer then executed with fewer operations… the cumulated computation error is much bigger. If I reduce the range of values (e.g. from 1e±7 to 1e±3), it needs fewer iterations to terminate. AND this effect changes from execution to execution. sometimes the algorithm needs … lets say 10 iterations… another time 100 iterations, with same value-range!. How could this happen?

The same algorithm implemented in serial way with “classic” C++ doesn’t act like that. It always have the same number of iterations, like it should to have.

to cut a long story short: Is GPU-computation lesser accurate? How could the iteration-number change in that intensive way, with exact the same starting conditions?

x87 floating point registers are 80-bit and will be treated as 80-bit until you copy the result back to non-x87 registers, at which point it will be rounded to 64-bit. There are some compiler options to deal with this (I think if you force floating point through SSE it will be 64-bit always), but they vary from compiler to compiler.

Hmmm… so what’s your point? Do you say, that double-computation on CPU is done internally with higher precision which is stripped when moving from computation-register to a double variable (or pointer), while GPU always does computation on 64-bit registers? Sounds like the GPU-double precision is a kind of fake ;-)

…but that would explain the behaviour. Do you know a link or literature that I can reference? (need that for my diploma thesis).

The GPU double precision is IEEE-754 compliant. There isn’t anything fake about it. If anything, the GPU can be a little more accurate than 64 bit floating point on the CPU because the gpu can do combined multiply adds without intermediate rounding. Some of the math library functions have slightly inferior ULP accuracy compared to reference CPU C standard libraries, but the differences are predictable and explainable. I think 386 style 80bit FPU arithmetic has been deprecated in the X86_64 instruction set in favor of SSE, but IA-32 can and will still do 80bit internal with final rounding (incidentally it was in this 80bit FPU that was the source of the infamous Pentium FDIV bug).

You should keep in mind that floating point arithmetic isn’t commutative. Even executing the same algorithm made to the same precision on the same machine can produce different results if the order of operations is changed. If you break an operation into a series of data parallel partial operations (like summation across 30 processors in a C1060, for example), you can get a different answer than if you do the summation serially. When and how rounding is applied can have a noticeable effect on final results, depending on the magnitude of the quantities and the operations involved.

I already read about that and have already modified the code in that way, that the computation order is exactly the one of the serial way. There’s just a norm-computation that could be done in different order, but it is not possible that this one can cause such a big error.

Does anyone know why the error varies from time to time of factor 10 without changing the starting values?

If you mean you are running the same code on the GPU, with the same initial conditions, and you are getting an order of magnitude difference in result then you have a program problem not a floating point problem.

What is the typical run-time of one kernel? Would it be close to 5 seconds?

I have an extra nvidia card configured without a kernel execution timeout. There has been done some set-valued analysis with kernel execution-time about 5-6 minutes. I also checked the computation with thousands of debug-outputs. The intermediate results are the same as in serial execution but ± 1e-13. Maybe though possible, there is no failure in the program. It’s really weird.

If you moved from x87 to a 64-bit register after every floating point operation, you would get identical (or mostly identical) results. If you used SSE, same thing. If you used a different architecture that uses 64-bit floating point (PPC does, I assume?) instead of 80-bit, same thing. So no, it’s not fake at all, it’s just that x87 is not the end-all in terms of floating point support (especially since if you wanted decent perf from your CPU you’d be using SSE anyway, and that’s 64-bit).

But avidday is right, and you’re far more likely to have a bug in your app than just seeing differences due to floating point differences.

Surely possible ;-).

Another idea: During the algorithm the data is often moved between global memory, shared memory and registers. One iteration step comes with a few kernel-executions due to a synchronision issue between large data computation. So there are about 40-50 kernel executions with reading data from global space, computing stuff, writing to global space, before the precision test is done. Could THAT harm the values?

Nope. There must be a bug somewhere in your code. Floating point is noncommutative and the order of computations varies, so small differences between CPU (SSE, -ffloat-store in gcc, whatever to enforce 64bit) and GPU are expected. I haven’t seen such severe cancellation yet that causes the behaviour you describe.

If you are asking whether the act of writing and re-writing numbers in an out of memory can subtly change their values, then the answer is obviously no.

If you are asking whether a subtle correctness problem in parallel, iterative calculations can effect the results, well of course. And given that the execution order of parallel calculations within a given CUDA kernel is non-deterministic and will change from compile to compile or run to run, that sort of correctness problem could, theoretically, produce different errors from apparently identical inputs.

The answer is in your code. Only you amongst us has seen it. Only you amongst us can fix it. The obvious things to do are:

  • look at every math library function you use and check its ULP accuracy, range and domain and make sure you aren’t assuming something your shouldn’t
  • build your code without optimization and check the results to make sure you are not facing either a compiler bug or optimization side effect
  • use something like gpu ocelot or valgrind to double check you don’t have any out of bounds memory addressing issues.

The double precision floating point in CUDA has been around for 18 months and a lot of people have done a lot of serious crunching with it in that time. If there was something as obviously wrong as what you are implying, it would be public knowledge by now.

Is there another compiler switch then -O0 i can try?

Thanks for the lots of answers until there. Let me annoy you with a last question: Does the compiler any algebraic transformations to the terms? When I say a=b*(c+d), does it compile like that or does it transform this code to something like a=bc + bd in order to use MAD-operations? And if it does something like that, how can I disable it?

There are some macros that start with “f”… i think __fadd, __fmul – which will make sure MAD optimizations are NOT performed… Check the programming guide…

We have been asking NVIDIA to provide a #pragma to disable MADs for GPU kernels. Not sure if this was included in 2.3

I salute to you… of course it was a programming failure :">

It was noticable, that the instability grew in the junction of the thread-blocks. The algorithm took data from the borders of the joining blocks. If the blockcount was greater then the totally parallel executed blocks, some blocks read already modified data. A really typical error in cuda programming : External Image

Thanks everybody who spent time in solving my problem… it hasn’t been only wasted but also made me wiser… hehe