Possible bug in arithmetic? Difference in emulated and "normal" mode

Dear All,

I have a strange and urgent problem with CUDA. I’m writing a CFD simulation program, and I’m using double-precision computations (sm_13). I tested several functions in emulation and “normal” (running on the GPU) mode, and everything worked fine, the values were totally the same, then in the program’s CPU version.

Then comes the problem with the iterative-solver. In emulation mode the program creates exactly the same results, than the CPU version, but as I switch to the GPU, by the 1000th iteration every value overflows, and it has no meaning, as it should work fine.

I have no idea why is this, because even with single-precision it cannot overflow. I searched the code, and if I switch off only one line of the code (a division), then it won’ crash (just the problem is that then I don’t solve the problem).

I have really no idea, what causes the problem, because in emulation mode this division works fine, and in the tested GPU kernels I have even divisions with smaller numbers working 100% precise.

[codebox] GPU_FLOAT Flag3=2.0*(Alpha+(Gamma+Sigma)/3.0);


Has anybody seen something like that???

Yours sincerely:

                                            Laszlo Daroczy

Are you running it on hardware with native-support for doubles?

Perhaps you’re running into denormalized numbers which are not supported by cuda or maybe it’s because round-to-nearest-even is the only supported IEEE rounding mode for

division in cuda.


It’s running on GTX295, so with native support.

On the other, I think the problem cannot be the denormalized numbers, as the results of the divison are in the first iteration around ~1.2.

The problem is that why in emulated mode and with CPU the algorithm is converging, with the GPU it’s diverging fast. I do not understand why is that, as all other kernel functions work with 1e-15 precision compared to the CPU (and the algorithm won1t diverge with single precision), so the only answer can be, that the arithmetic is simply unprecise???

I tried it with CUDA 2.3 and 3.0b and with several drivers, but always the same problem.

Perhaps it’s some kind of race condition. In emulation mode each thread is processed sequentially, so maybe you need to put in some __syncthread() calls…


Also, does adding “-Xopencc -O0” to the nvcc command line make any difference?

I can only agree with the other suggestions - it is either a subtle opencc optimization bug, or you have a correctness problem in your CUDA version which doesn’t show up when executed serially (as in emuation mode). I have double precision codes which I can verify down to accuracies of about 1e-13, when comparing the SSE and CUDA versions.

Dear all, finally I found the problem. I use a chess-board algorithm for updating the data in the iteration, but in the boundary condition I forget to define the rule.