I noticed a weird issue while I tested a cuda implementation of a numerical algorithm. It seems like the architecture precision is lesser then normal double precision. (before you ask. I HAVE a 1.3 device and I DID compile the code with -arch sm_13). The algorithm does a few numerically critical steps like finite differences discretization of a PDE and so on, so the computation has to handle calculation of quite big (approx. 1e+7) and quite small values (approx. 1e-7) and additionally has to iterate until a given precision has been reached.
I have successfully translated the sourcecode to cuda… and now the issue:
If the algorithm has to do many operations before precision-termination, the algorithm iterates much longer then executed with fewer operations… the cumulated computation error is much bigger. If I reduce the range of values (e.g. from 1e±7 to 1e±3), it needs fewer iterations to terminate. AND this effect changes from execution to execution. sometimes the algorithm needs … lets say 10 iterations… another time 100 iterations, with same value-range!. How could this happen?
The same algorithm implemented in serial way with “classic” C++ doesn’t act like that. It always have the same number of iterations, like it should to have.
to cut a long story short: Is GPU-computation lesser accurate? How could the iteration-number change in that intensive way, with exact the same starting conditions?