I’m working on a CUDA version of a lossless image encoder. Overall accuracy is not as important as you’d think, but what is important is that the same output is returned if a function is run on the CPU or GPU. I managed to do this using all double precision calculations as well as a few limitations on some functions (pow and tan, for example, caused problems when the output approached infinity, but as very large numbers aren’t helpful to our algorithm, we capped them at 10) plus some funky rounding magic. While this works, the algorithm is evolutive (it’ll run until you stop it to find the best result), so speed is key. I’m running on two different systems, one with GTX 480s and one with Tesla C2050s. The Tesla cards are about 15% faster when using DP, but the 480s with SP are about twice as fast as that. Unfortunately, if I use the float versions on both, there are some small differences in value (the final value is cast to an int, so the value just has to stay close). The programming guide says the maximum error is around 24 ULP for the SP trig functions (8 for powf), so am I essentially stuck with using DP?
I’m working on a CUDA version of a lossless image encoder. Overall accuracy is not as important as you’d think, but what is important is that the same output is returned if a function is run on the CPU or GPU. I managed to do this using all double precision calculations as well as a few limitations on some functions (pow and tan, for example, caused problems when the output approached infinity, but as very large numbers aren’t helpful to our algorithm, we capped them at 10) plus some funky rounding magic. While this works, the algorithm is evolutive (it’ll run until you stop it to find the best result), so speed is key. I’m running on two different systems, one with GTX 480s and one with Tesla C2050s. The Tesla cards are about 15% faster when using DP, but the 480s with SP are about twice as fast as that. Unfortunately, if I use the float versions on both, there are some small differences in value (the final value is cast to an int, so the value just has to stay close). The programming guide says the maximum error is around 24 ULP for the SP trig functions (8 for powf), so am I essentially stuck with using DP?
I think you are trying to do the impossible. In principle, you may never even recompile your code, as this might rearrange floating point operations.
If you really need bitforbit identical results, convert your code to fixed point arithmetics (i.e., scaled integers).
I think you are trying to do the impossible. In principle, you may never even recompile your code, as this might rearrange floating point operations.
If you really need bitforbit identical results, convert your code to fixed point arithmetics (i.e., scaled integers).
Hi,
I’ve also seen difference in floating point results between GTX280 and the C1060  either a bug on my part or some precision issues. So the use of doubles should indeed solve it.
Its a bit weird that the C2050 with DP is only 15% faster. There was a lot of discussions here about how NVIDIA crippled the DP badly in the GTX line and that the C2050 should be much much
faster then the Fermi GTX when it comes to DP. Maybe there is a lot of other overhead in your kernel preventing you from getting the full DP performance on the C2050?
my one cent :)
eyal
Hi,
I’ve also seen difference in floating point results between GTX280 and the C1060  either a bug on my part or some precision issues. So the use of doubles should indeed solve it.
Its a bit weird that the C2050 with DP is only 15% faster. There was a lot of discussions here about how NVIDIA crippled the DP badly in the GTX line and that the C2050 should be much much
faster then the Fermi GTX when it comes to DP. Maybe there is a lot of other overhead in your kernel preventing you from getting the full DP performance on the C2050?
my one cent :)
eyal
Depending on your CPU compiler, bitforbit accuracy may be achievable, even on floatingpoint numbers.

For basic operations:
On the GPU side, you need to use __fadd_rn and __fmul_rn to prevent the compiler from emitting (more accurate) fused multiplyadds.
On the CPU side, you need to make sure that the compiler uses the SSE instruction set exclusively and avoids any unsafe math optimization. 
For transcendentals, you have two options:
 Use the exact same implementation on both the CPU and the GPU. Performance will be suboptimal on at least one platform. But results should be the same, as long as the implementation only uses basic arithmetic operations.
 Enforce precise rounding rules that define the result of transcendental functions unambiguously. Then make sure that both the CPU and the GPU implementations follow these rules. The most reasonable set of rules is “return the FP number closest to the exact result”, or correct rounding.
Unfortunately, implementations of correctlyrounded transcendentals are quite involved, even on CPUs. So I suggest option 1 instead.
Using fixedpoint as Tera suggests is a good solution for basic operations, but you will still need to roll your own transcendentals…
Depending on your CPU compiler, bitforbit accuracy may be achievable, even on floatingpoint numbers.

For basic operations:
On the GPU side, you need to use __fadd_rn and __fmul_rn to prevent the compiler from emitting (more accurate) fused multiplyadds.
On the CPU side, you need to make sure that the compiler uses the SSE instruction set exclusively and avoids any unsafe math optimization. 
For transcendentals, you have two options:
 Use the exact same implementation on both the CPU and the GPU. Performance will be suboptimal on at least one platform. But results should be the same, as long as the implementation only uses basic arithmetic operations.
 Enforce precise rounding rules that define the result of transcendental functions unambiguously. Then make sure that both the CPU and the GPU implementations follow these rules. The most reasonable set of rules is “return the FP number closest to the exact result”, or correct rounding.
Unfortunately, implementations of correctlyrounded transcendentals are quite involved, even on CPUs. So I suggest option 1 instead.
Using fixedpoint as Tera suggests is a good solution for basic operations, but you will still need to roll your own transcendentals…
I think tera was envisioning a scenario where for the expression d = a + b + c, cpu compiler decides to do d = (a + b ) + c and the gpu compiler decides to do d = a + (b+c), which could give different results even for identical a, b, c. I guess the take away would be that to enforce bit accuracy you would need to exactly specify all order of operations for all arithmetic operations.
I think tera was envisioning a scenario where for the expression d = a + b + c, cpu compiler decides to do d = (a + b ) + c and the gpu compiler decides to do d = a + (b+c), which could give different results even for identical a, b, c. I guess the take away would be that to enforce bit accuracy you would need to exactly specify all order of operations for all arithmetic operations.