Single Precision Accuracy

gthazmatt · October 6, 2010, 2:44am

I’m working on a CUDA version of a lossless image encoder. Overall accuracy is not as important as you’d think, but what is important is that the same output is returned if a function is run on the CPU or GPU. I managed to do this using all double precision calculations as well as a few limitations on some functions (pow and tan, for example, caused problems when the output approached infinity, but as very large numbers aren’t helpful to our algorithm, we capped them at 10) plus some funky rounding magic. While this works, the algorithm is evolutive (it’ll run until you stop it to find the best result), so speed is key. I’m running on two different systems, one with GTX 480s and one with Tesla C2050s. The Tesla cards are about 15% faster when using DP, but the 480s with SP are about twice as fast as that. Unfortunately, if I use the float versions on both, there are some small differences in value (the final value is cast to an int, so the value just has to stay close). The programming guide says the maximum error is around 2-4 ULP for the SP trig functions (8 for powf), so am I essentially stuck with using DP?

gthazmatt · October 6, 2010, 2:44am

I’m working on a CUDA version of a lossless image encoder. Overall accuracy is not as important as you’d think, but what is important is that the same output is returned if a function is run on the CPU or GPU. I managed to do this using all double precision calculations as well as a few limitations on some functions (pow and tan, for example, caused problems when the output approached infinity, but as very large numbers aren’t helpful to our algorithm, we capped them at 10) plus some funky rounding magic. While this works, the algorithm is evolutive (it’ll run until you stop it to find the best result), so speed is key. I’m running on two different systems, one with GTX 480s and one with Tesla C2050s. The Tesla cards are about 15% faster when using DP, but the 480s with SP are about twice as fast as that. Unfortunately, if I use the float versions on both, there are some small differences in value (the final value is cast to an int, so the value just has to stay close). The programming guide says the maximum error is around 2-4 ULP for the SP trig functions (8 for powf), so am I essentially stuck with using DP?

tera · October 6, 2010, 8:28am

I think you are trying to do the impossible. In principle, you may never even recompile your code, as this might rearrange floating point operations.
If you really need bit-for-bit identical results, convert your code to fixed point arithmetics (i.e., scaled integers).

tera · October 6, 2010, 8:28am

I think you are trying to do the impossible. In principle, you may never even recompile your code, as this might rearrange floating point operations.
If you really need bit-for-bit identical results, convert your code to fixed point arithmetics (i.e., scaled integers).

eyalhir74 · October 6, 2010, 3:58pm

I’m working on a CUDA version of a lossless image encoder. Overall accuracy is not as important as you’d think, but what is important is that the same output is returned if a function is run on the CPU or GPU. I managed to do this using all double precision calculations as well as a few limitations on some functions (pow and tan, for example, caused problems when the output approached infinity, but as very large numbers aren’t helpful to our algorithm, we capped them at 10) plus some funky rounding magic. While this works, the algorithm is evolutive (it’ll run until you stop it to find the best result), so speed is key. I’m running on two different systems, one with GTX 480s and one with Tesla C2050s. The Tesla cards are about 15% faster when using DP, but the 480s with SP are about twice as fast as that. Unfortunately, if I use the float versions on both, there are some small differences in value (the final value is cast to an int, so the value just has to stay close). The programming guide says the maximum error is around 2-4 ULP for the SP trig functions (8 for powf), so am I essentially stuck with using DP?

Hi,

I’ve also seen difference in floating point results between GTX280 and the C1060 - either a bug on my part or some precision issues. So the use of doubles should indeed solve it.

Its a bit weird that the C2050 with DP is only 15% faster. There was a lot of discussions here about how NVIDIA crippled the DP badly in the GTX line and that the C2050 should be much much

faster then the Fermi GTX when it comes to DP. Maybe there is a lot of other overhead in your kernel preventing you from getting the full DP performance on the C2050?

my one cent :)

eyal

eyalhir74 · October 6, 2010, 3:58pm

I’m working on a CUDA version of a lossless image encoder. Overall accuracy is not as important as you’d think, but what is important is that the same output is returned if a function is run on the CPU or GPU. I managed to do this using all double precision calculations as well as a few limitations on some functions (pow and tan, for example, caused problems when the output approached infinity, but as very large numbers aren’t helpful to our algorithm, we capped them at 10) plus some funky rounding magic. While this works, the algorithm is evolutive (it’ll run until you stop it to find the best result), so speed is key. I’m running on two different systems, one with GTX 480s and one with Tesla C2050s. The Tesla cards are about 15% faster when using DP, but the 480s with SP are about twice as fast as that. Unfortunately, if I use the float versions on both, there are some small differences in value (the final value is cast to an int, so the value just has to stay close). The programming guide says the maximum error is around 2-4 ULP for the SP trig functions (8 for powf), so am I essentially stuck with using DP?

Hi,

I’ve also seen difference in floating point results between GTX280 and the C1060 - either a bug on my part or some precision issues. So the use of doubles should indeed solve it.

Its a bit weird that the C2050 with DP is only 15% faster. There was a lot of discussions here about how NVIDIA crippled the DP badly in the GTX line and that the C2050 should be much much

faster then the Fermi GTX when it comes to DP. Maybe there is a lot of other overhead in your kernel preventing you from getting the full DP performance on the C2050?

my one cent :)

eyal

Sylvain_Collange · October 6, 2010, 4:02pm

Depending on your CPU compiler, bit-for-bit accuracy may be achievable, even on floating-point numbers.

For basic operations:
On the GPU side, you need to use __fadd_rn and __fmul_rn to prevent the compiler from emitting (more accurate) fused multiply-adds.
On the CPU side, you need to make sure that the compiler uses the SSE instruction set exclusively and avoids any unsafe math optimization.
For transcendentals, you have two options:

Use the exact same implementation on both the CPU and the GPU. Performance will be suboptimal on at least one platform. But results should be the same, as long as the implementation only uses basic arithmetic operations.
Enforce precise rounding rules that define the result of transcendental functions unambiguously. Then make sure that both the CPU and the GPU implementations follow these rules. The most reasonable set of rules is “return the FP number closest to the exact result”, or correct rounding.
Unfortunately, implementations of correctly-rounded transcendentals are quite involved, even on CPUs. So I suggest option 1 instead.

Using fixed-point as Tera suggests is a good solution for basic operations, but you will still need to roll your own transcendentals…

Sylvain_Collange · October 6, 2010, 4:02pm

Depending on your CPU compiler, bit-for-bit accuracy may be achievable, even on floating-point numbers.

For basic operations:
On the GPU side, you need to use __fadd_rn and __fmul_rn to prevent the compiler from emitting (more accurate) fused multiply-adds.
On the CPU side, you need to make sure that the compiler uses the SSE instruction set exclusively and avoids any unsafe math optimization.
For transcendentals, you have two options:

Use the exact same implementation on both the CPU and the GPU. Performance will be suboptimal on at least one platform. But results should be the same, as long as the implementation only uses basic arithmetic operations.
Enforce precise rounding rules that define the result of transcendental functions unambiguously. Then make sure that both the CPU and the GPU implementations follow these rules. The most reasonable set of rules is “return the FP number closest to the exact result”, or correct rounding.
Unfortunately, implementations of correctly-rounded transcendentals are quite involved, even on CPUs. So I suggest option 1 instead.

Using fixed-point as Tera suggests is a good solution for basic operations, but you will still need to roll your own transcendentals…

eelsen · October 6, 2010, 6:09pm

I think tera was envisioning a scenario where for the expression d = a + b + c, cpu compiler decides to do d = (a + b ) + c and the gpu compiler decides to do d = a + (b+c), which could give different results even for identical a, b, c. I guess the take away would be that to enforce bit accuracy you would need to exactly specify all order of operations for all arithmetic operations.

eelsen · October 6, 2010, 6:09pm

I think tera was envisioning a scenario where for the expression d = a + b + c, cpu compiler decides to do d = (a + b ) + c and the gpu compiler decides to do d = a + (b+c), which could give different results even for identical a, b, c. I guess the take away would be that to enforce bit accuracy you would need to exactly specify all order of operations for all arithmetic operations.

Topic		Replies	Views
Why accuracy CPU and GPU not equal? CUDA Programming and Performance	6	10957	October 28, 2014
Accuracy in GPU floating point calculations CUDA Programming and Performance	35	8230	September 9, 2011
Is there a difference between GPU double precision and CPU double precision? CUDA Programming and Performance	14	10721	November 26, 2009
Floating-point precision problems CUDA Programming and Performance	14	4406	January 7, 2011
A question on single and double precision performance calculation with CUDA cores CUDA Programming and Performance	7	1770	May 31, 2024
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	679	April 4, 2017
round-to-double on GPUs? precision issue on the GPU CUDA Programming and Performance	10	10391	November 17, 2009
CUDA Double Precision Performance 933 GFlops vs 78GFlops CUDA Programming and Performance	17	9986	March 9, 2009
Precision in Tesla Suitability of GPUs for some applications CUDA Programming and Performance	17	5619	January 12, 2009
CUDA book by Kirk & Whu available CUDA Programming and Performance	44	12114	February 10, 2010

Single Precision Accuracy

Related topics