Use float rather than double in a kernel?

Abdopensky · November 13, 2019, 4:21pm

Hello,

I am developing a kernel function which is currently taking much time for calculation. All variables are double (64-bit) and I want to know if the computation time will decrease if you replace the type by float? If yes, will it be reduced by a ration of 1:32 (float/double)?

Also, will the usage of float type (32-bit) enable the opportunity to get more ressources (more Block size, more registers available, …) ?

Thanks for advance for a detailed answer !

Abdoulaye

cbuchner1 · November 13, 2019, 5:38pm

A 64 bit double variable takes 2 registers. A 32 bit float can be stored in 1 register. That affects how many blocks can simultaneously execute on a multiprocessor and what the maximum block size can be.

As for the speed-up: depends entirely on your hardware.

Assuming that you’re on a Geforce or Titan consumer device with a 32:1 SP/DP ratio the speed-up can be anything between 1 and 32, depending on how much the DP ALU was the bottleneck in your kernel performance.

If you’re on a Tesla card with a 2:1 SP/DP ratio then expect a speed-up up to ca factor 2.

Christian

Abdopensky · November 13, 2019, 5:51pm

Thank Christian for the explanation. It makes sense now

Anyway, as it can be up to 32, it could be better using float (despite of the fact that the rounding error will increase). So, it is just a matter of trade-off between (accuracy and calculation speed).

cbuchner1 · November 13, 2019, 5:55pm

Sometimes mathematical calculations can be rewritten into an equivalent that reduces the magnitude of error propagation.

Also there is the possibility to combine two floats into something that is called a “double single”. It has an accuracy that’s closer to double with a ca 4x performance penalty over float. Google DSFUN (fortran code) and C/C++ ports of this library.

njuffa · November 13, 2019, 6:14pm

As for the speed-up: depends entirely on your application.

For example, your application may be completely compute bound, with heavy use of transcendental functions. When you change from double precision to single-precision, your speed-up may be higher than what the DP to SP FLOPS ratio suggests, because not only is the operations throughput higher, but the code also needs fewer operations. Then you discover that some of the transcendental functions can utilize the less accurate built-in hardware transcendentals and that you don’t need denormal support. So you add the -use_fast_math compiler switch and, presto, you get another nice speedup.

It is definitely a good idea to let the CUDA profiler help you identify bottlenecks in your code before jumping into a double/float change. Your bottlenecks may not be (exactly) where you think they are.

Abdopensky · November 13, 2019, 6:57pm

By the way I recently migrated from a Quadro P2000 to a RTX2060 and i did not see a significant improvement in terms of speed up (I admit that I was frustrated after spending hundreds dollars for it). So, I think that I have to improve a lot my application code (To be honest, all my variables are double and there are many many (may be close to thousands).

njuffa · November 13, 2019, 10:33pm

[Corrected after I got my reading glasses and noticed that Wikipedia says “164” instead of the “104” DP GFLOPS that I had read before]

That is somewhat surprising, because the DP performance of the RTX 2060 is listed in Wikipedia as delivering 164 DP GFLOPS, vs the 109 DP GFLOPS I measure on my Quadro P2000. The RTX 2060 also has the advantage of providing more than 2x the memory bandwidth and about 50% more SP TFLOPS compared to the Quadro P2000. So theoretically you should have observed about 50% speedup when switching from Quadro P2000 to RTX 2060. What did you actually observe?

But if your code is limited by DP throughput, none of the other improvements will matter.

Abdopensky · November 14, 2019, 8:03am

Hello njuffa,

According to this site (NVIDIA GeForce RTX 2060 Specs | TechPowerUp GPU Database), before deciding to buy, I had already verified that the PF64 of the RTX 2060 is about around 200 GFLOPS (2 times faster) and had the number of SM is 30 (8 for Quadro P2000). Also, the latter will allow me to schedule more threads in flight once.

njuffa · November 14, 2019, 10:35am

It’s about 200 DP GFLOPS at maximum boost clock, which an app may or may not reach, or sustain.

Given that you researched the specs before the purchase, I am not sure why there is now frustration with this purchase.

Abdopensky · November 14, 2019, 11:03am

Lol. I mean that I was frustrated by the results (no significant computation speed up) until I discover through this topic that the problem may come from the high DP variable and math functions usage on my code. So, from now, i am hopeful that it will solve this bottleneck. Otherwise, as you advised, i’ll try to use the Profiler.

Thanks

Topic		Replies	Views
Performance Float vs Double effective bandwidth and execution time CUDA Programming and Performance	8	1389	June 6, 2011
A doubt... 64- 32- and 24-bit math... CUDA Programming and Performance	4	3366	September 23, 2009
Kernel faster in double precision than in simple ? CUDA Programming and Performance	4	1027	April 14, 2012
Speed of processing double, float, short CUDA Programming and Performance	1	1942	December 16, 2012
CUDA kernel converging to CPU double result thought G80 is single precision? CUDA Programming and Performance	6	8630	March 7, 2008
Strange change in behaviour between float and double CUDA Programming and Performance	6	1339	April 1, 2009
double and integer 64 bits CUDA Programming and Performance	5	3856	March 22, 2007
Single Precision Accuracy CUDA Programming and Performance	9	9204	October 6, 2010
CUDA Double Precision Performance 933 GFlops vs 78GFlops CUDA Programming and Performance	17	10048	March 9, 2009
Expected performance of double precision arithmetic CUDA Programming and Performance	8	4046	August 20, 2009

Use float rather than double in a kernel?

Related topics