Use float rather than double in a kernel?

Hello,

I am developing a kernel function which is currently taking much time for calculation. All variables are double (64-bit) and I want to know if the computation time will decrease if you replace the type by float? If yes, will it be reduced by a ration of 1:32 (float/double)?

Also, will the usage of float type (32-bit) enable the opportunity to get more ressources (more Block size, more registers available, …) ?

Thanks for advance for a detailed answer !

Abdoulaye

A 64 bit double variable takes 2 registers. A 32 bit float can be stored in 1 register. That affects how many blocks can simultaneously execute on a multiprocessor and what the maximum block size can be.

As for the speed-up: depends entirely on your hardware.

Assuming that you’re on a Geforce or Titan consumer device with a 32:1 SP/DP ratio the speed-up can be anything between 1 and 32, depending on how much the DP ALU was the bottleneck in your kernel performance.

If you’re on a Tesla card with a 2:1 SP/DP ratio then expect a speed-up up to ca factor 2.

Christian

Thank Christian for the explanation. It makes sense now

Anyway, as it can be up to 32, it could be better using float (despite of the fact that the rounding error will increase). So, it is just a matter of trade-off between (accuracy and calculation speed).

Sometimes mathematical calculations can be rewritten into an equivalent that reduces the magnitude of error propagation.

Also there is the possibility to combine two floats into something that is called a “double single”. It has an accuracy that’s closer to double with a ca 4x performance penalty over float. Google DSFUN (fortran code) and C/C++ ports of this library.

As for the speed-up: depends entirely on your application.

For example, your application may be completely compute bound, with heavy use of transcendental functions. When you change from double precision to single-precision, your speed-up may be higher than what the DP to SP FLOPS ratio suggests, because not only is the operations throughput higher, but the code also needs fewer operations. Then you discover that some of the transcendental functions can utilize the less accurate built-in hardware transcendentals and that you don’t need denormal support. So you add the -use_fast_math compiler switch and, presto, you get another nice speedup.

It is definitely a good idea to let the CUDA profiler help you identify bottlenecks in your code before jumping into a double/float change. Your bottlenecks may not be (exactly) where you think they are.

By the way I recently migrated from a Quadro P2000 to a RTX2060 and i did not see a significant improvement in terms of speed up (I admit that I was frustrated after spending hundreds dollars for it). So, I think that I have to improve a lot my application code (To be honest, all my variables are double and there are many many (may be close to thousands).

[Corrected after I got my reading glasses and noticed that Wikipedia says “164” instead of the “104” DP GFLOPS that I had read before]

That is somewhat surprising, because the DP performance of the RTX 2060 is listed in Wikipedia as delivering 164 DP GFLOPS, vs the 109 DP GFLOPS I measure on my Quadro P2000. The RTX 2060 also has the advantage of providing more than 2x the memory bandwidth and about 50% more SP TFLOPS compared to the Quadro P2000. So theoretically you should have observed about 50% speedup when switching from Quadro P2000 to RTX 2060. What did you actually observe?

But if your code is limited by DP throughput, none of the other improvements will matter.

Hello njuffa,

According to this site (https://www.techpowerup.com/gpu-specs/geforce-rtx-2060.c3310), before deciding to buy, I had already verified that the PF64 of the RTX 2060 is about around 200 GFLOPS (2 times faster) and had the number of SM is 30 (8 for Quadro P2000). Also, the latter will allow me to schedule more threads in flight once.

It’s about 200 DP GFLOPS at maximum boost clock, which an app may or may not reach, or sustain.

Given that you researched the specs before the purchase, I am not sure why there is now frustration with this purchase.

Lol. I mean that I was frustrated by the results (no significant computation speed up) until I discover through this topic that the problem may come from the high DP variable and math functions usage on my code. So, from now, i am hopeful that it will solve this bottleneck. Otherwise, as you advised, i’ll try to use the Profiler.

Thanks