Expected performance of double precision arithmetic

y09 · August 12, 2009, 10:11pm

I’m thinking about porting a project to CUDA that requires double precision. However, I can’t really afford to buy a GTX2xx or Tesla just to find out that it isn’t really what I need performance-wise.

I would be glad if anyone who has first-hand experience with it could shed some light on the relative performance of double precision compared to single precision for:

-standard FP arithmetic (add, mul, fma)
-trigonometric functions

and especially

-the CUFFT library.

Is it true that double precision arithmetic is handled by separate double precision units in the hardware? If so, is the number of these likely to increase in future Tesla systems?

Thanks a lot for your help!

y09 · August 17, 2009, 8:33am

Sorry for pushing this thread back to the top, but I\d really appreciate your opinions on this matter!

LSChien · August 17, 2009, 9:30am

evaluation of cuFFT on 3D-FFT (complex to complex, in-place)

platform: vc2005, icpc 11.1.035, -O2, cuda 2.3, GTX295

fftpack field: F77 package (transform to C-code by f2c), here I use single thread to run fftpack

cuFFT field: forward C2C in-place

device–>host field: transfer data from device to host

single precision

[codebox]------------±--------------±----------±------------------+

N       | fftpack (cpu) |  cuFFT    |  device --> host  |

------------±--------------±----------±------------------+

64,64,64 | 47 ms | 0 ms | 0 ms |

------------±--------------±----------±------------------+

80,80,80 | 63 ms | 16 ms | 0 ms |

------------±--------------±----------±------------------+

108,108,108 | 156 ms | 16 ms | 0 ms |

------------±--------------±----------±------------------+

128,128,128 | 297 ms | 16 ms | 0 ms |

------------±--------------±----------±------------------+

210,210,210 | 1578 ms | 156 ms | 47 ms |

------------±--------------±----------±------------------+

256,256,256 | 3000 ms | 15 ms | 78 ms |

------------±--------------±----------±------------------+

[/codebox]

double precision

[codebox]------------±--------------±----------±------------------+

N       | fftpack (cpu) |  cuFFT    |  device --> host  |

------------±--------------±----------±------------------+

64,64,64 | 47 ms | 0 ms | 15 ms |

------------±--------------±----------±------------------+

80,80,80 | 94 ms | 47 ms | 0 ms |

------------±--------------±----------±------------------+

108,108,108 | 172 ms | 94 ms | 15 ms |

------------±--------------±----------±------------------+

128,128,128 | 359 ms | 16 ms | 15 ms |

------------±--------------±----------±------------------+

210,210,210 | 1391 ms | 750 ms | 78 ms |

------------±--------------±----------±------------------+

256,256,256 | 2156 ms | 78 ms | 141 ms |

------------±--------------±----------±------------------+[/codebox]

roughly speaking, “float” FFT is 3x faster than “double” FFT

y09 · August 17, 2009, 9:49am

Thank you very much!

flagman5 · August 18, 2009, 9:25am

I am running into a situation where I need double precision or else the calculations would be wrong, but after reading and checking out guides, float is the preferred data type for CUDA.

I tried using floats on my arrays but the results are all wrong, if i use double then my results turn out ok. But using double precision, the CUDA is about 10x slower than CPU. I also tried varying thread/block numbers, and it also does not really help.

Anyone have any success using doubles in CUDA and the performance is better than CPU? Is it possible?

flagman5 · August 19, 2009, 2:14am

bump

anyone?

LSChien · August 19, 2009, 9:34am

can you describe your application or your coding style?

Maybe you waste much time on data transfer

laxsu19 · August 19, 2009, 1:57pm

I’ve had (relatively) great success using double-precision. My double-precision code (which only uses double for a portion of the algorithm because thats all that is needed) is only 1/2 as fast as the all-single-precision code. The double is still 10x as fast as the cpu.

However, my algorithm is clearly bandwidth limited.

flagman5 · August 20, 2009, 3:40am

Alright, my CUDA code executes as follows:

from main function,

I create and init with values of a 2-d array (not big, 3x3 only)
then i create a similar array on cuda using cudaMalloc, then cudaMemcpy the contents of 1st array to the one allocated on CUDA
then I cudaMalloc several variables of ints and floats on device,
Enters 1st loop, i cycle through 21 SNR values
- each loop I cudaMemcpy the new SNR value, a int rand() number and 2 other ints
- after the cudaMemcpy, I call the kernel to begin simulation
in the kernel that begins simulation,
- I allocate 6 arrays, of 3x3 size, using shared
- There is a loop in the kernel that executes until a certain variable reached a condition,
  -within this loop also another small loop to fill up 1 of the 6 arrays with random values. the RNG is complicated Gaussian RNG (with loops), I tried commenting it out and the code ran under 1 sec…but of course thats not a soluation. But there doesnt exist RNGs on CUDA…
the kernel calls another kernel which does the core of the computation, the computation requires exp and log, and as my SNR values get high, i end up having to compute to the order of exp(20) or exp(30), which is where overflow is occurring. the computing kernel has no loops, only sets of calcuations (kind of like an unrolled loop)
after computation the results are fed into another kernel, which would use a small loop to add up all the results to make a prediction.

Thats about it, so basically, 1 loop in CPU that cycles through 21 times, 3 small loops on CUDA (7 iterations), and 1 big loop in CUDA that goes how ever many times it takes to reach condition. One thing to note is, I guess my simulation kernel gets called 21 times. Is this where I am going wrong? Should I just put the 21 SNR values into an array, also 21 rand() seeds, and just call the simulation kernel 1 time?

Sorry about the length, but I think I need to decribe the whole picture to be helpful.

Thanks

Topic		Replies	Views
accuracy of CUFFT under double precision CUDA Programming and Performance	9	4146	September 18, 2009
FFT Speed vs. x86 CUDA Programming and Performance	14	24776	July 27, 2008
advice needed by a PhD student CUDA Programming and Performance	26	2869	December 4, 2011
Is it doable with CUDA? CUDA Programming and Performance	6	565	December 29, 2019
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	680	April 4, 2017
CUFFT small issue with Double and Float Precision on Plan Many GPU-Accelerated Libraries	0	1421	May 16, 2014
cufft doubt comparing r2c and c2c 2D FFTs CUDA Programming and Performance	28	13496	October 27, 2010
Half precision cuFFT Transforms GPU-Accelerated Libraries	12	6090	March 29, 2021
Double precision Accuracy with sqrt, log math functions Results on CPU & GPU are not exactly sam CUDA Programming and Performance	9	5426	April 12, 2012
Float accuracy CUDA Programming and Performance	16	9372	July 22, 2010

Expected performance of double precision arithmetic

Related topics