Expected performance of double precision arithmetic

I’m thinking about porting a project to CUDA that requires double precision. However, I can’t really afford to buy a GTX2xx or Tesla just to find out that it isn’t really what I need performance-wise.

I would be glad if anyone who has first-hand experience with it could shed some light on the relative performance of double precision compared to single precision for:

-standard FP arithmetic (add, mul, fma)
-trigonometric functions

and especially

-the CUFFT library.

Is it true that double precision arithmetic is handled by separate double precision units in the hardware? If so, is the number of these likely to increase in future Tesla systems?

Thanks a lot for your help!

Sorry for pushing this thread back to the top, but I\d really appreciate your opinions on this matter!

evaluation of cuFFT on 3D-FFT (complex to complex, in-place)

platform: vc2005, icpc 11.1.035, -O2, cuda 2.3, GTX295

fftpack field: F77 package (transform to C-code by f2c), here I use single thread to run fftpack

cuFFT field: forward C2C in-place

device–>host field: transfer data from device to host

single precision


N       | fftpack (cpu) |  cuFFT    |  device --> host  |


64,64,64 | 47 ms | 0 ms | 0 ms |


80,80,80 | 63 ms | 16 ms | 0 ms |


108,108,108 | 156 ms | 16 ms | 0 ms |


128,128,128 | 297 ms | 16 ms | 0 ms |


210,210,210 | 1578 ms | 156 ms | 47 ms |


256,256,256 | 3000 ms | 15 ms | 78 ms |



double precision


N       | fftpack (cpu) |  cuFFT    |  device --> host  |


64,64,64 | 47 ms | 0 ms | 15 ms |


80,80,80 | 94 ms | 47 ms | 0 ms |


108,108,108 | 172 ms | 94 ms | 15 ms |


128,128,128 | 359 ms | 16 ms | 15 ms |


210,210,210 | 1391 ms | 750 ms | 78 ms |


256,256,256 | 2156 ms | 78 ms | 141 ms |


roughly speaking, “float” FFT is 3x faster than “double” FFT

Thank you very much!

I am running into a situation where I need double precision or else the calculations would be wrong, but after reading and checking out guides, float is the preferred data type for CUDA.

I tried using floats on my arrays but the results are all wrong, if i use double then my results turn out ok. But using double precision, the CUDA is about 10x slower than CPU. I also tried varying thread/block numbers, and it also does not really help.

Anyone have any success using doubles in CUDA and the performance is better than CPU? Is it possible?



can you describe your application or your coding style?

Maybe you waste much time on data transfer

I’ve had (relatively) great success using double-precision. My double-precision code (which only uses double for a portion of the algorithm because thats all that is needed) is only 1/2 as fast as the all-single-precision code. The double is still 10x as fast as the cpu.

However, my algorithm is clearly bandwidth limited.

Alright, my CUDA code executes as follows:

from main function,

  1. I create and init with values of a 2-d array (not big, 3x3 only)

  2. then i create a similar array on cuda using cudaMalloc, then cudaMemcpy the contents of 1st array to the one allocated on CUDA

  3. then I cudaMalloc several variables of ints and floats on device,

  4. Enters 1st loop, i cycle through 21 SNR values

    • each loop I cudaMemcpy the new SNR value, a int rand() number and 2 other ints
    • after the cudaMemcpy, I call the kernel to begin simulation
  5. in the kernel that begins simulation,

    • I allocate 6 arrays, of 3x3 size, using shared
    • There is a loop in the kernel that executes until a certain variable reached a condition,
      -within this loop also another small loop to fill up 1 of the 6 arrays with random values. the RNG is complicated Gaussian RNG (with loops), I tried commenting it out and the code ran under 1 sec…but of course thats not a soluation. But there doesnt exist RNGs on CUDA…
  6. the kernel calls another kernel which does the core of the computation, the computation requires exp and log, and as my SNR values get high, i end up having to compute to the order of exp(20) or exp(30), which is where overflow is occurring. the computing kernel has no loops, only sets of calcuations (kind of like an unrolled loop)

  7. after computation the results are fed into another kernel, which would use a small loop to add up all the results to make a prediction.

Thats about it, so basically, 1 loop in CPU that cycles through 21 times, 3 small loops on CUDA (7 iterations), and 1 big loop in CUDA that goes how ever many times it takes to reach condition. One thing to note is, I guess my simulation kernel gets called 21 times. Is this where I am going wrong? Should I just put the 21 SNR values into an array, also 21 rand() seeds, and just call the simulation kernel 1 time?

Sorry about the length, but I think I need to decribe the whole picture to be helpful.