evaluation of cuFFT on 3D-FFT (complex to complex, in-place)
platform: vc2005, icpc 11.1.035, -O2, cuda 2.3, GTX295
fftpack field: F77 package (transform to C-code by f2c), here I use single thread to run fftpack
cuFFT field: forward C2C in-place
device–>host field: transfer data from device to host
single precision
[codebox]------------±--------------±----------±------------------+
N | fftpack (cpu) | cuFFT | device --> host |
------------±--------------±----------±------------------+
64,64,64 | 47 ms | 0 ms | 0 ms |
------------±--------------±----------±------------------+
80,80,80 | 63 ms | 16 ms | 0 ms |
------------±--------------±----------±------------------+
108,108,108 | 156 ms | 16 ms | 0 ms |
------------±--------------±----------±------------------+
128,128,128 | 297 ms | 16 ms | 0 ms |
------------±--------------±----------±------------------+
210,210,210 | 1578 ms | 156 ms | 47 ms |
------------±--------------±----------±------------------+
256,256,256 | 3000 ms | 15 ms | 78 ms |
------------±--------------±----------±------------------+
[/codebox]
double precision
[codebox]------------±--------------±----------±------------------+
N | fftpack (cpu) | cuFFT | device --> host |
------------±--------------±----------±------------------+
64,64,64 | 47 ms | 0 ms | 15 ms |
------------±--------------±----------±------------------+
80,80,80 | 94 ms | 47 ms | 0 ms |
------------±--------------±----------±------------------+
108,108,108 | 172 ms | 94 ms | 15 ms |
------------±--------------±----------±------------------+
128,128,128 | 359 ms | 16 ms | 15 ms |
------------±--------------±----------±------------------+
210,210,210 | 1391 ms | 750 ms | 78 ms |
------------±--------------±----------±------------------+
256,256,256 | 2156 ms | 78 ms | 141 ms |
------------±--------------±----------±------------------+[/codebox]
roughly speaking, “float” FFT is 3x faster than “double” FFT