Note that the previously posted results used CUDA 2.0, driver 180.06 … another suggestion by mfatica of using the
new 180.22 driver with CUDA 2.0 and the suggested compiler flags yields the following:
[codebox]
Device: GeForce GTX 260, 1296 MHz clock, 895 MB memory.
--------CUFFT------- ------This prototype--------
N Batch Gflop/s GB/s error Gflop/s GB/s error speedup
8 2097152 8.0 8.5 1.8 78.1 83.3 1.6 9.77
16 1048576 17.4 13.9 2.1 98.2 78.6 1.5 5.64
64 262144 49.6 26.4 2.5 163.7 87.3 2.6 3.30
256 65536 88.1 35.3 2.2 146.5 58.6 2.0 1.66
512 32768 58.8 20.9 2.9 194.8 69.3 2.5 3.32
1024 16384 76.1 24.4 2.6 165.1 52.8 2.5 2.17
Errors are supposed to be of order of 1 (ULPs).
[/codebox]
Wow! 194.8 GFlop/s for N=512! :)
When using CUDA 2.1 Beta,
[codebox]
Device: GeForce GTX 260, 1296 MHz clock, 895 MB memory.
--------CUFFT------- ------This prototype--------
N Batch Gflop/s GB/s error Gflop/s GB/s error speedup
8 2097152 8.0 8.5 1.8 78.1 83.3 1.7 9.77
16 1048576 17.3 13.8 2.1 98.8 79.0 1.5 5.71
64 262144 49.3 26.3 2.5 162.9 86.9 2.5 3.30
256 65536 88.1 35.2 2.2 147.3 58.9 2.0 1.67
512 32768 58.8 20.9 2.9 194.5 69.1 2.5 3.31
1024 16384 76.1 24.3 2.6 165.0 52.8 2.5 2.17
Errors are supposed to be of order of 1 (ULPs).
[/codebox]
So CUDA 2.1 Beta is good to go for Volkov’s implementation. :)
Verification files attached.
PS Are the 256, 512 & 1024 point FFT results compute bound? The memory bandwidth of the GTX 260 (216 SPs) is ~112 GB/sec.