my speedy FFT 3x faster than CUFFT

To make a FFT testing with double precision in CUDA, ,I made a simple change for 090808 code, And the result is really bad. While N=1024 batch=16384 , I got only 8 Gflop/s in a tesla c1060 system, while the single version is about 200 Gflops/s.

Did someone get better result while using double precision ? BTW, I use cos(phi) and cosf(phi) instead of __cosf(phi).

Hey Everyone,

I got a BFG GeForce 260 GTX OC for Xmas :) , got it installed in my somewhat old system (PCI-E x16 v.1.1) and am seeing the following performance numbers for FFT_090808:

[font=“Courier New”]
Device: GeForce GTX 260, 1296 MHz clock, 895 MB memory.
--------CUFFT------- —This prototype—
N Batch Gflop/s GB/s error Gflop/s GB/s error
8 2097152 8.0 8.5 1.8 78.0 83.2 1.7
16 1048576 17.4 13.9 2.1 5.9 4.7 1.4
64 262144 30.3 16.2 2.5 3.1 1.6 1.6
256 65536 23.3 9.3 2.2 4.1 1.7 1.7
512 32768 58.7 20.9 2.9 8.0 2.9 2.1
1024 16384 47.2 15.1 2.6 3.8 1.2 2.3

Errors are supposed to be of order of 1 (ULPs).
[/font]
I.e. for FFT sizes greater than 8 elements, the FFT_090808 performance results are 3-8 GFlop/s, not the expected 100s of GFlop/s. I believe that CUFFT is behaving as advertised …

I am running CUDA 2.1 Beta, v. 180.06 driver, Fedora Core 9. I have run the same code on a 280 GTX card at work and it reports GFlop numbers similar to those reported by Volkov - so I am a little mystified by these numbers.

Any suggestions???

Thanks!

PS I have also obtained similar FFT_090808 performance numbers on my old 8800 GTS card, which is really disappointing. :(

Performance of this code is highly sensitive to the compiler version. Try compiling it with CUDA 2.0 on the same system.

Double precision cos and sin are not supported in hardware. Therefore, close to optimal double precision FFT may look very differently.

I think Apple’s version should run well on NVIDIA. So, I have very little motivation to continue this little project.

Vasily

I have tried Vasily Volkov’s suggestion (thanks!) of using CUDA 2.0, i.e. the 2.0 nvcc compiler, and I have seen a performance improvement for FFT sizes greater than 8 elements, but the performance decreases for increasing number of elements and CUFFT 2.0 is slightly faster and/or equal in performance for N >= 256. The GFlop/s throughput is ~25 GFlop/s for larger FFT sizes.

Obtained performance is below:

Device: GeForce GTX 260, 1296 MHz clock, 895 MB memory.

         --------CUFFT-------  ---This prototype---

N Batch Gflop/s GB/s error Gflop/s GB/s error

8 2097152 7.9 8.5 1.8 78.1 83.3 1.6

16 1048576 17.4 13.9 2.1 99.0 79.2 1.5

64 262144 30.6 16.3 2.5 55.8 29.8 2.6

256 65536 23.3 9.3 2.2 22.8 9.1 2.0

512 32768 26.1 9.3 2.9 26.3 9.3 2.5

1024 16384 28.9 9.3 2.6 28.6 9.1 2.5

Errors are supposed to be of order of 1 (ULPs).

So again, any suggestions about how to proceed? I am still using the CUDA Beta 2.1 driver for Fedora 9, but installed the toolkit for CUDA 2.0 (Fedora 8). I would prefer to not have to keep reinstalling/switching between drivers.

Thanks!

  • dpe

System info:

nvcc: NVIDIA ® Cuda compiler driver

Copyright © 2005-2007 NVIDIA Corporation

Built on Thu_Jun_19_04:48:21_PDT_2008

Cuda compilation tools, release 2.0, V0.2.1221

/proc/nvidia/driver:

NVRM version: NVIDIA UNIX x86 Kernel Module 180.06 Sat Nov 8 12:13:58 PST 2008

GCC version: gcc version 4.3.0 20080428 (Red Hat 4.3.0-8) (GCC)

uname -a:

Linux manzano 2.6.27.9-73.fc9.i686 #1 SMP Tue Dec 16 15:25:05 EST 2008 i686 athlon i386 GNU/Linux

Thanks for reporting this bug, we are working on it

[codebox]Device: GeForce GTX 260, 1296 MHz clock, 895 MB memory.

         --------CUFFT-------  ---This prototype---

N Batch Gflop/s GB/s error Gflop/s GB/s error

8 2097152 7.9 8.5 1.8 78.1 83.3 1.6

16 1048576 17.4 13.9 2.1 99.0 79.2 1.5

64 262144 30.6 16.3 2.5 55.8 29.8 2.6

256 65536 23.3 9.3 2.2 22.8 9.1 2.0

512 32768 26.1 9.3 2.9 26.3 9.3 2.5

1024 16384 28.9 9.3 2.6 28.6 9.1 2.5[/codebox]

That’s surprising that both FFT implementations run at same 9 GB/s for all N>=256.

Could you please compile .cu files with -keep option and check if local memory is used. It’s in .cubin files, look for "lmem = ". The value is supposed to be 0.

Vasily,

I have tried your suggestion and obtained the following results when running with nvcc -keep

[codebox]

Device: GeForce GTX 260, 1296 MHz clock, 895 MB memory.

         --------CUFFT-------  ------This prototype--------

N Batch Gflop/s GB/s error Gflop/s GB/s error speedup

8 2097152 8.0 8.5 1.8 77.4 82.6 1.6 9.69

16 1048576 17.4 13.9 2.1 98.9 79.1 1.5 5.68

64 262144 30.6 16.3 2.5 55.8 29.7 2.6 1.82

256 65536 22.9 9.2 2.2 22.8 9.1 2.0 0.99

512 32768 26.1 9.3 2.9 26.3 9.3 2.5 1.01

1024 16384 28.9 9.3 2.6 29.1 9.3 2.5 1.00

Errors are supposed to be of order of 1 (ULPs).

Data written to FFT_010809_ver.txt

[/codebox]

Checking the lmem value in the cubin yields the following results:

[codebox]

grep -r lmem *.cubin

FFT1024.cubin: lmem = 0

FFT16.cubin: lmem = 0

FFT256.cubin: lmem = 0

FFT512.cubin: lmem = 0

FFT64.cubin: lmem = 0

FFT8.cubin: lmem = 0

[/codebox]

Thanks for your previous suggestions … I too am disturbed by poor memory bandwidth exhibited for N>=256 for BOTH FFT implementations. Any thoughts?

I have also attached FFT_010809_ver.txt to this post which contains nvcc, driver, and OS information along with the performance data listed above.

Regards,

  • dpe

This compiler flag will fix the performance problem in 2.1:

-Xopencc -OPT:unroll_size=200000

  1. CUDA 2.1 with -Xopencc -OPT:unroll_size=200000
    Device: Tesla T10 Processor, 1296 MHz clock, 4096 MB memory.
    --------CUFFT------- —This prototype—
    N Batch Gflop/s GB/s error Gflop/s GB/s error
    8 2097152 8.8 9.4 1.9 64.6 68.9 1.6
    16 1048576 19.2 15.3 2.2 68.1 54.5 1.5
    64 262144 56.7 30.3 2.4 140.1 74.7 2.5
    256 65536 100.2 40.1 2.2 176.8 70.7 1.9
    512 32768 67.2 23.9 2.9 219.0 77.9 2.5
    1024 16384 89.0 28.5 2.6 202.4 64.8 2.5

Errors are supposed to be of order of 1 (ULPs).

2.0) CUDA 2.0:

Device: Tesla T10 Processor, 1296 MHz clock, 4096 MB memory.
--------CUFFT------- —This prototype—
N Batch Gflop/s GB/s error Gflop/s GB/s error
8 2097152 8.8 9.4 1.9 64.3 68.5 1.6
16 1048576 19.2 15.3 2.2 68.8 55.0 1.5
64 262144 56.7 30.3 2.4 140.1 74.7 2.5
256 65536 100.1 40.1 2.2 176.8 70.7 1.9
512 32768 67.2 23.9 2.9 218.8 77.8 2.5
1024 16384 89.0 28.5 2.6 201.9 64.6 2.5

Errors are supposed to be of order of 1 (ULPs).

Thanks for the suggestion … but when running with a maxcore GTX 260 as opposed to a GTX 280/T10 (C1060) …

[codebox]

g++ -I. -I/usr/local/cuda2.0/include -c main.cpp

nvcc -Xopencc -OPT:unroll_size=200000 -I. -I/usr/local/cuda2.0/include -c FFT8.cu

nvcc -Xopencc -OPT:unroll_size=200000 -I. -I/usr/local/cuda2.0/include -c FFT64.cu

nvcc -Xopencc -OPT:unroll_size=200000 -I. -I/usr/local/cuda2.0/include -c FFT512.cu

nvcc -Xopencc -OPT:unroll_size=200000 -I. -I/usr/local/cuda2.0/include -c FFT16.cu

nvcc -Xopencc -OPT:unroll_size=200000 -I. -I/usr/local/cuda2.0/include -maxrregcount 40 -c FFT256.cu

nvcc -Xopencc -OPT:unroll_size=200000 -I. -I/usr/local/cuda2.0/include -c FFT1024.cu

g++ -fPIC -L/usr/local/cuda2.0/lib -lcufft -lcuda -o FFT main.o FFT8.o FFT64.o FFT512.o FFT16.o FFT256.o FFT1024.o

[/codebox]

this optimization flag did nothing to improve performance for N>= 256:

[codebox]

Device: GeForce GTX 260, 1296 MHz clock, 895 MB memory.

         --------CUFFT-------  ------This prototype--------

N Batch Gflop/s GB/s error Gflop/s GB/s error speedup

8 2097152 8.0 8.5 1.8 77.3 82.5 1.6 9.68

16 1048576 17.2 13.8 2.1 98.9 79.1 1.5 5.75

64 262144 30.6 16.3 2.5 55.2 29.4 2.6 1.80

256 65536 22.9 9.2 2.2 23.2 9.3 2.0 1.01

512 32768 25.7 9.1 2.9 26.3 9.4 2.5 1.02

1024 16384 28.9 9.3 2.6 28.6 9.2 2.5 0.99

Errors are supposed to be of order of 1 (ULPs).

Data written to FFT_011109_ver.txt

[/codebox]

Do you think a different loop unroll parameter may make a difference? The largest architecture difference between the GTX 280/T10 and the maxcore GTX 260 is the number of stream processors, i.e. it has 216 vs. 240 or one less TPC. The memory bandwidth is slightly less as well.

  • dpe

Note that the previously posted results used CUDA 2.0, driver 180.06 … another suggestion by mfatica of using the

new 180.22 driver with CUDA 2.0 and the suggested compiler flags yields the following:

[codebox]

Device: GeForce GTX 260, 1296 MHz clock, 895 MB memory.

         --------CUFFT-------  ------This prototype--------

N Batch Gflop/s GB/s error Gflop/s GB/s error speedup

8 2097152 8.0 8.5 1.8 78.1 83.3 1.6 9.77

16 1048576 17.4 13.9 2.1 98.2 78.6 1.5 5.64

64 262144 49.6 26.4 2.5 163.7 87.3 2.6 3.30

256 65536 88.1 35.3 2.2 146.5 58.6 2.0 1.66

512 32768 58.8 20.9 2.9 194.8 69.3 2.5 3.32

1024 16384 76.1 24.4 2.6 165.1 52.8 2.5 2.17

Errors are supposed to be of order of 1 (ULPs).

[/codebox]

Wow! 194.8 GFlop/s for N=512! :)

When using CUDA 2.1 Beta,

[codebox]

Device: GeForce GTX 260, 1296 MHz clock, 895 MB memory.

         --------CUFFT-------  ------This prototype--------

N Batch Gflop/s GB/s error Gflop/s GB/s error speedup

8 2097152 8.0 8.5 1.8 78.1 83.3 1.7 9.77

16 1048576 17.3 13.8 2.1 98.8 79.0 1.5 5.71

64 262144 49.3 26.3 2.5 162.9 86.9 2.5 3.30

256 65536 88.1 35.2 2.2 147.3 58.9 2.0 1.67

512 32768 58.8 20.9 2.9 194.5 69.1 2.5 3.31

1024 16384 76.1 24.3 2.6 165.0 52.8 2.5 2.17

Errors are supposed to be of order of 1 (ULPs).

[/codebox]

So CUDA 2.1 Beta is good to go for Volkov’s implementation. :)

Verification files attached.

  • dpe

PS Are the 256, 512 & 1024 point FFT results compute bound? The memory bandwidth of the GTX 260 (216 SPs) is ~112 GB/sec.

This throughput result reported mfatica (219 GFlop/s) is greater than the peak single precision performance possible by the currently available IBM PowerXCell 8i, PowerXCell 8i product brief. Pretty impressive …

Can the recently released GTX 295 attain .5 TFlop/sec??? That would be incredible! The first machine that produced a teraflop performance result on an actual computational kernel was the ASCI Red machine at Sandia National Laboratory in 1996. Slightly less than 13 years later one can achieve ~2 teraflops of peak single precision performance on a single card for $500!

  • dpe

PS I won’t begin to speculate about a GTX 295 SLi configuration throughput capability! :)

PPS A rough back of the envelope throughput calculation of a 512 element batch 1D single precision FFTs performed on the original ASCI Red (9326 Pentium Pro cores) system yields a throughput of ~537 GFlop/sec. The GTX 295 can probably do better than this …

Hello,
I’ve tried this fft and I obtained great results, but I need a 4096 points FFT. How can I obtain it? Mr Volkov can you help me?

Please see the updated post above.

Yes, it is compute bound at large N. Also, there is something wrong with performance of radix-16 FFTs (N = 16, 256, 1024) on GT200 architecture. On earlier GPUs the Gflop/s rate in single-pass FFTs is strictly increasing with N.

Great, thanks!

Thanks vvolkov for your really nice and useful implementation!

Here are my results on 9800GT (aka 8800GT):

[codebox]Device: GeForce 9800 GT, 1512 MHz clock, 512 MB memory.

Compiled with CUDA 2000.

         --------CUFFT-------  ---This prototype---  ---two way---

N Batch Gflop/s GB/s error Gflop/s GB/s error Gflop/s error

8 2097152 4.7 5.0 1.8 40.8 43.5 1.6 40.9 2.0

16 1048576 10.1 8.1 2.1 53.5 42.8 1.5 53.4 1.9

64 262144 29.8 15.9 2.4 81.6 43.5 2.3 81.6 2.8

256 65536 50.0 20.0 2.2 93.8 37.5 2.0 94.2 3.0

512 32768 34.4 12.2 2.9 100.7 35.8 2.5 100.4 3.7

1024 16384 46.6 14.9 2.7 98.8 31.6 2.5 94.8 3.9

2048 8192 25.4 7.4 3.7 69.2 20.1 3.0 69.2 4.5

4096 4096 24.1 6.4 4.0 74.0 19.7 3.3 73.5 4.9

8192 2048 22.9 5.6 4.4 77.0 19.0 3.4 77.0 5.2[/codebox]

77GFlops for 8k FFT, very impressive!!

What I want is really big FFTs (1M, 4M and 16M FFTs). Any plans to support those?

Thanks once again for such a nice program.

Not on my agenda. I hope NVIDIA will do that :)

Volkov’s implementation only went to batch sizes of 1K FFTs … Kanishk, can

you suggest a good reference and/or provide guidance about how to extend Volkov’s

implementation to FFTs sizes > 1K.

Thanks!

  • dpe

PS yes, those results are impressive … Govindaraju demonstrated at most 15 GFlops with 4 threads using MKL with 32K FFTs

some additional variable batch size results for 8 - 1K FFTs and a comparison to sse compiled

FFTW on my system …

Device: GeForce GTX 260, 1296 MHz clock, 895 MB memory.

			 --------FFTW--------  --------CUFFT-------  ---This prototype---

   N   Batch Gflop/s  GB/s  error  Gflop/s  GB/s  error  Gflop/s  GB/s  error FFTWsp CUFFTsp

   8	  64	0.8	0.8   0.7	 0.4	0.4   0.9	  0.8	0.8   0.8	 1.02	1.86

   8	 128	0.8	0.8   0.7	 0.8	0.9   1.1	  2.2	2.3   1.0	 2.84	2.59

   8	 256	1.5	1.6   0.6	 2.6	2.8   1.0	  4.3	4.6   0.9	 2.83	1.65

   8	 512	2.0	2.2   0.8	 4.0	4.3   1.3	  8.6	9.2   1.1	 4.21	2.14

   8	1024	2.0	2.2   0.8	 5.4	5.8   1.5	 15.7   16.8   1.0	 7.67	2.89

   8	2048	1.9	2.0   0.8	 6.4	6.8   1.3	 27.2   29.1   1.1	14.41	4.28

   8	8384	1.4	1.5   0.9	 7.6	8.2   1.4	 53.1   56.6   1.2	37.98	6.94

  16	  64	2.0	1.6   0.6	 1.8	1.4   1.4	  2.0	1.6   1.0	 0.99	1.13

  16	 128	2.0	1.6   0.7	 3.6	2.9   1.7	  4.1	3.3   1.1	 2.00	1.14

  16	 256	2.0	1.6   0.6	 6.4	5.1   1.4	  8.2	6.5   1.0	 3.99	1.28

  16	 512	1.8	1.5   0.7	 9.5	7.6   1.4	 16.2   13.0   1.3	 8.92	1.71

  16	1024	2.0	1.6   0.7	12.7   10.1   1.7	 30.3   24.2   1.1	14.78	2.39

  16	2048	2.0	1.6   0.7	14.4   11.5   1.7	 49.3   39.4   1.1	24.82	3.42

  16	8384	1.7	1.3   0.8	17.0   13.6   1.8	 79.6   63.7   1.2	47.49	4.69

  64	  16	1.5	0.8   0.6	 2.7	1.4   2.0	  3.1	1.6   1.6	 2.00	1.15

  64	  32	2.0	1.1   0.6	 5.4	2.9   1.9	  6.1	3.3   1.9	 2.98	1.14

  64	  64	1.8	0.9   0.6	10.8	5.8   2.2	 12.2	6.5   1.8	 6.96	1.13

  64	 128	1.9	1.0   0.6	20.5   10.9   2.0	 23.3   12.4   1.9	12.33	1.14

  64	 256	1.9	1.0   0.7	25.4   13.5   2.1	 41.9   22.3   1.9	22.14	1.65

  64	 512	1.8	1.0   0.7	34.2   18.3   2.0	 66.5   35.5   2.0	37.19	1.94

  64	1024	1.4	0.7   0.7	41.1   21.9   2.1	 91.6   48.9   2.1	66.64	2.23

  64	2048	1.4	0.7   0.7	44.4   23.7   2.1	106.8   57.0   2.1	76.61	2.41

  64	8384	1.4	0.7   0.7	48.6   25.9   2.3	144.6   77.1   2.2   103.39	2.97

 256	   4	1.4	0.5   0.7	 3.0	1.2   2.0	  2.2	0.9   1.8	 1.64	0.74

 256	   8	1.6	0.7   0.7	 6.0	2.4   2.0	  4.5	1.8   1.8	 2.77	0.75

 256	  16	1.6	0.7   0.7	11.9	4.7   2.0	  9.1	3.6   1.8	 5.55	0.77

 256	  32	1.6	0.7   0.7	22.2	8.9   2.1	 18.2	7.3   1.8	11.13	0.82

 256	  64	1.6	0.6   0.7	37.6   15.0   2.1	 34.6   13.8   1.9	21.64	0.92

 256	 128	1.6	0.6   0.7	54.6   21.8   2.1	 62.1   24.8   1.9	39.81	1.14

 256	 256	1.4	0.5   0.7	62.7   25.1   2.2	 94.1   37.6   1.8	68.94	1.50

 256	 512	1.4	0.5   0.7	72.6   29.1   2.1	 79.8   31.9   2.0	58.12	1.10

 256	1024	1.4	0.6   0.8	80.5   32.2   2.1	118.0   47.2   1.9	85.78	1.47

 256	2048	1.4	0.5   0.8	84.0   33.6   2.1	132.1   52.8   1.9	96.17	1.57

 256	8384	1.4	0.6   0.8	87.6   35.0   2.2	145.0   58.0   1.9   102.72	1.66

 512	   4	1.5	0.5   0.7	 5.3	1.9   2.8	  6.4	2.3   2.4	 4.15	1.20

 512	   8	1.7	0.6   0.7	10.6	3.8   2.8	 14.7	5.2   2.4	 8.78	1.39

 512	  16	1.7	0.6   0.7	17.2	6.1   2.8	 28.5   10.1   2.3	17.02	1.66

 512	  32	1.7	0.6   0.7	26.4	9.4   2.8	 52.0   18.5   2.4	31.02	1.97

 512	  64	1.6	0.6   0.7	36.0   12.8   2.8	 83.9   29.8   2.4	54.04	2.33

 512	 128	1.4	0.5   0.7	41.1   14.6   2.9	 84.0   29.9   2.4	59.50	2.04

 512	 256	1.4	0.5   0.7	48.9   17.4   2.9	133.3   47.4   2.5	94.73	2.73

 512	 512	1.4	0.5   0.7	54.2   19.3   2.9	157.5   56.0   2.4   112.66	2.90

 512	1024	1.4	0.5   0.8	56.4   20.0   2.9	174.5   62.1   2.4   125.45	3.10

 512	2048	1.4	0.5   0.8	49.4   17.5   2.9	160.5   57.1   2.5   116.14	3.25

 512	8384	1.4	0.5   0.8	56.2   20.0   2.9	182.0   64.7   2.5   130.29	3.24

1024	   4	1.6	0.5   0.8	 7.7	2.5   2.5	 10.2	3.3   2.3	 6.48	1.33

1024	   8	1.6	0.5   0.8	15.7	5.0   2.6	 15.1	4.8   2.4	 9.57	0.96

1024	  16	1.6	0.5   0.8	16.0	5.1   2.6	 31.6   10.1   2.4	19.65	1.97

1024	  32	1.4	0.5   0.8	38.4   12.3   2.6	 68.9   22.0   2.4	47.52	1.79

1024	  64	1.3	0.4   0.8	51.0   16.3   2.6	104.1   33.3   2.4	78.77	2.04

1024	 128	1.3	0.4   0.8	65.4   20.9   2.6	122.4   39.2   2.4	91.86	1.87

1024	 256	1.3	0.4   0.8	68.8   22.0   2.6	131.8   42.2   2.4	99.04	1.92

1024	 512	1.3	0.4   0.8	73.8   23.6   2.6	148.3   47.5   2.5   110.61	2.01

1024	1024	1.3	0.4   0.8	67.1   21.5   2.6	158.0   50.6   2.5   117.72	2.36

1024	2048	1.3	0.4   0.8	75.3   24.1   2.6	162.6   52.0   2.4   120.75	2.16

1024	8384	1.3	0.4   0.8	75.1   24.0   2.6	161.7   51.7   2.5   119.88	2.15

Errors are supposed to be of order of 1 (ULPs).

Enjoy!

  • dpe

PS Results are also attached …

IFFT-011209 running unaltered under Windows XP 64 bit, 2GHz Xeon, Tesla C1060, CUDA 2.0, driver 178.28:

Device: Tesla C1060, 1296 MHz clock, 4096 MB memory.

Compiled with CUDA 2000.

			 --------CUFFT-------  ---This prototype---  ---two way---

	 N   Batch Gflop/s  GB/s  error  Gflop/s  GB/s  error  Gflop/s error

	 8 1048576	6.5	6.9   1.7	 61.2   65.2   1.6	 61.1   2.0

	16  524288   16.7   13.3   2.1	 65.5   52.4   1.4	 66.5   1.8

	64  131072   52.4   27.9   2.4	132.6   70.7   2.3	131.9   3.0

   256   32768   97.0   38.8   2.2	166.0   66.4   2.0	168.8   3.0

   512   16384   65.5   23.3   3.0	213.1   75.8   2.5	205.8   3.8

  1024	8192   86.8   27.8   2.6	192.7   61.7   2.4	192.0   3.9

  2048	4096   48.9   14.2   3.7	132.6   38.6   3.0	130.5   4.5

  4096	2048   44.9   12.0   4.0	142.7   38.1   3.3	141.9   4.9

  8192	1024   43.0   10.6   4.4	152.5   37.5   3.4	152.6   5.2

Errors are supposed to be of order of 1 (ULPs).

Do you use a Hanning Window in your FFT?