FFT Performance

I’m working on some Xeon machines running linux, each with a C1060. I’m trying to verify the performance that I see on som ppt slides on the Nvidia site that show 150+ GFLOPS for a 256 point SP C2C FFT. I only seem to be getting about 30 GPLOPS. I’m only timing the fft and have the thread synchronize around the fft and timer calls. I also double checked the timer by calling both the cuda timer and the linux getTime, both are giving me the same elapsed time. I tried both the 1d and 2d ffts, with the 2d giving a little better performance. The 30 GFLOPS is very close to the number I would expect for the intel processor with MKL libraries. The deviceQuery seems to indicate that the tesla board is working. Calculating flops using 5n log2(n) (times 2n for the 2D FFT, using 256x256). Is it possible that the fft is really being done on the Xeon? Or, something else I may be doing wrong?

Results of deviceQuery:
Device 0: “Tesla C1060”
CUDA Driver Version: 2.30
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294705152 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

Did I understand it correctly that you do an FTT for only 256x256 array? Did you try varying the array size and see what’s going on? I don’t remember the exact numbers at the moment, but I compared FFTW with CUFFT at some point, and the difference in performance was significant only at large transforms. Below I attach one of the plots I made for some presentation with 3D CUFFT timing on GTX 285 also included. Here ignore everything else and just look at CUFFT line. The horizontal axis is the N in N^3 FFT, and the vertical axis is the FFT time /N^3. Hope this is of any help.

Here I dag out the numbers from the plot - N and Time_of_FFT / N^3 in microseconds:
8 0.448793
16 0.550607
32 0.137392
48 0.273981
64 0.0339002
96 0.0752926
128 4.21127e-05
144 0.0391407
192 0.0395601
196 0.0419164
256 6.87637e-06

What batch size are you using? (Not FFT size) CUFFT works best if you run a large number of FFTs simultaneously, so that it can schedule the calculations across more of the processors.

My expectation of 150 GFLOPS is based on the Nvidia slides,

http://gpgpu.org/wp/wp-content/uploads/200…Tools_Cohen.pdf

, page 11 graph. I picked the 256 size to play with. I first tried a 2d 256x256 and got

about 30 GFLOPS. I tried a single 256 1D and only got about 1 GFLOP, which probably seems

reasonable due to much less compute time to make up for any overhead. I did the 1d with a batch of

256 and that got me back up to the high 20s. But, still nothing close to the 150 on the slides.

It looks like the 2d is probably doing the batching decisions itself, which is what I would expect.

I’m pretty much using the same code that is on page 10 of the slides.

256x256 is not big enough to achieve max performance. Try 1024x1024.