my speedy FFT 3x faster than CUFFT


thanks for this great contribution.

Does it work for 2D transforms, too?


with “FFT_061408”

with “FFT_090808”

i dont know how to test my other GPU its not hooked up to anything and not running X like this GPU is.

is there a switch i use with ‘./FFT’ to force it to use my 2nd GPU?

No, it is 1D only.

It should be “./FFT -device 1” in FFT_090808. The earlier version requires modifying the source code: change “int idevice = 0” in main() to “int idevice = 1”.

i only did the test with 090808 this time but on each gpu (gpu1 is a little lower freq but both are factory clocks)

thanks for letting me know about the “-device 1” option. i trief “–./FFT --help” and there was nine so i didnt try -device 1 :( stupid me

any option to put the test in multigpu mode?

Would it be possible to run this code in CUDA 1.0? I don’t know exactly what kind of changes were made in CUDA 2.0… :mellow:

Why are you running CUDA 1.0? Isn’t that well over a year old?

I’m using a mac.

edit: I’ll just upgrade to the next version.

Cuda 2.0 is out for Mac.
The compiler in 1.1 will not do a good job on this code, you need to use CUDA 2.0 to get good performance.

Nope, multigpu mode is not currently implemented.

I’m sorry about the poor functionality and limited user interface. It was supposed to be a simple, easy-to-read technical demo, not a useful and versatile library.

I agree - great work!

Is there any chance of any of these improvements making their way into a new version of cuFFT?

Although this is now a bit old: cufft didn’t seem to really get up to full speed until it was using sizes much bigger than the library here. Do you have any plans to increase the sizes it can handle to enable further comparisons?

As I see, all plots on that page correspond to non-batched FFTs. In that case CUFFT runs slow due to large kernel launch overhead (~5us). For batched FFTs, as far as I recall, CUFFT is faster for smaller sizes, such as <1024.

Although I may release codes for few other sizes, I do not plan extensive updates as a similar work has recently been done by Microsoft. They produced FFT in CUDA that is as fast as one posted here but can handle arbitrary size. See…20on%20GPUs.pdf

Very interesting paper. Did I miss the link to the cuda code/library so I can have a play on my own system or has it not been made available?

I don’t see it available yet. May be they’ll announce the release at SC08 (next week).

I’ve been in touch with one of the authors of the MS paper, and the bad news is that MS will not release the Cuda versions of their libraries. Instead they plan to release them as DX11 compute shaders (ie. Vista only :thumbsdown: ).

Does anyone fancy implementing the algorithms in Cuda from the paper?

That’s really very annoying. Hopefully nVidia will do the work for us…

Not very surprising really, but encouraging that they may be planning to release a good set of APIs with the compute shader stuff.

Perhaps now vvolkov could continue his work if this is never going to see light of day on Cuda.

Using the 061408 and cuda 2.1 on 8800GT I got

Device: GeForce 8800 GT, 1620 MHz clock, 255 MB memory.
--------CUFFT------- —This prototype—
N Batch Gflop/s GB/s error Gflop/s GB/s error
8 524288 5.0 5.3 1.7 41.5 44.2 1.6
64 65536 31.8 17.0 2.4 5.9 3.1 1.6
512 8192 36.8 13.1 2.9 5.6 2.0 2.0

with the 090808 code and cuda 2.1 I got

Device: GeForce 8800 GT, 1620 MHz clock, 255 MB memory.
--------CUFFT------- —This prototype—
N Batch Gflop/s GB/s error Gflop/s GB/s error
8 2097152 4.7 5.0 1.8 41.4 44.1 1.7
16 1048576 10.4 8.3 2.1 5.0 4.0 1.4
64 262144 30.5 16.3 2.5 5.9 3.2 1.6
256 65536 53.4 21.4 2.2 23748.8 9499.5 8502756.0
cufft: ERROR: /root/cuda-stuff/sw/rel/gpgpu/toolkit/r2.1/cufft/src/, line 1070
cufft: ERROR: /root/cuda-stuff/sw/rel/gpgpu/toolkit/r2.1/cufft/src/, line 151
512 32768
FAILURE in main.cpp, line 218

Any suggestions ? on what is wrong ?

I’ve been thinking about the upcoming OpenCL and wondering where useful libraries like BLAS and FFT are going to come from. Obviously Nvidia has good reason to provide Cuda ones but perhaps won’t be so keen on providing OpenCL ones, and I imagine Apple’s versions are going to be platform specific. So I was wondering if (from what you’ve read of the OpenCL spec) if you have any plans to implement the FFT here in OpenCL?