FFT Speed vs. x86

Hey everyone,

Does anyone have any tips for speeding up the FFT routine? I have C FFT code that runs on a normal, x86 based CPU and the cudaFFT code that runs on the video card. The x86 is roughly 1.3 - 1.5 times as fast for a 1024x1000 array. The FFT code for CUDA is set up as a batch FFT, that is, it copies the entire 1024x1000 array to the video card then performs a batch FFT on all the data, and copies the data back off. The only difference in the code is the FFT routine, all other aspects are identical. Are these FFT sizes to small to see any gains vs. an x86 CPU?


A couple of questions:

  1. You mention that you run a “batch” FFT. So, are you running a 2D FFT on data that is 1024x1000, or 1D FFTs on 1024 vectors of 1000 elements each?

  2. Do your times include the data transfer from CPU to GPU and back? For smaller input sizes the transfer time actually dominates execution time (you can check that by timing only the FFT computation part). In general, you want to perform as many computations as possible on the GPU. So, not only the FFT, but any processing that comes afterwards as well, minimizing the overhead due to transfers over PCIe.


Sorry about the laggy response time.

  1. By batch FFT, I meant a 1D FFT, taking 1000 vectors, with 1024 elements each.

  2. The times include the data transfer time. Basically, 1,024,000*sizeof(float) bytes are transfered each time the FFT code is executed.



Thanks for the update. Another question, how do you allocate memory on the GPU? If you allocate with cuMemAllocHost, you should get a substantial speedup when transferring to/from the GPU (approx. 2.5X). I believe the programming guide has some guidelines for using “pinned” memory.


The new syntax for pinned memory is cudaMallocHost and cudaFreeHost.

FFT is an O(N logN) algorithm, you are adding a O(N) transfer to/from host, plus you are in a range (N=1024) where the data is fitting in L2 cache on the CPU.

The best way to achieve a speed up is to do something more with the data before transferring them back (like in one of the example in MATLAB). Can you move more processing to the GPU?


Hi AustinMcElroy,

Just for the curiosity… on which CPU and GPU you get this results ?

Hey guys,

mfatica and paulius:

I am currently using cudaMalloc and cudaMemcpy to do the allocation and transfer, respectively. I will look into the faster allocation and transfer functions. As to pushing more processing to the GPU, I am having a lot of trouble getting things to work on the GPU. Though this should be brought up in a different thread, the 2 issues I am currently having when building my DLL are:

1.) I get a compile error when I try to compile a .cu file with the <<<x,y,z>>> notation in VS2005 using the custom compile rules.

2.) Accessing the memory on the GPU using ptrs. For example, trying to do ptrGPUData[512] = ptrGPUData[512]*ptrGPUData[513] is crashing the program.

Again, I am just doing something wrong that is probably in the documentation or forums that I haven’t found yet.


The CPU vs. GPU comparison was done using a P4 D running at 3.06 GHz (I think).

We were seeing if we could beat the speed performance of the example Matlab FFT2 MEX provided by NVidia which was about a 4x improvement over Matlabs 2D FFT. Because Matlab stores its real/imaginary data as separate matrices in column major format and CuFFT uses an interleveaved row major format we thought we would be clever and use the cuBlas functions to directly move the data across. The idea was (assume single and not double Matlab data types) to copy the Matlab real and imaginary matrices into the device matrix using cublasGetVector/cublasSetVector using their increment argument to create the interleaved device matrix. The code works and produces the correct result, but it is slow (running at the same speed as the Matlab code). The functions in our MEX code which take the longest to run are the cublasGetVector/cublasSetVector that end up using 95% of the code execution time. We were wondering if anyone knew why the cublasGetVector/cublasSetVector is so much slower than the CUDA memcopy commands?


If your data is single precision, you could do the interleaving on the card.

I would use the standard cudaMemCpy to transfer real and imaginary part to the card and then call the kernel

static __global__ void  interleave(float *ar , float *ai, cufftComplex *c, int N)


  unsigned int idx   = __umul24(blockIdx.x,blockDim.x)+threadIdx.x;

  unsigned int idy   = __umul24(blockIdx.y,blockDim.y)+threadIdx.y;

  float2 *c2;

  if( idx<N && idy <N )


  c2 = (float2 *) c;

  unsigned int index = idx +__umul24(idy ,N);

  c2[index].x = ar[i];

  c2[index].y = ai[i];



Thanks for the idea we’ll try it and post our results!


[quote name=‘mfatica’ date=‘Aug 6 2007, 11:26 PM’]

If your data is single precision, you could do the interleaving on the card.

I would use the standard cudaMemCpy to transfer real and imaginary part to the card and then call the kernel

static __global__ void  interleave(float *ar , float *ai, cufftComplex *c, int N)


this is very interesting.

I’m using (“trying to use” is more appropriate) CUDA FFT to perform 2D Complex2Complex direct-and-inverse FFTs, on desktop-size images.

I’m trying to squeeze out all the performance I can from my G92 but it’s always too slow vs. a Core2Quad running MIT’s FFTW.

So, this interleaving hint is tempting.

But: this would require

  1. cudamalloc real part, imaginary part and complex array on the device (takes time)

  2. memory transfer from host real part to device real part, and from host Im part to device Im part (takes even more time)

  3. executing the kernel

and then the same (backward) to have the transformed Re and Im parts on the host.

Do you say I would get a speed gain from this?

Thanks a lot


Hi All,

I am trying to learn as much as possible before diving into the CUDA business and investing in a good graphics card. One benchmark that I am really interested in is 3D CUFFT vs FFTW 3.x or Intel’s FFT on 20^3 (16^3, 24^3) Complex-To-Real and Real-To-Complex transforms. I have the CPU benchmarks of FFTW and Intel FFT for Intel’s E6750 (2.66GHz Core 2 Duo) running on 32 bit Linux RHEL 5, so I was wondering how anything decent on GPU side would compare. In principle I am interested in double precision calculations, which requires GTX260 or 280, but would be grateful if anyone can give me any information even on single-precision data. I would be interested in both data transfer + execution time and execution only time . I would really appreciate any help, as I really need a speed-up of my calculations for a research project to be viable and CUDA looks more and more attractive to me :)


You’ve probably seen this already…


I’m doing CUDA FFT vs. FFTW comparisons; I work with single precision, Complex2Complex 2D arrays.

I use multithreading for FFTW, with NThreads = 4 (quad core CPU).

I use the “float” (single precision) version of FFTW, which is quite faster than the double-precision one.

I was getting uninspiring benchmarks (CUDA was faster only for very large arrays, that is, over 1024x1024); but then, I tweaked the code a bit: now, I do interleaving and de-interleaving on the GPU (from Re and Im to Complex, and back) and I use pinned memory on the host.

Here are the benchmarks; please note that the benchmark performs a forward FFT followed by a reverse FFT. Each time (for both F and R) I transfer the data to and from the device.

Times are evaluated by performing 4 cycles in a row and averaging the total time.

CPU: Intel Core2 Quad @ 3.0 GHz, FSB 1333MHz

GPU: GeForce G92 @ 650/1625/1944


FFTW = 11ms, CUDA = 9ms


FFTW = 46ms, CUDA = 23ms


FFTW = 291ms, CUDA = 109ms

Hope this helps.

Please note I still use CUDA 1.1, and my motherboard is a PCIe 1.1 (quite slower data transfers: about 2.9GB/s Host2Dev, about 2.1GB/s Dev2Host).


Thanks a lot for the information. Eventually I will need a double precision FFTs combined with intensive algebraic transformations, but I have enough positive impression about CUDA by now to start experimenting with it and maybe convincing my boss to get a 280GTX for the double precision.

Best Regards: