Does anyone have any tips for speeding up the FFT routine? I have C FFT code that runs on a normal, x86 based CPU and the cudaFFT code that runs on the video card. The x86 is roughly 1.3 - 1.5 times as fast for a 1024x1000 array. The FFT code for CUDA is set up as a batch FFT, that is, it copies the entire 1024x1000 array to the video card then performs a batch FFT on all the data, and copies the data back off. The only difference in the code is the FFT routine, all other aspects are identical. Are these FFT sizes to small to see any gains vs. an x86 CPU?

You mention that you run a “batch” FFT. So, are you running a 2D FFT on data that is 1024x1000, or 1D FFTs on 1024 vectors of 1000 elements each?

Do your times include the data transfer from CPU to GPU and back? For smaller input sizes the transfer time actually dominates execution time (you can check that by timing only the FFT computation part). In general, you want to perform as many computations as possible on the GPU. So, not only the FFT, but any processing that comes afterwards as well, minimizing the overhead due to transfers over PCIe.

Thanks for the update. Another question, how do you allocate memory on the GPU? If you allocate with cuMemAllocHost, you should get a substantial speedup when transferring to/from the GPU (approx. 2.5X). I believe the programming guide has some guidelines for using “pinned” memory.

The new syntax for pinned memory is cudaMallocHost and cudaFreeHost.

FFT is an O(N logN) algorithm, you are adding a O(N) transfer to/from host, plus you are in a range (N=1024) where the data is fitting in L2 cache on the CPU.

The best way to achieve a speed up is to do something more with the data before transferring them back (like in one of the example in MATLAB). Can you move more processing to the GPU?

I am currently using cudaMalloc and cudaMemcpy to do the allocation and transfer, respectively. I will look into the faster allocation and transfer functions. As to pushing more processing to the GPU, I am having a lot of trouble getting things to work on the GPU. Though this should be brought up in a different thread, the 2 issues I am currently having when building my DLL are:

1.) I get a compile error when I try to compile a .cu file with the <<<x,y,z>>> notation in VS2005 using the custom compile rules.

2.) Accessing the memory on the GPU using ptrs. For example, trying to do ptrGPUData[512] = ptrGPUData[512]*ptrGPUData[513] is crashing the program.

Again, I am just doing something wrong that is probably in the documentation or forums that I haven’t found yet.

Edgardz:

The CPU vs. GPU comparison was done using a P4 D running at 3.06 GHz (I think).

We were seeing if we could beat the speed performance of the example Matlab FFT2 MEX provided by NVidia which was about a 4x improvement over Matlabs 2D FFT. Because Matlab stores its real/imaginary data as separate matrices in column major format and CuFFT uses an interleveaved row major format we thought we would be clever and use the cuBlas functions to directly move the data across. The idea was (assume single and not double Matlab data types) to copy the Matlab real and imaginary matrices into the device matrix using cublasGetVector/cublasSetVector using their increment argument to create the interleaved device matrix. The code works and produces the correct result, but it is slow (running at the same speed as the Matlab code). The functions in our MEX code which take the longest to run are the cublasGetVector/cublasSetVector that end up using 95% of the code execution time. We were wondering if anyone knew why the cublasGetVector/cublasSetVector is so much slower than the CUDA memcopy commands?

I am trying to learn as much as possible before diving into the CUDA business and investing in a good graphics card. One benchmark that I am really interested in is 3D CUFFT vs FFTW 3.x or Intel’s FFT on 20^3 (16^3, 24^3) Complex-To-Real and Real-To-Complex transforms. I have the CPU benchmarks of FFTW and Intel FFT for Intel’s E6750 (2.66GHz Core 2 Duo) running on 32 bit Linux RHEL 5, so I was wondering how anything decent on GPU side would compare. In principle I am interested in double precision calculations, which requires GTX260 or 280, but would be grateful if anyone can give me any information even on single-precision data. I would be interested in both data transfer + execution time and execution only time . I would really appreciate any help, as I really need a speed-up of my calculations for a research project to be viable and CUDA looks more and more attractive to me :)

I’m doing CUDA FFT vs. FFTW comparisons; I work with single precision, Complex2Complex 2D arrays.

I use multithreading for FFTW, with NThreads = 4 (quad core CPU).

I use the “float” (single precision) version of FFTW, which is quite faster than the double-precision one.

I was getting uninspiring benchmarks (CUDA was faster only for very large arrays, that is, over 1024x1024); but then, I tweaked the code a bit: now, I do interleaving and de-interleaving on the GPU (from Re and Im to Complex, and back) and I use pinned memory on the host.

Here are the benchmarks; please note that the benchmark performs a forward FFT followed by a reverse FFT. Each time (for both F and R) I transfer the data to and from the device.

Times are evaluated by performing 4 cycles in a row and averaging the total time.

CPU: Intel Core2 Quad @ 3.0 GHz, FSB 1333MHz

GPU: GeForce G92 @ 650/1625/1944

512x512:
FFTW = 11ms, CUDA = 9ms
1024x1024:
FFTW = 46ms, CUDA = 23ms
2048x2048:
FFTW = 291ms, CUDA = 109ms

Hope this helps.

Please note I still use CUDA 1.1, and my motherboard is a PCIe 1.1 (quite slower data transfers: about 2.9GB/s Host2Dev, about 2.1GB/s Dev2Host).

Thanks a lot for the information. Eventually I will need a double precision FFTs combined with intensive algebraic transformations, but I have enough positive impression about CUDA by now to start experimenting with it and maybe convincing my boss to get a 280GTX for the double precision.