FFT Speed vs. x86

AustinMcElroy · June 29, 2007, 8:41pm

Hey everyone,

Does anyone have any tips for speeding up the FFT routine? I have C FFT code that runs on a normal, x86 based CPU and the cudaFFT code that runs on the video card. The x86 is roughly 1.3 - 1.5 times as fast for a 1024x1000 array. The FFT code for CUDA is set up as a batch FFT, that is, it copies the entire 1024x1000 array to the video card then performs a batch FFT on all the data, and copies the data back off. The only difference in the code is the FFT routine, all other aspects are identical. Are these FFT sizes to small to see any gains vs. an x86 CPU?

Thanks,
Austin

paulius · June 29, 2007, 11:40pm

A couple of questions:

You mention that you run a “batch” FFT. So, are you running a 2D FFT on data that is 1024x1000, or 1D FFTs on 1024 vectors of 1000 elements each?
Do your times include the data transfer from CPU to GPU and back? For smaller input sizes the transfer time actually dominates execution time (you can check that by timing only the FFT computation part). In general, you want to perform as many computations as possible on the GPU. So, not only the FFT, but any processing that comes afterwards as well, minimizing the overhead due to transfers over PCIe.

Paulius

AustinMcElroy · July 2, 2007, 7:02pm

Sorry about the laggy response time.

By batch FFT, I meant a 1D FFT, taking 1000 vectors, with 1024 elements each.
The times include the data transfer time. Basically, 1,024,000*sizeof(float) bytes are transfered each time the FFT code is executed.

Thanks,

Austin

paulius · July 2, 2007, 7:40pm

Thanks for the update. Another question, how do you allocate memory on the GPU? If you allocate with cuMemAllocHost, you should get a substantial speedup when transferring to/from the GPU (approx. 2.5X). I believe the programming guide has some guidelines for using “pinned” memory.

Paulius

mfatica · July 2, 2007, 8:08pm

The new syntax for pinned memory is cudaMallocHost and cudaFreeHost.

FFT is an O(N logN) algorithm, you are adding a O(N) transfer to/from host, plus you are in a range (N=1024) where the data is fitting in L2 cache on the CPU.

The best way to achieve a speed up is to do something more with the data before transferring them back (like in one of the example in MATLAB). Can you move more processing to the GPU?

Massimiliano

Edgardz · July 2, 2007, 8:09pm

Hi AustinMcElroy,

Just for the curiosity… on which CPU and GPU you get this results ?

AustinMcElroy · July 2, 2007, 8:52pm

Hey guys,

mfatica and paulius:

I am currently using cudaMalloc and cudaMemcpy to do the allocation and transfer, respectively. I will look into the faster allocation and transfer functions. As to pushing more processing to the GPU, I am having a lot of trouble getting things to work on the GPU. Though this should be brought up in a different thread, the 2 issues I am currently having when building my DLL are:

1.) I get a compile error when I try to compile a .cu file with the <<<x,y,z>>> notation in VS2005 using the custom compile rules.

2.) Accessing the memory on the GPU using ptrs. For example, trying to do ptrGPUData[512] = ptrGPUData[512]*ptrGPUData[513] is crashing the program.

Again, I am just doing something wrong that is probably in the documentation or forums that I haven’t found yet.

Edgardz:

The CPU vs. GPU comparison was done using a P4 D running at 3.06 GHz (I think).

potto216 · August 6, 2007, 9:03pm

We were seeing if we could beat the speed performance of the example Matlab FFT2 MEX provided by NVidia which was about a 4x improvement over Matlabs 2D FFT. Because Matlab stores its real/imaginary data as separate matrices in column major format and CuFFT uses an interleveaved row major format we thought we would be clever and use the cuBlas functions to directly move the data across. The idea was (assume single and not double Matlab data types) to copy the Matlab real and imaginary matrices into the device matrix using cublasGetVector/cublasSetVector using their increment argument to create the interleaved device matrix. The code works and produces the correct result, but it is slow (running at the same speed as the Matlab code). The functions in our MEX code which take the longest to run are the cublasGetVector/cublasSetVector that end up using 95% of the code execution time. We were wondering if anyone knew why the cublasGetVector/cublasSetVector is so much slower than the CUDA memcopy commands?

Cheers,
Paul

mfatica · August 6, 2007, 9:26pm

If your data is single precision, you could do the interleaving on the card.

I would use the standard cudaMemCpy to transfer real and imaginary part to the card and then call the kernel

static __global__ void  interleave(float *ar , float *ai, cufftComplex *c, int N)

{

  unsigned int idx   = __umul24(blockIdx.x,blockDim.x)+threadIdx.x;

  unsigned int idy   = __umul24(blockIdx.y,blockDim.y)+threadIdx.y;

  float2 *c2;

  if( idx<N && idy <N )

  {

  c2 = (float2 *) c;

  unsigned int index = idx +__umul24(idy ,N);

  c2[index].x = ar[i];

  c2[index].y = ai[i];

  }

}

potto216 · August 9, 2007, 5:04pm

Thanks for the idea we’ll try it and post our results!

XFer · July 13, 2008, 10:54am

Hello,

[quote name=‘mfatica’ date=‘Aug 6 2007, 11:26 PM’]

If your data is single precision, you could do the interleaving on the card.

I would use the standard cudaMemCpy to transfer real and imaginary part to the card and then call the kernel

static __global__ void  interleave(float *ar , float *ai, cufftComplex *c, int N)

[quote]

this is very interesting.

I’m using (“trying to use” is more appropriate) CUDA FFT to perform 2D Complex2Complex direct-and-inverse FFTs, on desktop-size images.

I’m trying to squeeze out all the performance I can from my G92 but it’s always too slow vs. a Core2Quad running MIT’s FFTW.

So, this interleaving hint is tempting.

But: this would require

cudamalloc real part, imaginary part and complex array on the device (takes time)
memory transfer from host real part to device real part, and from host Im part to device Im part (takes even more time)
executing the kernel

and then the same (backward) to have the transformed Re and Im parts on the host.

Do you say I would get a speed gain from this?

Thanks a lot

Fernando

Demq · July 17, 2008, 1:02am

Hi All,

I am trying to learn as much as possible before diving into the CUDA business and investing in a good graphics card. One benchmark that I am really interested in is 3D CUFFT vs FFTW 3.x or Intel’s FFT on 20^3 (16^3, 24^3) Complex-To-Real and Real-To-Complex transforms. I have the CPU benchmarks of FFTW and Intel FFT for Intel’s E6750 (2.66GHz Core 2 Duo) running on 32 bit Linux RHEL 5, so I was wondering how anything decent on GPU side would compare. In principle I am interested in double precision calculations, which requires GTX260 or 280, but would be grateful if anyone can give me any information even on single-precision data. I would be interested in both data transfer + execution time and execution only time . I would really appreciate any help, as I really need a speed-up of my calculations for a research project to be viable and CUDA looks more and more attractive to me :)

Cheers,
Demq

_Big_Mac · July 17, 2008, 9:38am

You’ve probably seen this already…
[url=“http://www.science.uwaterloo.ca/~hmerz/CUDA_benchFFT/”]http://www.science.uwaterloo.ca/~hmerz/CUDA_benchFFT/[/url]

XFer · July 21, 2008, 8:59pm

Hi,

I’m doing CUDA FFT vs. FFTW comparisons; I work with single precision, Complex2Complex 2D arrays.

I use multithreading for FFTW, with NThreads = 4 (quad core CPU).

I use the “float” (single precision) version of FFTW, which is quite faster than the double-precision one.

I was getting uninspiring benchmarks (CUDA was faster only for very large arrays, that is, over 1024x1024); but then, I tweaked the code a bit: now, I do interleaving and de-interleaving on the GPU (from Re and Im to Complex, and back) and I use pinned memory on the host.

Here are the benchmarks; please note that the benchmark performs a forward FFT followed by a reverse FFT. Each time (for both F and R) I transfer the data to and from the device.

Times are evaluated by performing 4 cycles in a row and averaging the total time.

CPU: Intel Core2 Quad @ 3.0 GHz, FSB 1333MHz

GPU: GeForce G92 @ 650/1625/1944

512x512:

FFTW = 11ms, CUDA = 9ms

1024x1024:

FFTW = 46ms, CUDA = 23ms

2048x2048:

FFTW = 291ms, CUDA = 109ms

Hope this helps.

Please note I still use CUDA 1.1, and my motherboard is a PCIe 1.1 (quite slower data transfers: about 2.9GB/s Host2Dev, about 2.1GB/s Dev2Host).

Fernando

Demq · July 27, 2008, 10:35pm

Hi,

I’m doing CUDA FFT vs. FFTW comparisons; I work with single precision, Complex2Complex 2D arrays.

I use multithreading for FFTW, with NThreads = 4 (quad core CPU).

I use the “float” (single precision) version of FFTW, which is quite faster than the double-precision one.

I was getting uninspiring benchmarks (CUDA was faster only for very large arrays, that is, over 1024x1024); but then, I tweaked the code a bit: now, I do interleaving and de-interleaving on the GPU (from Re and Im to Complex, and back) and I use pinned memory on the host.

Here are the benchmarks; please note that the benchmark performs a forward FFT followed by a reverse FFT. Each time (for both F and R) I transfer the data to and from the device.

Times are evaluated by performing 4 cycles in a row and averaging the total time.

CPU: Intel Core2 Quad @ 3.0 GHz, FSB 1333MHz

GPU: GeForce G92 @ 650/1625/1944
512x512:

FFTW = 11ms, CUDA = 9ms

1024x1024:

FFTW = 46ms, CUDA = 23ms

2048x2048:

FFTW = 291ms, CUDA = 109ms
Hope this helps.

Please note I still use CUDA 1.1, and my motherboard is a PCIe 1.1 (quite slower data transfers: about 2.9GB/s Host2Dev, about 2.1GB/s Dev2Host).

Fernando

[snapback]413956[/snapback]

Thanks a lot for the information. Eventually I will need a double precision FFTs combined with intensive algebraic transformations, but I have enough positive impression about CUDA by now to start experimenting with it and maybe convincing my boss to get a 280GTX for the double precision.

Best Regards:

          Hrayr

Topic		Replies	Views
cufft doubt comparing r2c and c2c 2D FFTs CUDA Programming and Performance	28	13490	October 27, 2010
advice needed by a PhD student CUDA Programming and Performance	26	2856	December 4, 2011
Multiple batches of 1D FFT using cuFFT GPU-Accelerated Libraries	10	5042	October 29, 2019
Poor CUFFT Performance? Am I doing something wrong? CUDA Programming and Performance	15	15486	May 4, 2010
CUDA slower than MATLAB... again I can't get the simplest examples to show any speed-up using GP CUDA Programming and Performance	5	2515	February 18, 2011
my speedy FFT 3x faster than CUFFT CUDA Programming and Performance	139	241043	November 16, 2011
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	677	April 4, 2017
Newbie : Would like to build a physics sim completely contained in the GPU CUDA Programming and Performance	14	738	January 18, 2024
CUDA FFT different from Matlab FFT CUDA Programming and Performance	32	9320	March 29, 2011
Challenges in Achieving Optimal GPU Performance for FFT on NVIDIA Jetson AGX Orin Jetson AGX Orin gpu-computing	5	213	August 28, 2024

FFT Speed vs. x86

Related topics