Estimating FFT Performance

bfoote · June 3, 2010, 10:30pm

Can anyone tell me how to fairly accurately estimate the time required to do an fft in CUDA?

If I calculate (within a factor of 2 or so) the number of floating-point operations required to do a 512x512 fft, implement it in CUDA, and time it, it’s taking almost 100 times as long as my estimate. Of course, my estimate does not include operations required to move things around in memory or any other overhead taken care of by cufftExecR2C.

This is JUST for the cufftExecR2C call, it does not include copying data from host to the device or calculating the plan etc.

I am new to CUDA, so please excuse me if this is a tottaly stupid question.

bfoote · June 3, 2010, 10:41pm

I meant to ask a couple of other questions in my post:

Is it reasonable that cuFFTExecR2C would take 100x longer than a simple estimate, just due to overhead and memory access?
If not, any ideas what I could be doing wrong?

I am not doing anything except what appears all over in many cufft examples. alloc and initialize real data on the device, allocate a plan, call cufftexecr2c.

Gregory_Diamos · June 3, 2010, 10:59pm

Yes, it is reasonable. This one gets a lot of people. The first CUDA call creates a driver context and binds it to a GPU device, which is a very heavy weight operation. Subsequent calls should be much faster.

bfoote · June 4, 2010, 12:45am

Every call exhibits this problem (still not sure it IS a problem). I’m running this in a host thread. At the beginning of the thread function, I allocate all of my memory on the host and device, setup my fft plan, copy my kernel into the device, etc.

Then in each iteration thru the thread loop, I simply copy data from the host to device, call cufftexecr2c() etc.

I don’t free anything up until the thread exits.

So the first call to cufftexecr2c should take the same as all the others, yes?

Gregory_Diamos · June 4, 2010, 1:26am

Would you mind posting the source code assuming that this is a simple benchmark and not something sensitive?

bfoote · June 4, 2010, 3:03am

It’s easiest to just start with the CUDA SDK example convolutionFFT2D, and change just a couple of lines…

If you look at the attached file, line 179, all I did was move the cutStopTimer() call to just after the cufftExecR2C() call, to time JUST the fft:

cufftSafeCall( cufftExecR2C(fftPlanFwd, d_PaddedData, (cufftComplex *)d_DataSpectrum) );

cutilCheckError( cutStopTimer(hTimer) );

    double gpuTime = cutGetTimerValue(hTimer);

On my GeForce 9800 GTX this takes around 1 ms for a 512x512. My estimate is around 2.6 MFLOPS. At ~400 GFLOPS/s spec (which I extrapolated from a different set of specs for the ION), this should take around 10 us. I’ve also timed this with a Windows multimedia timer with the same results.

I did a lot of rounding here, but I think the order of magnitude should be close.

Thanks very much for your help!

bfoote · June 4, 2010, 3:17am

Oops…

I just tested the file that I uploaded, and that implementation runs as I would expect ~10 us scaled for a 512x512 (it actually uses a 1kx1k).

So, the difference is in MY code, which I hesitated to upload not because it is sensitive, but because it has a bunch of infrastructure wrapped around it.

I will try to upload something a little later.

Gregory_Diamos · June 4, 2010, 3:58am

Shouldn’t this be more like 5 gflops not 2.5 mflops? For a 2D 512x512 fft you have ~NlogN ops (51251218) in ~1 ms = ~5 gflops? That is not as good as a 9800 should be able to do, but it is a bit closer. According to these results, we are missing about a factor of 5x http://www.science.uwaterloo.ca/~hmerz/CUDA_benchFFT/, not 5000x.

bfoote · June 4, 2010, 2:00pm

Yes, those numbers look right, and agree with ine by 2x (I used 512x512x9). My 2.6 MFLOPS was total operations. Divide by 1ms, gives 2.6 GFLOPS/s.

So even if we go with 5 GFLOPS/s, we’re off by almost 100x, not 5x. I could understand 5x.

Also, as I mentioned in my most previous post, the CUDA SDK example code performs nearly exactly as I would expect. So, my problem definitly has something to do with something else in my code - probably completely unrelated to CUDA.

Thanks for you help Gregory!

bfoote · June 4, 2010, 3:59pm

After noting that the original CUDA SDK example performed as expected, I realized that I had added a cudaThreadSynchronize() call immediately after the cufftExecR2C() call.

I would THINK that even though this is probably not necessary, it should simply return immediately, taking no extra time. But, it DOES take almost a ms to return.

Any idea why?

Anyway, the fft now executes with pretty predictable performance.

1kx1kx10x2 / (0.15e-3) = 130 GFLOPS/s compared to ~400 max for my GeForce 9800 GTX ~= 1/3 efficiency. Quite acceptable.

Topic		Replies	Views
FFT Computation Timing constraint on GPU. CUDA Programming and Performance	0	708	August 22, 2014
Comparing cuda fft and matlab fft CUDA Programming and Performance	5	6173	February 10, 2008
cufftComplex memory allocation very high CUDA Programming and Performance	8	6800	December 15, 2009
cufft doubt comparing r2c and c2c 2D FFTs CUDA Programming and Performance	28	13516	October 27, 2010
FFT Performance CUDA Programming and Performance	4	4699	March 3, 2010
CUDA slower than MATLAB... again I can't get the simplest examples to show any speed-up using GP CUDA Programming and Performance	5	2536	February 18, 2011
cufft error (?) CUDA Programming and Performance	7	9006	March 5, 2012
cuFFT Timing Jetson TX2	14	2449	October 18, 2021
Bad Performance of CUFFT library? compilation flags for optimizing fft performance CUDA Programming and Performance	11	13504	February 17, 2012
How to show CuFFT routines show higher performance than normal MATLAB fft() in terms of time taken. CUDA Programming and Performance	13	3171	July 10, 2014

Estimating FFT Performance

Related topics