Estimating FFT Performance

Can anyone tell me how to fairly accurately estimate the time required to do an fft in CUDA?

If I calculate (within a factor of 2 or so) the number of floating-point operations required to do a 512x512 fft, implement it in CUDA, and time it, it’s taking almost 100 times as long as my estimate. Of course, my estimate does not include operations required to move things around in memory or any other overhead taken care of by cufftExecR2C.

This is JUST for the cufftExecR2C call, it does not include copying data from host to the device or calculating the plan etc.

I am new to CUDA, so please excuse me if this is a tottaly stupid question.

I meant to ask a couple of other questions in my post:

  1. Is it reasonable that cuFFTExecR2C would take 100x longer than a simple estimate, just due to overhead and memory access?

  2. If not, any ideas what I could be doing wrong?

I am not doing anything except what appears all over in many cufft examples. alloc and initialize real data on the device, allocate a plan, call cufftexecr2c.

Yes, it is reasonable. This one gets a lot of people. The first CUDA call creates a driver context and binds it to a GPU device, which is a very heavy weight operation. Subsequent calls should be much faster.

Every call exhibits this problem (still not sure it IS a problem). I’m running this in a host thread. At the beginning of the thread function, I allocate all of my memory on the host and device, setup my fft plan, copy my kernel into the device, etc.

Then in each iteration thru the thread loop, I simply copy data from the host to device, call cufftexecr2c() etc.

I don’t free anything up until the thread exits.

So the first call to cufftexecr2c should take the same as all the others, yes?

Would you mind posting the source code assuming that this is a simple benchmark and not something sensitive?

It’s easiest to just start with the CUDA SDK example convolutionFFT2D, and change just a couple of lines…

If you look at the attached file, line 179, all I did was move the cutStopTimer() call to just after the cufftExecR2C() call, to time JUST the fft:

cufftSafeCall( cufftExecR2C(fftPlanFwd, d_PaddedData, (cufftComplex *)d_DataSpectrum) );

cutilCheckError( cutStopTimer(hTimer) );

    double gpuTime = cutGetTimerValue(hTimer);

On my GeForce 9800 GTX this takes around 1 ms for a 512x512. My estimate is around 2.6 MFLOPS. At ~400 GFLOPS/s spec (which I extrapolated from a different set of specs for the ION), this should take around 10 us. I’ve also timed this with a Windows multimedia timer with the same results.

I did a lot of rounding here, but I think the order of magnitude should be close.

Thanks very much for your help!

Oops…

I just tested the file that I uploaded, and that implementation runs as I would expect ~10 us scaled for a 512x512 (it actually uses a 1kx1k).

So, the difference is in MY code, which I hesitated to upload not because it is sensitive, but because it has a bunch of infrastructure wrapped around it.

I will try to upload something a little later.

Shouldn’t this be more like 5 gflops not 2.5 mflops? For a 2D 512x512 fft you have ~NlogN ops (51251218) in ~1 ms = ~5 gflops? That is not as good as a 9800 should be able to do, but it is a bit closer. According to these results, we are missing about a factor of 5x http://www.science.uwaterloo.ca/~hmerz/CUDA_benchFFT/, not 5000x.

Yes, those numbers look right, and agree with ine by 2x (I used 512x512x9). My 2.6 MFLOPS was total operations. Divide by 1ms, gives 2.6 GFLOPS/s.

So even if we go with 5 GFLOPS/s, we’re off by almost 100x, not 5x. I could understand 5x.

Also, as I mentioned in my most previous post, the CUDA SDK example code performs nearly exactly as I would expect. So, my problem definitly has something to do with something else in my code - probably completely unrelated to CUDA.

Thanks for you help Gregory!

After noting that the original CUDA SDK example performed as expected, I realized that I had added a cudaThreadSynchronize() call immediately after the cufftExecR2C() call.

I would THINK that even though this is probably not necessary, it should simply return immediately, taking no extra time. But, it DOES take almost a ms to return.

Any idea why?

Anyway, the fft now executes with pretty predictable performance.

1kx1kx10x2 / (0.15e-3) = 130 GFLOPS/s compared to ~400 max for my GeForce 9800 GTX ~= 1/3 efficiency. Quite acceptable.