Can anyone tell me how to fairly accurately estimate the time required to do an fft in CUDA?
If I calculate (within a factor of 2 or so) the number of floating-point operations required to do a 512x512 fft, implement it in CUDA, and time it, it’s taking almost 100 times as long as my estimate. Of course, my estimate does not include operations required to move things around in memory or any other overhead taken care of by cufftExecR2C.
This is JUST for the cufftExecR2C call, it does not include copying data from host to the device or calculating the plan etc.
I am new to CUDA, so please excuse me if this is a tottaly stupid question.
Yes, it is reasonable. This one gets a lot of people. The first CUDA call creates a driver context and binds it to a GPU device, which is a very heavy weight operation. Subsequent calls should be much faster.
Every call exhibits this problem (still not sure it IS a problem). I’m running this in a host thread. At the beginning of the thread function, I allocate all of my memory on the host and device, setup my fft plan, copy my kernel into the device, etc.
Then in each iteration thru the thread loop, I simply copy data from the host to device, call cufftexecr2c() etc.
I don’t free anything up until the thread exits.
So the first call to cufftexecr2c should take the same as all the others, yes?
On my GeForce 9800 GTX this takes around 1 ms for a 512x512. My estimate is around 2.6 MFLOPS. At ~400 GFLOPS/s spec (which I extrapolated from a different set of specs for the ION), this should take around 10 us. I’ve also timed this with a Windows multimedia timer with the same results.
I did a lot of rounding here, but I think the order of magnitude should be close.
Shouldn’t this be more like 5 gflops not 2.5 mflops? For a 2D 512x512 fft you have ~NlogN ops (51251218) in ~1 ms = ~5 gflops? That is not as good as a 9800 should be able to do, but it is a bit closer. According to these results, we are missing about a factor of 5x http://www.science.uwaterloo.ca/~hmerz/CUDA_benchFFT/, not 5000x.
Yes, those numbers look right, and agree with ine by 2x (I used 512x512x9). My 2.6 MFLOPS was total operations. Divide by 1ms, gives 2.6 GFLOPS/s.
So even if we go with 5 GFLOPS/s, we’re off by almost 100x, not 5x. I could understand 5x.
Also, as I mentioned in my most previous post, the CUDA SDK example code performs nearly exactly as I would expect. So, my problem definitly has something to do with something else in my code - probably completely unrelated to CUDA.