fft2 times

I need to do a bunch of 512 x 512 complex fft2s and I wanted to verify the running times as I hope that I’m timing wrong (seems to be using about 20% of device GFlops). I was wondering if anyone has benchmarks for either the GTX570 and/or the Tesla 2075 or 2090.

I’m currently seeing 50ms for 512 ffts, or about 0.1ms per fft (this is with CUDA 4.1 RC2)

Thanks