2D FFT performance on K40/K80

Can anyone point me in the direction of performance figures (specifically wall time) for doing 4K (3840 x 2160) and 8K (7680×4320) 2D FFTs in 8 bit and single precision with cuFFT, ideally on the Tesla K40 or K80?

may be of interest:

[url]http://developer.download.nvidia.com/compute/cuda/compute-docs/cuda-performance-report.pdf[/url]

What do you mean by “8 bit”? CUFFT offers single-precision and double-precision FFTs, both in real and complex variants. While performance data for 1D FFTs for K40 is readily available from NVIDIA ([url]http://developer.download.nvidia.com/compute/cuda/compute-docs/cuda-performance-report.pdf[/url]), I don’t know of a document that gives a good overview of 2D FFT performance with CUFFT.

Thanks for the replies.
The 8 bit figures aren’t that important to me - I was just wondering how performance changed when doing a single precision FFT with both a single precision and 8-bit input. However, I can restrict my interest to performance of single precision complex->complex FFTs. I’ve been struggling to find performance data for 2D FFTs at these resolutions too.

So far I have only found performance data for small, batched, 2D FFTs in a 2015 paper. Nothing regarding large 2D FFTs yet.

I think large 2D FFTs should approach the performance of a single batched 1D FFT. The previously linked performance report indicates on slide 4:

“1D Complex, Batched FFTs
Used in Signal Processing and as a Foundation for 2D and 3D FFTs”

Am I correct to assume that estimating 2D performance from 1D performance provides an upper bound for the estimated 2D performance?

Yes, I think that is what it means. In fact, 1D batched performance provides an upper bound for 2D performance.

I can provide some numbers generated with gearshifft (FFT benchmark suite on accelerators, in development, see github) for each transform type and single and double precision.

[url]http://pastebin.com/cA4hGGpE[/url] (results, average of 5 benchmark runs)
[url]http://pastebin.com/tPYHcQ0X[/url] (raw csv with all benchmark runs)

Looks like outplace and inplace real transforms are the fastest ones w.r.t. TimeToSolution including allocation, memcopy, fft and ifft, cufft setup and cleanup: ~15ms for 3840x2160 and ~60ms for 7680x4320.

Tell me, if you need more information.

Edit: Run on K80/CUDA 7.5, device initialization time excluded.

Brilliant - thanks. That’s exactly the information I was looking for!