2D FFT performance on K40/K80

cash4alex · September 3, 2016, 7:51pm

Can anyone point me in the direction of performance figures (specifically wall time) for doing 4K (3840 x 2160) and 8K (7680×4320) 2D FFTs in 8 bit and single precision with cuFFT, ideally on the Tesla K40 or K80?

Robert_Crovella · September 3, 2016, 8:09pm

may be of interest:

[url]http://developer.download.nvidia.com/compute/cuda/compute-docs/cuda-performance-report.pdf[/url]

njuffa · September 3, 2016, 8:09pm

What do you mean by “8 bit”? CUFFT offers single-precision and double-precision FFTs, both in real and complex variants. While performance data for 1D FFTs for K40 is readily available from NVIDIA ([url]http://developer.download.nvidia.com/compute/cuda/compute-docs/cuda-performance-report.pdf[/url]), I don’t know of a document that gives a good overview of 2D FFT performance with CUFFT.

cash4alex · September 3, 2016, 8:27pm

Thanks for the replies.
The 8 bit figures aren’t that important to me - I was just wondering how performance changed when doing a single precision FFT with both a single precision and 8-bit input. However, I can restrict my interest to performance of single precision complex->complex FFTs. I’ve been struggling to find performance data for 2D FFTs at these resolutions too.

njuffa · September 3, 2016, 8:32pm

So far I have only found performance data for small, batched, 2D FFTs in a 2015 paper. Nothing regarding large 2D FFTs yet.

Robert_Crovella · September 4, 2016, 3:43am

I think large 2D FFTs should approach the performance of a single batched 1D FFT. The previously linked performance report indicates on slide 4:

“1D Complex, Batched FFTs
Used in Signal Processing and as a Foundation for 2D and 3D FFTs”

njuffa · September 4, 2016, 7:45am

Am I correct to assume that estimating 2D performance from 1D performance provides an upper bound for the estimated 2D performance?

Robert_Crovella · September 4, 2016, 2:11pm

Yes, I think that is what it means. In fact, 1D batched performance provides an upper bound for 2D performance.

tdd11235813 · September 5, 2016, 7:26pm

I can provide some numbers generated with gearshifft (FFT benchmark suite on accelerators, in development, see github) for each transform type and single and double precision.

[url]http://pastebin.com/cA4hGGpE[/url] (results, average of 5 benchmark runs)
[url]http://pastebin.com/tPYHcQ0X[/url] (raw csv with all benchmark runs)

Looks like outplace and inplace real transforms are the fastest ones w.r.t. TimeToSolution including allocation, memcopy, fft and ifft, cufft setup and cleanup: ~15ms for 3840x2160 and ~60ms for 7680x4320.

Tell me, if you need more information.

Edit: Run on K80/CUDA 7.5, device initialization time excluded.

cash4alex · September 5, 2016, 7:39pm

Brilliant - thanks. That’s exactly the information I was looking for!

Topic		Replies	Views
Do 1/2/3D FFTs work correctly on K80 if batch = 1? CUDA Programming and Performance	6	978	April 4, 2016
CUFFT: calculation time CUDA Programming and Performance	6	2676	April 21, 2012
Tesla K40 vs GTX 690 CUDA Programming and Performance	2	1446	July 10, 2015
FFT Performance CUDA Programming and Performance	4	4686	March 3, 2010
CUFFT Implementation CUDA Programming and Performance	3	7428	July 2, 2007
Poor CUFFT Performance? Am I doing something wrong? CUDA Programming and Performance	15	15487	May 4, 2010
FFT problem on a 8800GT 1G card CUDA Programming and Performance	4	5370	April 2, 2008
CUFFT appears to give errors for vectors > 1024 CUDA Programming and Performance	6	8765	April 12, 2007
3d CUFFT issues / new implementation? CUDA Programming and Performance	6	5152	June 11, 2008
Bad Performance of CUFFT library? compilation flags for optimizing fft performance CUDA Programming and Performance	11	13491	February 17, 2012

2D FFT performance on K40/K80

Related topics