Which GPUs are best suited for computations involving many 2D FFTs?

lwcooke · August 6, 2019, 9:27pm

Context

To start, I’m a physicist who’s new to GPU computing. Also this is my first ever forum post so please have mercy. I’m working on some simulations which involve taking many FFTs and iFFTs on 2D arrays of complex numbers, in Python. I’ve implemented PyTorch to do this as it seemed like the easiest choice at the time (last year), and it’s proven to be substantially faster using their FFT functions. Currently, I’m using an Nvidia GeForce 980 Ti, but my research group recently received some funds to potentially upgrade this.

So to be specific, PyTorch’s FFT functions require the input arrays to be of shape (N,N,2) with the third dimensions representing the real and complex parts of the data respectively; these arrays contain double-precision floats, in our case, and typically N<=1024.

The Question

What I’m wondering is which GPUs are best suited for this type of computation? I’m not so well versed in computing hardware, so I’m wondering what specifications I should be paying the most attention to here? Naively I’d assume the CUDA cores and a large cache size is of course useful, but I feel like I’m missing something here. How substantial would the performance gain be from one of the GPUs which can operate on double-precision floats, for instance (given that the 980 Ti can’t)?

njuffa · August 11, 2019, 12:45pm

This is a bit tricky. Generally speaking, large FFTs on GPUs tend to be limited by memory bandwidth.

However, in recent GPU generations the double-precision throughput of non-compute GPUs is reduced to such an extent that double-precision FFTs should be compute limited, but I have not checked to verify that. The question is, how severely compute-limited? If memory serves, CUFFT on compute GPUs can achieve about 1/8 of the theoretical DP (double precision) compute throughput for its fastest FFTs; this is with DP running at half the throughput of SP (single precision).

However, DP throughput on modern non-compute GPUs is only 1/32 of SP throughput. This would suggest that changing a GTX 980 Ti to a compute GPU with full DP support (at 1/2 the SP rate) could achieve a doubling of FFT throughput when using a GPU with the same memory bandwidth as the GTX 980 Ti (336 GB/sec). Alternatively, you could simply install a newer consumer GPU with its accompanying increase in bandwidth and DP throughput.

For example, a GTX 2080Ti (a US$ 1200 investment) would clock in at 420 DP GFLOPS and 616 GB/sec vs 190 DP GFLOPS and 336 GB/sec for the GTX 980Ti.

My reasoning here is hand-wavy and also neglects to look at the impact of using GPU libraries via Python rather than directly, and how severely your application is bottlenecked on GPU-based FFTs. I would suggest doing detailed performance analysis at application level, and trying some loaner GPU(s) if possible to gauge the performance impact from more capable hardware in real-life use cases.

I am assuming that your additional research funds are not so copius that you could afford to spend them on something like a Quadro GV100 (7400 DP GFLOPS, 870 GB/sec, ~$9000).

lwcooke · August 12, 2019, 10:07pm

Thank you for the response, that was extremely helpful!

Hopefully soon we’ll be doing some comparisons with some other GPUs we have access to, and perhaps if we get a newer one this can be included!

Topic		Replies	Views
Performance of GTX 980 Ti as a General Purpose GPU CUDA Programming and Performance	5	4154	September 29, 2015
What to buy now for CUDA calculations? CUDA Setup and Installation	5	6054	February 19, 2015
Realistic Throughput for cuFFT GPU-Accelerated Libraries	6	1556	February 18, 2019
FFT algorithm really memory bound? CUDA Programming and Performance	5	2753	April 16, 2017
Guidelines for GPU comparison and performances assessment CUDA Programming and Performance	2	567	February 17, 2020
effective way to evaluation a phase function over a matrix and correctly perform fft using GPU? CUDA Programming and Performance	7	838	October 26, 2017
cufft doubt comparing r2c and c2c 2D FFTs CUDA Programming and Performance	28	13491	October 27, 2010
Best FFT library for Fermi architecture what do you use for best performance? CUDA Programming and Performance	4	12254	March 22, 2013
Double precision and CUDA CUDA Programming and Performance	9	7711	October 21, 2013
FFT Speed vs. x86 CUDA Programming and Performance	14	24722	July 27, 2008

Which GPUs are best suited for computations involving many 2D FFTs?

Related topics