To start, I’m a physicist who’s new to GPU computing. Also this is my first ever forum post so please have mercy. I’m working on some simulations which involve taking many FFTs and iFFTs on 2D arrays of complex numbers, in Python. I’ve implemented PyTorch to do this as it seemed like the easiest choice at the time (last year), and it’s proven to be substantially faster using their FFT functions. Currently, I’m using an Nvidia GeForce 980 Ti, but my research group recently received some funds to potentially upgrade this.
So to be specific, PyTorch’s FFT functions require the input arrays to be of shape (N,N,2) with the third dimensions representing the real and complex parts of the data respectively; these arrays contain double-precision floats, in our case, and typically N<=1024.
The Question
What I’m wondering is which GPUs are best suited for this type of computation? I’m not so well versed in computing hardware, so I’m wondering what specifications I should be paying the most attention to here? Naively I’d assume the CUDA cores and a large cache size is of course useful, but I feel like I’m missing something here. How substantial would the performance gain be from one of the GPUs which can operate on double-precision floats, for instance (given that the 980 Ti can’t)?
This is a bit tricky. Generally speaking, large FFTs on GPUs tend to be limited by memory bandwidth.
However, in recent GPU generations the double-precision throughput of non-compute GPUs is reduced to such an extent that double-precision FFTs should be compute limited, but I have not checked to verify that. The question is, how severely compute-limited? If memory serves, CUFFT on compute GPUs can achieve about 1/8 of the theoretical DP (double precision) compute throughput for its fastest FFTs; this is with DP running at half the throughput of SP (single precision).
However, DP throughput on modern non-compute GPUs is only 1/32 of SP throughput. This would suggest that changing a GTX 980 Ti to a compute GPU with full DP support (at 1/2 the SP rate) could achieve a doubling of FFT throughput when using a GPU with the same memory bandwidth as the GTX 980 Ti (336 GB/sec). Alternatively, you could simply install a newer consumer GPU with its accompanying increase in bandwidth and DP throughput.
For example, a GTX 2080Ti (a US$ 1200 investment) would clock in at 420 DP GFLOPS and 616 GB/sec vs 190 DP GFLOPS and 336 GB/sec for the GTX 980Ti.
My reasoning here is hand-wavy and also neglects to look at the impact of using GPU libraries via Python rather than directly, and how severely your application is bottlenecked on GPU-based FFTs. I would suggest doing detailed performance analysis at application level, and trying some loaner GPU(s) if possible to gauge the performance impact from more capable hardware in real-life use cases.
I am assuming that your additional research funds are not so copius that you could afford to spend them on something like a Quadro GV100 (7400 DP GFLOPS, 870 GB/sec, ~$9000).