I am having trouble selecting the appropriate GPU for my application, which is to take FFTs on streaming input data at high throughput. The marketing info for high end GPUs claim >10 TFLOPS of performance and >600 GB/s of memory bandwidth, but what does a real streaming cuFFT look like? I.e. how do these marketing numbers relate to real performance when you include overhead?
You don’t say what kind of FFTs you plan to perform. Generally speaking, FFTs on GPUs are bound by memory bandwidth (due to low computational intensity). The most recent FFT performance numbers reported by NVIDIA themselves that I could find are on slide 18 of this presentation, for a Tesla P100:
Slide 18 has many exclusions (see list of *'s). Input and output data is on device and plan creation is excluded. I intend to perform a single N-length 1D FFT on an incoming data stream. Since fft’s are memory bandwidth limited, does that mean that the Titan V can do FFTs at the total memory bandwidth rate of 652 GB/s???
Those are not exclusions. This is configuration data documented to aid in reproducibility. When you say “incoming data stream”, where is that data coming from exactly? If it is coming in over PCIe, you will be limited to 12.5 GB/sec (PCIe gen3 x16 link), assuming large transfer sizes. Keep in mind you will need sufficient GPU memory to store all the FFT data (input, output, FFT temporary storage).
Excluding plan creation from benchmarking data makes sense because most applications do more than a single FFT during their entire run time. Rather, they create a plan once and then run multiple FFTs with that plan. This usage model is really no different from what you would do with FFTW, for example. If you need fast plan creation make sure to use a fast host system (high single-thread performance).
Note that the Tesla P100 for which performance data is provided by NVIDIA at the above link has very high bandwidth, about 10% higher than the Titan V. The GPU bandwidth numbers listed in specifications (720 GB/s for Tesla P100, 653 GB/sec for Titan V) are theoretical numbers based on multiplying signalling speed by interface width; the same applies to bandwidth specifications you find on Intel’s website, for example. In practice you should be able to achieve about 80% of that.