Realistic Throughput for cuFFT

Hello all,

I am having trouble selecting the appropriate GPU for my application, which is to take FFTs on streaming input data at high throughput. The marketing info for high end GPUs claim >10 TFLOPS of performance and >600 GB/s of memory bandwidth, but what does a real streaming cuFFT look like? I.e. how do these marketing numbers relate to real performance when you include overhead?

Thanks in advance.


I use CUDA cores for FFT. What is your input data rate (samples / second)? With this info I can probably offer some advice.

You don’t say what kind of FFTs you plan to perform. Generally speaking, FFTs on GPUs are bound by memory bandwidth (due to low computational intensity). The most recent FFT performance numbers reported by NVIDIA themselves that I could find are on slide 18 of this presentation, for a Tesla P100:

Thank you HB9DRV: We have different use-cases of interest, but as an example can modern GPUs perform 50 GSPS at 16 bits per sample? (100 GBPS).

Hello njuffa,

Slide 18 has many exclusions (see list of *'s). Input and output data is on device and plan creation is excluded. I intend to perform a single N-length 1D FFT on an incoming data stream. Since fft’s are memory bandwidth limited, does that mean that the Titan V can do FFTs at the total memory bandwidth rate of 652 GB/s???


Those are not exclusions. This is configuration data documented to aid in reproducibility. When you say “incoming data stream”, where is that data coming from exactly? If it is coming in over PCIe, you will be limited to 12.5 GB/sec (PCIe gen3 x16 link), assuming large transfer sizes. Keep in mind you will need sufficient GPU memory to store all the FFT data (input, output, FFT temporary storage).

Excluding plan creation from benchmarking data makes sense because most applications do more than a single FFT during their entire run time. Rather, they create a plan once and then run multiple FFTs with that plan. This usage model is really no different from what you would do with FFTW, for example. If you need fast plan creation make sure to use a fast host system (high single-thread performance).

Note that the Tesla P100 for which performance data is provided by NVIDIA at the above link has very high bandwidth, about 10% higher than the Titan V. The GPU bandwidth numbers listed in specifications (720 GB/s for Tesla P100, 653 GB/sec for Titan V) are theoretical numbers based on multiplying signalling speed by interface width; the same applies to bandwidth specifications you find on Intel’s website, for example. In practice you should be able to achieve about 80% of that.

Hello njuffa,

This is exactly the information I was looking for, thank you very much.