I tried to evaluate performance of cuFFT on K80 using the provided sample.
I slightly modified the sample to be able to measure forward transforms. The sample showed me 450ms in average, what was not the number I expected.
Using API from the sample I evaluated my own test where test cases are for batch = 2+ . (in example, canonical problem cif512x512*2).
Test showed 3-5 ms for different testcases, what is more acceptable than results for batch = 1.
Also I used my test to re-evaluate batch=1 cases but got the same problem with performance and one more issue with incorrect results: actual results and expected results were different in half of dots (it seemed that second card didn’t send the data).
Could you help me to understand why batch=1 case is so slow and incorrect for all dimensions?
Thank you for the example!
Could you please point us to the performance charts for single 1/2/3D FFTs on K80?
Or just give us an idea about performance improvements expected for such FFTs, K80 vs K40?
I’m not sure a perf report for CUDA 7.5 is currently available. If the CUFFT data you’re looking for is not in the CUDA 6.5 or CUDA 7.0 perf reports, then it may not be available, or at least I’m not sure where to look for it.
Note that the memory throughput of the K80 is stated as a cumulative number. The K80 comprises two GPUs, each with its own attached memory delivering 240 GB/sec. Therefore, one would expect each half of a K80 to provide about 83% of the FFT throughput of a K40. I am not aware of any publicly posted CUFFT performance data comparing the K40 and the K80.
Generally speaking, to fully utilize a K80, multi-GPU programming techniques have to be used. While it is a single card, the two GPUs on the board appear to CUDA as two devices.