I tried to evaluate performance of cuFFT on K80 using the provided sample.
I slightly modified the sample to be able to measure forward transforms. The sample showed me 450ms in average, what was not the number I expected.
Using API from the sample I evaluated my own test where test cases are for batch = 2+ . (in example, canonical problem cif512x512*2).
Test showed 3-5 ms for different testcases, what is more acceptable than results for batch = 1.
Also I used my test to re-evaluate batch=1 cases but got the same problem with performance and one more issue with incorrect results: actual results and expected results were different in half of dots (it seemed that second card didn’t send the data).
Could you help me to understand why batch=1 case is so slow and incorrect for all dimensions?