I benchmarked an out-of-place complex 2D cufft and found that the TX1 outperforms the TX2.

These are the numbers I got for matrix sizes NxN where N = 2^n.

n TX2 GFLOPS TX1 GFLOPS

7 21.2 14.33

8 57.3 47.5

9 53.1 67.27

10 63.5 131.6

11 70.77 212

12 76.1 124

Can anyone explain why the TX1 outperforms the TX2 for values n > 8. To me this is puzzling.

I am running 1000 iterations of an out-of-place complex cufft. I made sure the gpu clock speeds were maximized which should give the TX2 30% more performance.