My K40 performs ~1.7 times better than GTX 690 for double precision complex FFTs. I expected closer to 6 times. Have I overlooked something?

Hello,

I have been using a GTX 690 and have recently acquired 4 Tesla K40 cards. When considering the performance of the two cards, the 690 has more cuda cores and a higher clock rate, so I expect it to perform single-precision calculations more quickly than a single K40. However, it is my understanding that for DP, the 690 gives 1/24 of SP performance whilst the K40 gives 1/3. Consequently, I would expect the K40 to comfortably outperform the 690 for DP.

The GTX690 is quoted to give 2x2810.88=5621.76 GFLOP (FMA, single). This would naively correspond to 5621.75/24= 234.24 GFLOP double precision.

The K40, meanwhile is quoted at 1430 GFLOP DP.

So the K40 should be roughly a factor of 6 times faster at DP.

To assess the performance of my new K40 cards I have written a simple code that performs 1 million double complex FFTs of 256x256 elements using CUFFT.

The GTX690 is running with CUDA 6.5, the K40 is on CUDA 7.

Whilst a single K40 does outperform the GTX690 for my test code, it is only by a factor of 1.7-1.8 .

The CUFFT library in CUDA 7 offers some improvements with respect to CUDA 6.5, though perhaps this is not relevant to this case.

Nonetheless, I expected better results from the K40. My question is, do you find this to be a reasonable result?

I remember seeing the titan needed to have DP switched on in settings to achieve good DP, but I assume this is not the case for the K40.