Tesla K40 vs GTX 690


My K40 performs ~1.7 times better than GTX 690 for double precision complex FFTs. I expected closer to 6 times. Have I overlooked something?

Hello,

I have been using a GTX 690 and have recently acquired 4 Tesla K40 cards. When considering the performance of the two cards, the 690 has more cuda cores and a higher clock rate, so I expect it to perform single-precision calculations more quickly than a single K40. However, it is my understanding that for DP, the 690 gives 1/24 of SP performance whilst the K40 gives 1/3. Consequently, I would expect the K40 to comfortably outperform the 690 for DP.

The GTX690 is quoted to give 2x2810.88=5621.76 GFLOP (FMA, single). This would naively correspond to 5621.75/24= 234.24 GFLOP double precision.

The K40, meanwhile is quoted at 1430 GFLOP DP.

So the K40 should be roughly a factor of 6 times faster at DP.

To assess the performance of my new K40 cards I have written a simple code that performs 1 million double complex FFTs of 256x256 elements using CUFFT.

The GTX690 is running with CUDA 6.5, the K40 is on CUDA 7.

Whilst a single K40 does outperform the GTX690 for my test code, it is only by a factor of 1.7-1.8 .

The CUFFT library in CUDA 7 offers some improvements with respect to CUDA 6.5, though perhaps this is not relevant to this case.

Nonetheless, I expected better results from the K40. My question is, do you find this to be a reasonable result?

I remember seeing the titan needed to have DP switched on in settings to achieve good DP, but I assume this is not the case for the K40.

In most cases, FFTs computations are actually limited by memory throughput, not by FLOPS. The GTX 690 provides significantly more bandwidth than the K40:

http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-690/specifications
Memory Bandwidth (GB/sec) 384

http://www.nvidia.com/object/tesla-servers.html
Memory bandwidth (ECC off) 288 GB/sec

However, since the GTX 690 has, like many other consumer cards, anemic double precision throughput, this provides an alternate bottleneck when double-precision FFTs are used. So in this case the two limiting factors combine, resulting in the performance ratio you observe. Also note that the GTX 690 is a dual GPU card, while the K40 is a single GPU card. I forget whether recent versions of CUFFT can automatically split the large FFTs you are performing across both GPUs. Check the CUFFT documentation.

To boost the K40’s FFT performance, you can try the following:

(1) Turn off ECC. This will boost memory bandwidth by around 10% - 12%. This may or may not be an option based on your use case, it depends on how tolerant the application is to random memory errors. The tool nvidia-smi allows setting the ECC state. A reboot may be required for the new settings to take effect.

(2) Set application clocks that are faster than the default clock rate. This will boost memory throughput as well as FLOPS. Many applications can run at the K40’s highest application clock. nvidia-smi can show you available application clocks for a given device. If the faster clock leads to exceeding the power or thermal thresholds, this will cause clock throttling which you would want to avoid. I would suggest trying the fastest suppported application clock first, then backing off to the next lower one if clock throttling is observed. For the K40, you would set the highest clock with something like this (adjust for device number if necessary):

nvidia-smi -i 0 -ac 3004,875

Hi,

Thanks for the response, it seems to explain the relative performances well.

I believe the CUFFT version included with CUDA 6.5 does not split the FFTs over both GPUs, as nvidia-smi reports one half to be idle for my simple test application. This means that the K40 gives 0.85 times the performance of the full GTX690 for double complex FFTs.

This certainly matches up with my experience with my FFT-heavy multi GPU application in which I see four K40 cards give ~3.2 times the performance of the whole of a GTX690 card.