Tesla K40 vs GTX 690

RichCUDA · July 10, 2015, 7:35am

My K40 performs ~1.7 times better than GTX 690 for double precision complex FFTs. I expected closer to 6 times. Have I overlooked something?

Hello,

I have been using a GTX 690 and have recently acquired 4 Tesla K40 cards. When considering the performance of the two cards, the 690 has more cuda cores and a higher clock rate, so I expect it to perform single-precision calculations more quickly than a single K40. However, it is my understanding that for DP, the 690 gives 1/24 of SP performance whilst the K40 gives 1/3. Consequently, I would expect the K40 to comfortably outperform the 690 for DP.

The GTX690 is quoted to give 2x2810.88=5621.76 GFLOP (FMA, single). This would naively correspond to 5621.75/24= 234.24 GFLOP double precision.

The K40, meanwhile is quoted at 1430 GFLOP DP.

So the K40 should be roughly a factor of 6 times faster at DP.

To assess the performance of my new K40 cards I have written a simple code that performs 1 million double complex FFTs of 256x256 elements using CUFFT.

The GTX690 is running with CUDA 6.5, the K40 is on CUDA 7.

Whilst a single K40 does outperform the GTX690 for my test code, it is only by a factor of 1.7-1.8 .

The CUFFT library in CUDA 7 offers some improvements with respect to CUDA 6.5, though perhaps this is not relevant to this case.

Nonetheless, I expected better results from the K40. My question is, do you find this to be a reasonable result?

I remember seeing the titan needed to have DP switched on in settings to achieve good DP, but I assume this is not the case for the K40.

njuffa · July 10, 2015, 8:34am

In most cases, FFTs computations are actually limited by memory throughput, not by FLOPS. The GTX 690 provides significantly more bandwidth than the K40:

[url]http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-690/specifications[/url]
Memory Bandwidth (GB/sec) 384

[url]http://www.nvidia.com/object/tesla-servers.html[/url]
Memory bandwidth (ECC off) 288 GB/sec

However, since the GTX 690 has, like many other consumer cards, anemic double precision throughput, this provides an alternate bottleneck when double-precision FFTs are used. So in this case the two limiting factors combine, resulting in the performance ratio you observe. Also note that the GTX 690 is a dual GPU card, while the K40 is a single GPU card. I forget whether recent versions of CUFFT can automatically split the large FFTs you are performing across both GPUs. Check the CUFFT documentation.

To boost the K40’s FFT performance, you can try the following:

(1) Turn off ECC. This will boost memory bandwidth by around 10% - 12%. This may or may not be an option based on your use case, it depends on how tolerant the application is to random memory errors. The tool nvidia-smi allows setting the ECC state. A reboot may be required for the new settings to take effect.

(2) Set application clocks that are faster than the default clock rate. This will boost memory throughput as well as FLOPS. Many applications can run at the K40’s highest application clock. nvidia-smi can show you available application clocks for a given device. If the faster clock leads to exceeding the power or thermal thresholds, this will cause clock throttling which you would want to avoid. I would suggest trying the fastest suppported application clock first, then backing off to the next lower one if clock throttling is observed. For the K40, you would set the highest clock with something like this (adjust for device number if necessary):

nvidia-smi -i 0 -ac 3004,875

RichCUDA · July 10, 2015, 9:04am

Hi,

Thanks for the response, it seems to explain the relative performances well.

I believe the CUFFT version included with CUDA 6.5 does not split the FFTs over both GPUs, as nvidia-smi reports one half to be idle for my simple test application. This means that the K40 gives 0.85 times the performance of the full GTX690 for double complex FFTs.

This certainly matches up with my experience with my FFT-heavy multi GPU application in which I see four K40 cards give ~3.2 times the performance of the whole of a GTX690 card.

Topic		Replies	Views
poor SP cuFFT performance on the Tesla k10 GPU-Accelerated Libraries	9	3491	November 21, 2012
Quadro 6000 vs GTX 690 CUDA Programming and Performance	7	16456	January 7, 2015
Hardware for a high-end development system CUDA Programming and Performance	11	3792	June 26, 2012
Student buying card for CUDA. Which one? CUDA Programming and Performance	16	14862	December 4, 2012
what's causing huge variations in run time? CUDA Programming and Performance	7	1738	May 8, 2013
2D FFT performance on K40/K80 CUDA Programming and Performance	9	1747	September 5, 2016
Tesla K40 vs. Quadro M6000 vs. GeForce Titan X CUDA Programming and Performance	12	45379	April 7, 2015
Buying Advice C2050/C2070 CUDA Programming and Performance	14	9658	August 15, 2010
Noob Alert: Tesla K20 slower than GTX 580? CUDA Programming and Performance	24	9151	November 3, 2013
Do 1/2/3D FFTs work correctly on K80 if batch = 1? CUDA Programming and Performance	6	980	April 4, 2016

Tesla K40 vs GTX 690

Related topics