poor SP cuFFT performance on the Tesla k10

I was wondering if anyone else had noticed the poor single precision cuFFT performance on the new Tesla K10 cards? The single precision floating point performance of this card should be quite high, so it surprised me that in my own benchmarks and using the SHOC benchmark, it was actually worse than the consumer GTX580 and is comparable to the M2090.

Telsa K10
SP C2C FFT at 2^22 points : 2.18 msec
SP C2C iFFT at 2^22 points : 2.26 msec

M2090
SP C2C FFT at 2^22 points : 2.08 msec
SP C2C iFFT at 2^22 points : 2.11 msec

GTX580
SP C2C FFT at 2^22 points : 1.32 msec
SP C2C iFFT at 2^22 points : 1.34 msec

The results are averaged over 10,000 trials.

I would like to consider using these cards for scientific analysis, but the single precision (We don’t need double precision) performance is quite limiting. Our analysis is FFT bound, so any improvements here would be a direct gain. Is this just a case of the cuFFT libraries not being optimized yet for these cards?

Are you using CUDA 5.0?

Best I know, FFTs (in particular large ones as used here) are memory bound. You may want to compare the effective memory bandwidth of these three platforms.

Note that use of ECC for GPU memory (enabled by default for the professional cards M2090 and K10, not supported on the consumer card GTX 580) leads to a slight reduction in the memory bandwidth available to user applications. You will likely observe an effective bandwidth of 80%-85% of the theoretical bandwidth when ECC is enabled.

The Tesla K10 test machine is running CUDA 5.0, but the M2090 and GTX580 machines are still running CUDA 4.1. We have a symmetric cluster where each node has a GPU so we will likely stay with CUDA 4.1 until there are some fairly significant advantages.

I’ve found some references for the memory bandwidth.

The GTX580 is listed at 192 GB/s.
http://www.nvidia.com/object/tesla-servers.html

The M2090 and K10 are listed at 177 GB/s and 160 GB/s (per GPU) respectively.
http://www.nvidia.com/object/tesla-servers.html

Considering the 20% hit in bandwidth due to ECC being enabled, the memory bandwidth does indeed scale very closely with the timing results. Thank you for pointing me in the right direction. It is unfortunate that the best performance is from the consumer cards. It would be nice if there were an option with similar or better performance that had the ECC and support of the professional cards.

Is there any word on what the memory bandwidth of the K20 cards is, or better yet is there a place where cuFFT is benchmarked for it?

Keep in mind that the K10 comprises two GPUs, each coupled to 4GB of memory. So as long as your app is multi-GPU capable (always a good idea), the aggregate bandwidth and therefore CUFFT performance should double., compared to your current measurements.

For different GPUs, different tradeoffs are being made depending on target application area. Tesla products are designed to deliver high performance for continuous HPC workloads with excellent reliability, within specific power envelopes. ECC support is a big part of that reliability which is crucial in cluster environments that may contain hundreds or thousands of GPU-accelerated nodes. Users may turn off ECC with the nvidia-smi utility.

Detailed specifications for the K20 are not yet available on NVIDIA’s website. I would expect that information to appear at http://www.nvidia.com/object/tesla-servers.html once the product becomes generally available.

Note that K20 requires CUDA 5.0 or higher; I believe K10 requires CUDA 4.2 or higher. I do not use CUFFT myself, so cannot say anything about possible performance differences between CUDA 4.1 and CUDA 5.0. I would suggest to do a quick performance comparison yourself, using the specific cases you plan to run in production. In general, lengths that are a power of two will give the best performance, and you are already using that.

As of this morning, http://www.nvidia.com/object/tesla-servers.html has been updated with data for the K20 and K20X modules.

Hi,
There is a performance penalty in cufft in CUDA 5.0 (the release version).
Try to either downgrade to CUDA 4.1 or the PRE-release version of CUDA 5.0 and run
again on the K10. We’ve noticed a drop of ~20-25% in the final version of CUDA 5.0.
This has been reported and confirmed by nvidia.

Actually the issue is only in the cufft lib, so you could actually install CUDA 5.0
release version and just take the cufft.so from the PRE-release and override the one
from the release version.

eyal

If I am looking at the correct bug report and reading it correctly, this performance regression between the CUFFT from CUDA 4.2 versus CUDA 5.0 only applies to certain combinations of platform, lengths, etc., according to your experiments. Thanks for pointing this out, it is good to be aware of that.

Since my memory is not always super accurate, I did a quick experiment to measure effective bandwidth with ECC. With ECC, useable bandwidth seems the be around 75-80% of theoretical, not 80-85% as I said above. Here is a worked example using an M2050. The memory clock reported by nvidia-smi -q for this module is 1546 MHz, thus theoretical bandwidth is 148.416e9 bytes/second. Running STREAM with arrays of 10 million 8-byte elements, on a system running CUDA 5.0 I see:

Function   Rate (bytes/s) 
Copy:      117.734e9     79.3% of theoretical
Scale:     117.487e9     79.2% of theoretical
Add:       111.736e9     75.3% of theoretical 
Triad:     111.736e9     75.3% of theoretical

Thanks for all the suggestions. Unfortunately, I was not able to get any better performance by using a different cuFFT library. In fact it was only worse. I tried cuFFT version 5.0.7 and 4.1.28. I also have timing for when ECC is turned off.

For SP iFFT @ 2^22 points averaged over 10 runs of 10,000 successive calls on one GPU of the Tesla K10.

Library Version : Execution Time

5.0.35 (ECC off) : 1.92 ms
5.0.35 (current) : 2.26 ms
5.0.7 (pre-release) : 2.41 ms
4.1.28 : 2.43 ms
4.0.7 : 5.72 ms

The performance difference you observe between running with ECC on and ECC off looks about right to me, and should be proportional to the bandwidth difference. I have been running exclusively on ECC-enabled Tesla platforms for years, but when ECC was a new feature I measured the bandwidth difference between ECC on and off and it was on the order of 15%.

K10 is designed to provide maximum single-precision throughput at comparatively low power, i.e. the target is power efficiency. To utilize this potential fully, one really would want to use both GPUs inside the module. If that is not possible, the K10 may not be the best fit, and some other Tesla device could be a better choice.

Based on your data, utilizing both GPUs of the K10 will double the FFT throughput from 520 FFTs/second to 1040 FFTs/sec, compared to 757 FFTs/sec on the GTX580. Either number should be significantly higher than what could be achieved on the CPU, or is that not so (I am not familar with FFT performance data in general)? Of course the use of both GPUs of the K10 does not improve the latency of each FFT. Does your application have specific latency requirements?

It is significantly faster than what can be achieved on a CPU, at least for long FFTs. Short FFTs that can fit within the CPU cache are more competitive. Our application does not have latency requirements, so that is not really an issue, and we can concentrate on throughput. We also have a very parallel workflow and generally run on large clusters, so taking advantage of multiple GPUs simultaneously is not an issue. We will be definitely be considering the (ability to do analysis) / (total ownership cost).