Double precision performance

I’m just about to get started with CUDA, and am looking at what graphics card to buy. My first thought was with the geforce range (lots of cores for cheap!), but then I found that double precision was throttled in this series, and also in other models. So, before I make my decision, I just want to make sure I got everything right with this. AFAIK this is the DP performance, as a fraction of SP performance:

Tesla series: 1/2 (full performance)
Quadro 4000-6000: 1/2 (full performance)
Quadro 600-2000: 1/12
GTX 5xx: 1/8

It has been remarkably hard to find these figures (or i’ve been looking in the wrong places…)!
If anyone could confirm these figures it would be greatly appreciated!


Unless you know you are limited by double precision performance of a single GPU, I’d recommend buying a consumer card first to familiarize yourself with CUDA and to find out your specific needs. There are remarkably few problems that are actually limited by double precision throughput, as double precision also needs twice the memory bandwidth. And even then, the double precision performance per $ is still better for the consumer cards.

Ultimately I will use CUDA for scientific computing, so at some point double precision is needed. But for my purposes right now, I’m only interested in learning CUDA. So I’m definitely leaning towards a consumer card. I guess for single precision performance, gtx580 will be similar to quadro 6000?

Actually the GTX 580 achieves about 50% more single precision GFLOP/s and 33% higher bandwidth than the Quadro 6000. It has only a quarter of the memory though. It’s definitely more than enough for learning CUDA.

Yeah the memory won’t be a problem. Is memory bandwidth throttled as well for DP on the gtx series?

No, you get the full memory bandwidth (33% more than Quadro 6000). So memory bandwidth limited double precision calculations are actually faster on the GTX 580.

I don’t think it would even be technically possible to throttle memory bandwidth depending on single- or double precision, as the memory controller has no info what the data is used for.