Double precision: GTX 465, GTX 480 and C2050

Hi !

I have some questions about the hardware. I would like to run computations in double precision, so I need some information about the DP support on various devices.

The GTX 465 has got 11 MPs, but how many of them are DP-capable ?
Same question for the GTX 480. Among the 15 MPs, how many are DP-capable ?

If I understand correctly the numbers (http://en.wikipedia.org/wiki/Comparison_of_NVIDIA_Graphics_Processing_Units#GeForce_400_S
eries) , the GTX 480 is 1.5x faster for memory reads/writes and is also 1.6x faster for the computations. It also has a bigger gmem. Is this all correct ?

Looking at the C2050, I find that it has a lowe number of MPs and I suppose that all of them are DP-capable. Am I right ? The card has got a lower memory bandwidth and is also slower at doing computations than the GTX 480.

So, if I don’t need 3GB of memory and if the 15 GTX480’s MPs are DP-capable, the GTX480 will be a better choice. Or am I missing something ? This seems weird as the GTX 480 seems more powerful than the “computing card” C2050.

Many thanks in advance for your answers.

All of them, AFAIK.

It is marginally slower in single precision, but anything up to four times faster in double precision than the consumer versions. The consumer GF100 cards have a limit on the rate of double precision instruction issue and retirement which reduces their DP throughput compared to the Fermi Tesla cards. For the consumer cards, the SP:DP performance ratio is nominally 8:1, for the Telsa it is 2:1.

All of them, AFAIK.

It is marginally slower in single precision, but anything up to four times faster in double precision than the consumer versions. The consumer GF100 cards have a limit on the rate of double precision instruction issue and retirement which reduces their DP throughput compared to the Fermi Tesla cards. For the consumer cards, the SP:DP performance ratio is nominally 8:1, for the Telsa it is 2:1.

Ok. So going from GTX 465 to C2050 I should experience a boost of 4*(15/11) = 5.4 in computing power.

However as the memory bandwith is only ~1.4 times faster, this boost will not be experienced unless my code is not memory bound, which as far as I know is very unlikely.

Is this correct ?

Ok. So going from GTX 465 to C2050 I should experience a boost of 4*(15/11) = 5.4 in computing power.

However as the memory bandwith is only ~1.4 times faster, this boost will not be experienced unless my code is not memory bound, which as far as I know is very unlikely.

Is this correct ?

It might be, but that would depend on whether your code is really memory bandwidth bound on the GTX465 or not. Your current card should do about 106 Gflop/s double precision peak, and has 102 Gb/s peak memory bandwidth. My experience with the older GT200 (78 Gflop/s double and 100Gb/s main memory bandwidth) was that is was usually compute bound in double precision for the sort of stuff I do in Cuda. The only way to be sure is to do some benchmarking.

It might be, but that would depend on whether your code is really memory bandwidth bound on the GTX465 or not. Your current card should do about 106 Gflop/s double precision peak, and has 102 Gb/s peak memory bandwidth. My experience with the older GT200 (78 Gflop/s double and 100Gb/s main memory bandwidth) was that is was usually compute bound in double precision for the sort of stuff I do in Cuda. The only way to be sure is to do some benchmarking.

Ok. And how can I benchmark this ?

Ok. And how can I benchmark this ?

The “traditional” way would be to use the profiler to identify where your code spends most of its time, and analysis the operation count and memory transaction count of that code. It combined with timings, it should tell you whether the code is memory bandwidth bound or compute bound.

A novel alternative someone on these boards suggested involves timing your code while independently underclocking the memory and core clock of your card. The slowdown versus memory clock and core clock relationships should show where the throughput limit of the code lies.

The “traditional” way would be to use the profiler to identify where your code spends most of its time, and analysis the operation count and memory transaction count of that code. It combined with timings, it should tell you whether the code is memory bandwidth bound or compute bound.

A novel alternative someone on these boards suggested involves timing your code while independently underclocking the memory and core clock of your card. The slowdown versus memory clock and core clock relationships should show where the throughput limit of the code lies.

Mmmmh. So on Linux, you cannot use the “traditional” way. Let’s see if I can try the other one.

Mmmmh. So on Linux, you cannot use the “traditional” way. Let’s see if I can try the other one.

I am not sure I understand. I use linux, and the “traditional” way is how I also do it. Just run the cuda visual profiler to get timing and profiling data for your application.

I am not sure I understand. I use linux, and the “traditional” way is how I also do it. Just run the cuda visual profiler to get timing and profiling data for your application.

Damn. You are right. I was thinking about the new Visual C++ tool and not about the visual profiler.

Damn. You are right. I was thinking about the new Visual C++ tool and not about the visual profiler.