Double precision: GTX 465, GTX 480 and C2050

Magorath · September 9, 2010, 7:14am

Hi !

I have some questions about the hardware. I would like to run computations in double precision, so I need some information about the DP support on various devices.

The GTX 465 has got 11 MPs, but how many of them are DP-capable ?
Same question for the GTX 480. Among the 15 MPs, how many are DP-capable ?

If I understand correctly the numbers (List of Nvidia graphics processing units - Wikipedia
eries) , the GTX 480 is 1.5x faster for memory reads/writes and is also 1.6x faster for the computations. It also has a bigger gmem. Is this all correct ?

Looking at the C2050, I find that it has a lowe number of MPs and I suppose that all of them are DP-capable. Am I right ? The card has got a lower memory bandwidth and is also slower at doing computations than the GTX 480.

So, if I don’t need 3GB of memory and if the 15 GTX480’s MPs are DP-capable, the GTX480 will be a better choice. Or am I missing something ? This seems weird as the GTX 480 seems more powerful than the “computing card” C2050.

Many thanks in advance for your answers.

avidday · September 9, 2010, 8:10am

All of them, AFAIK.

It is marginally slower in single precision, but anything up to four times faster in double precision than the consumer versions. The consumer GF100 cards have a limit on the rate of double precision instruction issue and retirement which reduces their DP throughput compared to the Fermi Tesla cards. For the consumer cards, the SP:DP performance ratio is nominally 8:1, for the Telsa it is 2:1.

avidday · September 9, 2010, 8:10am

All of them, AFAIK.

It is marginally slower in single precision, but anything up to four times faster in double precision than the consumer versions. The consumer GF100 cards have a limit on the rate of double precision instruction issue and retirement which reduces their DP throughput compared to the Fermi Tesla cards. For the consumer cards, the SP:DP performance ratio is nominally 8:1, for the Telsa it is 2:1.

Magorath · September 9, 2010, 8:19am

Ok. So going from GTX 465 to C2050 I should experience a boost of 4*(15/11) = 5.4 in computing power.

However as the memory bandwith is only ~1.4 times faster, this boost will not be experienced unless my code is not memory bound, which as far as I know is very unlikely.

Is this correct ?

Magorath · September 9, 2010, 8:19am

Ok. So going from GTX 465 to C2050 I should experience a boost of 4*(15/11) = 5.4 in computing power.

However as the memory bandwith is only ~1.4 times faster, this boost will not be experienced unless my code is not memory bound, which as far as I know is very unlikely.

Is this correct ?

avidday · September 9, 2010, 8:50am

It might be, but that would depend on whether your code is really memory bandwidth bound on the GTX465 or not. Your current card should do about 106 Gflop/s double precision peak, and has 102 Gb/s peak memory bandwidth. My experience with the older GT200 (78 Gflop/s double and 100Gb/s main memory bandwidth) was that is was usually compute bound in double precision for the sort of stuff I do in Cuda. The only way to be sure is to do some benchmarking.

avidday · September 9, 2010, 8:50am

It might be, but that would depend on whether your code is really memory bandwidth bound on the GTX465 or not. Your current card should do about 106 Gflop/s double precision peak, and has 102 Gb/s peak memory bandwidth. My experience with the older GT200 (78 Gflop/s double and 100Gb/s main memory bandwidth) was that is was usually compute bound in double precision for the sort of stuff I do in Cuda. The only way to be sure is to do some benchmarking.

Magorath · September 9, 2010, 9:01am

Ok. And how can I benchmark this ?

Magorath · September 9, 2010, 9:01am

Ok. And how can I benchmark this ?

avidday · September 9, 2010, 9:42am

The “traditional” way would be to use the profiler to identify where your code spends most of its time, and analysis the operation count and memory transaction count of that code. It combined with timings, it should tell you whether the code is memory bandwidth bound or compute bound.

A novel alternative someone on these boards suggested involves timing your code while independently underclocking the memory and core clock of your card. The slowdown versus memory clock and core clock relationships should show where the throughput limit of the code lies.

avidday · September 9, 2010, 9:42am

The “traditional” way would be to use the profiler to identify where your code spends most of its time, and analysis the operation count and memory transaction count of that code. It combined with timings, it should tell you whether the code is memory bandwidth bound or compute bound.

A novel alternative someone on these boards suggested involves timing your code while independently underclocking the memory and core clock of your card. The slowdown versus memory clock and core clock relationships should show where the throughput limit of the code lies.

Magorath · September 9, 2010, 9:48am

Mmmmh. So on Linux, you cannot use the “traditional” way. Let’s see if I can try the other one.

Magorath · September 9, 2010, 9:48am

Mmmmh. So on Linux, you cannot use the “traditional” way. Let’s see if I can try the other one.

avidday · September 9, 2010, 9:53am

I am not sure I understand. I use linux, and the “traditional” way is how I also do it. Just run the cuda visual profiler to get timing and profiling data for your application.

avidday · September 9, 2010, 9:53am

I am not sure I understand. I use linux, and the “traditional” way is how I also do it. Just run the cuda visual profiler to get timing and profiling data for your application.

Magorath · September 9, 2010, 9:55am

Damn. You are right. I was thinking about the new Visual C++ tool and not about the visual profiler.

Magorath · September 9, 2010, 9:55am

Damn. You are right. I was thinking about the new Visual C++ tool and not about the visual profiler.