Buying Advice C2050/C2070

Magorath · August 15, 2010, 10:34am

Hi all !

I’m currently running CUDA on a GTX 465 and I’m planning to build a small cluster of cards. The best choice currently available is the C2050/C2070. But this card is almost 10 times more expensive than the one I own. My question is thus quite simple, is the difference really worth it ?

The code I have to run on these cards is using double precision only and is limited by the number of accesses to the global device memory. Could these two points be improved with a “better” card ?

Thanks in advance for any advice you could give.

jack · August 15, 2010, 1:10pm

Well, from the description of your code, I’d say yes. The C2050 has a 384-bit memory bus (vs. the 256-bit bus on the GTX465), so you’ll get faster transfer speeds to/from global memory. Also, the Fermi-based Tesla cards have much faster (by x4, IIRC) double-precision support, so if that’s all you’re using, you might be better off with the more expensive card.

Or, if you’re planning on building a cluster anyway, then maybe just go with a bunch of GTX465 cards (if your code scales well across multiple GPUs/nodes). Also, depending on your specific algorithm, you might be able to modify your code to use mixed-precision (so you’d still get ‘double-precision’ accuracy even though part of your code would be single-precision).

jack · August 15, 2010, 1:10pm

Well, from the description of your code, I’d say yes. The C2050 has a 384-bit memory bus (vs. the 256-bit bus on the GTX465), so you’ll get faster transfer speeds to/from global memory. Also, the Fermi-based Tesla cards have much faster (by x4, IIRC) double-precision support, so if that’s all you’re using, you might be better off with the more expensive card.

Or, if you’re planning on building a cluster anyway, then maybe just go with a bunch of GTX465 cards (if your code scales well across multiple GPUs/nodes). Also, depending on your specific algorithm, you might be able to modify your code to use mixed-precision (so you’d still get ‘double-precision’ accuracy even though part of your code would be single-precision).

Magorath · August 15, 2010, 3:09pm

Well I’m planning to build a cluster of 3 or 4 cards.

Do you know how many double-precision cores the GTX 465 possess ? And the C2050 ? I could not find these numbers ?

Magorath · August 15, 2010, 3:09pm

Well I’m planning to build a cluster of 3 or 4 cards.

Do you know how many double-precision cores the GTX 465 possess ? And the C2050 ? I could not find these numbers ?

seibert · August 15, 2010, 3:25pm

The GTX 465 processes double precision floating point operations at 1/8 the single precision speed, whereas the C2050 does so at 1/2 the single precision speed. For fused multiply-add operations, that’s 106 GFLOPS for the GTX 465 and 512 GFLOPS for the C2050. So if your problem scales linearly across multiple cards, the GTX 465 is better “bang-for-the-buck”.

seibert · August 15, 2010, 3:25pm

The GTX 465 processes double precision floating point operations at 1/8 the single precision speed, whereas the C2050 does so at 1/2 the single precision speed. For fused multiply-add operations, that’s 106 GFLOPS for the GTX 465 and 512 GFLOPS for the C2050. So if your problem scales linearly across multiple cards, the GTX 465 is better “bang-for-the-buck”.

Magorath · August 15, 2010, 3:57pm

Wow. So if I’m right the same kernel using double-precision should run ~5 times faster on a C2050 than on my GTX 465 if just consider the “computation time”. But this number will change due to the memory reads that are only 1.5 times faster. Is this correct ?
Or maybe the bigger number of cores of the C2050 will allow me to get a better figure than this 1.5. It is really difficult to make predictions…

Magorath · August 15, 2010, 3:57pm

Wow. So if I’m right the same kernel using double-precision should run ~5 times faster on a C2050 than on my GTX 465 if just consider the “computation time”. But this number will change due to the memory reads that are only 1.5 times faster. Is this correct ?
Or maybe the bigger number of cores of the C2050 will allow me to get a better figure than this 1.5. It is really difficult to make predictions…

seibert · August 15, 2010, 5:57pm

Yes, for a memory bandwidth limited problem, you should look at the peak global memory bandwidth (which is never achieved in practice, but gives you a sense of memory speed) and not worry about the GFLOPS. The GTX 465 is 102 GB/sec and the C2050 is 144 GB/sec, so you won’t see a big improvement. In that case, (again, assuming you can partition your problem across multiple cards easily) many GTX 465s would be even more cost effective.

seibert · August 15, 2010, 5:57pm

Yes, for a memory bandwidth limited problem, you should look at the peak global memory bandwidth (which is never achieved in practice, but gives you a sense of memory speed) and not worry about the GFLOPS. The GTX 465 is 102 GB/sec and the C2050 is 144 GB/sec, so you won’t see a big improvement. In that case, (again, assuming you can partition your problem across multiple cards easily) many GTX 465s would be even more cost effective.

Magorath · August 15, 2010, 6:27pm

Partitioning my problem isn’t an issue. Let me summarize.

Bandwidth limited problem: No big improvement if I build a cluster of C2050 instead of a cluster of GTX 465.
Computation speed limited problem: Factor ~5 improvement if I choose the C2050.

Is there a way to conclude safely about the kind of limitations I’m facing ? Until now, I’ve been looking at the occupancy which is very low in the case of my kernel.

Magorath · August 15, 2010, 6:27pm

Partitioning my problem isn’t an issue. Let me summarize.

Bandwidth limited problem: No big improvement if I build a cluster of C2050 instead of a cluster of GTX 465.
Computation speed limited problem: Factor ~5 improvement if I choose the C2050.

Is there a way to conclude safely about the kind of limitations I’m facing ? Until now, I’ve been looking at the occupancy which is very low in the case of my kernel.

seibert · August 15, 2010, 11:01pm

If determining FLOPS vs. bandwidth bottleneck isn’t easy to do from inspecting the code, another trick is to find a tool that lets you change your GPU clocks. Turn the shader clock down 10%, benchmark your code, then put it back and turn the memory clock down 10% and benchmark again. See which change correlates to the largest performance drop.

It’s also worth optimizing against price as well. The GTX 465 runs for $250, whereas it looks like the C2050 is ~$2300. In the case where the problem partitions linearly, the GTX 465 wins on cost. I don’t think it wins on power usage or physical space though. Many dimensions to consider… :)

seibert · August 15, 2010, 11:01pm

If determining FLOPS vs. bandwidth bottleneck isn’t easy to do from inspecting the code, another trick is to find a tool that lets you change your GPU clocks. Turn the shader clock down 10%, benchmark your code, then put it back and turn the memory clock down 10% and benchmark again. See which change correlates to the largest performance drop.

It’s also worth optimizing against price as well. The GTX 465 runs for $250, whereas it looks like the C2050 is ~$2300. In the case where the problem partitions linearly, the GTX 465 wins on cost. I don’t think it wins on power usage or physical space though. Many dimensions to consider… :)

Topic		Replies	Views
Disappointed performance using C2050 CUDA Programming and Performance	20	7915	September 2, 2010
GeForce 570 vs. Tesla c2050 CUDA Programming and Performance	3	1820	August 16, 2011
Double precision: GTX 465, GTX 480 and C2050 CUDA Programming and Performance	16	3869	September 9, 2010
Tesla C1060 vs GTX 480 Double precision performance CUDA Programming and Performance	15	4799	September 28, 2010
Double precision performance CUDA Programming and Performance	5	5706	May 22, 2011
Used C1060s? Where to find? CUDA Programming and Performance	11	2885	November 24, 2009
gonna buy a CUDA card tomorrow - Which one? CUDA Programming and Performance	10	6291	September 6, 2008
Tesla C2050 vs Geforce 480 GTX performance of the cuFFT CUDA Programming and Performance	1	706	March 15, 2011
Tesla C2070 Performance Comparing Tesla C2070 performance to Geforce GTX CUDA Programming and Performance	4	2606	March 24, 2011
Tesla vs GeForce archs What makes the tesla better? CUDA Programming and Performance	8	18401	September 14, 2009

Buying Advice C2050/C2070

Related topics