Buying Advice C2050/C2070

Hi all !

I’m currently running CUDA on a GTX 465 and I’m planning to build a small cluster of cards. The best choice currently available is the C2050/C2070. But this card is almost 10 times more expensive than the one I own. My question is thus quite simple, is the difference really worth it ?

The code I have to run on these cards is using double precision only and is limited by the number of accesses to the global device memory. Could these two points be improved with a “better” card ?

Thanks in advance for any advice you could give.

Well, from the description of your code, I’d say yes. The C2050 has a 384-bit memory bus (vs. the 256-bit bus on the GTX465), so you’ll get faster transfer speeds to/from global memory. Also, the Fermi-based Tesla cards have much faster (by x4, IIRC) double-precision support, so if that’s all you’re using, you might be better off with the more expensive card.

Or, if you’re planning on building a cluster anyway, then maybe just go with a bunch of GTX465 cards (if your code scales well across multiple GPUs/nodes). Also, depending on your specific algorithm, you might be able to modify your code to use mixed-precision (so you’d still get ‘double-precision’ accuracy even though part of your code would be single-precision).

Well, from the description of your code, I’d say yes. The C2050 has a 384-bit memory bus (vs. the 256-bit bus on the GTX465), so you’ll get faster transfer speeds to/from global memory. Also, the Fermi-based Tesla cards have much faster (by x4, IIRC) double-precision support, so if that’s all you’re using, you might be better off with the more expensive card.

Or, if you’re planning on building a cluster anyway, then maybe just go with a bunch of GTX465 cards (if your code scales well across multiple GPUs/nodes). Also, depending on your specific algorithm, you might be able to modify your code to use mixed-precision (so you’d still get ‘double-precision’ accuracy even though part of your code would be single-precision).

Well I’m planning to build a cluster of 3 or 4 cards.

Do you know how many double-precision cores the GTX 465 possess ? And the C2050 ? I could not find these numbers ?

Well I’m planning to build a cluster of 3 or 4 cards.

Do you know how many double-precision cores the GTX 465 possess ? And the C2050 ? I could not find these numbers ?

The GTX 465 processes double precision floating point operations at 1/8 the single precision speed, whereas the C2050 does so at 1/2 the single precision speed. For fused multiply-add operations, that’s 106 GFLOPS for the GTX 465 and 512 GFLOPS for the C2050. So if your problem scales linearly across multiple cards, the GTX 465 is better “bang-for-the-buck”.

The GTX 465 processes double precision floating point operations at 1/8 the single precision speed, whereas the C2050 does so at 1/2 the single precision speed. For fused multiply-add operations, that’s 106 GFLOPS for the GTX 465 and 512 GFLOPS for the C2050. So if your problem scales linearly across multiple cards, the GTX 465 is better “bang-for-the-buck”.

Wow. So if I’m right the same kernel using double-precision should run ~5 times faster on a C2050 than on my GTX 465 if just consider the “computation time”. But this number will change due to the memory reads that are only 1.5 times faster. Is this correct ?
Or maybe the bigger number of cores of the C2050 will allow me to get a better figure than this 1.5. It is really difficult to make predictions…

Wow. So if I’m right the same kernel using double-precision should run ~5 times faster on a C2050 than on my GTX 465 if just consider the “computation time”. But this number will change due to the memory reads that are only 1.5 times faster. Is this correct ?
Or maybe the bigger number of cores of the C2050 will allow me to get a better figure than this 1.5. It is really difficult to make predictions…

Yes, for a memory bandwidth limited problem, you should look at the peak global memory bandwidth (which is never achieved in practice, but gives you a sense of memory speed) and not worry about the GFLOPS. The GTX 465 is 102 GB/sec and the C2050 is 144 GB/sec, so you won’t see a big improvement. In that case, (again, assuming you can partition your problem across multiple cards easily) many GTX 465s would be even more cost effective.

Yes, for a memory bandwidth limited problem, you should look at the peak global memory bandwidth (which is never achieved in practice, but gives you a sense of memory speed) and not worry about the GFLOPS. The GTX 465 is 102 GB/sec and the C2050 is 144 GB/sec, so you won’t see a big improvement. In that case, (again, assuming you can partition your problem across multiple cards easily) many GTX 465s would be even more cost effective.

Partitioning my problem isn’t an issue. Let me summarize.

Bandwidth limited problem: No big improvement if I build a cluster of C2050 instead of a cluster of GTX 465.
Computation speed limited problem: Factor ~5 improvement if I choose the C2050.

Is there a way to conclude safely about the kind of limitations I’m facing ? Until now, I’ve been looking at the occupancy which is very low in the case of my kernel.

Partitioning my problem isn’t an issue. Let me summarize.

Bandwidth limited problem: No big improvement if I build a cluster of C2050 instead of a cluster of GTX 465.
Computation speed limited problem: Factor ~5 improvement if I choose the C2050.

Is there a way to conclude safely about the kind of limitations I’m facing ? Until now, I’ve been looking at the occupancy which is very low in the case of my kernel.

If determining FLOPS vs. bandwidth bottleneck isn’t easy to do from inspecting the code, another trick is to find a tool that lets you change your GPU clocks. Turn the shader clock down 10%, benchmark your code, then put it back and turn the memory clock down 10% and benchmark again. See which change correlates to the largest performance drop.

It’s also worth optimizing against price as well. The GTX 465 runs for $250, whereas it looks like the C2050 is ~$2300. In the case where the problem partitions linearly, the GTX 465 wins on cost. I don’t think it wins on power usage or physical space though. Many dimensions to consider… :)

If determining FLOPS vs. bandwidth bottleneck isn’t easy to do from inspecting the code, another trick is to find a tool that lets you change your GPU clocks. Turn the shader clock down 10%, benchmark your code, then put it back and turn the memory clock down 10% and benchmark again. See which change correlates to the largest performance drop.

It’s also worth optimizing against price as well. The GTX 465 runs for $250, whereas it looks like the C2050 is ~$2300. In the case where the problem partitions linearly, the GTX 465 wins on cost. I don’t think it wins on power usage or physical space though. Many dimensions to consider… :)