What is better? 2xGTX460 SLI or 1xGTX480

Hi All,

I have to buy some GPU card and I wonder witch option will be better?
How do you think 2xGTX460 SLI or 1xGTX480, the price is the same?
If there are any additional problems when writing programs for SLI cards?

Hi All,

I have to buy some GPU card and I wonder witch option will be better?
How do you think 2xGTX460 SLI or 1xGTX480, the price is the same?
If there are any additional problems when writing programs for SLI cards?

SLI doesn’t help CUDA, so the question is whether you can partition your code to run kernels on both cards (from different host threads).

SLI doesn’t help CUDA, so the question is whether you can partition your code to run kernels on both cards (from different host threads).

correct, don’t sli them. use them as two independent cards.

on a gigaflop to gigaflop basis 2 460’s (1gb) will significantly outperform 1 480.

http://en.wikipedia.org/wiki/GeForce_400_Series#400_Series

though if you can’t split up the data set between the two cards, then you’ll end up doing twice the d2h/h2d transfer per compute, and that might be a consideration.

also, 460’s run a lot cooler so you can overclock them pretty good (i’m using an MSI Cyclone OC’d to 900Mhz), and thus approach the horsepower of a 480 with only a single 460.

long story short on a price-perfomance basis i’d take 2 460’s over 1 480 any day.

though i don’t blame nvidia for milking the big spenders. designing these beasts can’t be cheap.

correct, don’t sli them. use them as two independent cards.

on a gigaflop to gigaflop basis 2 460’s (1gb) will significantly outperform 1 480.

http://en.wikipedia.org/wiki/GeForce_400_Series#400_Series

though if you can’t split up the data set between the two cards, then you’ll end up doing twice the d2h/h2d transfer per compute, and that might be a consideration.

also, 460’s run a lot cooler so you can overclock them pretty good (i’m using an MSI Cyclone OC’d to 900Mhz), and thus approach the horsepower of a 480 with only a single 460.

long story short on a price-perfomance basis i’d take 2 460’s over 1 480 any day.

though i don’t blame nvidia for milking the big spenders. designing these beasts can’t be cheap.

it appears tom’s hardware already did the comparison, and the conclusions are rather unambiguous.

it appears tom’s hardware already did the comparison, and the conclusions are rather unambiguous.

I can confirm on two of my CUDA apps that 2 GTX460s are faster than a GTX480 by about 25%.
There are many caveats with this though… first, my app itself scales linearly with GPUs (I usually run with 3 GTX480s!) so there’s no multi-GPU inefficiencies. Second, the app is mildly memory bandwidth limited… a GTX295 slightly beats the GTX480 on the same app for this reason. 2 GTX460s will have slightly more bandwidth but much much more compute than a GTX295.

I still prefer a single GTX480 since I’m limited by the size of the cards in a rig… it’s always hard to put more than 3 double-width cards in a cost-effective machine.

But we may see a non-reference dual GF104 card soon from Zotac!

(And there’s also the upcoming GF110 GTX580…)

I can confirm on two of my CUDA apps that 2 GTX460s are faster than a GTX480 by about 25%.
There are many caveats with this though… first, my app itself scales linearly with GPUs (I usually run with 3 GTX480s!) so there’s no multi-GPU inefficiencies. Second, the app is mildly memory bandwidth limited… a GTX295 slightly beats the GTX480 on the same app for this reason. 2 GTX460s will have slightly more bandwidth but much much more compute than a GTX295.

I still prefer a single GTX480 since I’m limited by the size of the cards in a rig… it’s always hard to put more than 3 double-width cards in a cost-effective machine.

But we may see a non-reference dual GF104 card soon from Zotac!

(And there’s also the upcoming GF110 GTX580…)

Any word as to whether GF110 will be compute capability 2.1, and therefore have this weird 3rd instruction pipeline per multiprocessor?

Any word as to whether GF110 will be compute capability 2.1, and therefore have this weird 3rd instruction pipeline per multiprocessor?

No, its all speculation. The keywords of GF110 and GTX580 is for sure (there are references inside the very latest GeForce drivers.)

It could be a respin of GF100 with silicon tweaks, but I like your theory of a 2.1 device, scaled up to more SMs than GF104.

BTW compute 2.1 has four execution units, not three… two for each scheduler. Each execution unit picks among 7 resources: 1 set of 16 FP+int32+DP SPs, 2 sets of smaller 16 FP-only SPs, LD/STOR, texture, math SFU, and interpolation SFU. Ideally the superscalar execution will keep all 4 execution units busy every clock. GF100 has only two execution units.

(Actually compute 2.1 only lists 48 SPs per SM as a feature. The superscalar scheduling is not documented and could always be changed by NV. But it worked great for GF104 so I bet they keep it.)

No, its all speculation. The keywords of GF110 and GTX580 is for sure (there are references inside the very latest GeForce drivers.)

It could be a respin of GF100 with silicon tweaks, but I like your theory of a 2.1 device, scaled up to more SMs than GF104.

BTW compute 2.1 has four execution units, not three… two for each scheduler. Each execution unit picks among 7 resources: 1 set of 16 FP+int32+DP SPs, 2 sets of smaller 16 FP-only SPs, LD/STOR, texture, math SFU, and interpolation SFU. Ideally the superscalar execution will keep all 4 execution units busy every clock. GF100 has only two execution units.

(Actually compute 2.1 only lists 48 SPs per SM as a feature. The superscalar scheduling is not documented and could always be changed by NV. But it worked great for GF104 so I bet they keep it.)

ptxas knows about compute 2.2, 2.3 and 3.0. And it could generate code for them. Instructions are the same for all 2.x and even for 3.0 devices. Only instructions order and register allocations are different (see attachment [attachment=18690:_Z16reor…rticlesv.zip]).

So I think GTX580 will be 2.2 or 2.3 device.
_Z16reorderParticlesv.zip (7.3 KB)

ptxas knows about compute 2.2, 2.3 and 3.0. And it could generate code for them. Instructions are the same for all 2.x and even for 3.0 devices. Only instructions order and register allocations are different (see attachment [attachment=24248:_Z16reor…rticlesv.zip]).

So I think GTX580 will be 2.2 or 2.3 device.