Hardware for a high-end development system

So, knowing what we know about the performance of 600-series, is there consensus about what to get when you’re building a high-end / high-performance GPU development system on a reasonable budget?

Options:
2x 580 ($800)
2x 680 ($1000)
1x 690 ($1000)

I’d love to get 2x 590’s and I expected them to come down to this price range by now, but apparently, instead of going down, they went up (most sellers want $800+.) Are those out of production by now? If I wait some more, will they go down or will they disappear from the market completely?

Personally I’m waiting for K20 before making any more purchases. K10 is mostly for the CAD/viz guys who use the workstation graphics (that Kepler GK104 based K10 excels at.)

But yes, the Fermi 580s are the best bang-for-buck cards for single precision CUDA apps right now. The 590 is better if you’re trying to max out GPUs per box, and of course there’s Fermi based Quadro/Teslas for the apps which need large RAM, ECC, DP throughput, etc.

The 680 and 690 aren’t going to give you the CUDA horsepower the Fermis will. They’ll play Battlefield 3 better though!

But again, this could all change with K20 this fall.

Should get cards with as much VRAM per GPU as possible.

I could be misinformed, but it seems to me that K20 (GK110) was announced as a Tesla-only offering, the big brother of Tesla K10. And the only place I could find that sells Tesla K10 is asking $4300. So K20 will be way, way out of my and almost anyone else’s price range. But we’ll see what happens.

Edit: read the GK110 article on techreport.com. There is a chance that GK110 will apppear in consumer-grade cards, but it will almost certainly not happen before Q1 '13 and it may take longer than that…

I wouldn’t necessarily agree with all of that. The GTX 600 series has lots of memory bandwidth, lots of memory and lots of texturing throughput. There are some CUDA applications where it really shines. Personally, I think the GTX 670 4GB is a great buy.

Update: I just did some benchmarks for my application. I’m achieving 50-60 GTexels/s of texturing operations and 100-120 GB/s of (uncached) global memory operations simultaneously on my GTX 670. Even the specifications for the GTX 570 and 580 are only 43.9 and 49.4 GTexels/s respectively and real-world measurements will be somewhat lower.

Update: For me the perfect card would be a pair of GTX 670 4GB on a single card. The K10 is pretty much that but of course at Tesla prices.

The best choice depends on what you plan to do. I have both a 580 and a 680 in the same computer and find performance differences in CUDA programs that I use ranging from the GTX 680 being 10% slower to 100% faster than the GTX 580, with the most common case being about the same speed or slightly slower. Now that I have a GK104 card to experiment with, I also don’t plan to buy any more until the GK110-based GeForce cards come out in early 2013.

I find that in the 680 vs. 580 comparison things are very mixed:

    [*] Best power efficiency - GTX 680: At roughly the same speed as the GTX 580, but with 20% less power usage, at least based on the TDP. I have not yet tried benchmarking the power usage of the 580 vs. the 680 individually since my development system has 4 cards installed. In addition, the idle power draw of the GTX 680 is lower, which is also nice for workstations where you intersperse CUDA and non-CUDA usage.

    [*] Best memory bandwidth - GTX 580: Although both devices have nearly identical theoretical memory bandwidth, pretty much every test I’ve run or read about indicates the practical throughput on the GTX 580 is 7-10% better.

    [*] Best host-to-device (and vice-versa) bandwidth - GTX 680: If you are building a new system, then you are probably getting a motherboard with PCI-Express 3.0 support. In this case (but be careful that your motherboard is supported!), you’ll see roughly double the host-to-device and device-to-host bandwidth.

    [*] Best price - GTX 580: Assuming you can still find them in stock, the GTX 580 is 20% cheaper, as you noted.

    [*] Most device memory in a GeForce card - GTX 680: Now you can get 4 GB card, whereas the GTX 580 only went up to 3 GB.

    [*] Fastest raw double precision and integer shift performance - GTX 580: The new SMX design feels a lot more like the SM in compute capability 2.1, and so double precision performance is hurt significantly.

    [*] Fastest raw single precision and special function performance - GTX 680: The program I mentioned that is 100% faster? It is basically all single precision and special functions. However, most programs are not limited only by single precision floating point throughput.

    [*] Most cache, cache per active thread, and shared memory per active thread - GTX 580: I suspect that many programs seeing a large drop in performance on the GTX 680 are running into problems with the drop in cache size (or instruction scheduling, see below). The effective drop in shared memory per active thread might also hurt programs that use this space for per-thread scratch space that needs to be shared with the block.

    [*] Fastest atomics - GTX 680: The latest atomic performance is pretty impressive.

    [*] Most “predictable” performance - GTX 580: (This is fairly subjective, so you should feel free to disregard it.) I think compute capability 2.1 made a lot of CUDA developers uneasy because it made the ability to dual-issue instructions from the same warp very important to achieving full throughput. Since that is very code (and compiler) dependent, the theoretical performance numbers for compute capability 2.1 rarely lined up with actual experience, and was in general worse than compute capability 2.0 if you scaled by clock rate and # of CUDA cores. Kepler, with its massive number of instruction pipelines in each SMX, now really depends on dual-issue for full throughput. Compilers are getting better, but the lack of control and predictability can be annoying for the programmer. I think the backlash we saw when Kepler came out was driven by the huge mismatch between actual performance and the device parameters themselves. Clock rate * # of CUDA cores does not extrapolate between architectures.

tl;dr: If any of those GTX 680 benefits jump out as being really important for your program, I would get the GTX 680. Otherwise, I would go with a GTX 580 and save the money. That said, when the GTX 580 disappears from the retail sales channels, it won’t be a total disaster.

That’s a pretty good analysis.

Edit: The power efficiency is pretty astounding. On my Ivy Bridge system I’m running a pair of GTX 670 4GB from a passive 460W PSU without problems (its only drawing about 370W from the wall!).

Regarding cost, the GTX 670 has most of the performance (and crucially the same memory bandwidth) of the GTX 680 at a significantly lower cost.

I was going to mention PCI-E 3.0 but bear in mind that nVidia only officially supports this on Ivy Bridge which currently limits you to a single CPU, dual channel memory and a handful of PCI-E lanes. It doesn’t work on my C606 dual Xeon E5 motherboard.

I haven’t tried the atomics yet but I have an application in mind so it will be interesting to see how that works out.

A lot of people have been comparing Kepler to the GTX 460. I personally found the GTX 460 to be a complete disaster yet I have no problems (apart from PCI-E 3.0 support) with the GTX 670.

Assuming that you don’t need to optimise your code for older hardware then personally I would go with a GTX 670 or 680. Otherwise you’ll probably end up writing code that is optimal for Fermi and at some point down the road you will end up having to rewrite it for newer architectures.

Thanks everyone. All things considered, I think I’ll go with a pair of second-hand 590’s, it should be feasible on my budget. Hopefully, by the time they burn out, the consumer-grade version of GK110 will be out. PCI-E 3.0 is potentially nice, but I’m starting with a Sandy Bridge-E processor. 3 GB (x2) should be enough.

Edit: just grabbed an EVGA GTX 590 on eBay for $480.

“Regarding cost, the GTX 670 has most of the performance (and crucially the same memory bandwidth) of the GTX 680 at a significantly lower cost.”

How so? External Image

In terms of GFLOPS/$

gtx570, 1405.4/349 = 4.027 gflops/$

gtx580, 1581.1/499 = 3.169 gflops/$

gtx590, 2488.3/699 = 3.56 gflops/$

gtx670, 2460/399 = 6.165 gflops/$

gtx680, 3090.4/499 = 6.193 gflops/$

gtx690, 5621.76/999 = 5.627 gflops/$

680 is the best deal for us compute folks. External Image

The main problem is that fewer programs seem to achieve the peak GFLOPS with Kepler than with Fermi. As has been the case for a long time, GFLOPS is not a great predictor of program performance…

Well, I’m in the UK and the numbers work out a bit differently over here. Excluding VAT:

GTX 680 2GB 3090.4/339 = 9.12 GFLOP/£

GTX 680 4GB 3090.4/408 = 7.57 GFLOP/£

GTX 670 2GB 2460.0/250 = 9.84 GFLOP/£

GTX 670 4GB (with ~6% factory overclock) 2607.6/330 = 7.90 GFLOP/£ (okay, the overclocked card is a bit of a cheat but you can’t buy a stock GTX 670 4GB over here)

And that isn’t even factoring in the fact that the memory bandwidth is exactly the same on the GTX 670 and GTX 680.

I agree with Seibert. $/FLOP is likely not the metric you want to optimize.
Performance tests should use actual benchmarks of tools similar (if not identical) to what you need to run, not proxy estimates.

That’s probably the PRIMARY decision maker. But RAM size, PCIE speed, wattage, availability, graphic performance, ECC, SM 3.0, etc, all have their own contributions that you have personal and unique weights to assign.