The best choice depends on what you plan to do. I have both a 580 and a 680 in the same computer and find performance differences in CUDA programs that I use ranging from the GTX 680 being 10% slower to 100% faster than the GTX 580, with the most common case being about the same speed or slightly slower. Now that I have a GK104 card to experiment with, I also don’t plan to buy any more until the GK110-based GeForce cards come out in early 2013.
I find that in the 680 vs. 580 comparison things are very mixed:
[*] Best power efficiency - GTX 680: At roughly the same speed as the GTX 580, but with 20% less power usage, at least based on the TDP. I have not yet tried benchmarking the power usage of the 580 vs. the 680 individually since my development system has 4 cards installed. In addition, the idle power draw of the GTX 680 is lower, which is also nice for workstations where you intersperse CUDA and non-CUDA usage.
[*] Best memory bandwidth - GTX 580: Although both devices have nearly identical theoretical memory bandwidth, pretty much every test I’ve run or read about indicates the practical throughput on the GTX 580 is 7-10% better.
[*] Best host-to-device (and vice-versa) bandwidth - GTX 680: If you are building a new system, then you are probably getting a motherboard with PCI-Express 3.0 support. In this case (but be careful that your motherboard is supported!), you’ll see roughly double the host-to-device and device-to-host bandwidth.
[*] Best price - GTX 580: Assuming you can still find them in stock, the GTX 580 is 20% cheaper, as you noted.
[*] Most device memory in a GeForce card - GTX 680: Now you can get 4 GB card, whereas the GTX 580 only went up to 3 GB.
[*] Fastest raw double precision and integer shift performance - GTX 580: The new SMX design feels a lot more like the SM in compute capability 2.1, and so double precision performance is hurt significantly.
[*] Fastest raw single precision and special function performance - GTX 680: The program I mentioned that is 100% faster? It is basically all single precision and special functions. However, most programs are not limited only by single precision floating point throughput.
[*] Most cache, cache per active thread, and shared memory per active thread - GTX 580: I suspect that many programs seeing a large drop in performance on the GTX 680 are running into problems with the drop in cache size (or instruction scheduling, see below). The effective drop in shared memory per active thread might also hurt programs that use this space for per-thread scratch space that needs to be shared with the block.
[*] Fastest atomics - GTX 680: The latest atomic performance is pretty impressive.
[*] Most “predictable” performance - GTX 580: (This is fairly subjective, so you should feel free to disregard it.) I think compute capability 2.1 made a lot of CUDA developers uneasy because it made the ability to dual-issue instructions from the same warp very important to achieving full throughput. Since that is very code (and compiler) dependent, the theoretical performance numbers for compute capability 2.1 rarely lined up with actual experience, and was in general worse than compute capability 2.0 if you scaled by clock rate and # of CUDA cores. Kepler, with its massive number of instruction pipelines in each SMX, now really depends on dual-issue for full throughput. Compilers are getting better, but the lack of control and predictability can be annoying for the programmer. I think the backlash we saw when Kepler came out was driven by the huge mismatch between actual performance and the device parameters themselves. Clock rate * # of CUDA cores does not extrapolate between architectures.
tl;dr: If any of those GTX 680 benefits jump out as being really important for your program, I would get the GTX 680. Otherwise, I would go with a GTX 580 and save the money. That said, when the GTX 580 disappears from the retail sales channels, it won’t be a total disaster.