I’m a currently enrolled freshman undergraduate student looking at purchasing a CUDA enabled card.

I’m trying to keep the price down, so I think I’ll have to stick with the GeForce series (unless there are special promotions for being a student?). The GTX690 seems to be out of my price range too, but aside from those limitations, which card would you recommend for just CUDA performance vs. price (not gaming)?

I’m also thinking about waiting for the 700 series, which hopefully has a newer architecture, but would like to purchase one sooner than later. If you really have a compelling argument for waiting, please explain!

I know this probably isn’t the correct thread, but I couldn’t find a more specific one, so I decided this would be best because of “performance”.

If you are just starting out on CUDA, I would recommend just getting a ‘top range’ single GPU card. We use GTX580s primarily here, but have trialled out the GTX680s with some success (680s have some parts ‘crippled for compute’, whereas the 580s were the full beans). One of the important things is to try to determine the memory needs of your software, and make sure to get a card able to provide that.

A dual-GPU (such as the 690) card is probably not worth the hassle for studying as there is extra overhead associated with handling them. Similarly multiple cards requires extra work so I would not recommend that at first either unless you particularly need it.

If you are not looking for real top of the range then a 660ti or 670 will give you most of the performance but at a noticeably reduced cost.

The good thing with CUDA is, for starters, any card will do. I’ve started with a GeForce 8500, one of the lowest cards supporting CUDA.
However, if you plan to do something other than just studying the CUDA language, I would go with Tiomat, if you can afford one, get a GTX580, that’s the one I have now and, 3GB of GDDR5 coupled with 512 CUDA Cores, givin a grid of 65535x65535x65535 and maximum of 1024 threads, with 48KB of Shared Memory, are more then enough for testing and studying.
After you do get the hang of it, maybe you can have access to some Tesla cards, maybe even a dual GPU system, then it would be a lot easier.
If in doubt, set a price limit and take a look at the top cards you can actually buy with your limit, and what would be the best for CUDA (I’ve stood between the 680 and 580, and thank god I’ve choosen the 580).
Also, do not forget that almost all high-end GPUs need a good strong PowerSupply (I’ve had to change mine when installing the GTX580, and that increased the cost).

For just starting out, I would strongly suggest a newer GeForce (500 or 600 series) within your budget. You want a recent compute capability, if only to be able to experiment with more recent language features like shared memory atomic operations and the L1/L2 cache. At the very low end, NVIDIA model numbering does not always indicate the compute capability, so be sure to take a look at this page to figure out what is actually inside the card (the codename column):

Once you have a better feel for the CUDA language and how to solve problems, then you can more accurately decide what kind of high end card you want to purchase.

Looking at the 680 vs the 580 I cannot find a reason to purchase the 580 over the 680. If someone could clear up why the 580 is “better” that would be great.

For starters the 680 has 3 times as many cuda cores as the 580, the standard clock is about 230Mhz more, and it uses less power. I see that the “boost speed” for the 580 is higher, but the fact that the 680 is running 3 times as many cuda cores as the 580 seems to justify this. Again, any explination would be great appreciated.

This is how NVIDIA prefers to handle the supply chain for their GeForce cards. They design the chips, have TSMC fabricate them, and sell the chips to their card manufacturing partners (EVGA, ASUS, PNY, etc) along with a reference card design. The manufacturing partners then make changes to the reference design (if any, usually just for power management, cooling, or factory overclocking), card manufacture, packaging and sales. NVIDIA doesn’t have to deal with sales, RMAs, or end user hardware support.

For the professional lines, like Quadro or Tesla, the manufacturing process is basically the same, with a 3rd party making the cards. However, they use the NVIDIA brand on the cards and exert more control over card design and testing. (Ex: I seem to recall one of the Tesla models was built by PNY for NVIDIA some years ago.)

Hmm… I don’t think NVIDIA sacrificed the cuda capabilities, probably it’s just because the architecture on Kepler is quite different compare to Fermi. Take a look in term of SM in Fermi and SMX in Kepler.

580 has 16 SM with 32 cores each.
680 has 8 SMX with 192 cores each.

This means, in perspective of cuda programming, you may need to modify thread configuration (gridDim, blockDim) on your cuda code when you move from Fermi to Kepler in order to get max performance. I think this is the cause of why some existing cuda application run slower on Kepler.

So I prefer 680, at least I can play around with some new features on CUDA 5 which are only applicable on Kepler :D

The 580 and the 680 cards have different tradeoffs due to a different balance of resources. For the GTX 580, the peak floating point throughput is proportional to the shader clock (1544 MHz) times the number of CUDA cores (512). The “core clock” on a Fermi GPU is really not important for anything. On the Kepler GPU used in the GTX 680, they improved power efficiency by eliminating the fast shader clock and running everything on a much slower core clock, but then increased the number of cores by a lot. The net result is that raw floating point throughput on the GTX 680 should be twice as fast as the GTX 580.

But, as you will soon discover, CUDA performance is often not limited by raw floating point throughput. Other things can become the bottleneck, like device memory bandwidth. The GTX 580 has almost exactly the same bandwidth as the 680, so for purely memory bound applications, they should be equivalent.

But… floating point and memory bandwidth are not the whole story either. Many other chip resources on the GTX 680 were not scaled up proportionally with the massive increase in CUDA cores, so things like registers, shared memory, L1 and L2 cache per thread have gone down compared to the GTX 580. But atomic memory instruction performance has gone way up on the GTX 680…

So, the short version is that the 580 vs the 680 is complicated. Floating point dominated calculations go WAY faster, applications that depend heavily on non-FPU resources have gotten a little slower, and memory bandwidth dominated calculations are about the same speed. I have applications in all three of these categories, so I keep both a GTX 580 and a GTX 680 around.

That said, the Kepler architecture is the future, so there is also value in learning to live with the new performance tradeoffs.

After taking a deeper look at the compute ability of both the 680 and 580, it appears that the 680 is not as good as the 580 when comparing double-precision calculations. Although, from my understanding it is better with single-precision calculations when compared to the 580. Correct me if I’m wrong.

I was able to find better specs on the exact SP and DP calculation clocks.

nVidia GTX680 3090Gflops SP and and 129Gflops DP
I can’t seem to find the exact specs for the 580…

Back to deciding… I’m going to be using the card for research in CS + Neuroscience, so basically AI - working with Neural Nets and other things. My point is, I’m not sure if I NEED the do these calculations with DP… I could do them with SP. This may put the 680 on top… But I don’t have the 580 specs to make a good decision.

As mteguhsat says, if you want to use thing with the Kepler, then by all means go with the 680.
I know I’ve got the 580 because I’m actually doing some GA stuff, and here I can go with float and DP somewhere up, but basically is just some heavy memory bandwith stuff, since it would be almost the same, I choose the 580 to get some performance IF the need comes for DP.
If I really need some heavy stuff, I can always pay a few dollars and use a Amazon instance with the Tesla C2070 (if I’m not mistaken), of course it wouldn’t be top performance since it’s virtuallized, but it’s always a dual gpu system ;).

Ah, right. I generally avoid double precision in CUDA, so I forgot about the double precision performance issue. When I said “floating point” in my post above, I should have said “single precision floating point.”

The single precision throughput of the GTX 580 is 1544 MHz * 512 * 2 [fused multiple add is two FLOPs in one instruction] = 1.58 TFLOPS. The double precision throughput is 1/8 of that, or 198 GFLOPS. The GTX 680 gives 1006 MHz * 1536 * 2 = 3.09 TFLOPS of single precision, and double precision is 1/24 of that, so 129 GFLOPS.

So if you are limited by double precision throughput, then the GTX 680 will be 2/3 the performance of the GTX 580. But be careful: unless you are doing a lot of arithmetic operations per double read from memory, your kernel will actually be memory bandwidth bound, and neither the GTX 580 or 680 will achieve their maximum DP throughput.

After doing some more research and thinking about actual implementations of calculations for my research, I think the 680 will be my GPGPU of choice. This is for a number of reasons.

More CUDA cores. More raw compute power. This is a biggy.

Double Precision isn’t required.

In the case that more precision is required, I can always implement my own data type or combine multiple other data types to yield a similar result. In fact, I’m honestly not sure why people are complaining about the 680’s DP performance. The solution is to use two single precision numbers to closely represent a double precision number. A calculation of a DP number using two SP numbers takes roughly double (a little more), depending on the number, but triple at max, the calculations it takes to do the same thing with SP numbers. The result is being able to compute DP, in a worst case scenario at 3.09TFLOPS / 3 + 129GLFOPLS = 1.159TFLOPS for ~DP… Worst case. Not bad. At all.

Oh, and one other advantage to the GTX 680 is a big increase in the performance of the hardware-accelerated special functions, like reciprocals, reciprocal square root, __cosf(), __sinf(), etc. The overall throughput is about 2.5x faster than the GTX 580.

If you just want to develop, and price is the critical factor, then any GTX 400, 500 or 600 series card would be fine. Note that the GTX 600 series has 192 single / 8 double ops per SMX vs the 32 single / 4 double ops per SM in the Fermi (The GTX 560Ti has 48 single / 4 double).

The Tesla and Quadro parts have 1:2 or 1:3 single to double ratios, but you don’t need those parts to debug your double precision code, just to do performance tuning. Also, you generally do far more single precision stuff per double operation in support, so don’t let that bother you if you are interested in double precision.

One thing of bigger concern with GK107/GK104 chips is the 3x increase in CUDA core / SMX count without a similar increase in registers (vs Fermi SMs), which causes you some more register pressure / occupancy issues when optimizing. They solve that in the GK110 chip, which is rumored to power some of the GTX 700 series (other rumors say the GK114 chip will power them - both rumors may be true).