GTX280 vs Tesla C870

Hi all. I’m looking at the list of CUDA hardware: http://www.nvidia.com/object/cuda_learn_products.html
and wondering how performance varies across the spectrum. Are there any performance/price sweet spots at the moment? How much CUDA performance do you gain by bumping from a “consumer” level card such as the gtx280 to a Tesla card?
How about the FX products…do they make good CUDA cards?

I understand that the Tesla cards have no video output. If I were to get a gtx280 as the only card for a machine, will using it to produce monitor output significantly hamper CUDA power?

Eager to hear your thoughts/opinions.

The GTX 280 is great for CUDA, unless you need the extra memory that Tesla provides (in that case, wait for the Tesla 1000 series). All the new features in the G200 chip (double the registers, nearly double the SP’s, more memory bandwidth…) make choosing any of the older cards difficult if you want top performance.

Edit: I didn’t address one of your questions. Tesla cards offer no more performance than a consumer level card. The Tesla 800 series is a re-packaged and better tested 8800 GTX with more memory. When the Tesla 1000 series comes, it will be a re-packaged and better tested GTX 280 with a lot more memory (up to 4GiB total on one card).

If you truly want to optimize performance/price: see http://forums.nvidia.com/index.php?showtopic=72769&hl=

There are no major problems with running CUDA on the display device, unless you want to run single kernel launches for more than 5 seconds. Also, anytime the display is updated, it will slow your CUDA app down so if you want maximum performance then don’t switch desktops or move windows around and your performance will not suffer. Fancy desktop mangers like compiz/Vista Aero can also slow CUDA apps down some.

Many thanks, that was helpful.

Thanks from me also. That was most helpful.

I am building up a suitable system to run CUDA development and had been wondering whether it would be worthwhile to install a separate simple 2-D video card to take care of the display tasks while while the main, non-graphic task was running on the GPU.

Clearly, from what you say there is no point in doing that. Other posts seem to indicate that there may have been problems in getting a separate card to work anyway!

One question though. You say:
“There are no major problems with running CUDA on the display device, unless you want to run single kernel launches for more than 5 seconds.”

Unfortunately I have no experience in CUDA yet and so don’t know whether I am likely to come up against that problem. Any comments would be most appreciated.

Thanks and best regards,
John

In CUDA, a kernel is a function that is called on a grid of threads (with potentially millions of individual threads in a single kernel call). It is where you perform calculations on the GPU.

So, if you plan on doing extremely compute intensive calculations all in one kernel that might take longer than 5s you’ll run into the watchdog timer. Basically, the windowing system (X11 or windows) thinks the GPU crahsed and resets it. With a single GPU in the system, there is no way around the watchdog except to not run a windowing system on the GPU (feasible in linux only). Windowing systems accessed remotely via VNC or nomachine will not affect the watchdog timer.

To give you some sense of scale: In my application the slowest and most computationally intensive kernels run for ~30 milliseconds before finishing, but the whole simulation can run for days by repeatedly calling the kernels (it’s an iterative algorithm).

That clears up the problem.

Thanks very much for your very helpful reply.

Regards,
John

If you are looking @ a commercial installation or anything – Always go for a TESLA – they are the ones officially certified for computation (NVIDIA guys - correct me if I am wrong here)

Certified for charging more… Did I say that out loud?

I have two Tesla cards in my workstation and I am looking into switching because aside from the RAM, the Tesla card doesn’t really impress me at all… aside from the cost.
I would definitely go with the 280 at that price point… unless you really need the 1.5GB

As a follow up to my recent post. I purchased an 280 to try and speed up the rate limiting portion of my code which was taking approx 35sec on the Tesla card… I figured I could come close to cutting the run time in half. At any rate, I am not sure how this happened, but my 35 sec code now runs in 7 sec.

I am talking about a 5x speed up just from going from the G80 Tesla to the GTX 280 at standard clock speed. That of course is only for my specific/poorly written code.

Assuming you’re running the same code (e.g., you’re not recompiling and using -arch sm_13, because that could be more registers or whatever), I’d bet that it’s because of the improved behavior of nearly-coalesced accesses. Check the 2.0 programming guide for specifics. Different clocks and number of SMs will play a role, of course, but you’re way above the theoretical maximum for that alone to be the reason why.

Most programmers rely on logic;
Few programmers rely on Magic :-)

Enjoy!

Wow, nice. Could it be that your code doesn’t coalesce memory accesses on compute 1.0 hardware, and takes advantage of the new relaxed coalescing rules on compute 1.3? That is the only thing I could think of that would make such a big difference in performance.

well, doubling the amount of registers also helps a lot. Add to that the fact that you have 240 SP’s vs 128…

Well, nothing was changed in the actual code. The only difference was the new card and the associated driver update. I am still using CUDA 1.1.
I am pretty sure that the extra/unaccounted for speed up is related mostly to memory coalescing since that particular hunk of code was written pretty early in my CUDA days and so is not really optimized in any particular manner.
That being said, I have no idea why the performance jump was so large… I am not what you would call a “gifted programmer”… or a “programmer” for that matter. I am just a guy with more money than sense who goes around dropping money on GPUs like they are going out of style.

Mouli,

When you say 35seconds – Is that for a single kernel OR the whole application OR a bunch of kernels ?

If it is a single kernel - how did u manage the 5-sec watchdog timer stuff?

Best Regards,
Sarnath

The 35 sec was for multiple kernels, though that really doesn’t get to the issue that you are asking about. The 5 sec watchdog issue is only for display attacted devices. I have an 8600 which I am using to run my displays and separate GPUs for the CUDA code.
Hope that answers your question.

I was told that even TESLA suffers this 5-sec limitation because it is still the GRAPHICS driver that controls the TESLA and other CUDA capable cards that are not attached to the display!

Thanks for the info!

Whoever told you this is wrong.

We have been over this before: http://forums.nvidia.com/index.php?showtopic=62434&hl=tesla

Well, it is my own thread out there :-)

I had this thread in mind when I was writing this reply. SOmehow, it was registered in my head that TESLA would still have to put up with the Graphics driver limitation – atleast in windows!

May b Linux can survive that… but not Windows… is that right?