I am in need of some advice for a new GPU (or multiple GPU’s) for CUDA programming. The work that I will be doing is based on computational fluid dynamics. At this point I am running two GeForce 7950 GT’s in SLI on my motherboard, which obviously do not support CUDA. I am looking for a suitable upgrade. Here are the specifications of my current system:
Needless to say, I am very new to CUDA programming, but am very much looking forward to harnessing the immense power that the CUDA platform can offer.
I was considering a Quadro card for this work, but after looking at the prices and the supported options, I simply feel that it would be a huge waste of my money. Is this accurate, or is there something the Quadro cards will offer (besides OpenGL and the other usual stuff) over the GeForce cards? I don’t need any of the fancy technical support.
Is SLI worth it? Of course, my board supports three SLI cards, but I am not sure if going with three cheaper SLI cards would be a better decision over going with one more expensive card?
Any recommendations anyone can offer me would be much much appreciated. Thanks for your help!
It’s hard to give specific advice, much of it depends on your budget.
But likely a GTX460 at $200 or a GTX480 at $500 is the right choice. They’re modern Fermi cards so you have all the latest features.
You don’t need multiple cards… even the GTX460 will be much much more powerful than your pair of SLI 7950.
Yes, you can also start getting multiple cards, but you can always just start with one and get another later.
Maybe this changed with introducing the fermi architecture, but until then the SLI functionality didn’t bring any advantages to CUDA programming. SLI simply enables the cards to share the picture computation by separating it line-by-line to the available SLI-cards (very roughly), this needs to mirror the gpu memory to all cards.
Considering hierarchically, the CUDA programming is a layer below SLI. You have to access every card separately, allocate the card memory separately… in other words… you simply have multiple cards and have to deal with multiple effort…
I successfully worked with the Geforce GTX 285 during the last 1.5 years and gained high speed improvements. But I think I will switch to a FERMI card like the GTX 460 to benefit from the great fermi advancements (should be 3 times faster without special fermi instructions… )
As far as budget goes, I will get what I need. At this point, here is what I am looking at:
GTX 470 for $272.50
GTX 480 for $414.00
I am not a gamer at all (gave that up a long time ago), so this card would dual-serve as my primary GPU (interface rendering) and my CUDA processor (computation). At this point, I think everything is going to be more powerful than my SLI 7950, simply because the 7950 does not support CUDA :).
Is there any real advantage to the GTX 480 that would warrant a $141.50 premium? Seems like quite a price jump for a few extra cores and slightly more memory not to mention power consumption, etc.
Basically, what I am taking away at this point is that SLI, as far as general purpose computing, is simply not necessary. I was not sure how the SLI layer fit into CUDA (simply added more cores, or did nothing, etc). I suppose I will worry about SLI later, if I need to.
That’s basically it. More cores, higher clock rate, more memory, more memory bandwidth. Since you are getting started with CUDA, going with the GTX 470 is probably a better choice.
Note that a CUDA kernel has total control over the GPU while it is running. This means that for the duration of the function call, your display will not update, and if the kernel runs more than a few seconds, the driver will abort it. To fix this problem, you either have to turn off the GUI (not possible in Windows) or install a cheap second card to use for your primary display, freeing up the nice card for computations. It is also required to have a separate display card if you want to use either of the debuggers (cuda-gdb for Linux or Parallel Nsight for Windows). The easiest choice for a display card is some very cheap GeForce 8 or later card.
Ah, I wasn’t aware that CUDA overtook the card like that. So, if I understood you correctly, the only way to let the kernel run for longer than a few seconds is to have a secondary (non CUDA) card rendering the interface? That way, the CUDA card has nothing plugged into it and is only used for computations. This would also allow debuggers to be run, as you mentioned, and leave the interface running for Windows. Is my understanding correct?
That begs the question then, suppose you had three open PCI-Express 2.0 x16 slots. You have two 24" LCD monitors (as I do). How would you fill each slot and with what knowing the type of work I wish to do?
On a side note, is programming for two GPU’s any more difficult than programming for one? I mean, if I have to run two GPU’s (one CUDA, one interface) I might as well run three (two CUDA, one interface)? I suppose a Tesla card is what I really need… but they seem to only be sold as a workstation or a rack-mounted server.
Perhaps this belongs in more of the “programming” section, but typically is CUDA code pretty scalable? I would hate to setup my code for one GPU, and then need to rewrite it because I put in a second GPU.
Thanks again for all of your advice. It is really shallowing my learning curve (with the hardware, at least). I really appreciate it.
Correct. In practice, this limitation is not a significant obstacle for running code, as many CUDA programs make large numbers of short (tens or hundreds of milliseconds) kernel calls. However, once the kernel calls hit hundreds of milliseconds each, you will probably notice the GUI start to become annoyingly jerky. Debugging, however, is a more significant issue since “device emulation” (which was a big misnomer) has been removed from CUDA.
I would find the cheapest GeForce card with two DVI ports (looks like a GeForce 9500 GT for $55 on Newegg does the trick) and get one GTX 470. If CUDA works out well for your application, add a second GTX 470 for multi-GPU CUDA.
Tesla cards work basically the same as GeForce cards from the CUDA perspective. (Rackmount Teslas are 4 Tesla cards in a 1U enclosure with power, cooling, and a PCI-Express interface that you can plug into an adjacent server.) There are some capability differences between the two, but they are not as significant as people assume:
Multi-GPU programming is fairly simple. You spawn a CPU thread for each CUDA device you want to use, and call cudaSetDevice() with a different value in each thread. Then each CPU (“host”) thread is free to configure and use the CUDA device it is bound to. You have to manually partition your task between the cards, so it is not as automatic as SLI is for 3D rendering.
Due to the need to manually operate each CUDA device from a separate host thread, scaling from one to two devices tends to be a lot of work for most programs, unless they were written with multiple devices in mind. Scaling from 2 to N is pretty easy after that, assuming you have enough work to keep N cards busy in parallel. The actual scalability depends on your task and how much data you need to exchange between cards and the host. There is no direct GPU-to-GPU communication, so data exchange between cards has to be bounced through the host memory.
The third issue is related to flash. When watching Hulu, if other windows are moving around (Aero) or menus appearing/disappearing, etc, flash player drops frames. This didn’t happen on my other GPU’s. I am not sure if it’s a Quadro issue or what. Any ideas?
The third issue is related to flash. When watching Hulu, if other windows are moving around (Aero) or menus appearing/disappearing, etc, flash player drops frames. This didn’t happen on my other GPU’s. I am not sure if it’s a Quadro issue or what. Any ideas?