Need help to choose either the gtx 295 or the gtx 480 for massive Lattice Boltzman simulations

hi guys,

         i'm new to the forum, i've an issue choosing one model among the three gtx 295, gtx 480,(or gtx 580 too expensive to buy several units) in order to plug two of them to run the Lattice Boltzmann simulations for fluid dynamics.
         i was decided to buy two gtx 295 seeking the rawpower it delivers 1.788 TF, but after reading some topics people are claiming gtx 480 superiority but almost in gaming , how about simulation using cuda? i was deeply convinced that the gtx 295 has the lead, does the new FERMI architecture give more processing power even with less GigaFlops in single precision? i've checked out many online benchmarking sites like "" i've seen that the gtx 295 gtx beats the 480 in almost all aspects!

i just wanna avoid picking up the 480 if the gtx 295 still superior.
could you tell me please in which modes the 480 gtx is better than the gtx 295 and vice versa?
(excuse me!!) one last question question, i’m intending to use the GPUs only to perform HPC nothing else (i’m not a gamer!), if the gtx 295 chosen will it be outdated soon, unsupported due to the new FERMI architecture instruction set? so please what to choose.

             well, thanks a lot guys, please consider helping me out with this.

The first thing to keep in mind is that the GTX295 is really a pair of discrete GPUs sharing a PCI-e bus. Per GPU performance is considerably inferior to a Fermi based card. Because CUDA doesn’t have any sort of automatic multi-gpu capability, to use more than one GPU requires you to write your own code to implement your algorithms over multiple GPUs. That isn’t necessarily all that easy, and the performance you can achieve depends heavily on how well your algorithms can be split over multiple cards and how sensitive they are to the bandwidth and latency constraints of the PCI-e bus.

In my experience, Fermi is a superior HPC architecture in just about every respect compared to the GT200. I certainly wouldn’t contemplate buying any more GT200 GPUs to use in my applications (mostly linear algebra problems stemming from discretization of non-linear PDEs).

thanks a lot for the quick reply, so as expected the architecture brings the 480 to upper level,definitely its the 480, to achieve higher bandwidth i focused my attention on the Dual Xeon evga SR-2 ?? to still use efficiently nonlinear PDEs solvers with comsol on two xeon processors. does the CPU-GPU flux perform greatly better on a dual xeon than an i7?

       i think that the dual processor is fully supported since it's compiled using the gcc or VStudio. is the DeviceHost communication layer handles automatically the dual cpu? in other words, is it harder to code cuda on a dual cpu or it supports it naturally.

great thanks

That board doesn’t have any more bandwidth than a standard X58 board. Just more slots (and if you fill them all with GPUs, the per GPU bandwidth is lower than a standard board with a pair of cards).

No there is no automatic multithreading or multi-gpu with CUDA. The APIs themselves are thread safe, but you have to do all the multi thread programming explicitly yourself.

GTX480 gave me only ~40% performance boost over GTX280. This is to be expected since altough Fermi has twice the compute power
it only improved in ~40% in bandwidth over the GTX280. So since my production code is bandwidth bounded, rather than computation bounded,
I only saw a 40% performance boost, hardly worth the money difference between the cards.

I think that each user should first try out his code on a small test case - i.e buy a GTX280/GTX295 and a Fermi480 and see which
is faster. Altough avid is correct about the GTX295 sharing the PCI and that writing the multi-GPU code is not that trivial, you
probably will see that the GTX295 is faster than the GTX480. As for writing multi-gpu code, you’ll probably have to do it anyway
for any type of card you choose, since you want to pack as many cards possible in a host machine (2,4,8…) to get the best performance/$.

One thing to note about the GT200 and GTX295 in particular - I am not sure they are making it anymore or that you can even purchase
new GTX295 nowadays and how long will you be able to find new cards/parts if you decide to extend your initial GPU farm.


It’s better to get a GTX480, just for versatility… more memory and more capabilities. It may NOT be faster.

I have code which is roughly 15% faster on the GTX295 than the GTX480 (mostly due to the GTX295’s better aggregate memory bandwidth). The GTX480 is still a better card and more future proof.

I’m certainly not a hardware expert, so if you choose to consider my advise, please do seek a second opinion.

I think 480 is NOT the optimal GPU at the moment. If I were you, I would consider either 470 or 580. Here are the guidelines that got stuck in my head:

  • For price performance of a single/dual GPU system you want 470.

  • If you want to build a multi-GPU box, then 580 might be better from the price-performance standpoint. Indeed, for each 3-4 GPUs you need to build yet another box (motherboard, CPU, power supply, etc), which of course costs money, probably somewhere under $1K, plus maintenance headache. Therefore, you might choose to spend more money on faster GPUs and end up having fewer boxes. So, even though 580 per se is less cost-efficient than 470, the complete solution based on multiple 580s might turn out to be more cost-efficient.

  • Another consideration applicable mostly to multi-GPU setups is power consumption. As far as I heard, 580 is more energy efficient (NEED TO CONFIRM) than 470/480. This might be important for both cooling, easing the pressure on your electric circuit and electricity bill.

Best of luck and tell us how it went!

The 580 approximately has the same energy consumption (~250W) as the GTX480 but with 25% more compute power.

The 570 to be released today has the same power envelope as the GTX470 (~220W) but again at the performance level of an GTX480 which is roughly an addition of 25% over the 470.


Agree with ceearem. The GTX 5XX series is much more power efficient. I guess GTX570 would be the best option - you get performance of GTX 480 with power consumption even lower than GTX470 and it’s much cheaper ($350 in US).

Dear Quantum,
Your question is so fundamental that I’m afraid you are going to spend a few k$ without having much experience in GPU computing.
Obvious questions: do you have the software that will efficiently deal with LBM multiGPU simulations or hope to write it yourself? If not, buy 2 cheap GTXes and wait for new hardware while you develop the software.
Have you performed any real numerical simulations with GPUs, or just take your knowledge from “the press”?

My experience says that the single factor that determines computation efficiency is the memory bandwidth and its efficient utilisation.
Thus, even if my GTX480 is capable of about 100GFLOP/s in double precision, I’m pleased to see it achieve 20GFLOP/s, as this is 75% of
the number derived from peak memory bandwidth considerations. My Fermi is as much faster than my AMD 6-core cpu, as its memory system is.

Thus, my advice is:

  1. Try and find out if your problem is bandwidth- or computation-limited. The answer should come from experimentations rather than theory, though :-(
    For getting the full bandwidth utilisation in GPUs may be not that obvious.
  2. Then choose the card that will have the required bandwidth or GFLOPS characteristics.
    2a) Remember that many Fermi-based cards out there on the market do not meet the specification found on the NVidia website.
    For example, I was stupid enough to buy an expensive Fermi GTX 480 card only to see that its bandwidth is only 120GB/s, whereas NVidia claims it should be capable of 177.4 GB/s.
    2b) Don’t buy old cards for GPGPU purposes. New architecture is more flexible; more shared memory, more registers, more instructions, more ways to play tricks.
    2c) My experience with GTX 285 and GTX 480 , both of the same bandwith 120GB/s, says that Fermi is about 50% faster in my linear-algebra applications. And I managed to speed it up even more by next 50% by rewriting the kernel to take advantage of the fact that Fermi has far more shared memory!
    2d) GTX 480 at full load is far more noisy than GTX 285 :-(
    2e) A year ago my friend bought GTX 295 with GPGPU in mind; only very recently did he start to use it as a multi-GPU system - it’s not that easy!
  3. For multiGPU system the next single important factor is the motherboard and how it handles PCI-E bus(es). When I got my first CUDA card and I tested the PCI-E bandwith, I obtained only 0.5 GB/s, horror!
    Now I have 3GB/s. In theory PCI-E x16 gen. 2 is capable of 8GB/s in each direction. Remember that you’ll have to transfer your data twice: to and from a CPU! I guess this will be the narrowest bottleneck, although with a lot of cleverness and effort it is (allegedly, never tried myself) possible to compute and transfer the data simultaneously.

In summary, the only choice is between GTX 580 (192 GB/s)) and GTX 570 :-) plus the best motherboard with true pci-e x16 gen 2 support in all slots, also consider the total memory each card should have and don’t forget to consult your electrician about your power supply :-).

Straight linear algebra kernels can get close to 150 GB/s on a GTX 480 with appropriate tuning. While this is short of the advertised 177 GB/s, it’s still somewhat better than the 120 GB/s you claim. It’s possible to get 120 GB/s on a GTX 280. Having 85% of the peak bandwidth is typical across platforms, I doubt you’ll see much over 90% of bandwidth utilization on an Intel architecture.

The burden with multi-GPU is that you have to consider another parallel programming paradigm as well as CUDA - either threads (e.g., OpenMP) or message passing (MPI). This isn’t

On decent motherboards (Intel X58 chipset) I’ve typically got 5.5 - 6 GB/s over the PCIe bus in either direction. There are good reasons why one cannot exceed this on PCIe v2.0 (which will be rectified in v3.0).