GTX 480 / 470 Double Precision Reduced?

The real question is “If I want to do heavy double precision computation on the cheap, using a standard graphics card, is a gtx480/470 my best choice?”

How much did they reduce the double precision performance?

Does this fall below that of completing products?

1/4 tesla:

[attachment=16631:cuda_z.png]

I think that Nvidia could be generous after all that delays and suffering and enable full dp throughput on geforce.

Okay, that means that applications targeted at consumer hardware such as Physics APIs in games will have to find a reasonable balance between single and double precision computations. Like only using DP where it is really unavoidable, and use single precision otherwise. Not really any different from code targeted at the GTX260 and better.

Actually I can’t think of any scenario where you really would want to use DP in game physics. I mean who cares if the numerical errors of a probably very crude physics model are a bit less or not in a game, where it is only a gimmick. You would anyway use relatively simple physic models. I doubt for example that anyone would use a real MD Tip4d simulation for water in a game (considering that you usually only simulate cubics of some nm ;-).

In fact this reduced DP performance might have one good side effect, you have one more good reason to support single prec. in your code, which makes it easier to write stuff which still runs nice on the GT200 hardware, which many people and companies/universities still own.

It’s kind of a Catch-22 situation. The gaming industry won’t make use of double precision until there’s a compelling reason to do so, and keeping the DP performance crippled is a good way of stopping them from finding a compelling reason. At the same time, NVIDIA probably won’t make high performance DP available on consumer cards until there’s a strong push from the gaming industry for them to do so.

I would not be surprised if they made full DP available in a refresh of this hardware in two possible scenarios:

  1. More consumer software starts to use GPUs at all (even without double prec) since that would give more reason to have all compute features available.
  2. If AMD/Ati gets it act together and finally produces a software / developer infrastructure, which is reasonable good compared to the already rather mature CUDA ecosystem, they would have a very distinct advantage in DP performance, which might compell NVIDIA to unleash all of fermis capabilities on the consumer products as well.

So its as I thought. Tesla (~672 Gflops double) > 5870 (554Gflops double) > GTX 480 (168 Gflops double)

(554 Gflops double coming from http://techreport.com/articles.x/17618/5)

I do realise that these are peak rates and that Fermi architecture will let it get closer to the peak on more complex problems.

Ok, peak performance on something nontrivial. If you’re doing SGEMM, then sure.

BTW, here are some SiSoft GPGPU results that show GTX 480 is roughly comparable to Radeon 5970 in single precision.

http://hothardware.com/Articles/NVIDIA-GeF…Landed/?page=15

This is backed up by games, too. NVidia is slightly faster (or about even) in games, but ATI has 3X peak performance.

Of course, it all depends on the exact application…

  • Matt

This reduction is sad; especially, because it will be possible to buy GTX 480 in the very near future and tesla is still on the way.

Very interesting question is how particularly they reduce it? So, I can think about two possible ways
1 double precession is done on SFU
2 number of CUDA cores for double is limited

In addition it is unclear 1/4th in comparison with what?
with Tesla C2050 which now has 448 CUDA Cores @ 1.15 GHz
or with maximal capacity of multiprocessor

Any ideas?

Please see the post:

http://forums.nvidia.com/index.php?showtopic=165055

Double precision on GeForce GTX 4x0 is 1/8th of single precision.

On Tesla x20x0, double precision is 1/2 of single precision.

It has been brought to my attention that the GTX 400 series may have had it’s double precision floating point capabilities capped at 1/8 CUDA cores instead of the 1/2 the hardware is capable of. Can you very if this is true for us? I’m a number cruncher involved with BOINC and a member of the SETI.USA team. Nearly 2 million people worldwide crunch for BOINC, many others for other projects (such as EVGA’s own Folding team). The results of this will have a large impact on our decision making as far as upgrading GPUs and the rate at which answers are found for the various scientific projects we work on. Regardless of whether or not we have the money, we’re not going to be purchasing Tesla cards to get this performance. We do not need $2000 worth of tech support. If this issue isn’t fixed, we’ll have no choice but to focus on ATI’s offerings which outperform Nvidia’s crippled GTX series. Please don’t force our hands like this. Thank you.

-John P. Myers

Can you refer us to any GPU accelerated BOINC projects that make use of double precision floating point (current or announced projects)? I am not aware of any.

Milkyway@home requires double precision. In fact, it will not work on cards without a compute capability of at least 1.3

Others would exist, but have slowed their production down by writing FP32 apps to potentially attract more people. With both ATI and Nvidia offering tremendous FP64 support, this would be changed in short order.

To further elaborate, SETI.USA is the #1 crunching team. Not just #1 in the country, but #1 in the world with 6.25 Billion cobblestones of computation completed (1 cobblestone = 864,000,000,000 floating point operations). The major French team (technnically the French speakers of the world) is 2nd, with the major German team 3rd. We take advancing the scientific knowledge of the world very seriously. Therefore, we take Nvidia crippling our possible production very seriously as well. We simply cannot use Nvidia 400 series GPUs if FP64 is going to be intentionally crippled when ATI’s GPUs are left wide open, and will not use Tesla. We have our own tech support. Personally, I currently and always have used Nvidia GPUs. After this though…it’s just unacceptable.

Worldwide team rankings: [url=“Home | BOINCstats/BAM!”]Home | BOINCstats/BAM!

We are adamant about upgrading our computers to the best possible hardware we can get our hands on. The competition between us and the other nations of the world is fierce and often stressful. We have to be on top of things in order to remain #1 in the world. If using the 400 series will hinder our ability to maintain the #1 spot, then we simply cannot use them. Others around the world will follow suit. Everyone wants to be #1.

We had been anticipating that the GTX 480 would outperform the ATI HD5870 in FP64 computations. If you’re going to cripple the 400 series, obviously this will not be the case and will be a big disappointment to everyone. Seems Nvidia is happy to be in 2nd place in this regard. A very distant 2nd.

We need confirmation one way or the other about the 400 series. If it is currently crippled but you’ve decided it’s going to be a horrible business move and change your mind, that’s fine. If you’re going to leave it crippled, say so. If it’s all just a rumor, say so. If you say nothing, I (we) will have no choice but to assume your products will remain inferior to all of us. We will not lose the #1 spot to the French because they went with ATI and we didn’t.

Official confirmation, please.

-John P. Myers

I think no one here has a problem when you prefer ATI over nVidia in your particular use case. As for official confirmation, we’re all waiting.
I am actually rooting for the French and German teams though ;)

This post was written by Sumit Gupta, who is a Senior Product Manager in the CUDA group at NVIDIA:

http://forums.nvidia.com/index.php?showtopic=165055

I think that’s pretty official.

Just to check: You’re sure that you are compute bound, and not memory bandwidth bound on these applications? Is your work unit throughput proportional to the shader/core clock on the card if you scale it up and down (while leaving the memory clock fixed) with an overclocking tool?

You should definitely use whatever hardware runs your code the fastest, but there can be other limiting factors on performance besides pure FLOPS. It’s worth being certain before you pull out the wallet. :)

Bandwidth is not a problem with these apps. A PCIe x16 2.0 GPU put in a PCIe x1 1.1 slot will cause no performance decrease when crunching numbers. Overclocking the shader only does increase performance directly. Bandwidth isn’t even close to being a limiting factor. Our performance is directly related to pure FLOPS. This has been a known fact for us for several years.

Memory bandwidth isn’t the same the as PCI-e bandwidth, and the former was the subject of the question, not the latter.