According to this article GTX580 has limitations comparing to GTX480. Citation:
"
According to sources to NordicHardware it can be as many as 300 million transistors that NVIDIA has been able to cut in this way. The effect is that GF110 will be a GPU targetting only retail and will not be as efficient for GPGPU applications as the older siblings of the Fermi Tesla family.
"
Is that true ? And if it is - what is the difference between GF100 and GF110 ?
According to this article GTX580 has limitations comparing to GTX480. Citation:
"
According to sources to NordicHardware it can be as many as 300 million transistors that NVIDIA has been able to cut in this way. The effect is that GF110 will be a GPU targetting only retail and will not be as efficient for GPGPU applications as the older siblings of the Fermi Tesla family.
"
Is that true ? And if it is - what is the difference between GF100 and GF110 ?
even if it is true the question is how much that actually effects most CUDA applications. If you take a look at the benchmarks people have been posting, the consumer products are faster than the tesla ones in single precision and often even in double precision. For example if NVIDIA was able to spare some transistors by cutting away on possible double precision performance, that probably wont damage the usability much, since it was anyway hard to saturate those units in many real world applications.
In the end we will have to wait until products are out to actually measure how good they are.
even if it is true the question is how much that actually effects most CUDA applications. If you take a look at the benchmarks people have been posting, the consumer products are faster than the tesla ones in single precision and often even in double precision. For example if NVIDIA was able to spare some transistors by cutting away on possible double precision performance, that probably wont damage the usability much, since it was anyway hard to saturate those units in many real world applications.
In the end we will have to wait until products are out to actually measure how good they are.
Now with the new cards all Nvidia has to do is tell the game companies to make some PC only games to run on such high end cards,+ that goes for AMD/ATI as well. Why waste big money to play Ported console game, it’s so lame…CHEERS
Now with the new cards all Nvidia has to do is tell the game companies to make some PC only games to run on such high end cards,+ that goes for AMD/ATI as well. Why waste big money to play Ported console game, it’s so lame…CHEERS
If GF110 continues in the pattern of the compute capability 2.1 designs (GF104-108), they have simplified the multiprocessors and CUDA cores relative to GF100 to save die area. The 48 CUDA cores per multiprocessor in GF104-108 can have mixed (and difficult to predict) throughput as the scheduler depends more on instruction level parallelism than GF100. In the best case for a compute-bound kernel, you get the throughput of 48 CUDA cores, but in the worst case it can be more like 32. The old days of scaling by [# of CUDA cores] * [shader clock] are pretty much over as the throughput of compute capability 2.1 will behave very differently from compute capability 2.0. In addition, the ratio of double precision to single precision throughput will probably be more like 1/12 rather than 1/8.
So I imagine there will be kernels which run slower on the GTX 580, just as people found the performance of the GTX 460 did not extrapolate down from the GTX 470 performance based on paper GFLOPS.
That said, if these changes make the GTX 580 cheaper than it otherwise would be, I’m OK with that. “Efficiency” has to be defined relative to something, and I don’t care about efficiency relative to theoretical paper GFLOPS so much. Performance per watt or performance per dollar usually matter more. (Unless you are space constrained, and then you might actually care about “performance per PCI-E slot”.)
If GF110 continues in the pattern of the compute capability 2.1 designs (GF104-108), they have simplified the multiprocessors and CUDA cores relative to GF100 to save die area. The 48 CUDA cores per multiprocessor in GF104-108 can have mixed (and difficult to predict) throughput as the scheduler depends more on instruction level parallelism than GF100. In the best case for a compute-bound kernel, you get the throughput of 48 CUDA cores, but in the worst case it can be more like 32. The old days of scaling by [# of CUDA cores] * [shader clock] are pretty much over as the throughput of compute capability 2.1 will behave very differently from compute capability 2.0. In addition, the ratio of double precision to single precision throughput will probably be more like 1/12 rather than 1/8.
So I imagine there will be kernels which run slower on the GTX 580, just as people found the performance of the GTX 460 did not extrapolate down from the GTX 470 performance based on paper GFLOPS.
That said, if these changes make the GTX 580 cheaper than it otherwise would be, I’m OK with that. “Efficiency” has to be defined relative to something, and I don’t care about efficiency relative to theoretical paper GFLOPS so much. Performance per watt or performance per dollar usually matter more. (Unless you are space constrained, and then you might actually care about “performance per PCI-E slot”.)
It’s unlikely that GF110 based GTX580 will use the innovative 48 SPs per SM superscalar execution design of GF104.
GTX580 has 512 SPs, which is not a multiple of 48. So it’s likely that GF110 is a silicon layout respin of GF100 for better clocks, defects, and power, not an architectural change.
This is all just supposition, though… NVidia hasn’t even officially announced GF110 or GTX580. We may be surprised.
We should know early next week when the rumors say the announcement and launch will happen.
The consensus is we’ll see about 15-20% performance improvement over GTX480, half from higher clocks and half from more active SMs.
Power use will be lower as well.
It’s unlikely that GF110 based GTX580 will use the innovative 48 SPs per SM superscalar execution design of GF104.
GTX580 has 512 SPs, which is not a multiple of 48. So it’s likely that GF110 is a silicon layout respin of GF100 for better clocks, defects, and power, not an architectural change.
This is all just supposition, though… NVidia hasn’t even officially announced GF110 or GTX580. We may be surprised.
We should know early next week when the rumors say the announcement and launch will happen.
The consensus is we’ll see about 15-20% performance improvement over GTX480, half from higher clocks and half from more active SMs.
Power use will be lower as well.
Even if they were to use the GF104 architecture, they could simply remove their artificial crippling to make the ratio 1/6 instead of 1/12. I beleive the table below to be correct but am welcome to being proven wrong…
If they continue to go with an artificially limited DP/SP ratio of 1/8 of 1/12 then that will strengthen my resolve to abandon the CUDA monopoly and learn CAL.
WARNING, INTERNET RAGE AHEAD:
I know that they are within their rights to cripple their own product in order to segment the market in an effort to increase profit. Yes I am aware that they do not run a charity and no, I don’t have an especially high sense of entitlement (these are the 2 most common arguments defending this ridiculous move). There are a lot of things in this world that you are legally allowed to do but still make you a bad person if you do them. There is such thing as corporate ethics and looking after stakeholder interests can bring business advantages despite appearing financially disadvantageous on first inspection. Obviously nvidias management structure is not anywhere as capable as their technical people.
I’ve devoted a significant amount of time learning the intricacies of CUDA and also writing 2 relatively large (for my standards) CFD codes using CUDA over the past 2 years. I’m currently using a box with 2 second hand GTX295’s. A lot of my work was under the premise that consumer grade GPU’s would continue to offer increasing performance and I (we) would be able to leverage the popularity of 3D gaming in order to conduct scientific research. The main benefit would been that this would encourage small scale independent research rather than needing larger research grants.
How was I to predict that cheap GPU computing would be offered using the “drug dealer” business model. The product is initially offered for a cheap price and then withdrawn once you have become dependent. Now the consumers only choice is to move to the 10x more expensive product. The problem here for nvidia is 3 letters, AMD.
Due to nvidias corporate ethics, I may be spending the next few months of my research porting my codes from CUDA to CAL and depending on the results will most probably be purchasing a number of 6970’s. I am also reserving my next graphics card purchase for my gaming rig (currently 2 x GTX285’s) until this experiment.
Even if they were to use the GF104 architecture, they could simply remove their artificial crippling to make the ratio 1/6 instead of 1/12. I beleive the table below to be correct but am welcome to being proven wrong…
If they continue to go with an artificially limited DP/SP ratio of 1/8 of 1/12 then that will strengthen my resolve to abandon the CUDA monopoly and learn CAL.
WARNING, INTERNET RAGE AHEAD:
I know that they are within their rights to cripple their own product in order to segment the market in an effort to increase profit. Yes I am aware that they do not run a charity and no, I don’t have an especially high sense of entitlement (these are the 2 most common arguments defending this ridiculous move). There are a lot of things in this world that you are legally allowed to do but still make you a bad person if you do them. There is such thing as corporate ethics and looking after stakeholder interests can bring business advantages despite appearing financially disadvantageous on first inspection. Obviously nvidias management structure is not anywhere as capable as their technical people.
I’ve devoted a significant amount of time learning the intricacies of CUDA and also writing 2 relatively large (for my standards) CFD codes using CUDA over the past 2 years. I’m currently using a box with 2 second hand GTX295’s. A lot of my work was under the premise that consumer grade GPU’s would continue to offer increasing performance and I (we) would be able to leverage the popularity of 3D gaming in order to conduct scientific research. The main benefit would been that this would encourage small scale independent research rather than needing larger research grants.
How was I to predict that cheap GPU computing would be offered using the “drug dealer” business model. The product is initially offered for a cheap price and then withdrawn once you have become dependent. Now the consumers only choice is to move to the 10x more expensive product. The problem here for nvidia is 3 letters, AMD.
Due to nvidias corporate ethics, I may be spending the next few months of my research porting my codes from CUDA to CAL and depending on the results will most probably be purchasing a number of 6970’s. I am also reserving my next graphics card purchase for my gaming rig (currently 2 x GTX285’s) until this experiment.
The other option is to go more superscalar and add yet another ALU pipeline, for 64 CUDA cores per multiprocessor. :) But yeah, you are probably right.
That would be nice, especially if they can actually drop below the magic 225W number and go back to dual 6-pin PCI-E power.
The other option is to go more superscalar and add yet another ALU pipeline, for 64 CUDA cores per multiprocessor. :) But yeah, you are probably right.
That would be nice, especially if they can actually drop below the magic 225W number and go back to dual 6-pin PCI-E power.
Not to deflect your tantrum too much, but this fist fight was had back when the GTX 480 came out. Feel free to read this thread and all the arguments:
I’m still hoping someone will actually demonstrate that a realistic double precision benchmark (like DGEMM) runs faster on consumer AMD cards than consumer NVIDIA cards. Practical differences and competition between vendors will trump righteous indignation any day.
(And I’m being serious here! I don’t need maximum double precision throughput now, and there’s no way I’ll be in the market segment that can afford a Tesla. If for some reason I do need serious DP in the future, and people have discovered that AMD is delivering it in vast quantities, then I’ll go buy a Radeon 5870 and deal with OpenCL for that project. Technologies come and go, and I don’t need One To Rule Them All.)
Not to deflect your tantrum too much, but this fist fight was had back when the GTX 480 came out. Feel free to read this thread and all the arguments:
I’m still hoping someone will actually demonstrate that a realistic double precision benchmark (like DGEMM) runs faster on consumer AMD cards than consumer NVIDIA cards. Practical differences and competition between vendors will trump righteous indignation any day.
(And I’m being serious here! I don’t need maximum double precision throughput now, and there’s no way I’ll be in the market segment that can afford a Tesla. If for some reason I do need serious DP in the future, and people have discovered that AMD is delivering it in vast quantities, then I’ll go buy a Radeon 5870 and deal with OpenCL for that project. Technologies come and go, and I don’t need One To Rule Them All.)
I have to admit I don’t feel as strongly about the issue as my post indicates. I’ve enjoyed learning to use CUDA so far and am getting great results using the GT200 architecture. I tend to get wound up during the post writing process :D
I have to agree with you that not many problems require the full DP arithmetic performance due to being memory bandwidth bound. Incidentally, there are a whole class of problems that may appear memory bandwidth bound upon initially counting memory operations and arithmetic operations but will still benefit from faster arithmetic because of data dependency issues. The whole GPU computing premise that memory access times can be hidden behind calculations may not be achievable 100% for all algorithms. Consider an algorithm that repeatedly lunches a bandwidth bound kernel followed by a compute bound kernel. Even if the total ratio of arithmetic operations to memory operations indicates the problem is bandwidth bound, the algorithm will still benefit from faster arithmetic during the compute bound kernel phase. In fact I believe that any kernel, no matter how bandwidth bound will slightly benefit from faster arithmetic during the tail end of execution where no more memory transfers are required (and vice-versa for a compute bound kernel during the initial launch).
I also agree that fast double precision on AMD may or may not be reality. I too have not seen any benchmarks. I also have to admit that I don’t understand their architecture to the extent that I do nvidias.
Here here! The main reason I will begin experimenting with AMD is because I want to have an equal understanding of both technologies. Reading back my previous post, it sounds like I’m “jumping ship” to go to AMD but really I just want to understand both in order to pick the best one later in my research when I may be able to make a purchase of 4 or so cards (or half a tesla card).
Can anyone shed any light on the GF104 double precision performance? I can’t understand how they arrive at the 1/12 DP/SP ratio?
I have to admit I don’t feel as strongly about the issue as my post indicates. I’ve enjoyed learning to use CUDA so far and am getting great results using the GT200 architecture. I tend to get wound up during the post writing process :D
I have to agree with you that not many problems require the full DP arithmetic performance due to being memory bandwidth bound. Incidentally, there are a whole class of problems that may appear memory bandwidth bound upon initially counting memory operations and arithmetic operations but will still benefit from faster arithmetic because of data dependency issues. The whole GPU computing premise that memory access times can be hidden behind calculations may not be achievable 100% for all algorithms. Consider an algorithm that repeatedly lunches a bandwidth bound kernel followed by a compute bound kernel. Even if the total ratio of arithmetic operations to memory operations indicates the problem is bandwidth bound, the algorithm will still benefit from faster arithmetic during the compute bound kernel phase. In fact I believe that any kernel, no matter how bandwidth bound will slightly benefit from faster arithmetic during the tail end of execution where no more memory transfers are required (and vice-versa for a compute bound kernel during the initial launch).
I also agree that fast double precision on AMD may or may not be reality. I too have not seen any benchmarks. I also have to admit that I don’t understand their architecture to the extent that I do nvidias.
Here here! The main reason I will begin experimenting with AMD is because I want to have an equal understanding of both technologies. Reading back my previous post, it sounds like I’m “jumping ship” to go to AMD but really I just want to understand both in order to pick the best one later in my research when I may be able to make a purchase of 4 or so cards (or half a tesla card).
Can anyone shed any light on the GF104 double precision performance? I can’t understand how they arrive at the 1/12 DP/SP ratio?