GTX-480 cards have 1/8 DP performance of that of SP performance. For GF-104 cards, it is 1/6. Since there are 3 16-core pipelines in GF-104 SM, enabling only the DP capability of 1 of the 3 pipelines will result in the 1/6 DP performance compared with SP. But how it is done in GTX-480 cards? Since for them only 2 16-core pipelines exists, then how the disabling of DP capabilities can result in a 1/8 difference? Thanks.

GTX-480 cards have 1/8 DP performance of that of SP performance. For GF-104 cards, it is 1/6. Since there are 3 16-core pipelines in GF-104 SM, enabling only the DP capability of 1 of the 3 pipelines will result in the 1/6 DP performance compared with SP. But how it is done in GTX-480 cards? Since for them only 2 16-core pipelines exists, then how the disabling of DP capabilities can result in a 1/8 difference? Thanks.

i don’t know but offhand it seems intuitive that when you increase the precision of a floating point multiplication, the pipeline depth grows linearly and the circuitry area grows with the square of the precision. so even if you’re sharing circuitry w/single precision, double precision would be 2x^2*2x = 8x of single precision’s spatial-temporal resources. so 1/8 seems about right. then i’d imagine w/3 times the fpu’s for 2 threads you just share the circuitry differently and end up w/ (8x+8x)*3/2 = 6x.

again i don’t know anything about the actual circuitry. this is all just educated guessing.

i don’t know but offhand it seems intuitive that when you increase the precision of a floating point multiplication, the pipeline depth grows linearly and the circuitry area grows with the square of the precision. so even if you’re sharing circuitry w/single precision, double precision would be 2x^2*2x = 8x of single precision’s spatial-temporal resources. so 1/8 seems about right. then i’d imagine w/3 times the fpu’s for 2 threads you just share the circuitry differently and end up w/ (8x+8x)*3/2 = 6x.

again i don’t know anything about the actual circuitry. this is all just educated guessing.

I thought the DP rate on GF-104 was 1/12th of the SP rate, assuming you can get SP instructions issued on all 3 pipelines. (That is, GF104 is even less efficient at double precision than GF100 in GeForce cards.)

If it really is 1/12, then you can think of a GF104 multiprocessor as a GF100 multiprocessor with an extra pipeline attached. The GF100 multiprocessor completes 32 SP instructions per clock, and 32 DP instructions per 8 clocks, giving the 1/8th rate. If you attach an additional pipeline with no DP capability at all, then (at peak) you complete 48 SP instructions per clock, but still 32 DP instructions per 8 clocks. That gives you the 1/12 factor that I remember from the GF104 reviews.

(As far as we know, the GTX 470 and 480 are built with the same die (“GF100”) as the Tesla cards, so the limit on double precision is enforced in firmware or some kind of after-the-fact hardware modification.)

I thought the DP rate on GF-104 was 1/12th of the SP rate, assuming you can get SP instructions issued on all 3 pipelines. (That is, GF104 is even less efficient at double precision than GF100 in GeForce cards.)

If it really is 1/12, then you can think of a GF104 multiprocessor as a GF100 multiprocessor with an extra pipeline attached. The GF100 multiprocessor completes 32 SP instructions per clock, and 32 DP instructions per 8 clocks, giving the 1/8th rate. If you attach an additional pipeline with no DP capability at all, then (at peak) you complete 48 SP instructions per clock, but still 32 DP instructions per 8 clocks. That gives you the 1/12 factor that I remember from the GF104 reviews.

(As far as we know, the GTX 470 and 480 are built with the same die (“GF100”) as the Tesla cards, so the limit on double precision is enforced in firmware or some kind of after-the-fact hardware modification.)

Sorry then I might have remembered the ratio and mistake 1/12 as 1/6. Does the fact that the assumption that adding another pipeline which is not capable of DP coins up with the number 1/8 to 1/12 imply that only 1 of the 2 pipelines in GF-100 is capable of DP? This might translate to 1 DP-capable pipeline issuing 4 DP instructions per clock and issue all the 32 DP MADs in 8 cycles.

Sorry then I might have remembered the ratio and mistake 1/12 as 1/6. Does the fact that the assumption that adding another pipeline which is not capable of DP coins up with the number 1/8 to 1/12 imply that only 1 of the 2 pipelines in GF-100 is capable of DP? This might translate to 1 DP-capable pipeline issuing 4 DP instructions per clock and issue all the 32 DP MADs in 8 cycles.

Also why C2050 has 1/2 DP performance than SP ones? It must be that both pipelines are capable of DP and more circuity is added to support DP issue to a higher density.

Also why C2050 has 1/2 DP performance than SP ones? It must be that both pipelines are capable of DP and more circuity is added to support DP issue to a higher density.

Sometimes GF106 in practice show more than 1/12 DP/SP rate (from http://nvworld.ru/articles/gigabyte-gts450/page3/ )

Sometimes GF106 in practice show more than 1/12 DP/SP rate (from http://nvworld.ru/articles/gigabyte-gts450/page3/ )

I suspect the problem here is failure to fully utilize the third pipeline in the SP test. It sounds like reaching peak SP is non-trivial with compute capability 2.1.

I suspect the problem here is failure to fully utilize the third pipeline in the SP test. It sounds like reaching peak SP is non-trivial with compute capability 2.1.

The difference is due to market segmentation rather than technology. The previous generation of Tesla was so similar to the GeForce in capability, there were few reasons to buy it. (For example, DP was slightly faster with the GTX 285 as compared to the Telsa C1060.) By reserving the highest DP performance for the Tesla in the current generation, NVIDIA hopes to encourage more high-performance computing types to buy it.

(Basically, I’m saying that there probably is very little difference between the GTX 480/470 and the Tesla C2050/70. The DP is limited on the GeForce for sales reasons.)

The difference is due to market segmentation rather than technology. The previous generation of Tesla was so similar to the GeForce in capability, there were few reasons to buy it. (For example, DP was slightly faster with the GTX 285 as compared to the Telsa C1060.) By reserving the highest DP performance for the Tesla in the current generation, NVIDIA hopes to encourage more high-performance computing types to buy it.

(Basically, I’m saying that there probably is very little difference between the GTX 480/470 and the Tesla C2050/70. The DP is limited on the GeForce for sales reasons.)

My 2c,

As seibert said, it has become clear (although I haven’t seen an official statement) that nvidia is artificially crippling the DP performance of GF100 consumer cards in order to segment the market and encourage sales of its extremely high price tesla cards.

What is not clear to me is why the DP/SP ratio is 1/12 instead of 1/6 for GF104.

Let me elaborate on the table below…

**DP/SP # execution units:** The old GT200 architecture has a single separate DP execution unit for each cluster of 8 SP units. Each execution unit in GF100 is SP and DP capable while the GF104 has one DP and SP capable unit and 2 SP only units.

**DP/SP ops per clock:** The GT200 had a separate unit that could perform a DP operation each clock cycle. The DP and SP unit are the same unit in the GF100 however for DP it operates on a path twice as wide and hence takes 2 clocks for each DP operation. Same deal with the GF104 DP/SP capable unit.

**Artificial crippling:** The mysterious factor missing from consumer GF100 DP performance.

**DP/SP total FLOPS:** When you multiply the other factors together you get the total ratio of DP/SP performance.

```
| DP/SP | DP/SP | Artificial | DP/SP
ARCH | # execution | ops per | crippling | total
| units | clock | | FLOPS
------------+-------------+---------+--------------+-----
GT200 | 1/8 | 1 | 1 | 1/8
GF100 | 1 | 1/2 | 1/4 | 1/8
GF100 Tesla | 1 | 1/2 | 1 | 1/2
GF104 | 1/3 | 1/2 | 1/2 (maybe?) | 1/12
```

**QUESTION:**

Why is the DP/SP ratio 1/12 instead of 1/6 for GF104? Have I got something wrong in the table above or is nvidia also crippling the DP performance of the GF104 chips but to a lesser extent than GF100?

My 2c,

As seibert said, it has become clear (although I haven’t seen an official statement) that nvidia is artificially crippling the DP performance of GF100 consumer cards in order to segment the market and encourage sales of its extremely high price tesla cards.

What is not clear to me is why the DP/SP ratio is 1/12 instead of 1/6 for GF104.

Let me elaborate on the table below…

**DP/SP # execution units:** The old GT200 architecture has a single separate DP execution unit for each cluster of 8 SP units. Each execution unit in GF100 is SP and DP capable while the GF104 has one DP and SP capable unit and 2 SP only units.

**DP/SP ops per clock:** The GT200 had a separate unit that could perform a DP operation each clock cycle. The DP and SP unit are the same unit in the GF100 however for DP it operates on a path twice as wide and hence takes 2 clocks for each DP operation. Same deal with the GF104 DP/SP capable unit.

**Artificial crippling:** The mysterious factor missing from consumer GF100 DP performance.

**DP/SP total FLOPS:** When you multiply the other factors together you get the total ratio of DP/SP performance.

```
| DP/SP | DP/SP | Artificial | DP/SP
ARCH | # execution | ops per | crippling | total
| units | clock | | FLOPS
------------+-------------+---------+--------------+-----
GT200 | 1/8 | 1 | 1 | 1/8
GF100 | 1 | 1/2 | 1/4 | 1/8
GF100 Tesla | 1 | 1/2 | 1 | 1/2
GF104 | 1/3 | 1/2 | 1/2 (maybe?) | 1/12
```

**QUESTION:**

Why is the DP/SP ratio 1/12 instead of 1/6 for GF104? Have I got something wrong in the table above or is nvidia also crippling the DP performance of the GF104 chips but to a lesser extent than GF100?

I think the GF 104 numbers are 2/3 1/2 1/4. The 2/3 would come from the superscalar nature of the multicores, which sometimes act as if they were 48 cores (hard to reach) and sometimes only 32. The 1/4th “crippling factor” might actually be not a crippling factor at all, but the ommiting of the neccessary circuitry to allow two singleprec cores act as one double prec core on most of the cores. This might also be a factor in reducing the power consumption (and might be done on the upcoming GF110 as well).

Cheers

Ceearem

I think the GF 104 numbers are 2/3 1/2 1/4. The 2/3 would come from the superscalar nature of the multicores, which sometimes act as if they were 48 cores (hard to reach) and sometimes only 32. The 1/4th “crippling factor” might actually be not a crippling factor at all, but the ommiting of the neccessary circuitry to allow two singleprec cores act as one double prec core on most of the cores. This might also be a factor in reducing the power consumption (and might be done on the upcoming GF110 as well).

Cheers

Ceearem