Please help by answering ANY of the questions at the bottom!

Compute device: 1.3, Tesla C1060, compiled WITHOUT 1.3 architecture flag (sm_arch)

My original belief was that mad.f32 was a faster less accurate (not IEEE 754) multiply add combination and that __fadd_rn() protected against this combination which made it IEEE but slower. The PTX manual appears to disagree.

I have the following line of code:

```
float Px = imgShiftX â€“ (startX + (float) i*device_stepX[klm]) * reci_imgpixD;
```

Compiling that returns (in PTX):

```
.loc 18 91 0
cvt.rn.f32.s32 %f23, %r26;
.loc 18 65 0
ld.param.f32 %f19, [__cudaparm__Z16gpu_forward_viewPtS_S_S_PfS0_S0_S_S_S_iiiiiiififfffffiiii_startX];
.loc 18 91 0
mad.f32 %f24, %f23, %f1, %f19;
.loc 18 65 0
ld.param.f32 %f20, [__cudaparm__Z16gpu_forward_viewPtS_S_S_PfS0_S0_S_S_S_iiiiiiififfffffiiii_reci_imgpixD];
.loc 18 91 0
mul.f32 %f25, %f20, %f24;
sub.f32 %f26, %f4, %f25;
```

PTX manual says:

Questions:

- What is this about “product of a and b at double precision…”? It is listed as for 1.x compute, meaning devices that may not have double precision (x<3)?

1a. If the double precision was not a typo, is this truncate the mantissa and keep extra exponent bits just so the intermediate multiply result can be larger than a float and that hopefully the add will be a negative number to bring it back into range?

- When I compile using:

```
float Px = imgShiftX - (__fadd_rn(startX, (float)i*device_stepX[klm]))*reci_imgpixD;
```

I get the PTX:

```
cvt.rn.f32.s32 %f31, %r99;
mul.f32 %f32, %f31, %f3;
.loc 18 63 0
ld.param.f32 %f27, [__cudaparm__Z16gpu_forward_viewPtS_S_S_PfS0_S0_S_S_S_iiiiiiififfffffiiiiS__startX];
.loc 18 87 0
add.rn.f32 %f33, %f27, %f32;
.loc 18 63 0
ld.param.f32 %f28, [__cudaparm__Z16gpu_forward_viewPtS_S_S_PfS0_S0_S_S_S_iiiiiiififfffffiiiiS__reci_imgpixD];
.loc 18 87 0
mul.f32 %f34, %f28, %f33;
sub.f32 %f35, %f12, %f34;
```

The resulting value is different (depending on input values). Implying that “mad.f32 is identical to the result computed using separate mul and add” is incorrect? Also, why would “mad.f32 is identical to the result computed using separate mul and add” be correct if there is some kind of double precision going on in the mad.f32?

- Only the __fadd_rn matches the CPU.

Finally, programming guide says:

Leading me to believe the “double precision” in the PTX manual is a mistake and should say single precision, and the difference I’m seeing is the truncation rather than a round.

the _fadd_rn indeed tests significantly slower.