Please explain mad.f32 vs. mul & add

Please help by answering ANY of the questions at the bottom!

Compute device: 1.3, Tesla C1060, compiled WITHOUT 1.3 architecture flag (sm_arch)

My original belief was that mad.f32 was a faster less accurate (not IEEE 754) multiply add combination and that __fadd_rn() protected against this combination which made it IEEE but slower. The PTX manual appears to disagree.

I have the following line of code:

float Px = imgShiftX – (startX + (float) i*device_stepX[klm]) * reci_imgpixD;

Compiling that returns (in PTX):

.loc	18	91	0

	cvt.rn.f32.s32 	%f23, %r26;

	.loc	18	65	0

	ld.param.f32 	%f19, [__cudaparm__Z16gpu_forward_viewPtS_S_S_PfS0_S0_S_S_S_iiiiiiififfffffiiii_startX];

	.loc	18	91	0

	mad.f32 	%f24, %f23, %f1, %f19;

	.loc	18	65	0

	ld.param.f32 	%f20, [__cudaparm__Z16gpu_forward_viewPtS_S_S_PfS0_S0_S_S_S_iiiiiiififfffffiiii_reci_imgpixD];

	.loc	18	91	0

	mul.f32 	%f25, %f20, %f24;

	sub.f32 	%f26, %f4, %f25;

PTX manual says:

Questions:

  1. What is this about “product of a and b at double precision…”? It is listed as for 1.x compute, meaning devices that may not have double precision (x<3)?

1a. If the double precision was not a typo, is this truncate the mantissa and keep extra exponent bits just so the intermediate multiply result can be larger than a float and that hopefully the add will be a negative number to bring it back into range?

  1. When I compile using:
float Px = imgShiftX - (__fadd_rn(startX, (float)i*device_stepX[klm]))*reci_imgpixD;

I get the PTX:

cvt.rn.f32.s32 	%f31, %r99;

	mul.f32 	%f32, %f31, %f3;

	.loc	18	63	0

	ld.param.f32 	%f27, [__cudaparm__Z16gpu_forward_viewPtS_S_S_PfS0_S0_S_S_S_iiiiiiififfffffiiiiS__startX];

	.loc	18	87	0

	add.rn.f32 	%f33, %f27, %f32;

	.loc	18	63	0

	ld.param.f32 	%f28, [__cudaparm__Z16gpu_forward_viewPtS_S_S_PfS0_S0_S_S_S_iiiiiiififfffffiiiiS__reci_imgpixD];

	.loc	18	87	0

	mul.f32 	%f34, %f28, %f33;

	sub.f32 	%f35, %f12, %f34;

The resulting value is different (depending on input values). Implying that “mad.f32 is identical to the result computed using separate mul and add” is incorrect? Also, why would “mad.f32 is identical to the result computed using separate mul and add” be correct if there is some kind of double precision going on in the mad.f32?

  1. Only the __fadd_rn matches the CPU.

Finally, programming guide says:

Leading me to believe the “double precision” in the PTX manual is a mistake and should say single precision, and the difference I’m seeing is the truncation rather than a round.

the _fadd_rn indeed tests significantly slower.

Your understanding is correct (as far as Compute 1.x GPUs are concerned). The PTX manual just describes more precisely the (non-IEEE 754) behavior of mad.f32.

The PTX manual describes behavior, not a specific hardware implementation. It means the operation behaves exactly as if the exact result were computed, and then rounded.

Yes, this is the case where this can make a difference between mad.f32 and mul.f32.rz followed by add.f32.rn.

Note the first part of the sentence:

The exception for mad.f32 is when c = +/-0.0, mad.f32 is identical to the result computed using separate mul and add instructions.”

This is an exception. In the general case, the result may be different, as the programming guide says.

Appreciate the help. The main confusion was that the last sentence was part of the exception sentence. :)