Prevent ncc from applying MADD optimization precision, IEEE 754

mzechner · January 22, 2008, 2:29pm

hi,

i’m currently trying to evaluate IEEE 754 compliance of 1.x cards. i know the programmer’s guide states that add, sub, mul and div are standard compliant, i need to have empirical proof of this though.

in order to test this i wrote two simple squared vector length evaluating kernels.

__global__ void kernel_vector_length( float* vector, unsigned dimensions, float* length )

{

	float len = 0;

	for( int i = 0; i < dimensions; i++ )	

  len += vector[i] * vector[i];	

	*length = len;

}

this is the one evaluating the precision of madd. the other one is split up into two kernels. i first calculate the squared value for each vector component. based on this i sum up the squared values in another kernel. i had to do this in order to get the compiler to not use madd.

__global__ void kernel_dot( float* vector, unsigned dimensions )

{

	for( int i = 0; i < dimensions; i++ )

  vector[i] = vector[i] * vector[i];

}

__global__ void kernel_sum( float* vector, unsigned dimensions, float* sum )

{

	float s = 0;

	for( int i = 0; i < dimensions; i++ )

  s += vector[i];

       *sum = s;

}

i found significant differences between the madd-based and non-madd-based implementation as expected.

calculating the vector length is part of a k-means implementation i wrote some time ago. of course not in the form as above. however the calculation uses madd. making a long story short:

is there a way to force the compiler not to use madd optimizations? i tried various flags ( -Ox etc. ) but all yielded the same result. i would like to evaluate the speed penalty introduced by replacing the madd with a multiplication and addition but couldn’t find a way without resorting to handwritting ptx assembly. splitting up the calculation as above is of course not an option.

thanks in advance

MisterAnderson42 · January 22, 2008, 2:54pm

Can you define “significant differences”, what kind of relative “error” are we talking about here? And how does it compare to the “error” introduced by a change in summation order?

I ask the 2nd question because any efficient kernel performing this operation will certainly be performing a reduction which will sum values in a different order.

mzechner · January 22, 2008, 3:10pm

i can not yet give concrete empricial ulp measures for madd summation compared to ieee compliant single precision summation. what i did so far is generating a vector randomly within certain range bounds that would potentially give divergent results, i.e. high dimensionality plus high value range, but still low enough to not overflow. it is significant in my case as even small differences that can be neglected for other algorithms lead to different clustering results in the case of k-means. the worst case i found so far was a difference of about 10% when classifying a testdataset based on centroids found via k-means on the gpu.

i did not investigate issues arising from different summation orders. given a dataset with small mean difference per dimensions and low standard deviation the algorithm will hardly suffer from any noticeable errors. in the case of k-means the order employed on the cpu is the same as on the gpu as each thread is essentially performing a distance measure for 2 vectors on it’s own. there’s no reduction involved.

tachyon_john · January 22, 2008, 7:24pm

When comparing vs. the CPU, make sure that you’re not “accidentally” using the 80-bit floating point unit on the CPU. You can force most compilers to use the 32-bit floating point unit by passing extra compiler flags. The summation order can be a big deal if you have a wide range of values, particularly if those values are then squared. :-) Beyond that, if you sum large vectors of numbers, you’ll get better precision if you sum them in subgroups rather than going through the whole vector at once.
I’ve written a couple of CUDA templates for compensated summation and for native-pair arithmetic, if you’d like to play with those in addition to straight summation.

Cheers,
John Stone

mfatica · January 22, 2008, 7:28pm

The compiler aggressively merges single-precision adds and multiplies into single-precision FMADs; FMADs involve intermediate multiplication with truncation.

There is currently no way for users to control FMAD merging. The next version of CUDA will provide __fadd_rn(), __fmul_rn() device functions that get translated into single-precision additions and multiplications that are guaranteed not to be merged into FMADs.

mzechner · January 23, 2008, 10:16am

@tachyon_john: i compare the results not to extended double precision but against single precision CPU computations with round to nearest mode. i would love to play around with the CUDA templates you’ve mentioned.

@mfatica: i couldn’t find any information on the fp-properties of FMAD, so you’re information is perfect. You don’t know of a citeable source for this do you? edit: well, again i overlooked this in the guide. there’s my source…

thank you all very much for your helpful responses

vvolkov · January 24, 2008, 8:46am

I know this is not your primary concern, but speaking of div Programming Guide v. 1.1 states: “Division is implemented via the reciprocal in a non-standard-compliant way” (Ch. A.2, p.74). Also, the accuracy of reciprocals is 1 ulp, which is larger than when rounding to the nearest.

tachyon_john · January 25, 2008, 5:19pm

Hi,

I haven’t had time to really clean these up yet, but I’ve got two template/classes for device code (limited to what one can get away with in CUDA 1.[01] anyway) that implement compensated summation and native-pair arithmetic. The native-pair arithmetic is based on Bailey’s DSFUN90 implementation, along with a few CUDA-specific tweaks I gleaned from one of the functions in the SDK Mandelbrot example that may help when going to global memory. Feedback is welcome. Hopefully I’ll have more time to work on it in the next few weeks. In order to be completely safe, the multiplies in my code will need to be converted to use __fadd_rn() described by Massimiliano, so that future CUDA compilers don’t over-optimize the tricky parts of these codes.

Cheers,

John Stone
dsfuncuda.tar.gz (5.32 KB)
compsum.tar.gz (4.63 KB)

Aditi · March 17, 2009, 4:15pm

I have a slightly relevant question:

I need to perform the following operation: sum += ab + cd + ef + gh;

Now this can be coded in different ways, few of them listed below in the order of accuracy of result:

sum = sum + __fmul_rn(a,b) + __fmul_rn(c,d) + __fmul(e,f) + __fmul(g,h);
sum += ab + cd + ef + gh;
sum = __fadd_rn(sum,(__fadd_rn(__fadd_rn(__fmul_rn(a,b),__fmul_rn(
c,d)),__fadd_rn(__fmul_rn(e,f),__fmul_rn(g,h)))))
;

__fadd_rn and __fadd_mul were used to increase accuracy by avoiding FMAD. Why does the third way give least accuracy? I had understood that both __fadd_rn and __fmul_rn were required. What could be the reason?

Thanks,

Aditi

seibert · March 17, 2009, 6:52pm

It is quite likely that #2 is evaluated left to right: (((ab + cd) + ef) + gh)

However #3 is evaluated in this order: ((ab + cd) + (ef + gh))

Floating point arithmetic is only approximately associative, so the order in which you add things can change your accuracy.

If you are really worried about error accumulation, Kahan summation is a convenient way to limit round-off error in sums.

Aditi · April 5, 2009, 9:54pm

Hey thanks so much…this was a very useful input. However, I am still facing some trouble. I am using GeForce 9800GTX+ GPU that supports only single precision floating point operations. I do only around 64 multiplies and 320 accumulates per pixel.

Earlier my problem was that my results from CPU and Device Emulation mode were way apart. I tried different sequence of arithmetic operations, used __fadd_rn and __fmul_rn, used Kahan summation for atleast 256 of these accumulates per pixel…but nothing seemed to work and reduce the average relative error wrt the CPU single precision results. When I normalized my input (image) from [0,1) instead of [0,256), my problem solved and I got quite an accurate and identical result from CPU and Device (GPU) emulation.

Now the problem I am facing is that the results from the Device Emulation and Device mode are way apart while results from Device Emulation and CPU are identical. Normalization to [0,1) is not working. I checked the intermediate results and saw that the error increases gradually and accumulates to a big one in the end. Now I want to check if having a device that supports double precision arithmetic would help. I don’t have one such device and thought if I could emulate one to test my code on. But I do not find a way to be able to emulate a higher-end device (say GTX280 that supports double precision arithmetic) to compile and run my .cu code. Can you please suggest how to do it? Can you please suggest some other way of getting the results from CPU/Device emulation and GPU as close as possible on my device?

Thanks,

Aditi

Jamie_K · April 6, 2009, 1:50am

Modern Intel CPUs carry extra bits, and can give results that are more accurate than single-precision should be. One way to prevent this is to explicitly cast intermediate expressions to float, as described here: [url=“http://msdn.microsoft.com/en-us/library/aa289157.aspx#floapoint_topic17”]Microsoft Docs - Developer tools, technical documentation and coding examples. Try casting all the intermediate results and see if it makes a difference.

Topic		Replies	Views
floating point precision on CUDA CUDA Programming and Performance	11	14768	June 8, 2010
precision CUDA Programming and Performance	3	2614	December 16, 2008
Reduction Reduction Reduction................. Precision Confusion Race Condition...... HELP! CUDA Programming and Performance	16	10463	December 8, 2009
GPU Code and CPU Code output not matching till machine precision (i.e. 13 decimals places) CUDA Programming and Performance	22	791	August 9, 2023
how to improve float array summation precision and stability? CUDA Programming and Performance	9	3325	January 15, 2019
CUDA different results when running on different driver? CUDA Programming and Performance	6	2059	March 6, 2010
Float precision error in matrix multiplication application. CUDA Programming and Performance	14	3560	February 27, 2014
Floating-point precision problems CUDA Programming and Performance	14	4399	January 7, 2011
Floating Point Accuracy CUDA Programming and Performance	11	30426	April 6, 2013
Accumulation Difference between CUDA and CPU? CUDA Programming and Performance	7	10471	August 22, 2009

Prevent ncc from applying MADD optimization precision, IEEE 754

Related topics