# Prevent ncc from applying MADD optimization precision, IEEE 754

hi,

i’m currently trying to evaluate IEEE 754 compliance of 1.x cards. i know the programmer’s guide states that add, sub, mul and div are standard compliant, i need to have empirical proof of this though.

in order to test this i wrote two simple squared vector length evaluating kernels.

``````__global__ void kernel_vector_length( float* vector, unsigned dimensions, float* length )

{

float len = 0;

for( int i = 0; i < dimensions; i++ )

len += vector[i] * vector[i];

*length = len;

}
``````

this is the one evaluating the precision of madd. the other one is split up into two kernels. i first calculate the squared value for each vector component. based on this i sum up the squared values in another kernel. i had to do this in order to get the compiler to not use madd.

``````__global__ void kernel_dot( float* vector, unsigned dimensions )

{

for( int i = 0; i < dimensions; i++ )

vector[i] = vector[i] * vector[i];

}

__global__ void kernel_sum( float* vector, unsigned dimensions, float* sum )

{

float s = 0;

for( int i = 0; i < dimensions; i++ )

s += vector[i];

*sum = s;

}
``````

i found significant differences between the madd-based and non-madd-based implementation as expected.

calculating the vector length is part of a k-means implementation i wrote some time ago. of course not in the form as above. however the calculation uses madd. making a long story short:

is there a way to force the compiler not to use madd optimizations? i tried various flags ( -Ox etc. ) but all yielded the same result. i would like to evaluate the speed penalty introduced by replacing the madd with a multiplication and addition but couldn’t find a way without resorting to handwritting ptx assembly. splitting up the calculation as above is of course not an option.

Can you define “significant differences”, what kind of relative “error” are we talking about here? And how does it compare to the “error” introduced by a change in summation order?

I ask the 2nd question because any efficient kernel performing this operation will certainly be performing a reduction which will sum values in a different order.

i can not yet give concrete empricial ulp measures for madd summation compared to ieee compliant single precision summation. what i did so far is generating a vector randomly within certain range bounds that would potentially give divergent results, i.e. high dimensionality plus high value range, but still low enough to not overflow. it is significant in my case as even small differences that can be neglected for other algorithms lead to different clustering results in the case of k-means. the worst case i found so far was a difference of about 10% when classifying a testdataset based on centroids found via k-means on the gpu.

i did not investigate issues arising from different summation orders. given a dataset with small mean difference per dimensions and low standard deviation the algorithm will hardly suffer from any noticeable errors. in the case of k-means the order employed on the cpu is the same as on the gpu as each thread is essentially performing a distance measure for 2 vectors on it’s own. there’s no reduction involved.

When comparing vs. the CPU, make sure that you’re not “accidentally” using the 80-bit floating point unit on the CPU. You can force most compilers to use the 32-bit floating point unit by passing extra compiler flags. The summation order can be a big deal if you have a wide range of values, particularly if those values are then squared. :-) Beyond that, if you sum large vectors of numbers, you’ll get better precision if you sum them in subgroups rather than going through the whole vector at once.
I’ve written a couple of CUDA templates for compensated summation and for native-pair arithmetic, if you’d like to play with those in addition to straight summation.

Cheers,
John Stone

The compiler aggressively merges single-precision adds and multiplies into single-precision FMADs; FMADs involve intermediate multiplication with truncation.

There is currently no way for users to control FMAD merging. The next version of CUDA will provide __fadd_rn(), __fmul_rn() device functions that get translated into single-precision additions and multiplications that are guaranteed not to be merged into FMADs.

@tachyon_john: i compare the results not to extended double precision but against single precision CPU computations with round to nearest mode. i would love to play around with the CUDA templates you’ve mentioned.

@mfatica: i couldn’t find any information on the fp-properties of FMAD, so you’re information is perfect. You don’t know of a citeable source for this do you? edit: well, again i overlooked this in the guide. there’s my source…

I know this is not your primary concern, but speaking of div Programming Guide v. 1.1 states: “Division is implemented via the reciprocal in a non-standard-compliant way” (Ch. A.2, p.74). Also, the accuracy of reciprocals is 1 ulp, which is larger than when rounding to the nearest.

Hi,

I haven’t had time to really clean these up yet, but I’ve got two template/classes for device code (limited to what one can get away with in CUDA 1.[01] anyway) that implement compensated summation and native-pair arithmetic. The native-pair arithmetic is based on Bailey’s DSFUN90 implementation, along with a few CUDA-specific tweaks I gleaned from one of the functions in the SDK Mandelbrot example that may help when going to global memory. Feedback is welcome. Hopefully I’ll have more time to work on it in the next few weeks. In order to be completely safe, the multiplies in my code will need to be converted to use __fadd_rn() described by Massimiliano, so that future CUDA compilers don’t over-optimize the tricky parts of these codes.

Cheers,

John Stone
dsfuncuda.tar.gz (5.32 KB)
compsum.tar.gz (4.63 KB)

I have a slightly relevant question:

I need to perform the following operation: sum += ab + cd + ef + gh;

Now this can be coded in different ways, few of them listed below in the order of accuracy of result:

1. sum = sum + __fmul_rn(a,b) + __fmul_rn(c,d) + __fmul(e,f) + __fmul(g,h);

2. sum += ab + cd + ef + gh;

;

__fadd_rn and __fadd_mul were used to increase accuracy by avoiding FMAD. Why does the third way give least accuracy? I had understood that both __fadd_rn and __fmul_rn were required. What could be the reason?

Thanks,

It is quite likely that #2 is evaluated left to right: (((ab + cd) + ef) + gh)

However #3 is evaluated in this order: ((ab + cd) + (ef + gh))

Floating point arithmetic is only approximately associative, so the order in which you add things can change your accuracy.

If you are really worried about error accumulation, Kahan summation is a convenient way to limit round-off error in sums.

Hey thanks so much…this was a very useful input. However, I am still facing some trouble. I am using GeForce 9800GTX+ GPU that supports only single precision floating point operations. I do only around 64 multiplies and 320 accumulates per pixel.

Earlier my problem was that my results from CPU and Device Emulation mode were way apart. I tried different sequence of arithmetic operations, used __fadd_rn and __fmul_rn, used Kahan summation for atleast 256 of these accumulates per pixel…but nothing seemed to work and reduce the average relative error wrt the CPU single precision results. When I normalized my input (image) from [0,1) instead of [0,256), my problem solved and I got quite an accurate and identical result from CPU and Device (GPU) emulation.

Now the problem I am facing is that the results from the Device Emulation and Device mode are way apart while results from Device Emulation and CPU are identical. Normalization to [0,1) is not working. I checked the intermediate results and saw that the error increases gradually and accumulates to a big one in the end. Now I want to check if having a device that supports double precision arithmetic would help. I don’t have one such device and thought if I could emulate one to test my code on. But I do not find a way to be able to emulate a higher-end device (say GTX280 that supports double precision arithmetic) to compile and run my .cu code. Can you please suggest how to do it? Can you please suggest some other way of getting the results from CPU/Device emulation and GPU as close as possible on my device?

Thanks,