Did you also initialize the high-order bits of arg1 and arg2 before the loop starts?
Here is what I get (normalized by add=1 clock, CUDA 2.3, 9800GX2):
L01 1.0 a^12345
L02 1.0 a>>20
L03 2.0 a+1
L07 2.0 a*0x12345678753121LL
L15 2.0 a^0x123456712345671LL
L16 19.9 a*b
L17 19.9 a>>(int)b
The first xor just operate on the lower bits, so 1 clock.
Additions need an add instruction on the lower bits, then an add-with-carry on the higher bits, so 2 clocks.
64-bit xors can be split in 2 32-bit independant xors.
The right shift by a constant is more complex, but can be explained: you xor together the results of all shifts. Simple boolean algebra shows that you can just xor the unshifted values together and perform the shift only at the end of the loop.
So it’s actually surprising the result is not 0.
Then it seems multiplications by constants can be really efficient. Looking at the code, it seems to use the mad-with-carry extensively.
Didn’t try with other numbers, such as numbers specially crafted to trigger a carry-propagation chain.
Yes. The standard defines exactly what a / b is: basically it’s the floating-point value closest to the exact result of the division.
If you multiply by the reciprocal, you will suffer from an extra intermediate rounding.
An easy example is 14.0/7.0 The answer is trivially 2.0, exactly representable in FP and what IEEE-754 requires the division to return.
But 1.0/7.0 is not representable in FP, so it’s rounded to a slightly smaller value.
Then when you multiply back by 14.0, you get an answer slightly lower than 2.0. Oops…
Good to know, thanks.
Are you sure it doesn’t get too wild, namely, +infinity?
Roughly speaking, each iteration doubles the values of arg1 and arg2 by adding them together. So after just around 110 iterations you are probably overflowing your float variables.
So I strongly suspect that what you are testing is the code:
if( isinf(x) ) return NANF;
at the beginning of the sinf function… ;)
Can’t reproduce it here. I get instead:
I52 33.5 a/b
I53 75.1 (int)(((long long)a)/((long long)b ))
(same setup as above)