Problem - puzzled, by double operations another mysterious DOUBLE behavior :( or CUDA awesomeness #2

Hello,

I have following code in my kernel.

double value1 = (dStake/nWinners);

	double value2 = dStake*(dPrice - 1.0);

	int nIndex = 0;

	for (int jjk=0; jjk < nWinners; jjk++)

	{

		nIndex = ((combinations >> (((WinnerLegVariation >> (jjk*4)) & 15ULL) * 4)) & 15ULL);

	        localresults[nIndex*2] += value1;

		localresults[nIndex*2 + 1] += (value1 + value2*aPrices[nIndex*c_nSelections]/dSumP);

	}

if I rearang Value2 to look like this:

double value2 = dStake*(dPrice - 1.0)/dSump;

than localresults[nIndex*2 + 1] looks like this

localresults[nIndex2 + 1] += (value1 + value2aPrices[nIndex*c_nSelections]);

and whole code after that looks like this

double value1 = (dStake/nWinners);

	double value2 = dStake*(dPrice - 1.0)/dSumP; <<<<<<<<<<<

	int nIndex = 0;

	for (int jjk=0; jjk < nWinners; jjk++)

	{

		nIndex = ((combinations >> (((WinnerLegVariation >> (jjk*4)) & 15ULL) * 4)) & 15ULL);

	        localresults[nIndex*2] += value1;

		localresults[nIndex*2 + 1] += (value1 + value2*aPrices[nIndex*c_nSelections]);

	}

However I do not get same results in later case ?!

Some results differ at 10th decimal digit - no idea why…

Now the funniest part of all :D This really makes me laugh :)

if Value2 gets a little rearangement

<b>double value2 = dStake/dSumP*(dPrice - 1.0)</b> instead of <b>double value2 = dStake*(dPrice - 1.0)/dSumP;</b>

I get 3rd result set also different from previous two (10th and 9th decimal digits). However, the big CUDA awesomeness is that version

double value2 = dStake/dSumP*(dPrice - 1.0)

RUNS TWO SECONDS FASTER :D :D :D than double value2 = dStake*(dPrice - 1.0)/dSumP; :D

So, basically I have two questions:

Q1: why taking dSump into Value2 calculations produces different results

Q2: How on earth can that simple change of order of operations (Div. and Mult.) make my kernel runs faster ???

It is all tested at Tesla C1060, CUDA 4.0 32bit, OS Windows 2008R2 64bit. Visual Studio 2010.

many thanks

Mirko

Regarding the precision issue:

Floating point numbers are not real numbers. The order of operations can produce small differences in the results even when they would not for real numbers. For example: a * (b + c) != a * b + a * c in general, but they are close. By changing the multiplication order, the rounding operations will happen in a different order and give you a slightly different answer. Then, when you perform many additions, these small differences can accumulate and produce a final answer that is off by 10^-10, even if the terms in the sum are only off by 10^-15.

Regarding the speed issue:

This is speculation, but since you are doing a lot of double precision operations in your kernel, I will assume that you are mostly compute bound. Floating point division is generally quite a bit slower than addition or multiplication. By moving the division out of the loop, you removed the most expensive operation from the loop body, speeding it up significantly.

Seibert many thanks for fast response. I agree for Q1 and I will accept it as it is.

Regarding Q2 I have moved division from the loop and performed it only once.

And with this code Kernel takes 171 seconds to compute ~500 000 000 000 doubles

double value2 = dStake*(dPrice - 1.0)/dSumP;

int nIndex = 0;

for (int jjk=0; jjk < nWinners; jjk++)

{

...

...

...

However with very similar code - the only one change made is that I moved /dSumP

double value2 = dStake/dSumP*(dPrice - 1.0);

int nIndex = 0;

for (int jjk=0; jjk < nWinners; jjk++)

{

...

...

...

very same calculation takes 169 seconds on Tesla C1060. That’s the thing that puzzles me - why the later code gives 2sec improvement ?

I will test this at Fermi as well once I get back home.

many thanks

Mirko

I’m not quite sure it is only an assumption:

I don’t know your definitions of the vars but maybe the operation (dStake/dSumP) can be cached since the compiler will solve the expression 1 as ((dStake*(dPrice - 1.0))/dSumP) and the expression 2 as ((dStake/dSumP)*(dPrice - 1.0)). In expression 1 you make a calculation with the dividend. How I mentioned, thats only an assumption. How often did you run the programm?

Any source code change can cause slightly different machine code to be generated, which then dynamically interacts with the instruction scheduling inside the GPU to produce small performance differences. For example, the loop following the modified code (which presumably accounts for a large part of the execution time) may be aligned differently in the two versions. As a rule of thumb, I consider performance differences below 2% as not significant, i.e. “noise”.

thanks njuffa, this performance diff does not appears on GTX 580. Only on Tesla C1060. I will pay no attention to it. However I stumbled upon one more perf issue. If I compile the code with SM20 it runs slower on GTX 580 than when it is compiled with SM13 flag. I have tried to add compiler switches from programming guide (chapter 5.4.1) -ftz=true -prec-div=false. So I left SM13. When compiled with SM20 it executes 54 seconds - when compiled with SM13 it executes 44seconds. So difference is significant.