Hello,
I have following code in my kernel.
double value1 = (dStake/nWinners);
double value2 = dStake*(dPrice - 1.0);
int nIndex = 0;
for (int jjk=0; jjk < nWinners; jjk++)
{
nIndex = ((combinations >> (((WinnerLegVariation >> (jjk*4)) & 15ULL) * 4)) & 15ULL);
localresults[nIndex*2] += value1;
localresults[nIndex*2 + 1] += (value1 + value2*aPrices[nIndex*c_nSelections]/dSumP);
}
if I rearang Value2 to look like this:
double value2 = dStake*(dPrice - 1.0)/dSump;
than localresults[nIndex*2 + 1] looks like this
localresults[nIndex2 + 1] += (value1 + value2aPrices[nIndex*c_nSelections]);
and whole code after that looks like this
double value1 = (dStake/nWinners);
double value2 = dStake*(dPrice - 1.0)/dSumP; <<<<<<<<<<<
int nIndex = 0;
for (int jjk=0; jjk < nWinners; jjk++)
{
nIndex = ((combinations >> (((WinnerLegVariation >> (jjk*4)) & 15ULL) * 4)) & 15ULL);
localresults[nIndex*2] += value1;
localresults[nIndex*2 + 1] += (value1 + value2*aPrices[nIndex*c_nSelections]);
}
However I do not get same results in later case ?!
Some results differ at 10th decimal digit - no idea why…
Now the funniest part of all :D This really makes me laugh :)
if Value2 gets a little rearangement
<b>double value2 = dStake/dSumP*(dPrice - 1.0)</b> instead of <b>double value2 = dStake*(dPrice - 1.0)/dSumP;</b>
I get 3rd result set also different from previous two (10th and 9th decimal digits). However, the big CUDA awesomeness is that version
double value2 = dStake/dSumP*(dPrice - 1.0)
RUNS TWO SECONDS FASTER :D :D :D than double value2 = dStake*(dPrice - 1.0)/dSumP; :D
So, basically I have two questions:
Q1: why taking dSump into Value2 calculations produces different results
Q2: How on earth can that simple change of order of operations (Div. and Mult.) make my kernel runs faster ???
It is all tested at Tesla C1060, CUDA 4.0 32bit, OS Windows 2008R2 64bit. Visual Studio 2010.
many thanks
Mirko