External Image Something quite strange appeared in a loop.
For example, when computing the sum of several global variables in array g_idata (assuming large enough); I first define a register variable temp = 0, then I do the following things(assuming it was done in thread 0):
for ( int i = 0; i < 16; i++)
{
temp = temp + g_idata[i];
}
Here, the global data is fetched from continuous address in g_idata, and the result is exactly the same with that from CPU.
However, when the indexs into g_idata are apart from each other, that is:
for ( int i = 0; i < 16; i++)
{
temp = temp + g_idata[i + stride];
}
where stride is an positive integer which is larger than 1, the resulting value in temp is slightly different from previous result.
I wonder are there any optimizations involved in the nvcc compiler which treats the 2 situations above respectively?
Then how to avoid such optimization since i would want the exact result.