# Why these two scripts give different results?

hi Folks;

I am very confused on why these two scripts give different results. The are basically a double loop. One is the CUDA version, the other is the c version. The CUDA version is modified from the c version. Here is what they are:

c version:

``````for(j = 0;j < p->Ns;j++)

{	/* initialize */

arr = 1;ari = 0;brr = 0;bri = 0;

afr = p->afr[j];afi = p->afi[j];bfr = p->bfr[j];bfi = p->bfi[j];

auxar = p->auxar[j];auxai = p->auxai[j];

auxbr = p->auxbr[j];auxbi = p->auxbi[j];

x = p->x[j];

if(Ndim > 1) y = p->y[j];

if(Ndim > 2) z = p->z[j];

for(i = p->Nt-1;i >= 0;i--){

// pre-setup;

phs = x*gx[i];

if(Ndim > 1) phs += y*gy[i];

if(Ndim > 2) phs += z*gz[i];

zr = cos(phs/2.0);

zi = sin(phs/2.0);

// here is what makes the difference in results

art = afr*zr + afi*zi;

afr = art;

// db1

brbfr = -(brr*bfi + bri*bfr)/2.0;

brbfi =  (brr*bfr - bri*bfi)/2.0;

db1r = brbfr*auxbr + brbfi*auxbi;

db1i = brbfr*auxbi - brbfi*auxbr;

// db2

afarr = -(afi*arr - afr*ari)/2.0;

afari =  (afr*arr + afi*ari)/2.0;

db2r = afarr*auxbr + afari*auxbi;

db2i = afarr*auxbi - afari*auxbr;

// da1

brafr = (brr*afi - bri*afr)/2.0;

brafi = (brr*afr + bri*afi)/2.0;

da1r = brafr*auxar + brafi*auxai;

da1i = brafr*auxai - brafi*auxar;

// da2

arbfr = -(arr*bfi + ari*bfr)/2.0;

arbfi =  (arr*bfr - ari*bfi)/2.0;

da2r = arbfr*auxar - arbfi*auxai;

da2i = arbfr*auxai + arbfi*auxar;

// drfr drfi

drfr[i] += da2r + db2r + da1r + db1r;

drfi[i] += da2i + db2i - da1i - db1i;

art = arr*c - (brr*sr + bri*si);

ait = ari*c - (brr*si - bri*sr);

brr = arr*sr + ari*si + brr*c;

bri = -ari*sr + arr*si + bri*c;

arr = art;ari = ait;

} /* i */

} /* j */
``````

And here is the CUDA verision:

``````for (int i = Nt-1; i>=0; i--)

{

/* initialize */

j = blockIdx.x*(blockSize) + tid;

gridSize = blockSize*gridDim.x;

drfrdata[tid] = 0;

drfidata[tid] = 0;

gx = g[i];

if(Ndim > 1) gy = g[Nt+i];

if(Ndim > 2) gz = g[2*Nt+i];

while (j < Ns)

{	// pre-setup;

arr = 1; ari = 0; brr = 0; bri = 0;

afrt = afr[j]; afit = afi[j]; bfrt = bfr[j]; bfit = bfi[j];

auxart = auxar[j]; auxait = auxai[j];

auxbrt = auxbr[j]; auxbit = auxbi[j];

phs = gx*x[j];

if(Ndim > 1) phs += gy*x[Ns+j];

if(Ndim > 2) phs += gz*x[2*Ns+j];

zr = cos(phs/2.0);

zi = sin(phs/2.0);

// here is what makes the difference in results

art =  afrt*zr + afit*zi;

afrt = art;

// db1

brbfr = -(brr*bfit + bri*bfrt)/2.0;

brbfi =  (brr*bfrt - bri*bfit)/2.0;

db1r = brbfr*auxbrt + brbfi*auxbit;

db1i = brbfr*auxbit - brbfi*auxbrt;

// db2

afarr = -(afit*arr - afrt*ari)/2.0;

afari =  (afrt*arr + afit*ari)/2.0;

db2r = afarr*auxbrt + afari*auxbit;

db2i = afarr*auxbit - afari*auxbrt;

// da1

brafr =  (brr*afit - bri*afrt)/2.0;

brafi = -(brr*afrt + bri*afit)/2.0;

da1r = brafr*auxart - brafi*auxait;

da1i = brafr*auxait + brafi*auxart;

// da2

arbfr = -(arr*bfit + ari*bfrt)/2.0;

arbfi =  (arr*bfrt - ari*bfit)/2.0;

da2r = arbfr*auxart - arbfi*auxait;

da2i = arbfr*auxait + arbfi*auxart;

// drfr drfi

drfrdata[tid] += da2r + db2r + da1r + db1r;

drfidata[tid] += da2i + db2i - da1i - db1i;

art = arr*c - (brr*sr + bri*si);

ait = ari*c - (brr*si - bri*sr);

brr =  arr*sr + ari*si + brr*c;

bri = -ari*sr + arr*si + bri*c;

arr = art;ari = ait;

} /* j */

if (blockSize >= 512) { if (tid < 256) { drfrdata[tid] += drfrdata[tid + 256]; drfidata[tid] += drfidata[tid + 256];} __syncthreads(); }

if (blockSize >= 256) { if (tid < 128) { drfrdata[tid] += drfrdata[tid + 128]; drfidata[tid] += drfidata[tid + 128];} __syncthreads(); }

if (blockSize >= 128) { if (tid < 64)  { drfrdata[tid] += drfrdata[tid + 64]; drfidata[tid] += drfidata[tid + 64];}  __syncthreads(); }

if (tid < 32)

{

warpReduce(drfrdata, tid);

warpReduce(drfidata, tid);

}

if (tid == 0)

}

} /* i */
``````

First I did not show all the definition and declaration of all variables here, but they are defined and assigned with same value.

It is basically a double loop (i and j dimension); Using this double I am trying to fill a 2D matrix, and then sum up along one of the dimensions, j, to get a vector of length Nt;

In the c file, the outside loop down-counts from Nt to 1, and the inner loop generates a new value in j dimension for each j. And this new value will be added to its previous value. That is how the final vector is got.

In the CUDA file, the reduction algorithm, which I did not show here, is used to sum all elements in j dimension.

The problem is that the following code causes the difference in results;

in c file:

``````// here is what makes the difference in results

art = afr*zr + afi*zi;

afr = art;
``````

in CUDA:

``````// here is what makes the difference in results

art =  afrt*zr + afit*zi;

afrt = art;
``````

If these lines are removed, the two scripts generate same results, but the results are different if the lines are added.

The two scripts are fed with same input, and they are functionally equivalent with each other with only parameter name changed.

Any one could give some hint about this? I am trying to make the CUDA give same result as the c file does. I don’t understand why the CUDA gives a different result. How do I change it?

I am not sure if I provided enough detail for you to figure out. If you need, let me know. Thank you!

Really appreciate any suggestions.

Serial (CPU) and parallel (GPU) reduction of float variables are not guaranteed to give (exactly) the same results, due to computer arithmetic.

So if I want the CUDA code to give same result as the c code, any suggestion how should I modify the CUDA code?

It may be an effect of FMA instructions.
If you are using CUDA 4.2, there is a flag to disable them ( don’t remember on top of my head, something similar to -nofma ).

The relevant nvcc flag is -fmad=false (the default is -fmad=true, allowing the compiler to merge a floating-point multiply with a dependent floating-pint add into a single FMAD, FFMA, or DFMA instruction for better performance). Please consult the following whitepaper for general background on the question “Why are my results on the GPU different from my results on the CPU?”: