hi Folks;

I am very confused on why these two scripts give different results. The are basically a double loop. One is the CUDA version, the other is the c version. The CUDA version is modified from the c version. Here is what they are:

c version:

```
for(j = 0;j < p->Ns;j++)
{ /* initialize */
arr = 1;ari = 0;brr = 0;bri = 0;
afr = p->afr[j];afi = p->afi[j];bfr = p->bfr[j];bfi = p->bfi[j];
auxar = p->auxar[j];auxai = p->auxai[j];
auxbr = p->auxbr[j];auxbi = p->auxbi[j];
x = p->x[j];
if(Ndim > 1) y = p->y[j];
if(Ndim > 2) z = p->z[j];
for(i = p->Nt-1;i >= 0;i--){
// pre-setup;
phs = x*gx[i];
if(Ndim > 1) phs += y*gy[i];
if(Ndim > 2) phs += z*gz[i];
zr = cos(phs/2.0);
zi = sin(phs/2.0);
// here is what makes the difference in results
art = afr*zr + afi*zi;
afr = art;
// db1
brbfr = -(brr*bfi + bri*bfr)/2.0;
brbfi = (brr*bfr - bri*bfi)/2.0;
db1r = brbfr*auxbr + brbfi*auxbi;
db1i = brbfr*auxbi - brbfi*auxbr;
// db2
afarr = -(afi*arr - afr*ari)/2.0;
afari = (afr*arr + afi*ari)/2.0;
db2r = afarr*auxbr + afari*auxbi;
db2i = afarr*auxbi - afari*auxbr;
// da1
brafr = (brr*afi - bri*afr)/2.0;
brafi = (brr*afr + bri*afi)/2.0;
da1r = brafr*auxar + brafi*auxai;
da1i = brafr*auxai - brafi*auxar;
// da2
arbfr = -(arr*bfi + ari*bfr)/2.0;
arbfi = (arr*bfr - ari*bfi)/2.0;
da2r = arbfr*auxar - arbfi*auxai;
da2i = arbfr*auxai + arbfi*auxar;
// drfr drfi
drfr[i] += da2r + db2r + da1r + db1r;
drfi[i] += da2i + db2i - da1i - db1i;
// updates
art = arr*c - (brr*sr + bri*si);
ait = ari*c - (brr*si - bri*sr);
brr = arr*sr + ari*si + brr*c;
bri = -ari*sr + arr*si + bri*c;
arr = art;ari = ait;
} /* i */
} /* j */
```

And here is the CUDA verision:

```
for (int i = Nt-1; i>=0; i--)
{
/* initialize */
tid = threadIdx.x;
j = blockIdx.x*(blockSize) + tid;
gridSize = blockSize*gridDim.x;
drfrdata[tid] = 0;
drfidata[tid] = 0;
gx = g[i];
if(Ndim > 1) gy = g[Nt+i];
if(Ndim > 2) gz = g[2*Nt+i];
__syncthreads();
while (j < Ns)
{ // pre-setup;
arr = 1; ari = 0; brr = 0; bri = 0;
afrt = afr[j]; afit = afi[j]; bfrt = bfr[j]; bfit = bfi[j];
auxart = auxar[j]; auxait = auxai[j];
auxbrt = auxbr[j]; auxbit = auxbi[j];
phs = gx*x[j];
if(Ndim > 1) phs += gy*x[Ns+j];
if(Ndim > 2) phs += gz*x[2*Ns+j];
zr = cos(phs/2.0);
zi = sin(phs/2.0);
// here is what makes the difference in results
art = afrt*zr + afit*zi;
afrt = art;
// db1
brbfr = -(brr*bfit + bri*bfrt)/2.0;
brbfi = (brr*bfrt - bri*bfit)/2.0;
db1r = brbfr*auxbrt + brbfi*auxbit;
db1i = brbfr*auxbit - brbfi*auxbrt;
// db2
afarr = -(afit*arr - afrt*ari)/2.0;
afari = (afrt*arr + afit*ari)/2.0;
db2r = afarr*auxbrt + afari*auxbit;
db2i = afarr*auxbit - afari*auxbrt;
// da1
brafr = (brr*afit - bri*afrt)/2.0;
brafi = -(brr*afrt + bri*afit)/2.0;
da1r = brafr*auxart - brafi*auxait;
da1i = brafr*auxait + brafi*auxart;
// da2
arbfr = -(arr*bfit + ari*bfrt)/2.0;
arbfi = (arr*bfrt - ari*bfit)/2.0;
da2r = arbfr*auxart - arbfi*auxait;
da2i = arbfr*auxait + arbfi*auxart;
// drfr drfi
drfrdata[tid] += da2r + db2r + da1r + db1r;
drfidata[tid] += da2i + db2i - da1i - db1i;
// updates
art = arr*c - (brr*sr + bri*si);
ait = ari*c - (brr*si - bri*sr);
brr = arr*sr + ari*si + brr*c;
bri = -ari*sr + arr*si + bri*c;
arr = art;ari = ait;
} /* j */
__syncthreads();
if (blockSize >= 512) { if (tid < 256) { drfrdata[tid] += drfrdata[tid + 256]; drfidata[tid] += drfidata[tid + 256];} __syncthreads(); }
if (blockSize >= 256) { if (tid < 128) { drfrdata[tid] += drfrdata[tid + 128]; drfidata[tid] += drfidata[tid + 128];} __syncthreads(); }
if (blockSize >= 128) { if (tid < 64) { drfrdata[tid] += drfrdata[tid + 64]; drfidata[tid] += drfidata[tid + 64];} __syncthreads(); }
if (tid < 32)
{
warpReduce(drfrdata, tid);
warpReduce(drfidata, tid);
}
__syncthreads();
if (tid == 0)
{ // atomicAdd
atomicAdd(&drfr[i],drfrdata[0]);
atomicAdd(&drfi[i],drfidata[0]);
}
} /* i */
```

First I did not show all the definition and declaration of all variables here, but they are defined and assigned with same value.

It is basically a double loop (i and j dimension); Using this double I am trying to fill a 2D matrix, and then sum up along one of the dimensions, j, to get a vector of length Nt;

In the c file, the outside loop down-counts from Nt to 1, and the inner loop generates a new value in j dimension for each j. And this new value will be added to its previous value. That is how the final vector is got.

In the CUDA file, the reduction algorithm, which I did not show here, is used to sum all elements in j dimension.

The problem is that the following code causes the difference in results;

in c file:

```
// here is what makes the difference in results
art = afr*zr + afi*zi;
afr = art;
```

in CUDA:

```
// here is what makes the difference in results
art = afrt*zr + afit*zi;
afrt = art;
```

If these lines are removed, the two scripts generate same results, but the results are different if the lines are added.

The two scripts are fed with same input, and they are functionally equivalent with each other with only parameter name changed.

Any one could give some hint about this? I am trying to make the CUDA give same result as the c file does. I don’t understand why the CUDA gives a different result. How do I change it?

I am not sure if I provided enough detail for you to figure out. If you need, let me know. Thank you!

Really appreciate any suggestions.