Why these two scripts give different results?

hi Folks;

I am very confused on why these two scripts give different results. The are basically a double loop. One is the CUDA version, the other is the c version. The CUDA version is modified from the c version. Here is what they are:

c version:

for(j = 0;j < p->Ns;j++)

	{	/* initialize */

		arr = 1;ari = 0;brr = 0;bri = 0;

		afr = p->afr[j];afi = p->afi[j];bfr = p->bfr[j];bfi = p->bfi[j];

		 

		auxar = p->auxar[j];auxai = p->auxai[j];		

		auxbr = p->auxbr[j];auxbi = p->auxbi[j];

		

		x = p->x[j];

		if(Ndim > 1) y = p->y[j];

		if(Ndim > 2) z = p->z[j];

		for(i = p->Nt-1;i >= 0;i--){

			// pre-setup; 

			phs = x*gx[i];

			if(Ndim > 1) phs += y*gy[i];

			if(Ndim > 2) phs += z*gz[i];

			zr = cos(phs/2.0);

			zi = sin(phs/2.0);

			

			// here is what makes the difference in results

			art = afr*zr + afi*zi;

			afr = art;	

	

			// db1

			brbfr = -(brr*bfi + bri*bfr)/2.0;

			brbfi =  (brr*bfr - bri*bfi)/2.0;

			db1r = brbfr*auxbr + brbfi*auxbi;

			db1i = brbfr*auxbi - brbfi*auxbr;

			// db2

			afarr = -(afi*arr - afr*ari)/2.0;

			afari =  (afr*arr + afi*ari)/2.0;

			db2r = afarr*auxbr + afari*auxbi;

			db2i = afarr*auxbi - afari*auxbr;		

			// da1

			brafr = (brr*afi - bri*afr)/2.0;

			brafi = (brr*afr + bri*afi)/2.0;

			da1r = brafr*auxar + brafi*auxai;

			da1i = brafr*auxai - brafi*auxar;

			// da2

			arbfr = -(arr*bfi + ari*bfr)/2.0;

			arbfi =  (arr*bfr - ari*bfi)/2.0;

			da2r = arbfr*auxar - arbfi*auxai;

			da2i = arbfr*auxai + arbfi*auxar;

			

			// drfr drfi

			drfr[i] += da2r + db2r + da1r + db1r;

			drfi[i] += da2i + db2i - da1i - db1i;

	  

			// updates

			art = arr*c - (brr*sr + bri*si);

			ait = ari*c - (brr*si - bri*sr);				  

			brr = arr*sr + ari*si + brr*c;

			bri = -ari*sr + arr*si + bri*c;

			arr = art;ari = ait;     

		} /* i */

	} /* j */

And here is the CUDA verision:

for (int i = Nt-1; i>=0; i--)

{

    /* initialize */               

    tid = threadIdx.x;

    j = blockIdx.x*(blockSize) + tid;

    gridSize = blockSize*gridDim.x;

    drfrdata[tid] = 0;   

	drfidata[tid] = 0;			

		

    gx = g[i]; 

	if(Ndim > 1) gy = g[Nt+i];

	if(Ndim > 2) gz = g[2*Nt+i];

    __syncthreads();

while (j < Ns)

    {	// pre-setup; 

		arr = 1; ari = 0; brr = 0; bri = 0;

		afrt = afr[j]; afit = afi[j]; bfrt = bfr[j]; bfit = bfi[j];

			

		auxart = auxar[j]; auxait = auxai[j]; 

		auxbrt = auxbr[j]; auxbit = auxbi[j];

			

		phs = gx*x[j];		

		if(Ndim > 1) phs += gy*x[Ns+j];

		if(Ndim > 2) phs += gz*x[2*Ns+j];

			

		zr = cos(phs/2.0);

		zi = sin(phs/2.0);

		

		// here is what makes the difference in results

		art =  afrt*zr + afit*zi;

		afrt = art;	

		// db1 

		brbfr = -(brr*bfit + bri*bfrt)/2.0;

		brbfi =  (brr*bfrt - bri*bfit)/2.0;

		db1r = brbfr*auxbrt + brbfi*auxbit;

		db1i = brbfr*auxbit - brbfi*auxbrt;

		// db2 

		afarr = -(afit*arr - afrt*ari)/2.0;

		afari =  (afrt*arr + afit*ari)/2.0;

		db2r = afarr*auxbrt + afari*auxbit;

		db2i = afarr*auxbit - afari*auxbrt;

		// da1 

		brafr =  (brr*afit - bri*afrt)/2.0;

		brafi = -(brr*afrt + bri*afit)/2.0;

		da1r = brafr*auxart - brafi*auxait;

		da1i = brafr*auxait + brafi*auxart;

		// da2 

		arbfr = -(arr*bfit + ari*bfrt)/2.0;

		arbfi =  (arr*bfrt - ari*bfit)/2.0;

		da2r = arbfr*auxart - arbfi*auxait;

		da2i = arbfr*auxait + arbfi*auxart;

// drfr drfi

		drfrdata[tid] += da2r + db2r + da1r + db1r;

		drfidata[tid] += da2i + db2i - da1i - db1i;	

				

		// updates

		art = arr*c - (brr*sr + bri*si);

		ait = ari*c - (brr*si - bri*sr);

		brr =  arr*sr + ari*si + brr*c;

		bri = -ari*sr + arr*si + bri*c;

		arr = art;ari = ait;

} /* j */

    __syncthreads();

if (blockSize >= 512) { if (tid < 256) { drfrdata[tid] += drfrdata[tid + 256]; drfidata[tid] += drfidata[tid + 256];} __syncthreads(); }

    if (blockSize >= 256) { if (tid < 128) { drfrdata[tid] += drfrdata[tid + 128]; drfidata[tid] += drfidata[tid + 128];} __syncthreads(); }

    if (blockSize >= 128) { if (tid < 64)  { drfrdata[tid] += drfrdata[tid + 64]; drfidata[tid] += drfidata[tid + 64];}  __syncthreads(); }

    if (tid < 32)

    {   

		warpReduce(drfrdata, tid);  

		warpReduce(drfidata, tid);				

    }

	__syncthreads();

		

    if (tid == 0) 

	{ // atomicAdd 

		atomicAdd(&drfr[i],drfrdata[0]); 

		atomicAdd(&drfi[i],drfidata[0]);		

	}

} /* i */

First I did not show all the definition and declaration of all variables here, but they are defined and assigned with same value.

It is basically a double loop (i and j dimension); Using this double I am trying to fill a 2D matrix, and then sum up along one of the dimensions, j, to get a vector of length Nt;

In the c file, the outside loop down-counts from Nt to 1, and the inner loop generates a new value in j dimension for each j. And this new value will be added to its previous value. That is how the final vector is got.

In the CUDA file, the reduction algorithm, which I did not show here, is used to sum all elements in j dimension.

The problem is that the following code causes the difference in results;

in c file:

// here is what makes the difference in results

			art = afr*zr + afi*zi;

			afr = art;

in CUDA:

// here is what makes the difference in results

		art =  afrt*zr + afit*zi;

		afrt = art;

If these lines are removed, the two scripts generate same results, but the results are different if the lines are added.

The two scripts are fed with same input, and they are functionally equivalent with each other with only parameter name changed.

Any one could give some hint about this? I am trying to make the CUDA give same result as the c file does. I don’t understand why the CUDA gives a different result. How do I change it?

I am not sure if I provided enough detail for you to figure out. If you need, let me know. Thank you!

Really appreciate any suggestions.