Why these two scripts give different results?

hi Folks;

I am very confused on why these two scripts give different results. The are basically a double loop. One is the CUDA version, the other is the c version. The CUDA version is modified from the c version. Here is what they are:

c version:

for(j = 0;j < p->Ns;j++)

	{	/* initialize */

		arr = 1;ari = 0;brr = 0;bri = 0;

		afr = p->afr[j];afi = p->afi[j];bfr = p->bfr[j];bfi = p->bfi[j];

		 

		auxar = p->auxar[j];auxai = p->auxai[j];		

		auxbr = p->auxbr[j];auxbi = p->auxbi[j];

		

		x = p->x[j];

		if(Ndim > 1) y = p->y[j];

		if(Ndim > 2) z = p->z[j];

		for(i = p->Nt-1;i >= 0;i--){

			// pre-setup; 

			phs = x*gx[i];

			if(Ndim > 1) phs += y*gy[i];

			if(Ndim > 2) phs += z*gz[i];

			zr = cos(phs/2.0);

			zi = sin(phs/2.0);

			

			// here is what makes the difference in results

			art = afr*zr + afi*zi;

			afr = art;	

	

			// db1

			brbfr = -(brr*bfi + bri*bfr)/2.0;

			brbfi =  (brr*bfr - bri*bfi)/2.0;

			db1r = brbfr*auxbr + brbfi*auxbi;

			db1i = brbfr*auxbi - brbfi*auxbr;

			// db2

			afarr = -(afi*arr - afr*ari)/2.0;

			afari =  (afr*arr + afi*ari)/2.0;

			db2r = afarr*auxbr + afari*auxbi;

			db2i = afarr*auxbi - afari*auxbr;		

			// da1

			brafr = (brr*afi - bri*afr)/2.0;

			brafi = (brr*afr + bri*afi)/2.0;

			da1r = brafr*auxar + brafi*auxai;

			da1i = brafr*auxai - brafi*auxar;

			// da2

			arbfr = -(arr*bfi + ari*bfr)/2.0;

			arbfi =  (arr*bfr - ari*bfi)/2.0;

			da2r = arbfr*auxar - arbfi*auxai;

			da2i = arbfr*auxai + arbfi*auxar;

			

			// drfr drfi

			drfr[i] += da2r + db2r + da1r + db1r;

			drfi[i] += da2i + db2i - da1i - db1i;

	  

			// updates

			art = arr*c - (brr*sr + bri*si);

			ait = ari*c - (brr*si - bri*sr);				  

			brr = arr*sr + ari*si + brr*c;

			bri = -ari*sr + arr*si + bri*c;

			arr = art;ari = ait;     

		} /* i */

	} /* j */

And here is the CUDA verision:

for (int i = Nt-1; i>=0; i--)

{

    /* initialize */               

    tid = threadIdx.x;

    j = blockIdx.x*(blockSize) + tid;

    gridSize = blockSize*gridDim.x;

    drfrdata[tid] = 0;   

	drfidata[tid] = 0;			

		

    gx = g[i]; 

	if(Ndim > 1) gy = g[Nt+i];

	if(Ndim > 2) gz = g[2*Nt+i];

    __syncthreads();

while (j < Ns)

    {	// pre-setup; 

		arr = 1; ari = 0; brr = 0; bri = 0;

		afrt = afr[j]; afit = afi[j]; bfrt = bfr[j]; bfit = bfi[j];

			

		auxart = auxar[j]; auxait = auxai[j]; 

		auxbrt = auxbr[j]; auxbit = auxbi[j];

			

		phs = gx*x[j];		

		if(Ndim > 1) phs += gy*x[Ns+j];

		if(Ndim > 2) phs += gz*x[2*Ns+j];

			

		zr = cos(phs/2.0);

		zi = sin(phs/2.0);

		

		// here is what makes the difference in results

		art =  afrt*zr + afit*zi;

		afrt = art;	

		// db1 

		brbfr = -(brr*bfit + bri*bfrt)/2.0;

		brbfi =  (brr*bfrt - bri*bfit)/2.0;

		db1r = brbfr*auxbrt + brbfi*auxbit;

		db1i = brbfr*auxbit - brbfi*auxbrt;

		// db2 

		afarr = -(afit*arr - afrt*ari)/2.0;

		afari =  (afrt*arr + afit*ari)/2.0;

		db2r = afarr*auxbrt + afari*auxbit;

		db2i = afarr*auxbit - afari*auxbrt;

		// da1 

		brafr =  (brr*afit - bri*afrt)/2.0;

		brafi = -(brr*afrt + bri*afit)/2.0;

		da1r = brafr*auxart - brafi*auxait;

		da1i = brafr*auxait + brafi*auxart;

		// da2 

		arbfr = -(arr*bfit + ari*bfrt)/2.0;

		arbfi =  (arr*bfrt - ari*bfit)/2.0;

		da2r = arbfr*auxart - arbfi*auxait;

		da2i = arbfr*auxait + arbfi*auxart;

// drfr drfi

		drfrdata[tid] += da2r + db2r + da1r + db1r;

		drfidata[tid] += da2i + db2i - da1i - db1i;	

				

		// updates

		art = arr*c - (brr*sr + bri*si);

		ait = ari*c - (brr*si - bri*sr);

		brr =  arr*sr + ari*si + brr*c;

		bri = -ari*sr + arr*si + bri*c;

		arr = art;ari = ait;

} /* j */

    __syncthreads();

if (blockSize >= 512) { if (tid < 256) { drfrdata[tid] += drfrdata[tid + 256]; drfidata[tid] += drfidata[tid + 256];} __syncthreads(); }

    if (blockSize >= 256) { if (tid < 128) { drfrdata[tid] += drfrdata[tid + 128]; drfidata[tid] += drfidata[tid + 128];} __syncthreads(); }

    if (blockSize >= 128) { if (tid < 64)  { drfrdata[tid] += drfrdata[tid + 64]; drfidata[tid] += drfidata[tid + 64];}  __syncthreads(); }

    if (tid < 32)

    {   

		warpReduce(drfrdata, tid);  

		warpReduce(drfidata, tid);				

    }

	__syncthreads();

		

    if (tid == 0) 

	{ // atomicAdd 

		atomicAdd(&drfr[i],drfrdata[0]); 

		atomicAdd(&drfi[i],drfidata[0]);		

	}

} /* i */

First I did not show all the definition and declaration of all variables here, but they are defined and assigned with same value.

It is basically a double loop (i and j dimension); Using this double I am trying to fill a 2D matrix, and then sum up along one of the dimensions, j, to get a vector of length Nt;

In the c file, the outside loop down-counts from Nt to 1, and the inner loop generates a new value in j dimension for each j. And this new value will be added to its previous value. That is how the final vector is got.

In the CUDA file, the reduction algorithm, which I did not show here, is used to sum all elements in j dimension.

The problem is that the following code causes the difference in results;

in c file:

// here is what makes the difference in results

			art = afr*zr + afi*zi;

			afr = art;

in CUDA:

// here is what makes the difference in results

		art =  afrt*zr + afit*zi;

		afrt = art;

If these lines are removed, the two scripts generate same results, but the results are different if the lines are added.

The two scripts are fed with same input, and they are functionally equivalent with each other with only parameter name changed.

Any one could give some hint about this? I am trying to make the CUDA give same result as the c file does. I don’t understand why the CUDA gives a different result. How do I change it?

I am not sure if I provided enough detail for you to figure out. If you need, let me know. Thank you!

Really appreciate any suggestions.

Serial (CPU) and parallel (GPU) reduction of float variables are not guaranteed to give (exactly) the same results, due to computer arithmetic.

So if I want the CUDA code to give same result as the c code, any suggestion how should I modify the CUDA code?

It may be an effect of FMA instructions.
If you are using CUDA 4.2, there is a flag to disable them ( don’t remember on top of my head, something similar to -nofma ).

The relevant nvcc flag is -fmad=false (the default is -fmad=true, allowing the compiler to merge a floating-point multiply with a dependent floating-pint add into a single FMAD, FFMA, or DFMA instruction for better performance). Please consult the following whitepaper for general background on the question “Why are my results on the GPU different from my results on the CPU?”:

http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf

Hey:

Thanks a lot for your replies.

I tried like you told, but it did not work. the CUDA file gives the same results as before.

I actually don’t think the problem is caused by floating number process. I think it should be something wrong in my CUDA program.

I mean, when zr and zi is added into the program, something went wrong. But aren’t they processed in each j (thread) in parallel? So shouldn’t they give the same result as the c file, in which, zr and zi are also processed for each i and j. The only difference is CUDA process j simultaneously, whereas CPU process j sequentially?

Very confused. Any other suggestions? Thank you