Why these two scripts give different results?

nickxtsui · June 18, 2012, 4:55pm

hi Folks;

I am very confused on why these two scripts give different results. The are basically a double loop. One is the CUDA version, the other is the c version. The CUDA version is modified from the c version. Here is what they are:

c version:

for(j = 0;j < p->Ns;j++)

	{	/* initialize */

		arr = 1;ari = 0;brr = 0;bri = 0;

		afr = p->afr[j];afi = p->afi[j];bfr = p->bfr[j];bfi = p->bfi[j];

		 

		auxar = p->auxar[j];auxai = p->auxai[j];		

		auxbr = p->auxbr[j];auxbi = p->auxbi[j];

		

		x = p->x[j];

		if(Ndim > 1) y = p->y[j];

		if(Ndim > 2) z = p->z[j];

		for(i = p->Nt-1;i >= 0;i--){

			// pre-setup; 

			phs = x*gx[i];

			if(Ndim > 1) phs += y*gy[i];

			if(Ndim > 2) phs += z*gz[i];

			zr = cos(phs/2.0);

			zi = sin(phs/2.0);

			

			// here is what makes the difference in results

			art = afr*zr + afi*zi;

			afr = art;	

	

			// db1

			brbfr = -(brr*bfi + bri*bfr)/2.0;

			brbfi =  (brr*bfr - bri*bfi)/2.0;

			db1r = brbfr*auxbr + brbfi*auxbi;

			db1i = brbfr*auxbi - brbfi*auxbr;

			// db2

			afarr = -(afi*arr - afr*ari)/2.0;

			afari =  (afr*arr + afi*ari)/2.0;

			db2r = afarr*auxbr + afari*auxbi;

			db2i = afarr*auxbi - afari*auxbr;		

			// da1

			brafr = (brr*afi - bri*afr)/2.0;

			brafi = (brr*afr + bri*afi)/2.0;

			da1r = brafr*auxar + brafi*auxai;

			da1i = brafr*auxai - brafi*auxar;

			// da2

			arbfr = -(arr*bfi + ari*bfr)/2.0;

			arbfi =  (arr*bfr - ari*bfi)/2.0;

			da2r = arbfr*auxar - arbfi*auxai;

			da2i = arbfr*auxai + arbfi*auxar;

			

			// drfr drfi

			drfr[i] += da2r + db2r + da1r + db1r;

			drfi[i] += da2i + db2i - da1i - db1i;

	  

			// updates

			art = arr*c - (brr*sr + bri*si);

			ait = ari*c - (brr*si - bri*sr);				  

			brr = arr*sr + ari*si + brr*c;

			bri = -ari*sr + arr*si + bri*c;

			arr = art;ari = ait;     

		} /* i */

	} /* j */

And here is the CUDA verision:

for (int i = Nt-1; i>=0; i--)

{

    /* initialize */               

    tid = threadIdx.x;

    j = blockIdx.x*(blockSize) + tid;

    gridSize = blockSize*gridDim.x;

    drfrdata[tid] = 0;   

	drfidata[tid] = 0;			

		

    gx = g[i]; 

	if(Ndim > 1) gy = g[Nt+i];

	if(Ndim > 2) gz = g[2*Nt+i];

    __syncthreads();

while (j < Ns)

    {	// pre-setup; 

		arr = 1; ari = 0; brr = 0; bri = 0;

		afrt = afr[j]; afit = afi[j]; bfrt = bfr[j]; bfit = bfi[j];

			

		auxart = auxar[j]; auxait = auxai[j]; 

		auxbrt = auxbr[j]; auxbit = auxbi[j];

			

		phs = gx*x[j];		

		if(Ndim > 1) phs += gy*x[Ns+j];

		if(Ndim > 2) phs += gz*x[2*Ns+j];

			

		zr = cos(phs/2.0);

		zi = sin(phs/2.0);

		

		// here is what makes the difference in results

		art =  afrt*zr + afit*zi;

		afrt = art;	

		// db1 

		brbfr = -(brr*bfit + bri*bfrt)/2.0;

		brbfi =  (brr*bfrt - bri*bfit)/2.0;

		db1r = brbfr*auxbrt + brbfi*auxbit;

		db1i = brbfr*auxbit - brbfi*auxbrt;

		// db2 

		afarr = -(afit*arr - afrt*ari)/2.0;

		afari =  (afrt*arr + afit*ari)/2.0;

		db2r = afarr*auxbrt + afari*auxbit;

		db2i = afarr*auxbit - afari*auxbrt;

		// da1 

		brafr =  (brr*afit - bri*afrt)/2.0;

		brafi = -(brr*afrt + bri*afit)/2.0;

		da1r = brafr*auxart - brafi*auxait;

		da1i = brafr*auxait + brafi*auxart;

		// da2 

		arbfr = -(arr*bfit + ari*bfrt)/2.0;

		arbfi =  (arr*bfrt - ari*bfit)/2.0;

		da2r = arbfr*auxart - arbfi*auxait;

		da2i = arbfr*auxait + arbfi*auxart;

// drfr drfi

		drfrdata[tid] += da2r + db2r + da1r + db1r;

		drfidata[tid] += da2i + db2i - da1i - db1i;	

				

		// updates

		art = arr*c - (brr*sr + bri*si);

		ait = ari*c - (brr*si - bri*sr);

		brr =  arr*sr + ari*si + brr*c;

		bri = -ari*sr + arr*si + bri*c;

		arr = art;ari = ait;

} /* j */

    __syncthreads();

if (blockSize >= 512) { if (tid < 256) { drfrdata[tid] += drfrdata[tid + 256]; drfidata[tid] += drfidata[tid + 256];} __syncthreads(); }

    if (blockSize >= 256) { if (tid < 128) { drfrdata[tid] += drfrdata[tid + 128]; drfidata[tid] += drfidata[tid + 128];} __syncthreads(); }

    if (blockSize >= 128) { if (tid < 64)  { drfrdata[tid] += drfrdata[tid + 64]; drfidata[tid] += drfidata[tid + 64];}  __syncthreads(); }

    if (tid < 32)

    {   

		warpReduce(drfrdata, tid);  

		warpReduce(drfidata, tid);				

    }

	__syncthreads();

		

    if (tid == 0) 

	{ // atomicAdd 

		atomicAdd(&drfr[i],drfrdata[0]); 

		atomicAdd(&drfi[i],drfidata[0]);		

	}

} /* i */

First I did not show all the definition and declaration of all variables here, but they are defined and assigned with same value.

It is basically a double loop (i and j dimension); Using this double I am trying to fill a 2D matrix, and then sum up along one of the dimensions, j, to get a vector of length Nt;

In the c file, the outside loop down-counts from Nt to 1, and the inner loop generates a new value in j dimension for each j. And this new value will be added to its previous value. That is how the final vector is got.

In the CUDA file, the reduction algorithm, which I did not show here, is used to sum all elements in j dimension.

The problem is that the following code causes the difference in results;

in c file:

// here is what makes the difference in results

			art = afr*zr + afi*zi;

			afr = art;

in CUDA:

// here is what makes the difference in results

		art =  afrt*zr + afit*zi;

		afrt = art;

If these lines are removed, the two scripts generate same results, but the results are different if the lines are added.

The two scripts are fed with same input, and they are functionally equivalent with each other with only parameter name changed.

Any one could give some hint about this? I am trying to make the CUDA give same result as the c file does. I don’t understand why the CUDA gives a different result. How do I change it?

I am not sure if I provided enough detail for you to figure out. If you need, let me know. Thank you!

Really appreciate any suggestions.

wanderine · June 18, 2012, 5:22pm

Serial (CPU) and parallel (GPU) reduction of float variables are not guaranteed to give (exactly) the same results, due to computer arithmetic.

nickxtsui · June 19, 2012, 3:16am

So if I want the CUDA code to give same result as the c code, any suggestion how should I modify the CUDA code?

mfatica · June 19, 2012, 5:38am

It may be an effect of FMA instructions.
If you are using CUDA 4.2, there is a flag to disable them ( don’t remember on top of my head, something similar to -nofma ).

njuffa · June 19, 2012, 7:16am

The relevant nvcc flag is -fmad=false (the default is -fmad=true, allowing the compiler to merge a floating-point multiply with a dependent floating-pint add into a single FMAD, FFMA, or DFMA instruction for better performance). Please consult the following whitepaper for general background on the question “Why are my results on the GPU different from my results on the CPU?”:

http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf

nickxtsui · June 19, 2012, 6:04pm

Hey:

Thanks a lot for your replies.

I tried like you told, but it did not work. the CUDA file gives the same results as before.

I actually don’t think the problem is caused by floating number process. I think it should be something wrong in my CUDA program.

I mean, when zr and zi is added into the program, something went wrong. But aren’t they processed in each j (thread) in parallel? So shouldn’t they give the same result as the c file, in which, zr and zi are also processed for each i and j. The only difference is CUDA process j simultaneously, whereas CPU process j sequentially?

Very confused. Any other suggestions? Thank you

Topic		Replies	Views
Why these two scripts give different results? CUDA Programming and Performance	0	1142	June 18, 2012
float reduction, cpu and cuda answers differ CUDA Programming and Performance	4	3410	April 1, 2008
Reduction Reduction Reduction................. Precision Confusion Race Condition...... HELP! CUDA Programming and Performance	16	10612	December 8, 2009
Basic reduction with CUDA CUDA Programming and Performance	1	562	March 22, 2018
Different results using diffrent memory types CUDA Programming and Performance	22	4607	April 14, 2010
Accumulation Difference between CUDA and CPU? CUDA Programming and Performance	7	10591	August 22, 2009
Error for parrallelise this c++ code CUDA Programming and Performance	1	609	August 18, 2015
Precision error - CUDA kernel Vs CPU Reduction kernel with fabs and differnce operations on float CUDA Programming and Performance	3	2095	May 6, 2011
Float accuracy CUDA Programming and Performance	16	9525	July 22, 2010
Result of reduction in GPU do not match with the CPU's, also GPU's result vary with blocksize Legacy PGI Compilers	4	950	June 23, 2020

Why these two scripts give different results?

Related topics