Accumulation Difference between CUDA and CPU?

The.Soloist · August 17, 2009, 5:17pm

Hi all,

I’m currently learning to use CUDA for a project and I ran into the following problem:

When accumulating an array to get its sum, the result is different if I use GPU when compared with CPU. The following code is for CPU:

[codebox]int i, i1, i2;

    struct atom *a1, *a2;

    float ei, ej, eij, ri, rj, rij;

    float dx, dy, dz, d2;

    float Rd6, Rd12, dven, g;

    for(i=0; i<n03; i++)

    {

       i1=list03[2*i];

       a1=&(ag->atoms[i1]);

       ei=f*(a1->eps03);

       ri=a1->rminh03;

i2=list03[2*i+1];

       a2=&(ag->atoms[i2]);

       ej=a2->eps03;

       rj=a2->rminh03;

eij=ei*ej;

       rij=ri+rj;

       rij*=rij;

dx=ag->atoms[i1].X-ag->atoms[i2].X;

       dy=ag->atoms[i1].Y-ag->atoms[i2].Y;

       dz=ag->atoms[i1].Z-ag->atoms[i2].Z;

       d2=dx*dx+dy*dy+dz*dz;

       Rd6=rij/d2;

       Rd6=Rd6*Rd6*Rd6;

       Rd12=Rd6*Rd6;

       (*ven)+=eij*(Rd12-2*Rd6);

       dven=-eij*12*(-Rd12+Rd6)/d2;

       g=dven*dx;

       (a1->GX)+=g;

       (a2->GX)-=g;

       g=dven*dy;

       (a1->GY)+=g;

       (a2->GY)-=g;

       g=dven*dz;

       (a1->GZ)+=g;

       (a2->GZ)-=g;

    }[/codebox]

here the variable (*ven) is the place to store accumulated value.

The following is the GPU version of function above:

[codebox]device unsigned int count = 0;

shared bool isLastBlockDone;

global void vdweng_assign(int numPairs, float* atom_X, float* atom_Y, float* atom_Z, float* atom_GX, float* atom_GY, float* atom_GZ, float* evdw, int* firstAtomIndex_assign, int* secondAtomIndex_assign, float rc2, float* atom_eps, float* atom_rminh, float* partial_result){

const int tid = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;

const int threadN = IMUL(blockDim.x, gridDim.x);



__shared__ float evdw_sh_first[NUM_THREADS_PER_BLOCK];



int i, j;

int i1, i2;

int it, kt;

float x1,y1,z1;

float dx, dy, dz;

float d2;

int ij;

int my_i1;

float ri,rj,ei,ej,rij,eij;

float Rd6, Rd12, Rr6, Rr12, dr6, dven, g;



//float evdw_total;

int blockId, threadId;

int numBlocks, numThreadsPerBlock;

blockId = blockIdx.x;

numBlocks = gridDim.x;

threadId = threadIdx.x;

numThreadsPerBlock = blockDim.x;



int shared_memory_index = threadId;

for(i=0;i<NUM_BLOCKS;i++)

	evdw_nop[i]=0;

for(i = tid; i < numPairs; i += threadN) {

    //printf("Thread %d getting pair %d of %d, i1=%d, i2=%d\n", tid, i, numPairs);

    my_i1 = firstAtomIndex_assign[i];

	//printf("thread %d getting i1 = %d, n2 = %d\n", tid, my_i1, my_i2);

    evdw_sh_first[shared_memory_index] = 0.0;

if(my_i1 >= 0) {

        i1 = my_i1;

        i2 = secondAtomIndex_assign[i];

								

	    x1 = atom_X[i1];                     //(1)

        y1 = atom_Y[i1];

        z1 = atom_Z[i1];

dx = x1 - atom_X[i2];

        dy = y1 - atom_Y[i2];

        dz = z1 - atom_Z[i2];

		d2 = dx*dx+dy*dy+dz*dz;

		//printf("Thread id = %d, i1 = %d, i2 = %d\n", tid, i1, i2);

		if(d2<rc2){

			

			ri=atom_rminh[i1];

			rj=atom_rminh[i2];

			ei=atom_eps[i1];

			ej=atom_eps[i2];

			eij = ei*ej;

			rij = (ri+rj)*(ri+rj);

			Rd6 = (rij/d2)*(rij/d2)*(rij/d2);

			Rd12 = Rd6*Rd6;

			Rr6=(rij/rc2)*(rij/rc2)*(rij/rc2);

			Rr12=Rr6*Rr6;

			dr6=(d2/rc2)*(d2/rc2)*(d2/rc2);

			evdw_sh_first[shared_memory_index] = eij*(Rd12-2*Rd6+Rr6*(4.0-2*dr6)+Rr12*(2*dr6-3.0));

			//printf("Thread id = %d, i1 = %d, i2 = %d, evdw = %f\n", tid, i1, i2, eij*(Rd12-2*Rd6+Rr6*(4.0-2*dr6)+Rr12*(2*dr6-3.0)));

			printf("%f\n",eij*(Rd12-2*Rd6+Rr6*(4.0-2*dr6)+Rr12*(2*dr6-3.0)));

			dven=-eij*12*(-Rd12+Rd6+dr6*(Rr12-Rr6))/d2;

            g=dven*dx;

            atom_GX[i1]+=g;

            atom_GX[i2]-=g;

            g=dven*dy;

            atom_GY[i1]+=g;

            atom_GY[i2]-=g;

            g=dven*dz;

            atom_GZ[i1]+=g;

            atom_GZ[i2]-=g;

		}

		else{

			//printf("Thread id = %d, i1 = %d, i2 = %d, evdw = 0\n", tid, i1, i2);

		}

	}

	__syncthreads();

	if(threadIdx.x == 0){//@hxdai: first thread in each block is responsible for accumulating all the pairs in this block

		partial_result[blockIdx.x] = 0;

		for(i=0;i<NUM_THREADS_PER_BLOCK;i++){

			partial_result[blockIdx.x]+=evdw_sh_first[i];

		}

		//printf("block %d's total is %f\n", blockIdx.x, partial_result[blockIdx.x]);

	__threadfence();

		printf("partial result of block %d is %f\n", blockIdx.x, partial_result[blockIdx.x]);

	unsigned int value = atomicInc(&count, gridDim.x);

	isLastBlockDone = (value == gridDim.x-1);

	}

	__syncthreads();

	if(isLastBlockDone){

		if(threadIdx.x == 0){

		for(i=0;i<numBlocks;i++)

			*evdw+=partial_result[i];

		count = 0;

		}

	}

	//__syncthreads();      

}       // end of for loop

}[/codebox]

Here *evdw is the final destination of accumulation.

I used the synchronization method mentioned in the programming guide to do the synchronization of accumulation.

I then run the code in EmuRelease mode, and the result from GPU is larger than that from CPU. They are using the same data inputs so there shouldn’t be a difference.

FYI I’m using a dell workstation with a Xeon dual-core 3.06GHz CPU and VS2008.

Any ideas? Thanks!

mcleary · August 18, 2009, 7:50pm

Have you read that the floating point operations have a little error compared with the CPU calculations?
Run the SDK example “Reduction” with float numbers and you’ll see a little difference between CPU and GPU calculations.
Hope this will help you.

The.Soloist · August 20, 2009, 7:51pm

Thank you tluisrs for the reply.

I did look at the Reduction code, and found that when using float, kernel 1 and 2 gave same results from CPU and GPU. Starting from kernel 3, the results differ. However, I don’t think that by adding 4~5k numbers would give a 10% error, which is what my code reflects. Do you think kahan summation will help? I’m about to try that out.

mcleary · August 20, 2009, 9:35pm

Would you tell what device you are using? You need to check if a simple float sum is with a little error. Don’t forget the error propagation. I don’t know about kahan summation. I suggest you to check the specifications of your device and the error rate of float operations. Another suggestion is to use f suffix in float assignments. Change

evdw_sh_first[shared_memory_index] = 0.0;

To

evdw_sh_first[shared_memory_index] = 0.0f;

This avoid the device to use implicit conversion.

Hope this will help you.

The.Soloist · August 21, 2009, 9:19pm

[quote name=‘tluisrs’ post=‘580814’ date=‘Aug 20 2009, 05:35 PM’]

Would you tell what device you are using? You need to check if a simple float sum is with a little error. Don’t forget the error propagation. I don’t know about kahan summation. I suggest you to check the specifications of your device and the error rate of float operations. Another suggestion is to use f suffix in float assignments. Change

[codebox]__global__ void vdweng_accu(int numPairs, float* single_result, float* partial_result, float *total_evdw){

const int tid = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;

const int threadN = IMUL(blockDim.x, gridDim.x);

__shared__ float tmp_partial[NUM_THREADS_PER_BLOCK];	

int i;

int index;

tmp_partial[threadIdx.x]=single_result[tid];

__syncthreads();

//@hxdai: add numbers in this block together

for(i=1;i<blockDim.x;i*=2){

	index = 2*i*threadIdx.x;

	if(index < blockDim.x)

		tmp_partial[index]+=tmp_partial[index+i];

	__syncthreads();

}

//@hxdai: first thread in block move it to total

if(threadIdx.x==0){

	//printf("from kernel block %d: final evdw = %f\n", blockIdx.x, tmp_partial[0]);

	partial_result[blockIdx.x] = tmp_partial[0];

	__threadfence();

}

}[/codebox]

This is basically the code of reduction in SDK sample. Any idea what’s going wrong?

tmurray · August 21, 2009, 9:34pm

The CPU is using 80-bit floating point, and the GPU isn’t. Try accumulating with double precision or something.

SPWorley · August 22, 2009, 2:05am

If there is a difference between CPU and GPU accumulations, quite often that difference is due to the GPU being more precise than the CPU. This is an algorithmic, not hardware, difference.
Basically if you sum x0+x1+x2+x3… you get accumulated error and at the end when you’re adding x10000, you are adding one large number to one small one and you have floating point precision losses.

But if you use a parallel algorithm, as a byproduct of its calculation, the values are condensed in a tree, sort of like:

y0=x0+x3 y1=x2+x4

y0+y1

And this is actually ideal for dealing with roundoff precision since every addition is of two values which have similar accumulations and therefore likely to have more similar magnitudes, and therefore less precision loss.

Certainly you can rearrange your additions on the CPU to follow the same pattern to get the similar precision boost, but people usually don’t bother because it’s a pain.

The.Soloist · August 22, 2009, 7:42pm

I tried with double precision and got basically the same result.

If there is a difference between CPU and GPU accumulations, quite often that difference is due to the GPU being more precise than the CPU. This is an algorithmic, not hardware, difference.

Basically if you sum x0+x1+x2+x3… you get accumulated error and at the end when you’re adding x10000, you are adding one large number to one small one and you have floating point precision losses.

But if you use a parallel algorithm, as a byproduct of its calculation, the values are condensed in a tree, sort of like:

y0=x0+x3 y1=x2+x4

y0+y1

And this is actually ideal for dealing with roundoff precision since every addition is of two values which have similar accumulations and therefore likely to have more similar magnitudes, and therefore less precision loss.

Certainly you can rearrange your additions on the CPU to follow the same pattern to get the similar precision boost, but people usually don’t bother because it’s a pain.

To be honest, I haven’t thought about this before. This is definitely possible. The project I’m working on is to add #s(mostly -1e-6 level) to an existing #(which is ~-18000), therefore, it’s quite possible that the CPU code simply add the calculated value to the big number, possibly causing some loss in precision. But in GPU code, all the #s are added together before adding to the big one. I’ll try use kahan summation on CPU code to see if it gets better. Thank you for the inspiration.

Update: This IS the problem. By utilizing Kahan Summation in CPU code, the result from CPU is same as in GPU. (not EXACTLY, but within a affordable error margin). Thanks for everyone helped.

Topic		Replies	Views
floating point precision on CUDA CUDA Programming and Performance	11	14768	June 8, 2010
Different results using diffrent memory types CUDA Programming and Performance	22	4393	April 14, 2010
Reduction Reduction Reduction................. Precision Confusion Race Condition...... HELP! CUDA Programming and Performance	16	10463	December 8, 2009
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	677	April 4, 2017
Thread block clusters and distributed shared memory not working as intended CUDA Programming and Performance	8	1418	November 8, 2023
CUDA kernel converging to CPU double result thought G80 is single precision? CUDA Programming and Performance	6	8610	March 7, 2008
GPU - CPU Performance comparison on string conversion i7 860 3.5GHz beat out NVidia 9800 GT CUDA Programming and Performance	11	10659	January 4, 2011
Difference between Device emulation and execution modes CUDA Programming and Performance	8	4435	May 12, 2009
floating point error Error with floating point division CUDA Programming and Performance	9	8393	November 30, 2007
Iteration help in CUDA CUDA Programming and Performance	11	6878	April 19, 2012

Accumulation Difference between CUDA and CPU?

Related topics