Accumulation Difference between CUDA and CPU?

Hi all,

I’m currently learning to use CUDA for a project and I ran into the following problem:

When accumulating an array to get its sum, the result is different if I use GPU when compared with CPU. The following code is for CPU:

[codebox]int i, i1, i2;

    struct atom *a1, *a2;

    float ei, ej, eij, ri, rj, rij;

    float dx, dy, dz, d2;

    float Rd6, Rd12, dven, g;

    for(i=0; i<n03; i++)

    {

       i1=list03[2*i];

       a1=&(ag->atoms[i1]);

       ei=f*(a1->eps03);

       ri=a1->rminh03;

i2=list03[2*i+1];

       a2=&(ag->atoms[i2]);

       ej=a2->eps03;

       rj=a2->rminh03;

eij=ei*ej;

       rij=ri+rj;

       rij*=rij;

dx=ag->atoms[i1].X-ag->atoms[i2].X;

       dy=ag->atoms[i1].Y-ag->atoms[i2].Y;

       dz=ag->atoms[i1].Z-ag->atoms[i2].Z;

       d2=dx*dx+dy*dy+dz*dz;

       Rd6=rij/d2;

       Rd6=Rd6*Rd6*Rd6;

       Rd12=Rd6*Rd6;

       (*ven)+=eij*(Rd12-2*Rd6);

       dven=-eij*12*(-Rd12+Rd6)/d2;

       g=dven*dx;

       (a1->GX)+=g;

       (a2->GX)-=g;

       g=dven*dy;

       (a1->GY)+=g;

       (a2->GY)-=g;

       g=dven*dz;

       (a1->GZ)+=g;

       (a2->GZ)-=g;

    }[/codebox]

here the variable (*ven) is the place to store accumulated value.

The following is the GPU version of function above:

[codebox]device unsigned int count = 0;

shared bool isLastBlockDone;

global void vdweng_assign(int numPairs, float* atom_X, float* atom_Y, float* atom_Z, float* atom_GX, float* atom_GY, float* atom_GZ, float* evdw, int* firstAtomIndex_assign, int* secondAtomIndex_assign, float rc2, float* atom_eps, float* atom_rminh, float* partial_result){

const int tid = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;

const int threadN = IMUL(blockDim.x, gridDim.x);



__shared__ float evdw_sh_first[NUM_THREADS_PER_BLOCK];



int i, j;

int i1, i2;

int it, kt;

float x1,y1,z1;

float dx, dy, dz;

float d2;

int ij;

int my_i1;

float ri,rj,ei,ej,rij,eij;

float Rd6, Rd12, Rr6, Rr12, dr6, dven, g;



//float evdw_total;

int blockId, threadId;

int numBlocks, numThreadsPerBlock;

blockId = blockIdx.x;

numBlocks = gridDim.x;

threadId = threadIdx.x;

numThreadsPerBlock = blockDim.x;



int shared_memory_index = threadId;

for(i=0;i<NUM_BLOCKS;i++)

	evdw_nop[i]=0;

for(i = tid; i < numPairs; i += threadN) {

    //printf("Thread %d getting pair %d of %d, i1=%d, i2=%d\n", tid, i, numPairs);

    my_i1 = firstAtomIndex_assign[i];

	//printf("thread %d getting i1 = %d, n2 = %d\n", tid, my_i1, my_i2);

    evdw_sh_first[shared_memory_index] = 0.0;

if(my_i1 >= 0) {

        i1 = my_i1;

        i2 = secondAtomIndex_assign[i];

								

	    x1 = atom_X[i1];                     //(1)

        y1 = atom_Y[i1];

        z1 = atom_Z[i1];

dx = x1 - atom_X[i2];

        dy = y1 - atom_Y[i2];

        dz = z1 - atom_Z[i2];

		d2 = dx*dx+dy*dy+dz*dz;

		//printf("Thread id = %d, i1 = %d, i2 = %d\n", tid, i1, i2);

		if(d2<rc2){

			

			ri=atom_rminh[i1];

			rj=atom_rminh[i2];

			ei=atom_eps[i1];

			ej=atom_eps[i2];

			eij = ei*ej;

			rij = (ri+rj)*(ri+rj);

			Rd6 = (rij/d2)*(rij/d2)*(rij/d2);

			Rd12 = Rd6*Rd6;

			Rr6=(rij/rc2)*(rij/rc2)*(rij/rc2);

			Rr12=Rr6*Rr6;

			dr6=(d2/rc2)*(d2/rc2)*(d2/rc2);

			evdw_sh_first[shared_memory_index] = eij*(Rd12-2*Rd6+Rr6*(4.0-2*dr6)+Rr12*(2*dr6-3.0));

			//printf("Thread id = %d, i1 = %d, i2 = %d, evdw = %f\n", tid, i1, i2, eij*(Rd12-2*Rd6+Rr6*(4.0-2*dr6)+Rr12*(2*dr6-3.0)));

			printf("%f\n",eij*(Rd12-2*Rd6+Rr6*(4.0-2*dr6)+Rr12*(2*dr6-3.0)));

			dven=-eij*12*(-Rd12+Rd6+dr6*(Rr12-Rr6))/d2;

            g=dven*dx;

            atom_GX[i1]+=g;

            atom_GX[i2]-=g;

            g=dven*dy;

            atom_GY[i1]+=g;

            atom_GY[i2]-=g;

            g=dven*dz;

            atom_GZ[i1]+=g;

            atom_GZ[i2]-=g;

		}

		else{

			//printf("Thread id = %d, i1 = %d, i2 = %d, evdw = 0\n", tid, i1, i2);

		}

	}

	__syncthreads();

	if(threadIdx.x == 0){//@hxdai: first thread in each block is responsible for accumulating all the pairs in this block

		partial_result[blockIdx.x] = 0;

		for(i=0;i<NUM_THREADS_PER_BLOCK;i++){

			partial_result[blockIdx.x]+=evdw_sh_first[i];

		}

		//printf("block %d's total is %f\n", blockIdx.x, partial_result[blockIdx.x]);

	__threadfence();

		printf("partial result of block %d is %f\n", blockIdx.x, partial_result[blockIdx.x]);

	unsigned int value = atomicInc(&count, gridDim.x);

	isLastBlockDone = (value == gridDim.x-1);

	}

	__syncthreads();

	if(isLastBlockDone){

		if(threadIdx.x == 0){

		for(i=0;i<numBlocks;i++)

			*evdw+=partial_result[i];

		count = 0;

		}

	}

	//__syncthreads();      

}       // end of for loop

}[/codebox]

Here *evdw is the final destination of accumulation.

I used the synchronization method mentioned in the programming guide to do the synchronization of accumulation.

I then run the code in EmuRelease mode, and the result from GPU is larger than that from CPU. They are using the same data inputs so there shouldn’t be a difference.

FYI I’m using a dell workstation with a Xeon dual-core 3.06GHz CPU and VS2008.

Any ideas? Thanks!

Have you read that the floating point operations have a little error compared with the CPU calculations?
Run the SDK example “Reduction” with float numbers and you’ll see a little difference between CPU and GPU calculations.
Hope this will help you.

Thank you tluisrs for the reply.

I did look at the Reduction code, and found that when using float, kernel 1 and 2 gave same results from CPU and GPU. Starting from kernel 3, the results differ. However, I don’t think that by adding 4~5k numbers would give a 10% error, which is what my code reflects. Do you think kahan summation will help? I’m about to try that out.

Would you tell what device you are using? You need to check if a simple float sum is with a little error. Don’t forget the error propagation. I don’t know about kahan summation. I suggest you to check the specifications of your device and the error rate of float operations. Another suggestion is to use f suffix in float assignments. Change

evdw_sh_first[shared_memory_index] = 0.0;

To

evdw_sh_first[shared_memory_index] = 0.0f;

This avoid the device to use implicit conversion.

Hope this will help you.

[quote name=‘tluisrs’ post=‘580814’ date=‘Aug 20 2009, 05:35 PM’]

Would you tell what device you are using? You need to check if a simple float sum is with a little error. Don’t forget the error propagation. I don’t know about kahan summation. I suggest you to check the specifications of your device and the error rate of float operations. Another suggestion is to use f suffix in float assignments. Change

[codebox]__global__ void vdweng_accu(int numPairs, float* single_result, float* partial_result, float *total_evdw){
const int tid = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;

const int threadN = IMUL(blockDim.x, gridDim.x);

__shared__ float tmp_partial[NUM_THREADS_PER_BLOCK];	

int i;

int index;

tmp_partial[threadIdx.x]=single_result[tid];

__syncthreads();

//@hxdai: add numbers in this block together

for(i=1;i<blockDim.x;i*=2){

	index = 2*i*threadIdx.x;

	if(index < blockDim.x)

		tmp_partial[index]+=tmp_partial[index+i];

	__syncthreads();

}

//@hxdai: first thread in block move it to total

if(threadIdx.x==0){

	//printf("from kernel block %d: final evdw = %f\n", blockIdx.x, tmp_partial[0]);

	partial_result[blockIdx.x] = tmp_partial[0];

	__threadfence();

}

}[/codebox]

This is basically the code of reduction in SDK sample. Any idea what’s going wrong?

The CPU is using 80-bit floating point, and the GPU isn’t. Try accumulating with double precision or something.

If there is a difference between CPU and GPU accumulations, quite often that difference is due to the GPU being more precise than the CPU. This is an algorithmic, not hardware, difference.
Basically if you sum x0+x1+x2+x3… you get accumulated error and at the end when you’re adding x10000, you are adding one large number to one small one and you have floating point precision losses.

But if you use a parallel algorithm, as a byproduct of its calculation, the values are condensed in a tree, sort of like:

y0=x0+x3 y1=x2+x4

y0+y1

And this is actually ideal for dealing with roundoff precision since every addition is of two values which have similar accumulations and therefore likely to have more similar magnitudes, and therefore less precision loss.

Certainly you can rearrange your additions on the CPU to follow the same pattern to get the similar precision boost, but people usually don’t bother because it’s a pain.

I tried with double precision and got basically the same result.

To be honest, I haven’t thought about this before. This is definitely possible. The project I’m working on is to add #s(mostly -1e-6 level) to an existing #(which is ~-18000), therefore, it’s quite possible that the CPU code simply add the calculated value to the big number, possibly causing some loss in precision. But in GPU code, all the #s are added together before adding to the big one. I’ll try use kahan summation on CPU code to see if it gets better. Thank you for the inspiration.

Update: This IS the problem. By utilizing Kahan Summation in CPU code, the result from CPU is same as in GPU. (not EXACTLY, but within a affordable error margin). Thanks for everyone helped.