Precision Problems

I am having issues with some sort of random error propagating through my results. My algorithm works on the CPU but the GPU results deviate. After the first time step, I notice a few errors on the least significant digit of a few elements when compared to the CPU implementation. The errors keep getting worse every time step. Worse yet, if I run the same code twice with the same number of time steps, the errors change! This leads me to believe that there is some sort of rounding error, although one would think it would round the same way every time.

I can’t figure out my problem after many hours and I can’t look objectively at my code anymore. I need some fresh eyes, and I appreciate your help!!

ALGORITHM DESCRIPTION

I am simulating a vibrating membrane using three time steps, t0,t1,t2. I use t0 and t1 to find t2. I shuffle the pointers around to effectively throw out t0 and to use the two most recent time steps to calculate the next one, and so on.

[b]

CALLING FUNCTION:[/b]

void timeStep_Wrap(int num_steps, REAL** t0, REAL **t1, REAL**t2, size_t pitch, int sx, int sy, REAL k1){

	dim3 dimBlock(BLOCKSIZE, BLOCKSIZE);

        dim3 dimGrid(sx/dimBlock.x+1, sy/dimBlock.y+1);

	GPUTimeStep<<<dimGrid, dimBlock>>>(num_steps, *t0, *t1, *t2, pitch, sx, sy, k1);

	sychronizePTR(num_steps, t0, t1, t2);

KERNAL

__global__ void GPUTimeStep(int num_steps, REAL* t0, REAL *t1, REAL*t2, size_t pitch, int sx, int sy, REAL k1){

	REAL k2;			// temp variable to make step look nicer

	REAL * temp;		// temp pointer used to shuffle arrays

	//Block index 

    int bidX = blockIdx.x;

    int bidY = blockIdx.y;

    int bDimX = blockDim.x;

    int bDimY = blockDim.y;

	//Thread index

    int tx = threadIdx.x;

    int ty = threadIdx.y;

size_t real_pitch = pitch/sizeof(REAL);

	

    //Row index of matrices a and c 

    int row = bidY * bDimY + ty;

	//Column index of matrices a and b 

    int col = bidX * bDimX + tx;

	int indice = row*real_pitch + col;

	for(int i=0; i<num_steps; i++){

		if( (row > 0) && (row < (sy-1)) && (col > 0) && (col < (sx-1) ) ){

		k2 =  t1[indice+1] + t1[indice-1] + t1[indice+real_pitch ] \

					  + t1[indice - real_pitch] - (REAL)4.0*t1[indice];

				

				t2[indice] = (REAL)2.0*t1[indice] - t0[indice] + k1*k2;

		}

		// SHuffle pointers

		temp = t0;

		t0 = t1;

		t1 = t2;

		t2 = temp;

	}

}

CPU IMPLEMENTATION

void CPUTimeStep(ARRAY3D &Array, REAL k1){

	double k2;

	//Apply0BC(Array.ARRAY[2], Array.sx, Array.sy);		// apply BCs

	for (int iy = 1; iy < (Array.sy-1); iy++){

		for (int ix = 1; ix < (Array.sx-1); ix++){

			k2 =  Array.Array[1][iy][ix+1] + Array.Array[1][iy][ix-1] + Array.Array[1][iy+1][ix] \

				  + Array.Array[1][iy-1][ix] - 4.0*Array.Array[1][iy][ix];

			Array.Array[2][iy][ix] = 2.0*Array.Array[1][iy][ix]- Array.Array[0][iy][ix] + k1*k2;

		}

	}

}

Ok… By randomly reading threads in this forum, I saw a syncthread() post and then it hit me! The threads are not synchronized!! The random error symptom fits! I quickly tried syncthreads() but that didn’t work (I must be implementing wrong). I tried a less efficient, one time step kernel which I call many times, and wham, the error is no longer random. I am getting “more correct” results. I think I have some small discrepancy that I can find now that my error isn’t random.

now to figure out why syncthreads() doesn’t work…

You are not using shared memory, so the synchthread will not work for you.

Your pointer swapping is probably the source of the error.
All the blocks may not run at the same time and after the first batch completes the kernel , the next batches will use inconsistent data.

You are absolutely correct. The implementation done here is perfect, except for the pointer swapping without synchronization. As it seems, there is no global memory synchronization. This implementation is for a class, and I need one global memory implementation and one shared. For the global implementation, I will use the host as synchronization. Repeatedly call the kernel to execute one time step from the host. As of now, this is the only way I know how to synchronize global memory. There may be other more efficient ways, but I don’t have time for that right now.

Thanks