I am having issues with some sort of random error propagating through my results. My algorithm works on the CPU but the GPU results deviate. After the first time step, I notice a few errors on the least significant digit of a few elements when compared to the CPU implementation. The errors keep getting worse every time step. Worse yet, if I run the same code twice with the same number of time steps, the errors change! This leads me to believe that there is some sort of rounding error, although one would think it would round the same way every time.

I can’t figure out my problem after many hours and I can’t look objectively at my code anymore. I need some fresh eyes, and I appreciate your help!!

**ALGORITHM DESCRIPTION**

I am simulating a vibrating membrane using three time steps, t0,t1,t2. I use t0 and t1 to find t2. I shuffle the pointers around to effectively throw out t0 and to use the two most recent time steps to calculate the next one, and so on.

[b]

CALLING FUNCTION:[/b]

```
void timeStep_Wrap(int num_steps, REAL** t0, REAL **t1, REAL**t2, size_t pitch, int sx, int sy, REAL k1){
dim3 dimBlock(BLOCKSIZE, BLOCKSIZE);
dim3 dimGrid(sx/dimBlock.x+1, sy/dimBlock.y+1);
GPUTimeStep<<<dimGrid, dimBlock>>>(num_steps, *t0, *t1, *t2, pitch, sx, sy, k1);
sychronizePTR(num_steps, t0, t1, t2);
```

**KERNAL**

```
__global__ void GPUTimeStep(int num_steps, REAL* t0, REAL *t1, REAL*t2, size_t pitch, int sx, int sy, REAL k1){
REAL k2; // temp variable to make step look nicer
REAL * temp; // temp pointer used to shuffle arrays
//Block index
int bidX = blockIdx.x;
int bidY = blockIdx.y;
int bDimX = blockDim.x;
int bDimY = blockDim.y;
//Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
size_t real_pitch = pitch/sizeof(REAL);
//Row index of matrices a and c
int row = bidY * bDimY + ty;
//Column index of matrices a and b
int col = bidX * bDimX + tx;
int indice = row*real_pitch + col;
for(int i=0; i<num_steps; i++){
if( (row > 0) && (row < (sy-1)) && (col > 0) && (col < (sx-1) ) ){
k2 = t1[indice+1] + t1[indice-1] + t1[indice+real_pitch ] \
+ t1[indice - real_pitch] - (REAL)4.0*t1[indice];
t2[indice] = (REAL)2.0*t1[indice] - t0[indice] + k1*k2;
}
// SHuffle pointers
temp = t0;
t0 = t1;
t1 = t2;
t2 = temp;
}
}
```

**CPU IMPLEMENTATION**

```
void CPUTimeStep(ARRAY3D &Array, REAL k1){
double k2;
//Apply0BC(Array.ARRAY[2], Array.sx, Array.sy); // apply BCs
for (int iy = 1; iy < (Array.sy-1); iy++){
for (int ix = 1; ix < (Array.sx-1); ix++){
k2 = Array.Array[1][iy][ix+1] + Array.Array[1][iy][ix-1] + Array.Array[1][iy+1][ix] \
+ Array.Array[1][iy-1][ix] - 4.0*Array.Array[1][iy][ix];
Array.Array[2][iy][ix] = 2.0*Array.Array[1][iy][ix]- Array.Array[0][iy][ix] + k1*k2;
}
}
}
```