# Same Code (really, it is) - Much Different Results

Hi All,

Smoke coming out of my ears; Have been struggling with a problem for over a month now.

I am implementing a (very) simple 3D finite difference algorithm using Cuda C on a Tesla C1070.

The code basically loops over all elements in a 3D array and performs some basic arithmetic for each element (5 adds, 1 mul).
This happens iteratively, about 8000 times, where each iteration is calculated based on the data calculated in the iteration before; and so on and so forth.

The values at a specific point in the array are recorded, and this is the output of the algorithm.

This is implemented once on the GPU, and for reference on the CPU. Both codes are double precision floating points.
(sm-13 is enabled, and I am aware of the fact that floating point calculations can deviate from symbolic arithmetic).

The CPU code works like a charm.
The GPU code, however, produces similar results but with a linearly growing error.
That is, a scalar value is somehow recursively being added to each sample in the output.

Imagine the same results, but with a constantly growing shift in magnitude in the GPU code.

Obviously, looking for some variable that is not cleared etc… is the first thing I did.
Nothing there. I promise. The codes are identical (didn’t want to make this post too heavy, but I can strip it down and upload if needed).

Does anybody have a clue why something like this could be happening?

Any help would be highly appreciated.

Cheers,

Jon.

You probably write somewhere a[i]=i instead of a[i]=c[i]

You probably write somewhere a[i]=i instead of a[i]=c[i]

Nope. Searched my code upside down… :)

Any other ideas?

Nope. Searched my code upside down… :)

Any other ideas?

OK.

I am processing a 3D array slice by slice.

Here’s the kernel:

``````__global__ void pKernel(double* p, double* pz, double* pzz, double* sourceFn, double* ir, int iDim, int jDim, int kDim, int si, int sj, int sk, int ri, int rj, int rk, int n, int k)

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

double wR = 1.0;

double pCoef = 1.0/3.0;

// Solve for pressure

if ((i>0) && (i<iDim-1) && (j>0) && (j<jDim-1) && (k>0) && (k<kDim-1))

{

// We are inside the domain

p[ijk] = (pCoef*(pz[ip] + pz[im] + pz[jp] + pz[jm] + pz[kp] + pz[km])) - pzz[ijk];

}

if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; }		// Down Boundary

if (k==kDim-1) { p[ijk] = (1.0+wR)*pz[km] - wR*pzz[ijk]; } // Up Boundary

if (j==0) { p[ijk] = (1.0+wR)*pz[jp] - wR*pzz[ijk]; }		// Front Boundary

if (j==jDim-1) { p[ijk] = (1.0+wR)*pz[jm] - wR*pzz[ijk]; } // Back Boundary

if (i==0) { p[ijk] = (1.0+wR)*pz[ip] - wR*pzz[ijk]; }		// Left Boundary

if (i==iDim-1) { p[ijk] = (1.0+wR)*pz[im] - wR*pzz[ijk]; } // Right Boundary

// Inject Source

if ((i == si) && (j == sj) && (k == sk)) {

p[sourcePos] = p[sourcePos] + sourceFn[n] + sourceFn[n-2];

}

// Collect at Reciever

if ((i == ri) && (j == rj) && (k == rk)) {

ir[n] = p[recPos]; }

// Wait for all

}
``````

with the striding macros defined as:

``````#define ijk (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))

#define im (((i-1) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))

#define ip (((i+1) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))

#define jm (((i) * (jDim) * (kDim)) + ((j-1) * (kDim)) + (k))

#define jp (((i) * (jDim) * (kDim)) + ((j+1) * (kDim)) + (k))

#define km (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k-1))

#define kp (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k+1))

#define sourcePos (((si) * (jDim) * (kDim)) + ((sj) * (kDim)) + (sk))

#define recPos (((ri) * (jDim) * (kDim)) + ((rj) * (kDim)) + (rk))
``````

And on the host side:

``````// Loop through time

for (int n=2; n<ns; n++) {

// Execute Pressure Kernel

cudaPrintfInit();

// Loop through k-dimension tiles

for (int k=0; k<kDim; k++)

{

pKernel<<<Dg, Db>>>(pCuda, pzCuda, pzzCuda, sourceFnCuda, irCuda, iDim, jDim, kDim, si, sj, sk, ri, rj, rk, n, k);

cutilCheckMsg("Kernel execution failed\n");

cudaPrintfDisplay(stdout, true);

}

cudaPrintfEnd();

// Update temporal memory

cudaMemcpy(pzzCuda, pzCuda, dataSize, cudaMemcpyDeviceToDevice);

cudaMemcpy(pzCuda, pCuda, dataSize, cudaMemcpyDeviceToDevice);

cudaMemset((void**)pCuda, 0, dataSize);

// Status

printf("GPU Iteration n = %i out of %i \r", n, ns);

} // end n
``````

And this is the reference code for the CPU:

``````// Run the Simulation on CPU

for (int n=2; n<ns; n++)

{

for (int i=0; i<iDim; i++)

{

for (int j=0; j<jDim; j++)

{

for (int k=0; k<kDim; k++)

{

//// Where are we?

if ((i>0) && (i<iDim-1) && (j>0) && (j<jDim-1) && (k>0) && (k<kDim-1))

{

// We are inside the domain

p[i][j][k] = (pCoef*(pz[i+1][j][k] + pz[i-1][j][k] + pz[i][j+1][k] + pz[i][j-1][k] + pz[i][j][k+1] + pz[i][j][k-1])) - pzz[i][j][k];

}

if (k==0) { p[i][j][k] = ((1+wR)*pz[i][j][k+1]) - wR*pzz[i][j][k]; }		// Down Boundary

if (k==kDim-1) { p[i][j][k] = ((1+wR)*pz[i][j][k-1]) -wR*pzz[i][j][k]; } // Up Boundary

if (j==0) { p[i][j][k] = ((1+wR)*pz[i][j+1][k]) -wR*pzz[i][j][k]; }		// Front Boundary

if (j==jDim-1) { p[i][j][k] = ((1+wR)*pz[i][j-1][k]) -wR*pzz[i][j][k]; } // Back Boundary

if (i==0) { p[i][j][k] = ((1+wR)*pz[i+1][j][k]) -wR*pzz[i][j][k]; }		// Left Boundary

if (i==iDim-1) { p[i][j][k] = ((1+wR)*pz[i-1][j][k]) + (-wR*pzz[i][j][k]); } // Right Boundary

} // end k

} // end j

} // end i

// Insert Source

p[si][sj][sk] = p[si][sj][sk] + sourceFn[n] - sourceFn[n-2];

ir[n] = p[ri][rj][rk];

// Update temporal memory

for (int i=0; i<iDim; i++)

{

for (int j=0; j<jDim; j++)

{

for (int k=0; k<kDim; k++)

{

pzz[i][j][k] = pz[i][j][k];

pz[i][j][k] = p[i][j][k];

}

}

}

// Status

printf("Iteration n = %i out of %i \r", n, ns);

} // end n
``````

OK.

I am processing a 3D array slice by slice.

Here’s the kernel:

``````__global__ void pKernel(double* p, double* pz, double* pzz, double* sourceFn, double* ir, int iDim, int jDim, int kDim, int si, int sj, int sk, int ri, int rj, int rk, int n, int k)

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

double wR = 1.0;

double pCoef = 1.0/3.0;

// Solve for pressure

if ((i>0) && (i<iDim-1) && (j>0) && (j<jDim-1) && (k>0) && (k<kDim-1))

{

// We are inside the domain

p[ijk] = (pCoef*(pz[ip] + pz[im] + pz[jp] + pz[jm] + pz[kp] + pz[km])) - pzz[ijk];

}

if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; }		// Down Boundary

if (k==kDim-1) { p[ijk] = (1.0+wR)*pz[km] - wR*pzz[ijk]; } // Up Boundary

if (j==0) { p[ijk] = (1.0+wR)*pz[jp] - wR*pzz[ijk]; }		// Front Boundary

if (j==jDim-1) { p[ijk] = (1.0+wR)*pz[jm] - wR*pzz[ijk]; } // Back Boundary

if (i==0) { p[ijk] = (1.0+wR)*pz[ip] - wR*pzz[ijk]; }		// Left Boundary

if (i==iDim-1) { p[ijk] = (1.0+wR)*pz[im] - wR*pzz[ijk]; } // Right Boundary

// Inject Source

if ((i == si) && (j == sj) && (k == sk)) {

p[sourcePos] = p[sourcePos] + sourceFn[n] + sourceFn[n-2];

}

// Collect at Reciever

if ((i == ri) && (j == rj) && (k == rk)) {

ir[n] = p[recPos]; }

// Wait for all

}
``````

with the striding macros defined as:

``````#define ijk (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))

#define im (((i-1) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))

#define ip (((i+1) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))

#define jm (((i) * (jDim) * (kDim)) + ((j-1) * (kDim)) + (k))

#define jp (((i) * (jDim) * (kDim)) + ((j+1) * (kDim)) + (k))

#define km (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k-1))

#define kp (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k+1))

#define sourcePos (((si) * (jDim) * (kDim)) + ((sj) * (kDim)) + (sk))

#define recPos (((ri) * (jDim) * (kDim)) + ((rj) * (kDim)) + (rk))
``````

And on the host side:

``````// Loop through time

for (int n=2; n<ns; n++) {

// Execute Pressure Kernel

cudaPrintfInit();

// Loop through k-dimension tiles

for (int k=0; k<kDim; k++)

{

pKernel<<<Dg, Db>>>(pCuda, pzCuda, pzzCuda, sourceFnCuda, irCuda, iDim, jDim, kDim, si, sj, sk, ri, rj, rk, n, k);

cutilCheckMsg("Kernel execution failed\n");

cudaPrintfDisplay(stdout, true);

}

cudaPrintfEnd();

// Update temporal memory

cudaMemcpy(pzzCuda, pzCuda, dataSize, cudaMemcpyDeviceToDevice);

cudaMemcpy(pzCuda, pCuda, dataSize, cudaMemcpyDeviceToDevice);

cudaMemset((void**)pCuda, 0, dataSize);

// Status

printf("GPU Iteration n = %i out of %i \r", n, ns);

} // end n
``````

And this is the reference code for the CPU:

``````// Run the Simulation on CPU

for (int n=2; n<ns; n++)

{

for (int i=0; i<iDim; i++)

{

for (int j=0; j<jDim; j++)

{

for (int k=0; k<kDim; k++)

{

//// Where are we?

if ((i>0) && (i<iDim-1) && (j>0) && (j<jDim-1) && (k>0) && (k<kDim-1))

{

// We are inside the domain

p[i][j][k] = (pCoef*(pz[i+1][j][k] + pz[i-1][j][k] + pz[i][j+1][k] + pz[i][j-1][k] + pz[i][j][k+1] + pz[i][j][k-1])) - pzz[i][j][k];

}

if (k==0) { p[i][j][k] = ((1+wR)*pz[i][j][k+1]) - wR*pzz[i][j][k]; }		// Down Boundary

if (k==kDim-1) { p[i][j][k] = ((1+wR)*pz[i][j][k-1]) -wR*pzz[i][j][k]; } // Up Boundary

if (j==0) { p[i][j][k] = ((1+wR)*pz[i][j+1][k]) -wR*pzz[i][j][k]; }		// Front Boundary

if (j==jDim-1) { p[i][j][k] = ((1+wR)*pz[i][j-1][k]) -wR*pzz[i][j][k]; } // Back Boundary

if (i==0) { p[i][j][k] = ((1+wR)*pz[i+1][j][k]) -wR*pzz[i][j][k]; }		// Left Boundary

if (i==iDim-1) { p[i][j][k] = ((1+wR)*pz[i-1][j][k]) + (-wR*pzz[i][j][k]); } // Right Boundary

} // end k

} // end j

} // end i

// Insert Source

p[si][sj][sk] = p[si][sj][sk] + sourceFn[n] - sourceFn[n-2];

ir[n] = p[ri][rj][rk];

// Update temporal memory

for (int i=0; i<iDim; i++)

{

for (int j=0; j<jDim; j++)

{

for (int k=0; k<kDim; k++)

{

pzz[i][j][k] = pz[i][j][k];

pz[i][j][k] = p[i][j][k];

}

}

}

// Status

printf("Iteration n = %i out of %i \r", n, ns);

} // end n
``````

The code does not include difinition for block size and block number. Are you usre you have not threads outside domain?

The code does not include difinition for block size and block number. Are you usre you have not threads outside domain?

I suggest to split cuda kernel on 3 parts, and debug not complete implementation, but partial. for example, stay only

``````if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; }        // Down Boundary

if (k==kDim-1) { p[ijk] = (1.0+wR)*pz[km] - wR*pzz[ijk]; } // Up Boundary

if (j==0) { p[ijk] = (1.0+wR)*pz[jp] - wR*pzz[ijk]; }        // Front Boundary

if (j==jDim-1) { p[ijk] = (1.0+wR)*pz[jm] - wR*pzz[ijk]; } // Back Boundary

if (i==0) { p[ijk] = (1.0+wR)*pz[ip] - wR*pzz[ijk]; }        // Left Boundary

if (i==iDim-1) { p[ijk] = (1.0+wR)*pz[im] - wR*pzz[ijk]; } // Right Boundary
``````

and compare to cpu version

I suggest to split cuda kernel on 3 parts, and debug not complete implementation, but partial. for example, stay only

``````if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; }        // Down Boundary

if (k==kDim-1) { p[ijk] = (1.0+wR)*pz[km] - wR*pzz[ijk]; } // Up Boundary

if (j==0) { p[ijk] = (1.0+wR)*pz[jp] - wR*pzz[ijk]; }        // Front Boundary

if (j==jDim-1) { p[ijk] = (1.0+wR)*pz[jm] - wR*pzz[ijk]; } // Back Boundary

if (i==0) { p[ijk] = (1.0+wR)*pz[ip] - wR*pzz[ijk]; }        // Left Boundary

if (i==iDim-1) { p[ijk] = (1.0+wR)*pz[im] - wR*pzz[ijk]; } // Right Boundary
``````

and compare to cpu version

Thanks for the tips. I will try splitting the code.

Here’s the grid/block definitions:

``````#define blockDimX 16

#define blockDimY 16

dim3 Db(blockDimX, blockDimY, 1);

dim3 Dg(ceil((float)iDim/blockDimX), ceil((float)jDim/blockDimY));
``````

Thanks for the tips. I will try splitting the code.

Here’s the grid/block definitions:

``````#define blockDimX 16

#define blockDimY 16

dim3 Db(blockDimX, blockDimY, 1);

dim3 Dg(ceil((float)iDim/blockDimX), ceil((float)jDim/blockDimY));
``````

If your problem size is not multiple of 16, you may spawn additional threads behind border. see ceil(idim)/blockdymx. And they may spoil data.

If your problem size is not multiple of 16, you may spawn additional threads behind border. see ceil(idim)/blockdymx. And they may spoil data.

I’m testing for redundant threads, but I tried adjusting the problem size to multiples of 16 just in case;
Still no luck…

The result (a calculated impulse response) looks like there is some accumulated error.

It fits the profile of numerical instability of an algorithm; but at least from a symbolic point of view - the algorithm should be stable (and is stable on the CPU)…

BTW Thanks for taking the time to help. Its appreciated.

I’m testing for redundant threads, but I tried adjusting the problem size to multiples of 16 just in case;
Still no luck…

The result (a calculated impulse response) looks like there is some accumulated error.

It fits the profile of numerical instability of an algorithm; but at least from a symbolic point of view - the algorithm should be stable (and is stable on the CPU)…

BTW Thanks for taking the time to help. Its appreciated.

if ((i == si) && (j == sj) && (k == sk)) {
p[sourcePos] = p[sourcePos] + sourceFn[n] + sourceFn[n-2];
}

``````        // Insert Source
p[si][sj][sk] = p[si][sj][sk] + sourceFn[n] - sourceFn[n-2];
``````

Look like it is not right.