Same Code (really, it is) - Much Different Results

if ((i == si) && (j == sj) && (k == sk)) {
p[sourcePos] = p[sourcePos] + sourceFn[n] + sourceFn[n-2];
}

        // Insert Source
        p[si][sj][sk] = p[si][sj][sk] + sourceFn[n] - sourceFn[n-2];

Look like it is not right.

Why not?

[sourcePos] is the defined stride address for [si][sj][sk] (see the #defines above).

Or is it because of the ‘if’ condition? (On the CPU it doesn’t make any difference with or without it).

Please do elaborate…

Why not?

[sourcePos] is the defined stride address for [si][sj][sk] (see the #defines above).

Or is it because of the ‘if’ condition? (On the CPU it doesn’t make any difference with or without it).

Please do elaborate…

(drive-by posting without reading the thread)

is this just 80-bit precision on the host and 64-bit precision on the GPU due to x87? or are you not defining -arch sm_13?

(drive-by posting without reading the thread)

is this just 80-bit precision on the host and 64-bit precision on the GPU due to x87? or are you not defining -arch sm_13?

+sourceFn and - sourceFn

+sourceFn and - sourceFn

What a fool I am… :">

How could I have missed that?

I have looked at this code over and over again, and never caught it!

Thank you Lev :). I’ll make the correction, and check again.

tmurray - I am defining sm13 architecture.

What a fool I am… :">

How could I have missed that?

I have looked at this code over and over again, and never caught it!

Thank you Lev :). I’ll make the correction, and check again.

tmurray - I am defining sm13 architecture.

Do not forget about additional threads.

Do not forget about additional threads.

Also consider

instead

if ((i>0) && (i<iDim-1) && (j>0) && (j<jDim-1) && (k>0) && (k<kDim-1))
{
    // We are inside the domain
    p[ijk] = (pCoef*(pz[ip] + pz[im] + pz[jp] + pz[jm] + pz[kp] + pz[km])) - pzz[ijk];

    
}

if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; }        // Down Boundary

if (k==kDim-1) { p[ijk] = (1.0+wR)*pz[km] - wR*pzz[ijk]; } // Up Boundary

if (j==0) { p[ijk] = (1.0+wR)*pz[jp] - wR*pzz[ijk]; }        // Front Boundary

if (j==jDim-1) { p[ijk] = (1.0+wR)*pz[jm] - wR*pzz[ijk]; } // Back Boundary

if (i==0) { p[ijk] = (1.0+wR)*pz[ip] - wR*pzz[ijk]; }        // Left Boundary

if (i==iDim-1) { p[ijk] = (1.0+wR)*pz[im] - wR*pzz[ijk]; } // Right Boundary

somelthing like this

if ((i>0) & (i<iDim-1) & (j>0) & (j<jDim-1) & (k>0) & (k<kDim-1))
{
    // We are inside the domain
    p[ijk] = (pCoef*(pz[ip] + pz[im] + pz[jp] + pz[jm] + pz[kp] + pz[km])) - pzz[ijk];

}
else
{
if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; }        // Down Boundary

if (k==kDim-1) { p[ijk] = (1.0+wR)*pz[km] - wR*pzz[ijk]; } // Up Boundary

if (j==0) { p[ijk] = (1.0+wR)*pz[jp] - wR*pzz[ijk]; }        // Front Boundary

if (j==jDim-1) { p[ijk] = (1.0+wR)*pz[jm] - wR*pzz[ijk]; } // Back Boundary

if (i==0) { p[ijk] = (1.0+wR)*pz[ip] - wR*pzz[ijk]; }        // Left Boundary

if (i==iDim-1) { p[ijk] = (1.0+wR)*pz[im] - wR*pzz[ijk]; } // Right Boundary
}

or

if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; goto noinside; }        // Down Boundary

etc

    p[ijk] = (pCoef*(pz[ip] + pz[im] + pz[jp] + pz[jm] + pz[kp] + pz[km])) - pzz[ijk];

for better performance

though compiler may do it itself

also no need of syncthread at the end of the block
and you may combine all slices in one kernell, without cycle by k.
There are some easy optimization options arround.

Also consider

instead

if ((i>0) && (i<iDim-1) && (j>0) && (j<jDim-1) && (k>0) && (k<kDim-1))
{
    // We are inside the domain
    p[ijk] = (pCoef*(pz[ip] + pz[im] + pz[jp] + pz[jm] + pz[kp] + pz[km])) - pzz[ijk];

    
}

if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; }        // Down Boundary

if (k==kDim-1) { p[ijk] = (1.0+wR)*pz[km] - wR*pzz[ijk]; } // Up Boundary

if (j==0) { p[ijk] = (1.0+wR)*pz[jp] - wR*pzz[ijk]; }        // Front Boundary

if (j==jDim-1) { p[ijk] = (1.0+wR)*pz[jm] - wR*pzz[ijk]; } // Back Boundary

if (i==0) { p[ijk] = (1.0+wR)*pz[ip] - wR*pzz[ijk]; }        // Left Boundary

if (i==iDim-1) { p[ijk] = (1.0+wR)*pz[im] - wR*pzz[ijk]; } // Right Boundary

somelthing like this

if ((i>0) & (i<iDim-1) & (j>0) & (j<jDim-1) & (k>0) & (k<kDim-1))
{
    // We are inside the domain
    p[ijk] = (pCoef*(pz[ip] + pz[im] + pz[jp] + pz[jm] + pz[kp] + pz[km])) - pzz[ijk];

}
else
{
if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; }        // Down Boundary

if (k==kDim-1) { p[ijk] = (1.0+wR)*pz[km] - wR*pzz[ijk]; } // Up Boundary

if (j==0) { p[ijk] = (1.0+wR)*pz[jp] - wR*pzz[ijk]; }        // Front Boundary

if (j==jDim-1) { p[ijk] = (1.0+wR)*pz[jm] - wR*pzz[ijk]; } // Back Boundary

if (i==0) { p[ijk] = (1.0+wR)*pz[ip] - wR*pzz[ijk]; }        // Left Boundary

if (i==iDim-1) { p[ijk] = (1.0+wR)*pz[im] - wR*pzz[ijk]; } // Right Boundary
}

or

if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; goto noinside; }        // Down Boundary

etc

    p[ijk] = (pCoef*(pz[ip] + pz[im] + pz[jp] + pz[jm] + pz[kp] + pz[km])) - pzz[ijk];

for better performance

though compiler may do it itself

also no need of syncthread at the end of the block
and you may combine all slices in one kernell, without cycle by k.
There are some easy optimization options arround.

If you use x87 on CPU, not SSE2 scalar or vector, result has a bit more precision, cause internal registers of x87 is 80bit wide.

If you use x87 on CPU, not SSE2 scalar or vector, result has a bit more precision, cause internal registers of x87 is 80bit wide.

Thank you much for the help, Lev.

Greatly appreciated.

Thank you much for the help, Lev.

Greatly appreciated.