Different Output on Device and Emulation Mode 2 What makes difference b/w Emu and Dev?

Several People have posted this problem before and there have been solutions. But my case doesn’t seem to belong to them ( no shared memory, no kernel error, etc). Any observation of potential error source in code below would be appreciated.

I suspect the last statement in Kernel function but I have no idea.
Any possible suggestions are more than welcomed. The variables not shown its declaration are proven to work fine and capital variable is just macro.

Output (d_xout) of emulation mode is what expected while output of device is not. Thanks,

============================================================
global void __Test(float * d_dY, float * d_xout, float * d_ytemp, float * d_ZDtemp)
{
// each thread do the same job
int threads = (blockIdx.ygridDim.x + blockIdx.x) * (blockDim.xblockDim.y) + threadIdx.y*blockDim.x + threadIdx.x;

float xtemp = 0.0;
float t1, t2, t3, t4, t5, t6;
float Betai,Zi;
float ZS;
int i;
float Ziso;

int k2 = ( threads / (NX_ROI * NZ_ROI) ) % NY_ROI;
int k1 = ( threads / NZ_ROI ) % NX_ROI;
int k3 = ( threads ) % NZ_ROI;

float vx[2], vy[2];
unsigned int idx0, idx1;

unsigned int idxProj = 0;
vx[0] = __cosf( PI + d_alphaList[idxProj] );
vx[1] = __sinf( PI + d_alphaList[idxProj] );
vy[0] = -__cosf( PI / 2.0 + d_alphaList[idxProj] );
vy[1] = -__sinf( PI / 2.0 + d_alphaList[idxProj] );

for(idx0 = 0; idx0 < N_ROWS; idx0++) 
{
    d_ZDtemp[idxProj*N_ROWS + idx0] = d_ZD[idx0] + d_vSourceZ[idxProj];
    for(idx1 = 0; idx1 < N_CHNS; idx1++) 
    {
d_ytemp[idxProj*N_CHNS*N_ROWS + idx1*N_ROWS + idx0] = d_dY[idx1*N_ROWS*N_PROJ + idx0*N_PROJ + idxProj];
    }
}

t1 = d_vROIx[k2] - d_vSourceX_FFS[idxProj];
t2 = d_vROIy[k1] - d_vSourceY_FFS[idxProj];
t3 = d_vROIz[k3] - d_vSourceZ_FFS[idxProj];
t4 = sqrtf(t1*t1 + t2*t2);
Zi = (RF + RD - d_FLY_dR[idxProj]) / t4 * t3 + d_vSourceZ_FFS[idxProj];
t5 = t1*vx[0] + t2*vx[1];
t6 = t1*vy[0] + t2*vy[1];
Betai = atan(t6/t5);

xtemp = BiInterp2Dpix(d_BD, d_ZDtemp, d_ytemp, Betai, Zi);

d_xout[ (k1*NY_ROI*NZ_ROI) + (k2*NZ_ROI) + (k3) ] += xtemp;

}

void main()
{

__Test<<<10000, 63>>>(d_dY, d_xout, d_ytemp, d_ZDtemp);

}

__global__ void __Test(float * d_dY, float * d_xout, float * d_ytemp, float * d_ZDtemp)

{

    int threads = (blockIdx.y*gridDim.x  + blockIdx.x) * (blockDim.x*blockDim.y) + threadIdx.y*blockDim.x + threadIdx.x;

   float xtemp = 0.0;

   float t1, t2, t3, t4, t5, t6;

    float Betai,Zi;

    float ZS;

    int i;

    float Ziso;

   int k2 = ( threads / (NX_ROI * NZ_ROI) ) % NY_ROI;

    int k1 = ( threads / NZ_ROI ) % NX_ROI;

    int k3 = ( threads ) % NZ_ROI;

   float vx[2], vy[2];

    unsigned int idx0, idx1;

   unsigned int idxProj = 0;

    vx[0] = __cosf( PI + d_alphaList[idxProj] );

    vx[1] = __sinf( PI + d_alphaList[idxProj] );

    vy[0] = -__cosf( PI / 2.0 + d_alphaList[idxProj] );

    vy[1] = -__sinf( PI / 2.0 + d_alphaList[idxProj] );

   for(idx0 = 0; idx0 < N_ROWS; idx0++) 

    {

        d_ZDtemp[idxProj*N_ROWS + idx0] = d_ZD[idx0] + d_vSourceZ[idxProj];

        for(idx1 = 0; idx1 < N_CHNS; idx1++) 

        {

	d_ytemp[idxProj*N_CHNS*N_ROWS + idx1*N_ROWS + idx0] = d_dY[idx1*N_ROWS*N_PROJ + idx0*N_PROJ + idxProj];

        }

    }

   t1 = d_vROIx[k2] - d_vSourceX_FFS[idxProj];

    t2 = d_vROIy[k1] - d_vSourceY_FFS[idxProj];

    t3 = d_vROIz[k3] - d_vSourceZ_FFS[idxProj];

    t4 = sqrtf(t1*t1 + t2*t2);

    Zi = (RF + RD - d_FLY_dR[idxProj]) / t4 * t3 + d_vSourceZ_FFS[idxProj];

    t5 = t1*vx[0] + t2*vx[1];

    t6 = t1*vy[0] + t2*vy[1];

    Betai = atan(t6/t5);

   xtemp = BiInterp2Dpix(d_BD, d_ZDtemp, d_ytemp, Betai, Zi);

   d_xout[ (k1*NY_ROI*NZ_ROI) + (k2*NZ_ROI) + (k3) ] += xtemp;

}

void main()

{

.....

	__Test<<<10000, 63>>>(d_dY, d_xout, d_ytemp, d_ZDtemp);

.....

}

Just to be able to read it a bit better ;)

It looks like all your threads are doing the same computation, so the reason why it works in emulation mode is because it actually runs. In release mode you are probably hitting the 5 sec kernel limitation, so, there is no output being generated at all.

Also a number of threads of 63 is really not good for performance, as I understand it it should be a multiple of 32 (or at least a multiple of 16) for good performance.

DenisR, thanks for the help.

I am using two graphic card: one for display and the other for CUDA therefore I am sure that there is no 5 sec limitation. Yes, all my threads do the same job and the reason I make the size of thread block 63 is that it is initial version to test correctness, not performance. Thanks for suggestion. Any idea of the reason that makes different output between in emulation mode and on device. Thanks,

Do you check for errors after the kernel runs? Either use CUT_CHECK_ERROR or cudaThreadSycnhronize(); cudaGetLastError(); cudaGetErrorString();

I think such easy kernel wouldn’t hit the 5sec limitation.
I come with the same question also. When I run on Emu mode it’s OK but on device it doestn’t have results or even reboot.

MisterAnderson42,

I always appreciate you for great help for others. Thanks,

Yes, I checked CUT_CHECK_ERROR in debug mode and there is no errors after kernel runs.

Any other possible rotten fish that smells bad? :) Anything is welcomed.

byung,

You are writing to d_xout with a += at the end, is d_xout initialized to the proper value before executing the kernel? It is such a simple thing, but I have made that error before, and the results may be correct in emulation mode because some compilers initialize memory to 0 for you in debug mode.

A second thought is that each thread is writing to d_xout element: (k1NY_ROINZ_ROI) + (k2NZ_ROI) + (k3) . It seems that this should have every thread writing to a different value of d_xout, but it never hurts to double check. You could create an array tmp array the length of the number of threads and write d_tmparray[threads] = (k1NY_ROINZ_ROI) + (k2NZ_ROI) + (k3). Then copy d_tmparray back to the host and check that there are no duplicate values.

MisterAnderson42,

I fixed the problem. Error was at unexpected place. In the code, I used constant memory for d_vROIx, d_vROIy, d_vROIz. When I read it in emulation mode, the value was fine but when I read it in device mode, it was not correct. I fixed the problem by declaring those in global memory. but I am still wondering WHY IT GIVES DIFFERENT VALUE. I am using constant memory for many variable and array and all works fine. I checked the amount of constant memory from .cubin file and it is way below 64K (actually, 34K is used).

Any body know any issues on this???

Check the Programming Guide section on running in Emulation mode. This has come up in the forums several times - Intel CPUs execute float arithmetic at 80bit precision (unless you use SSE). Intermediate values are also stored in that extended precision while in registers (unless you use a bunch of compiler flags to force better IEEE conformance). So, if you’re seeing difference in floating point values, that’s one reason. Compile your emulation code with different fp compiler flags and you’ll most likely see your values vary too. I’ve seen code and data that changed the value from -8 to 40, the only difference being the compiler flags.

Paulius

Paulius,

It is not the problem of Emulation and Device mode now. The problem turns out to be Global memory and Constant memory. Probably you misunderstood. Please take a look at my comment right above yours.

By the way, do you have any idea why declaring variable(actually array) in global memory and constant memory makes difference?

Thanks,

How do you copy data to constant memory?

I copied data as follows.

CUDA_SAFE_CALL( cudaMemcpyToSymbol(d_vROIx, vROIx, sizeof(float)*numY_ROI) );

CUDA_SAFE_CALL( cudaMemcpyToSymbol(d_vROIy, vROIy, sizeof(float)*numX_ROI) );

CUDA_SAFE_CALL( cudaMemcpyToSymbol(d_vROIz, vROIz, sizeof(float)*numZ_ROI) );
__device__ __constant__ float d_vROIx[100];

__device__ __constant__ float d_vROIy[100];

__device__ __constant__ float d_vROIz[63];

Any suspicious thing?

Is it related to something like alignment?

Do you really need device constant. When I have used constant memory, I have always declared it constant.

You seem to have found the root cause, but your kernel is still sufficiently complicated that there could be an interplay of issues, possibly even CUDA bugs. Start removing parts of the kernel, but keeping the constant memory read and some kind of write to d_xout. If you get down to a simple copy of a value from constant memory to d_xout and it STILL doesn’t work, then there is something funny going on.

Reasonable idea. I will try that. Thanks,

You’re right, I did miss what the problem was exactly. I’ve posted a reply in your other thread. One guess is that the compiler may be generating slightly different operation orders, which can affect results due to finite precision.

Paulius