Different Output on Device and Emulation Mode 2 What makes difference b/w Emu and Dev?

byung · November 30, 2007, 5:27am

Several People have posted this problem before and there have been solutions. But my case doesn’t seem to belong to them ( no shared memory, no kernel error, etc). Any observation of potential error source in code below would be appreciated.

I suspect the last statement in Kernel function but I have no idea.
Any possible suggestions are more than welcomed. The variables not shown its declaration are proven to work fine and capital variable is just macro.

Output (d_xout) of emulation mode is what expected while output of device is not. Thanks,

============================================================
global void __Test(float * d_dY, float * d_xout, float * d_ytemp, float * d_ZDtemp)
{
// each thread do the same job
int threads = (blockIdx.ygridDim.x + blockIdx.x) * (blockDim.xblockDim.y) + threadIdx.y*blockDim.x + threadIdx.x;

float xtemp = 0.0;
float t1, t2, t3, t4, t5, t6;
float Betai,Zi;
float ZS;
int i;
float Ziso;

int k2 = ( threads / (NX_ROI * NZ_ROI) ) % NY_ROI;
int k1 = ( threads / NZ_ROI ) % NX_ROI;
int k3 = ( threads ) % NZ_ROI;

float vx[2], vy[2];
unsigned int idx0, idx1;

unsigned int idxProj = 0;
vx[0] = __cosf( PI + d_alphaList[idxProj] );
vx[1] = __sinf( PI + d_alphaList[idxProj] );
vy[0] = -__cosf( PI / 2.0 + d_alphaList[idxProj] );
vy[1] = -__sinf( PI / 2.0 + d_alphaList[idxProj] );

for(idx0 = 0; idx0 < N_ROWS; idx0++) 
{
    d_ZDtemp[idxProj*N_ROWS + idx0] = d_ZD[idx0] + d_vSourceZ[idxProj];
    for(idx1 = 0; idx1 < N_CHNS; idx1++) 
    {
d_ytemp[idxProj*N_CHNS*N_ROWS + idx1*N_ROWS + idx0] = d_dY[idx1*N_ROWS*N_PROJ + idx0*N_PROJ + idxProj];
    }
}

t1 = d_vROIx[k2] - d_vSourceX_FFS[idxProj];
t2 = d_vROIy[k1] - d_vSourceY_FFS[idxProj];
t3 = d_vROIz[k3] - d_vSourceZ_FFS[idxProj];
t4 = sqrtf(t1*t1 + t2*t2);
Zi = (RF + RD - d_FLY_dR[idxProj]) / t4 * t3 + d_vSourceZ_FFS[idxProj];
t5 = t1*vx[0] + t2*vx[1];
t6 = t1*vy[0] + t2*vy[1];
Betai = atan(t6/t5);

xtemp = BiInterp2Dpix(d_BD, d_ZDtemp, d_ytemp, Betai, Zi);

d_xout[ (k1*NY_ROI*NZ_ROI) + (k2*NZ_ROI) + (k3) ] += xtemp;

}

void main()
{
…
__Test<<<10000, 63>>>(d_dY, d_xout, d_ytemp, d_ZDtemp);
…
}

DenisR · November 30, 2007, 7:08am

__global__ void __Test(float * d_dY, float * d_xout, float * d_ytemp, float * d_ZDtemp)

{

    int threads = (blockIdx.y*gridDim.x  + blockIdx.x) * (blockDim.x*blockDim.y) + threadIdx.y*blockDim.x + threadIdx.x;

   float xtemp = 0.0;

   float t1, t2, t3, t4, t5, t6;

    float Betai,Zi;

    float ZS;

    int i;

    float Ziso;

   int k2 = ( threads / (NX_ROI * NZ_ROI) ) % NY_ROI;

    int k1 = ( threads / NZ_ROI ) % NX_ROI;

    int k3 = ( threads ) % NZ_ROI;

   float vx[2], vy[2];

    unsigned int idx0, idx1;

   unsigned int idxProj = 0;

    vx[0] = __cosf( PI + d_alphaList[idxProj] );

    vx[1] = __sinf( PI + d_alphaList[idxProj] );

    vy[0] = -__cosf( PI / 2.0 + d_alphaList[idxProj] );

    vy[1] = -__sinf( PI / 2.0 + d_alphaList[idxProj] );

   for(idx0 = 0; idx0 < N_ROWS; idx0++) 

    {

        d_ZDtemp[idxProj*N_ROWS + idx0] = d_ZD[idx0] + d_vSourceZ[idxProj];

        for(idx1 = 0; idx1 < N_CHNS; idx1++) 

        {

	d_ytemp[idxProj*N_CHNS*N_ROWS + idx1*N_ROWS + idx0] = d_dY[idx1*N_ROWS*N_PROJ + idx0*N_PROJ + idxProj];

        }

    }

   t1 = d_vROIx[k2] - d_vSourceX_FFS[idxProj];

    t2 = d_vROIy[k1] - d_vSourceY_FFS[idxProj];

    t3 = d_vROIz[k3] - d_vSourceZ_FFS[idxProj];

    t4 = sqrtf(t1*t1 + t2*t2);

    Zi = (RF + RD - d_FLY_dR[idxProj]) / t4 * t3 + d_vSourceZ_FFS[idxProj];

    t5 = t1*vx[0] + t2*vx[1];

    t6 = t1*vy[0] + t2*vy[1];

    Betai = atan(t6/t5);

   xtemp = BiInterp2Dpix(d_BD, d_ZDtemp, d_ytemp, Betai, Zi);

   d_xout[ (k1*NY_ROI*NZ_ROI) + (k2*NZ_ROI) + (k3) ] += xtemp;

}

void main()

{

.....

	__Test<<<10000, 63>>>(d_dY, d_xout, d_ytemp, d_ZDtemp);

.....

}

Just to be able to read it a bit better ;)

DenisR · November 30, 2007, 7:14am

It looks like all your threads are doing the same computation, so the reason why it works in emulation mode is because it actually runs. In release mode you are probably hitting the 5 sec kernel limitation, so, there is no output being generated at all.

Also a number of threads of 63 is really not good for performance, as I understand it it should be a multiple of 32 (or at least a multiple of 16) for good performance.

byung · November 30, 2007, 1:04pm

DenisR, thanks for the help.

I am using two graphic card: one for display and the other for CUDA therefore I am sure that there is no 5 sec limitation. Yes, all my threads do the same job and the reason I make the size of thread block 63 is that it is initial version to test correctness, not performance. Thanks for suggestion. Any idea of the reason that makes different output between in emulation mode and on device. Thanks,

MisterAnderson42 · November 30, 2007, 2:09pm

Do you check for errors after the kernel runs? Either use CUT_CHECK_ERROR or cudaThreadSycnhronize(); cudaGetLastError(); cudaGetErrorString();

LuYao · November 30, 2007, 3:03pm

I think such easy kernel wouldn’t hit the 5sec limitation.
I come with the same question also. When I run on Emu mode it’s OK but on device it doestn’t have results or even reboot.

byung · November 30, 2007, 3:12pm

MisterAnderson42,

I always appreciate you for great help for others. Thanks,

Yes, I checked CUT_CHECK_ERROR in debug mode and there is no errors after kernel runs.

Any other possible rotten fish that smells bad? :) Anything is welcomed.

byung,

MisterAnderson42 · November 30, 2007, 3:47pm

You are writing to d_xout with a += at the end, is d_xout initialized to the proper value before executing the kernel? It is such a simple thing, but I have made that error before, and the results may be correct in emulation mode because some compilers initialize memory to 0 for you in debug mode.

A second thought is that each thread is writing to d_xout element: (k1NY_ROINZ_ROI) + (k2NZ_ROI) + (k3) . It seems that this should have every thread writing to a different value of d_xout, but it never hurts to double check. You could create an array tmp array the length of the number of threads and write d_tmparray[threads] = (k1NY_ROINZ_ROI) + (k2NZ_ROI) + (k3). Then copy d_tmparray back to the host and check that there are no duplicate values.

byung · December 1, 2007, 6:36pm

MisterAnderson42,

I fixed the problem. Error was at unexpected place. In the code, I used constant memory for d_vROIx, d_vROIy, d_vROIz. When I read it in emulation mode, the value was fine but when I read it in device mode, it was not correct. I fixed the problem by declaring those in global memory. but I am still wondering WHY IT GIVES DIFFERENT VALUE. I am using constant memory for many variable and array and all works fine. I checked the amount of constant memory from .cubin file and it is way below 64K (actually, 34K is used).

Any body know any issues on this???

paulius · December 2, 2007, 12:31am

Check the Programming Guide section on running in Emulation mode. This has come up in the forums several times - Intel CPUs execute float arithmetic at 80bit precision (unless you use SSE). Intermediate values are also stored in that extended precision while in registers (unless you use a bunch of compiler flags to force better IEEE conformance). So, if you’re seeing difference in floating point values, that’s one reason. Compile your emulation code with different fp compiler flags and you’ll most likely see your values vary too. I’ve seen code and data that changed the value from -8 to 40, the only difference being the compiler flags.

Paulius

byung · December 2, 2007, 4:19am

Paulius,

It is not the problem of Emulation and Device mode now. The problem turns out to be Global memory and Constant memory. Probably you misunderstood. Please take a look at my comment right above yours.

By the way, do you have any idea why declaring variable(actually array) in global memory and constant memory makes difference?

Thanks,

AndreiB · December 2, 2007, 6:59am

How do you copy data to constant memory?

byung · December 2, 2007, 2:35pm

I copied data as follows.

CUDA_SAFE_CALL( cudaMemcpyToSymbol(d_vROIx, vROIx, sizeof(float)*numY_ROI) );

CUDA_SAFE_CALL( cudaMemcpyToSymbol(d_vROIy, vROIy, sizeof(float)*numX_ROI) );

CUDA_SAFE_CALL( cudaMemcpyToSymbol(d_vROIz, vROIz, sizeof(float)*numZ_ROI) );

__device__ __constant__ float d_vROIx[100];

__device__ __constant__ float d_vROIy[100];

__device__ __constant__ float d_vROIz[63];

Any suspicious thing?

Is it related to something like alignment?

MisterAnderson42 · December 2, 2007, 2:53pm

Do you really need device constant. When I have used constant memory, I have always declared it constant.

You seem to have found the root cause, but your kernel is still sufficiently complicated that there could be an interplay of issues, possibly even CUDA bugs. Start removing parts of the kernel, but keeping the constant memory read and some kind of write to d_xout. If you get down to a simple copy of a value from constant memory to d_xout and it STILL doesn’t work, then there is something funny going on.

byung · December 2, 2007, 3:17pm

Reasonable idea. I will try that. Thanks,

paulius · December 2, 2007, 10:33pm

You’re right, I did miss what the problem was exactly. I’ve posted a reply in your other thread. One guess is that the compiler may be generating slightly different operation orders, which can affect results due to finite precision.

Paulius

Topic		Replies	Views
Different results on device and Emulation mode CUDA Programming and Performance	5	3618	February 5, 2009
Emulation works, Debug doesn't CUDA Programming and Performance	12	2756	January 29, 2010
emulation mode and debug mode gave me totally different results! CUDA Programming and Performance	2	1661	May 21, 2009
emu vs debug, different values CUDA Programming and Performance	48	16054	February 5, 2009
a difference between emu and not CUDA Programming and Performance	2	3229	August 17, 2008
Different results with and without emulation mode CUDA Programming and Performance	6	1723	February 1, 2010
Difference between Device emulation and execution modes CUDA Programming and Performance	0	1115	April 11, 2009
Code works under emulation, but fails on the device CUDA Programming and Performance	3	2226	July 30, 2009
Please help: different result, emu vs non emu CUDA Programming and Performance	8	5271	September 11, 2008
Emulation/CPU=correct,Execution/GPU=incorrect emulation CUDA Programming and Performance	26	21674	September 2, 2008

Different Output on Device and Emulation Mode 2 What makes difference b/w Emu and Dev?

}

void main() { … __Test<<<10000, 63>>>(d_dY, d_xout, d_ytemp, d_ZDtemp); … }

Related topics

void main()
{
…
__Test<<<10000, 63>>>(d_dY, d_xout, d_ytemp, d_ZDtemp);
…
}