emulation mode and debug mode gave me totally different results!

swhastan · May 21, 2009, 6:25pm

Hi,
I’m a beginner of cuda. I’m suffering from a problem for several days.

Any help will be greatly appreciated.

In cuda lecture note in U of Illinois, there are comments below.

Emulated device threads execute sequentially, so simultaneous accesses of the same memory location by multiple threads could produce different results.
Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode.

Emulation mode gives me a correct result, but debug or release mode gives me a wrong result.

I think it cause the problem that I’m suffering now.

Here is my simple code.

In main function, I wrote these.

int npx=4;
int npy=4;
int npz=2;

unsigned int mem_size2 = 3*(npx+1)(npy+1)(npz+2)* sizeof( float );

CUDA_SAFE_CALL(cudaMemcpy(d_elf, elf, mem_size2 ,cudaMemcpyHostToDevice) );

CalIntForce_kernel<<<dim3(1,1),dim3((npx+1),(npy+1),(npz+1))>>>(d_elf,npx,npy,npz);

CUDA_SAFE_CALL(cudaMemcpy(elf, d_elf, mem_size2 ,cudaMemcpyDeviceToHost) );

global void CalIntForce_kernel (float *elf, int npx, int npy, int npz)

{

int i, j, n, n1, n2, n3;
int indx, nn;

int tx = threadIdx.x;
int ty = threadIdx.y;
int tz = threadIdx.z;

__shared__ float lelf[128];

nx=npx+1;
ny=npy+1;
nz=npz+1;

indx = tx*ny*nz + ty*nz + tz;    

for(n1=0;n1<=npx;n1++)

{
n=n1nynz+tynz+tz;
nn=3n;
elf[nn]+=1;
elf[nn+1]+=1;
elf[nn+2]+=1;
}
for(n2=0;n2<=npy;n2++)
{
n=txnynz+n2nz+tz;
nn=3n;
elf[nn]+=1;
elf[nn+1]+=1;
elf[nn+2]+=1;
}
for(n3=0;n3<=npz;n3++)
{
n=txnynz+tynz+n3;
nn=3n;
elf[nn]+=1;
elf[nn+1]+=1;
elf[nn+2]+=1;
}

}

In Kernel, I used elf array. I think it’s in global memory.

However, each tread access this array simulatneously and make a problem.

Could you tell me if my thought is right and how to fix this problem?

Thank you,
swhastan

Jamie_K · May 21, 2009, 7:58pm

I would say you are right, all your threads are reading and writing from elf and are stepping on each other.

It’s not clear what you are trying to do, but the simplest thing is to organize the threads so that each thread operates on a disjoint set of data. For instance, you might have three kernels, to replace the three loops, where for the first kernel, each thread loops over x, with a separate thread for each (y, z) combination. This assumes you need to implement some sort of dependency across different x values. Then the second kernel could loop over y, with a separate thread for each (x, z) combination, and a third kernel loops over z, with a separate thread for each (x, y) combination.

swhastan · May 21, 2009, 8:48pm

Thank you for reply.

The purpose of this code is to get Elf array from Kernel.

Calculated Elf values are different in Emulation and Debug mode.

I’d like to disjoint data, but it’s very hard.

Is there any good trick for each thread to read and write Elf array?

Topic		Replies	Views
EMUDEBUG AND RELEASE MODE - HELP PLEASE! CUDA Programming and Performance	1	2533	March 1, 2010
Different results with and without emulation mode CUDA Programming and Performance	6	1743	February 1, 2010
Problem:Different results on EMULATION and RELEASE Problem on release and debug mode CUDA Programming and Performance	4	1945	June 5, 2008
Emulation works, Debug doesn't CUDA Programming and Performance	12	2760	January 29, 2010
Different Output on Device and Emulation Mode 2 What makes difference b/w Emu and Dev? CUDA Programming and Performance	15	13622	December 2, 2007
Different results on device and Emulation mode CUDA Programming and Performance	5	3632	February 5, 2009
Please help: different result, emu vs non emu CUDA Programming and Performance	8	5280	September 11, 2008
CUDA FORTRAN Programming, debug in Emulation mode Legacy PGI Compilers	1	7304	February 8, 2012
Code works under emulation, but fails on the device CUDA Programming and Performance	3	2230	July 30, 2009
Different result between emulation and real intractable bug CUDA Programming and Performance	2	4220	December 13, 2007

emulation mode and debug mode gave me totally different results!

Related topics