emulation mode and debug mode gave me totally different results!

Hi,
I’m a beginner of cuda. I’m suffering from a problem for several days.

Any help will be greatly appreciated.

In cuda lecture note in U of Illinois, there are comments below.

  1. Emulated device threads execute sequentially, so simultaneous accesses of the same memory location by multiple threads could produce different results.
  2. Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode.

Emulation mode gives me a correct result, but debug or release mode gives me a wrong result.

I think it cause the problem that I’m suffering now.

Here is my simple code.


In main function, I wrote these.

int npx=4;
int npy=4;
int npz=2;

unsigned int mem_size2 = 3*(npx+1)(npy+1)(npz+2)* sizeof( float );

CUDA_SAFE_CALL(cudaMemcpy(d_elf, elf, mem_size2 ,cudaMemcpyHostToDevice) );

CalIntForce_kernel<<<dim3(1,1),dim3((npx+1),(npy+1),(npz+1))>>>(d_elf,npx,npy,npz);

CUDA_SAFE_CALL(cudaMemcpy(elf, d_elf, mem_size2 ,cudaMemcpyDeviceToHost) );

global void CalIntForce_kernel (float *elf, int npx, int npy, int npz)

{

int i, j, n, n1, n2, n3;
int indx, nn;

int tx = threadIdx.x;
int ty = threadIdx.y;
int tz = threadIdx.z;

__shared__ float lelf[128];

nx=npx+1;
ny=npy+1;
nz=npz+1;

indx = tx*ny*nz + ty*nz + tz;    

for(n1=0;n1<=npx;n1++)

{
n=n1nynz+tynz+tz;
nn=3
n;
elf[nn]+=1;
elf[nn+1]+=1;
elf[nn+2]+=1;
}
for(n2=0;n2<=npy;n2++)
{
n=txnynz+n2nz+tz;
nn=3
n;
elf[nn]+=1;
elf[nn+1]+=1;
elf[nn+2]+=1;
}
for(n3=0;n3<=npz;n3++)
{
n=txnynz+tynz+n3;
nn=3
n;
elf[nn]+=1;
elf[nn+1]+=1;
elf[nn+2]+=1;
}

}


In Kernel, I used elf array. I think it’s in global memory.

However, each tread access this array simulatneously and make a problem.

Could you tell me if my thought is right and how to fix this problem?

Thank you,
swhastan

I would say you are right, all your threads are reading and writing from elf and are stepping on each other.

It’s not clear what you are trying to do, but the simplest thing is to organize the threads so that each thread operates on a disjoint set of data. For instance, you might have three kernels, to replace the three loops, where for the first kernel, each thread loops over x, with a separate thread for each (y, z) combination. This assumes you need to implement some sort of dependency across different x values. Then the second kernel could loop over y, with a separate thread for each (x, z) combination, and a third kernel loops over z, with a separate thread for each (x, y) combination.

Thank you for reply.

The purpose of this code is to get Elf array from Kernel.

Calculated Elf values are different in Emulation and Debug mode.

I’d like to disjoint data, but it’s very hard.

Is there any good trick for each thread to read and write Elf array?