Hi,
I’m a beginner of cuda. I’m suffering from a problem for several days.
Any help will be greatly appreciated.
In cuda lecture note in U of Illinois, there are comments below.
- Emulated device threads execute sequentially, so simultaneous accesses of the same memory location by multiple threads could produce different results.
- Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode.
Emulation mode gives me a correct result, but debug or release mode gives me a wrong result.
I think it cause the problem that I’m suffering now.
Here is my simple code.
In main function, I wrote these.
int npx=4;
int npy=4;
int npz=2;
unsigned int mem_size2 = 3*(npx+1)(npy+1)(npz+2)* sizeof( float );
CUDA_SAFE_CALL(cudaMemcpy(d_elf, elf, mem_size2 ,cudaMemcpyHostToDevice) );
CalIntForce_kernel<<<dim3(1,1),dim3((npx+1),(npy+1),(npz+1))>>>(d_elf,npx,npy,npz);
CUDA_SAFE_CALL(cudaMemcpy(elf, d_elf, mem_size2 ,cudaMemcpyDeviceToHost) );
global void CalIntForce_kernel (float *elf, int npx, int npy, int npz)
{
int i, j, n, n1, n2, n3;
int indx, nn;
int tx = threadIdx.x;
int ty = threadIdx.y;
int tz = threadIdx.z;
__shared__ float lelf[128];
nx=npx+1;
ny=npy+1;
nz=npz+1;
indx = tx*ny*nz + ty*nz + tz;
for(n1=0;n1<=npx;n1++)
{
n=n1nynz+tynz+tz;
nn=3n;
elf[nn]+=1;
elf[nn+1]+=1;
elf[nn+2]+=1;
}
for(n2=0;n2<=npy;n2++)
{
n=txnynz+n2nz+tz;
nn=3n;
elf[nn]+=1;
elf[nn+1]+=1;
elf[nn+2]+=1;
}
for(n3=0;n3<=npz;n3++)
{
n=txnynz+tynz+n3;
nn=3n;
elf[nn]+=1;
elf[nn+1]+=1;
elf[nn+2]+=1;
}
}
In Kernel, I used elf array. I think it’s in global memory.
However, each tread access this array simulatneously and make a problem.
Could you tell me if my thought is right and how to fix this problem?
Thank you,
swhastan