the result is not always right !

I have built an cuda program on Geforce 8600GT. When I run the program in the emulation mode, the result is always right. But when I run it on the GPU, sometimes the result was right while sometimes was wrong and what I did was just run it for several times without making any changes in the code. I observed the amount and location of all the errors in the result were rather random. I consider my program follows the rules for accessing global memory space and won’t meet bank conflicts in shared memory.
I tried to add some syncthread in the code or change the order of some sentences, I believe it did some help (I compared the intermediate result for emudebug and debug mode) but not that effective. Who can tell me what causes this condition.
Thanks a lot !

I think it will be helpful for you to post your some codes if the program is not huge.

I think my program lacks of some syncthreads(those before were not in the right position I think). I will try tomorrow in the lab.

But this not means my problem will be perfectly resolved. There are still other problems. for example, the following codes calculate the vertical xor value for 17 rows (64KB each). The result of emudebug and debug mode are totally different.

__global__ void enchorkernel(unsigned int * Inputarray,unsigned int * Outputarray)


 Â Â __shared__ unsigned int sh_temp[1024];

 Â Â const int tid=threadIdx.x+blockDim.x;

 Â Â int blockbase=blockIdx.x<<11;

  for(int i=0;i<2;i++)

 Â {

 Â Â Â __syncthreads();

 Â Â Â sh_temp[threadIdx.x]=Inputarray[blockbase+threadIdx.x];

 Â Â Â sh_temp[tid]=Inputarray[blockbase+tid];	

 Â Â Â blockbase+=16384;//16k unsigned int


 Â Â Â for(int j=0;j<16;j++)

 Â Â {

 Â Â Â Â Â Â Â Â Â Â __syncthreads();

 Â Â Â Â Â Â Â Â Â Â sh_temp[threadIdx.x]^=Inputarray[blockbase+threadIdx.x]; Â Â 

 Â Â Â Â Â Â Â Â Â Â sh_temp[tid]^=Inputarray[blockbase+tid];

 Â Â Â Â Â Â Â Â Â Â blockbase+=16384;//16k unsigned int

 Â Â Â Â }

 Â Â Â blockbase=blockIdx.x<<11;


 Â Â Outputarray[blockbase+(i<<10)+tid]=sh_temp[tid];


 Â }


each block is responsible for 2 columns and there are totally 16 columns. There are 4KB in the point of intersection of each column and row.

I believe there is no bank conflict during accessing shared memory. The elements in the inputarray are relatively small while the results from my geforce 8600GT are not only unbelievable huge but negative for some of them.

Waiting for experts to elighten me what happens.