Hi,

This code fragment

```
#define convert_thread_index_to_table_row(bx, tx) bx*blockDim.x+tx;
_global__ void MatCopyKernel(int* table, int* candidate, int crows, int ccols, int* another, int* count)
{
int i= threadIdx.x;
int bx = blockIdx.x;
//int j= threadIdx.y;
int j,k, l;
int item;
for(j=0;j<crows;j++)
{
for(k=0;k<ccols;k++)
{
item= candidate[j*ccols+k];
__syncthreads();
int row = convert_thread_index_to_table_row(bx,i);
if(table[row*ITEMS+(item-1)]==1)
{
if(k==0)
another[row]=1;
else
another[row]*=1;
}
else
{
another[row]=0;
}
__syncthreads();
}
__syncthreads();
//reduction
l=2;
while((l/2)<NUMRECORDS)
{
__syncthreads();
int row = convert_thread_index_to_table_row(bx,i);
if((row%l)==0 && row+l/2 < NUMRECORDS)
{
another[row]= another[row]+another[(row+(l/2))];
}
__syncthreads();
l*=2;
}
//reduction ends
count[j]=another[0];
__syncthreads();
}
}
```

The output is random for each run. I dont understand why would the output differ for different runs. If the calculation is wrong then it should produce the same incorrect result for each run. I guess it is due to synchronization. But at the beginning and end of each iteration of the loops there are _syncthreads() call (redundant) included. Then how come the output differs?