__syncthreads screwes calculation

Hi,

I am using CUDA to develop a monte-carlo code. GPU kernel basically does pretty uniform caclulations and works fine, though slow. While improving performance I was trying different things and found out that adding __syncthreads in the middle of the code leads to incorrect caclulation, with non-predictable outcome (though within certain range). Yes, it works fine without syncthreads and silently produces wrong result with it. There is no aliasing, I just want to load numbers into shared before applying calculation to them

code snippet: s is grid.x, c is grid.y, H/U are shared cuda_flink is cuda_array

{ n = neighbour_table[8 * i + dir]; { U[s*3+c+16*0] = cuda_flink[8 * i + dir].e
[c][0].real; U[s*3+c+16*1] = cuda_flink[8 * i + dir].e[c][0].imag; U[s*3+c+16*2] =
cuda_flink[8 * i + dir].e[c][1].real; U[s*3+c+16*3] = cuda_flink[8 * i + dir].e[c][
1].imag; U[s*3+c+16*4] = cuda_flink[8 * i + dir].e[c][2].real; U[s*3+c+16*5] = cuda
_flink[8 * i + dir].e[c][2].imag; }; };
__syncthreads();

H[s].h[0].c[c].real =src[n].c[c].d[0].real-src[n].c[c].d[3].imag;
H[s].h[0].c[c].imag =src[n].c[c].d[0].imag+src[n].c[c].d[3].real;
H[s].h[1].c[c].real =src[n].c[c].d[1].real-src[n].c[c].d[2].imag;
H[s].h[1].c[c].imag =src[n].c[c].d[1].imag+src[n].c[c].d[2].real;

usually you screw things up when forgetting syncthreads but not by adding.
cuda 1.0 and 1.1 beta give same results.

any ideas?

thanks!

This is easier to read :

code snippet: s is grid.x, c is grid.y, H/U are shared cuda_flink is cuda_array

{

 n = neighbour_table[8 * i + dir];

 {

  U[s*3+c+16*0] = cuda_flink[8 * i + dir].e[c][0].real; 

  U[s*3+c+16*1] = cuda_flink[8 * i + dir].e[c][0].imag;

  U[s*3+c+16*2] =cuda_flink[8 * i + dir].e[c][1].real;

  U[s*3+c+16*3] = cuda_flink[8 * i + dir].e[c][1].imag;

  U[s*3+c+16*4] = cuda_flink[8 * i + dir].e[c][2].real;

  U[s*3+c+16*5] = cuda_flink[8 * i + dir].e[c][2].imag;

 };

};

__syncthreads();

    H[s].h[0].c[c].real =src[n].c[c].d[0].real-src[n].c[c].d[3].imag;

    H[s].h[0].c[c].imag =src[n].c[c].d[0].imag+src[n].c[c].d[3].real;

    H[s].h[1].c[c].real =src[n].c[c].d[1].real-src[n].c[c].d[2].imag;

    H[s].h[1].c[c].imag =src[n].c[c].d[1].imag+src[n].c[c].d[2].real;

I Realy can’t tell you whats wrong but maybe it’s better next time to post only “one” thread instead of “three” ;)