Racing condition problem

hi

I have a problem with the following kernel may be a racing condition. I tried to use __threadfence() but nothing happens and results remain faulty

Here is the kernel:

Inputs:
ADiagnol:
1.016585 0.683285 3.045785 0.320685

ASubDiagnol:
4.242600 1.779500 0.988300

Results:
Q:
1.653077 3.045785 0.320685

R:
1.730515 0.988300

Z:
4.362694 1.779500 0.988300 0.000000

X:
1.016585 -3.966595 0.00000 0.00000

Y:
4.242600 0.414655 0.000000

Correct values [computed by CPU]:
Q:
1.653077 0.868369 0.956883

R:
1.730515 0.404530

Z:
4.362694 4.347469 3.109890 -0.01750