Racing Condition problem


I have a problem with the following kernel may be a racing condition. I used __threadfence() but nothing happens and results remain faulty

Here is the kernel:

1.016585 0.683285 3.045785 0.320685

4.242600 1.779500 0.988300

1.653077 3.045785 0.320685

1.730515 0.988300

4.362694 1.779500 0.988300 0.000000

1.016585 -3.966595 0.00000 0.00000

4.242600 0.414655 0.000000

Correct values [computed by CPU]:
1.653077 0.868369 0.956883

1.730515 0.404530

4.362694 4.347469 3.109890 -0.01750

I don’t know what you expect to happen (would be useful if you posted the CPU code as well), and what you were hoping to achieve with the [font=“Courier New”]__threadfence()[/font].

However, if [font=“Courier New”]WIDTH[/font] is the blocksize you are calling the kernel with, you have out-of-bounds accesses in shared memory. You would need to declare C and S as

__shared__ float C[WIDTH], S[WIDTH]


It is not supported to perform a __syncthreads(); within a condition, i would guess that this is also valid for threadfence?

I don’t think so as threadfence does not synchronize.

As I understand it, the threadfence instructions only effects active threads, so there is no harm in using threadfence where there might be warp divergence.

I think the race condition is due adjacent threads reading and writing the same elements of X. (I would paste the problematic lines, but the code was included as a picture.)

In one line, X[idx] is read, and in a later line X[j] is written to, and j = idx + 1. This turns out to be OK, except at the warp boundaries. Then there is a possibility of thread 32 reading X[idx] before or after thread 31 has modified it.

Your guess is right __syncthreads synchronize threads for read/write ops in shared memory only and with a block, so i tried to use something like __threadfence(); - it can be used with global memory ops and shared memory too

Yea you are right, the problem with X & Y arrays. I should modify their elements with thread j then read them with thread idx next time, therefore i is used __threadfence() trying to synch. but nothing happened :(.Do you have any other solution ??? - ohhh and here is the code in text :)

global void computeR(float *ADiag, float *ASubDiag, float *q, float *r, float *z, float *X, float *Y)


int idx = threadIdx.x + blockDim.x * blockIdx.x;

__shared__ float C[WIDTH-1], S[WIDTH-1];

int j = idx + 1; 

if (idx == 0)


	X[idx] = ADiag[idx];   // ==> a1 , a2 , ... , an

	Y[idx] = ASubDiag[idx];   // ==> b2 , b3 , ... , bn


if (j < WIDTH)


	z[idx] = sqrtf(X[idx] * X[idx] + ASubDiag[idx] * ASubDiag[idx]);

	C[j] = X[idx] / z[idx];

	S[j]= ASubDiag[idx] / z[idx];

	q[idx] = C[j] * Y[idx] + S[j] * ADiag[j];

	X[j] = -S[j] * Y[idx] + C[j] * ADiag[j];


	if (j != (WIDTH-1))


		r[idx] = S[j] * ASubDiag[j];

		Y[j] = C[j] * ASubDiag[j];