Can't Resolve Racing I'm getting random, different outputs...

I’m trying to implement a matrix orthonormalization algorithm. There’s no problem with small matrices but I’m getting weired outputs with larger inputs. I suppose this is due to some racing but I have considered everything to avoid any memory hazard. Threads in my algorithm need to modify the input matrix (a global variable). So I use __threadfence() to flush and make sure every other thread gets the new values. But still, although I have tried to consider everything, I get wrong results sometimes. It produces the right answer most of the time though. Last time I tested, I got 11 correct results out of 15 times of running the application.

Here is the kernel code:

#include <math.h>

/*

The following function implements the stabilized Gram–Schmidt orthonormalization.

Matrix "a" will contain the orthogonal matrix "Q" at the end.

*/

__device__ unsigned int flag = 0; // number of orthogonalized columns

__global__ void orthonormalize(float* a) {

	__shared__ float s[BLOCK_DIM];

	int tid = threadIdx.x;

	int idx = blockIdx.x * blockDim.x + tid;

	float me;

	int i = 0;

	while (i < blockIdx.x) {

		if (flag > i) {

			float vi = a[i * blockDim.x + tid];

	 		me = a[idx];

	 		s[tid] = vi * me;

	 		__syncthreads();

	 		int k = blockDim.x / 2;

 			while (k != 0) {

				if(tid < k)

					s[tid] += s[tid + k];

				__syncthreads();

				k /= 2;

			}

			if (blockDim.x % 2 != 0 && tid == 0)

				s[0] += s[blockDim.x - 1];

	 		__syncthreads();

	 		a[idx] -= s[0] * vi;

	 		__threadfence_block();

	 		i++;

		}

	}

	me = a[idx];

	s[tid] = me * me;

	__syncthreads();

	i = blockDim.x / 2;

	while (i != 0) {

		if(tid < i) {

			s[tid] += s[tid + i];

		}

		__syncthreads();

		i /= 2;

	}

	if (blockDim.x % 2 != 0 && tid == 0)

	{

		s[0] += s[blockDim.x - 1];

	}

	__syncthreads();

	a[idx] = me / sqrtf(s[0]);

	__threadfence();

	if (tid == 0) flag++;

	__threadfence();

}

I have also attached this kernel code along with a testing host code.

Any ideas what I’m missing?
orthonormalize.zip (1.34 KB)

I have not gone through your code…

__threadfence() dont guarantee block synchronization… If you are trying to schronize blocks, you are not going anywhere.

Yea I know that and I’m not trying to sync blocks. I use threadfence right after updating the global variable to make every thread read the new values from that point on.