Can't Resolve Racing I'm getting random, different outputs...

I’m trying to implement a matrix orthonormalization algorithm. There’s no problem with small matrices but I’m getting weired outputs with larger inputs. I suppose this is due to some racing but I have considered everything to avoid any memory hazard. Threads in my algorithm need to modify the input matrix (a global variable). So I use __threadfence() to flush and make sure every other thread gets the new values. But still, although I have tried to consider everything, I get wrong results sometimes. It produces the right answer most of the time though. Last time I tested, I got 11 correct results out of 15 times of running the application.

Here is the kernel code:

``````#include <math.h>

/*

The following function implements the stabilized Gramâ€“Schmidt orthonormalization.

Matrix "a" will contain the orthogonal matrix "Q" at the end.

*/

__device__ unsigned int flag = 0; // number of orthogonalized columns

__global__ void orthonormalize(float* a) {

__shared__ float s[BLOCK_DIM];

int idx = blockIdx.x * blockDim.x + tid;

float me;

int i = 0;

while (i < blockIdx.x) {

if (flag > i) {

float vi = a[i * blockDim.x + tid];

me = a[idx];

s[tid] = vi * me;

int k = blockDim.x / 2;

while (k != 0) {

if(tid < k)

s[tid] += s[tid + k];

k /= 2;

}

if (blockDim.x % 2 != 0 && tid == 0)

s[0] += s[blockDim.x - 1];

a[idx] -= s[0] * vi;

i++;

}

}

me = a[idx];

s[tid] = me * me;

i = blockDim.x / 2;

while (i != 0) {

if(tid < i) {

s[tid] += s[tid + i];

}

i /= 2;

}

if (blockDim.x % 2 != 0 && tid == 0)

{

s[0] += s[blockDim.x - 1];

}

a[idx] = me / sqrtf(s[0]);

if (tid == 0) flag++;

}
``````

I have also attached this kernel code along with a testing host code.

Any ideas what I’m missing?
orthonormalize.zip (1.34 KB)

I have not gone through your code…

__threadfence() dont guarantee block synchronization… If you are trying to schronize blocks, you are not going anywhere.

Yea I know that and I’m not trying to sync blocks. I use threadfence right after updating the global variable to make every thread read the new values from that point on.