I’m trying to implement a matrix orthonormalization algorithm. There’s no problem with small matrices but I’m getting weired outputs with larger inputs. I suppose this is due to some racing but I have considered everything to avoid any memory hazard. Threads in my algorithm need to modify the input matrix (a global variable). So I use __threadfence() to flush and make sure every other thread gets the new values. But still, although I have tried to consider everything, I get wrong results sometimes. It produces the right answer most of the time though. Last time I tested, I got 11 correct results out of 15 times of running the application.

Here is the kernel code:

```
#include <math.h>
/*
The following function implements the stabilized Gramâ€“Schmidt orthonormalization.
Matrix "a" will contain the orthogonal matrix "Q" at the end.
*/
__device__ unsigned int flag = 0; // number of orthogonalized columns
__global__ void orthonormalize(float* a) {
__shared__ float s[BLOCK_DIM];
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + tid;
float me;
int i = 0;
while (i < blockIdx.x) {
if (flag > i) {
float vi = a[i * blockDim.x + tid];
me = a[idx];
s[tid] = vi * me;
__syncthreads();
int k = blockDim.x / 2;
while (k != 0) {
if(tid < k)
s[tid] += s[tid + k];
__syncthreads();
k /= 2;
}
if (blockDim.x % 2 != 0 && tid == 0)
s[0] += s[blockDim.x - 1];
__syncthreads();
a[idx] -= s[0] * vi;
__threadfence_block();
i++;
}
}
me = a[idx];
s[tid] = me * me;
__syncthreads();
i = blockDim.x / 2;
while (i != 0) {
if(tid < i) {
s[tid] += s[tid + i];
}
__syncthreads();
i /= 2;
}
if (blockDim.x % 2 != 0 && tid == 0)
{
s[0] += s[blockDim.x - 1];
}
__syncthreads();
a[idx] = me / sqrtf(s[0]);
__threadfence();
if (tid == 0) flag++;
__threadfence();
}
```

I have also attached this kernel code along with a testing host code.

Any ideas what I’m missing?

orthonormalize.zip (1.34 KB)