Hi All,

I am doing some matrix calculations which require me to move column by column. The only way I could think of to do this might not be ideal, but it seems to work (mostly). I have the following Kernel, but the output is inconsistent. When I print the result vector at the end, I often come up with different output results, despite the fact that the matrix and vector are static and never change. Is this possibly due to block execution? I’ve tried hard to force cuda to execute the way I need it to, but I’m at a loss. FYI - the code is not the most efficient.

```
__global__ void Matrix(complex* a, complex* b, complex* c){
// set initial i value for navigating through matrix
int i = blockIdx.x*blockDim.x+threadIdx.x;
// setup column counter to move through matrix from left to right
int colCtr = 0;
while(colCtr < N)
{
if(blockIdx.x < colCtr){
// do something above the diagonal
}
if(blockIdx.x == colCtr){
// diagonal
c[colCtr] = a[i] * b[colCtr];
}
else if(blockIdx.x > colCtr){
// below diagonal
c[blockIdx.x] = b[blockIdx.x] + a[i] * b[colCtr];
}
// increment I to next column
i += blockDim.x * N;
// increment column counter
colCtr++;
// copy new results back to original vector
b[blockIdx.x] = c[blockIdx.x];
}
}
```

Where “a” is an NxN matrix and “b” is an N sized vector. “c” is the N sized output vector.

complex is a structure that I put together with two doubles - x & y. I also wrote device code to interpret complex addition, subtraction, multiplication and division. I am calling the kernel like this:

```
Matrix<<<N,1>>>(d_a, d_b, d_c);
```

Where N is set to 6000.

Thoughts?