I’m working with a sparse matrix, packed similarly to CSR format. Because of the guidelines of the algorithm, I can only process one row at a time. The rows are generally sparse (on average each row has 2.67 elements). The algorithm says I have to do a bunch of work on all the elements below the diagonal and when that is finished, work above the diagonal.

I am looping through the array one row at a time, and after wards I use atomicAdd to see how many threads actually executed the statement, I then increment tid by however many elements there were in that row. It works, as expected, but I run into an issue when I have an element that is above the diagonal. e.g.

num1, num2, x, y:

1.1,2.1,0,0

1.2,2.2,0,1

1.3,2.3,0,1000

2.1,3.1,1,0

2.2,3.2,1,1

2.3,3.3,1,2

2.4,3.4,1,1000

The algorithm starts producing incorrect data at 1,0. ( in the case of my real test case, it is something like row 873, but that is irrelevant for this case ).

Here is basically the code I’m using, and the kernel call. If anyone can spot why it may be returning garbage when a number exists above the diagonal, I would owe you a beer.

```
__global__ void LCompute(cuDoubleComplex *a, cuDoubleComplex* b, cuDoubleComplex *c, cuDoubleComplex *newX, int *index){
int num = 0;
int tid;
*index = 0;
while(num < N){
tid = *index + threadIdx.x;
if(a[tid].px==num && a[tid].py==num)
c[num] = a[tid] * b[num];
if(a[tid].px==num && a[tid].py > num)
c[a[tid].py] = b[a[tid].py] + a[tid] * b[num];
__syncthreads();
if(a[tid].px==num){
atomicAdd(index,1);
}
__syncthreads();
num++;
}
}
```

Kernel call

```
LCompute<<<1,256>>>(d_a, d_b, d_c, d_newX, d_index);
```

btw - index ends up being equal to the number of elements in the array a, which is what I was aiming for. And this is a Fermi based card (GTX 570).