atomicAdd question

I’m working with a sparse matrix, packed similarly to CSR format. Because of the guidelines of the algorithm, I can only process one row at a time. The rows are generally sparse (on average each row has 2.67 elements). The algorithm says I have to do a bunch of work on all the elements below the diagonal and when that is finished, work above the diagonal.

I am looping through the array one row at a time, and after wards I use atomicAdd to see how many threads actually executed the statement, I then increment tid by however many elements there were in that row. It works, as expected, but I run into an issue when I have an element that is above the diagonal. e.g.

num1, num2, x, y:








The algorithm starts producing incorrect data at 1,0. ( in the case of my real test case, it is something like row 873, but that is irrelevant for this case ;) ).

Here is basically the code I’m using, and the kernel call. If anyone can spot why it may be returning garbage when a number exists above the diagonal, I would owe you a beer.

__global__ void LCompute(cuDoubleComplex *a, cuDoubleComplex* b, cuDoubleComplex *c, cuDoubleComplex *newX, int *index){

	int num = 0;

	int tid;

	*index = 0;

	while(num < N){

		tid = *index + threadIdx.x;

		if(a[tid].px==num && a[tid].py==num)

			c[num] = a[tid] * b[num];


		if(a[tid].px==num && a[tid].py > num)

			c[a[tid].py] = b[a[tid].py] + a[tid] * b[num];









Kernel call

LCompute<<<1,256>>>(d_a, d_b, d_c, d_newX, d_index);

btw - index ends up being equal to the number of elements in the array a, which is what I was aiming for. And this is a Fermi based card (GTX 570).

In stepping through the code with the debugger, it looks like threads are being deactivated - which is what I wanted - however, in each iteration of the while loop, I want to reactivate all of the threads. How would I go about doing this?