How to run this small portion of code run in parallel?

I have this small portion of code that is looking at a large sparse matrix and finding how many non-zero values per row there are. This array, called d_mesh->nnzPerRow is initialized to all zeros before running this function.

Below is the code. When I run it in serial with just 1 thread (and 1 block) it works and gives the correct output for the first 5 rows : 3 4 3 4 5. When I try to run it in parallel by passing a numBlocks and numThreads parameters into the kernel function, it does not work and gives me an output of : 1 1 1 1 1.

Is there any way to run this in parallel where each thread can see the most recent value of d_mesh->nzxPerRow before updating it? Thanks.

__global__ void implicitNZPR(mesh * d_mesh)
{
	int index = blockIdx.x * blockDim.x + threadIdx.x;
	int J;

	int Nwidth = (d_mesh->Nx-2)*(d_mesh->Ny-2); //Matrix width/height

	//PARALLEL (NOT WORKING)
	/*
	if(index < d_mesh->Ni)
	{
		int *A = d_getIJK3D(index, Nwidth, Nwidth, 0); //Get the I,J indices and offset by 1 since CPP starts at 0
		J = A[1] + 1;

		if(d_mesh->implicitA[index] != 0)
		{
			d_mesh->nnzPerRow[J] += 1;
		}
	}
	*/

	//SERIAL (WORKS AS INTENDED)
	for(int i=0; i<d_mesh->Ni; i++)
	{
		int *A = d_getIJK3D(i, Nwidth, Nwidth, 0);
		J = A[1] + 1;

		if(d_mesh->implicitA[i] != 0)
		{
			d_mesh->nnzPerRow[J] += 1;
		}
	}
}

If you have multiple threads working on each row, then they may be stepping on each other as they try to update d_mesh->nnzPerRow. This is a typical parallel race condition. Four possible methods to sort this out:

  1. Use atomics:

instead of:

d_mesh->nnzPerRow[J] += 1;

try:

atomicAdd(d_mesh->nnzPerRow + J, 1);
  1. Use a classical parallel reduction (per row, perhaps)

http://docs.nvidia.com/cuda/cuda-samples/index.html#cuda-parallel-reduction

  1. Use one thread and one for-loop per row.

  2. CUSPARSE provides a function to do something like this:

http://docs.nvidia.com/cuda/cusparse/index.html#cusparse-lt-t-gt-nnz

If your matrix is extremely sparse, the atomics method may give better performance than the reduction method. The CUSPARSE method may be fastest.