atomic functions

I know there is a LOT of discussion on atomic functions of which I have attempted to go through most of it (or at least the posts that relate to my issue). So I apologize if something similar to this has already been answered.

In short, I have points non uniformly positioned in Cartesian space that I am trying to filter down into whether or not they exist within a bounding box. My code looks something like this…

bool PointInRange = (CartesianX >= l_BoxMinX) && (CartesianX <= l_BoxMaxX);
// Store any raw points within range
if( PointInRange )
	unsigned int InsertIndex = atomicAdd( (unsigned int*)(&(g_CartesianPointCount)), 1 );

	g_CartesianPointsInRangeX[ ListStartIndex + InsertIndex ] = CartesianX;
	g_CartesianPointsInRangeY[ ListStartIndex + InsertIndex ] = CartesianY;

The purpose of this code is to pack the relevant points into a list by reserving the index that it will be inserted into using the atomicAdd function and then simply writing into the index reserved.

The problem is that the results from the atomicAdd are not correct. The more debugging I do, the more correct it is. Given my data set, the end result of g_CartesianPointCount should be in the range of 28-40 (depending on the bounding box). The problem is that if I let it run in release mode, the resulting count is actually in the range 7-22. If I run it in debug mode the resulting count is more in the range 19-24. If I put a breakpoint in the kernel and make sure it stops every time, I get the correct results.

This seems like similar behavior to what would happen IF a race condition were possible for incrementing g_CartesianPointCount. (racing threads double counting the same index, the solution improves when collisions are less likely to occur).

Am I missing something obvious? Is there a better way to insert arbitrary data into a (packed, don’t care the order) list for (input and output) data counts that exceeds the number of threads for a single warp.

P.S. I am running CUDA3.0 hardware.
P.P.S And yes, I have already tried using atomicAdd on shared (local) memory and then copy the results of the local calculation to global memory
P.P.P.S I had considered writing my own atomicAdd as suggested for the double version in but decided if the unsigned int version of atomicAdd doesn’t seem to be working, what makes me assume that a version using atomicCAS would be any better.

I don’t see anything wrong with the code you have posted (making various assumptions, since it’s incomplete).

I think it’s unlikely that atomicAdd is broken in any way.

I think it’s likely that you have a race condition somewhere else in your code.

If you can provide a short, complete code that reproduces the issue, I’ll bet it would be sliced and diced pretty quickly.

Otherwise, try running your code with cuda-memcheck plus the various sub-tool options on cuda-memcheck such as inittcheck, racecheck, and synccheck.

Wow, that was easy.

Turns out you were right, there was a race loading a 2 dependent data elements from global memory that was causing this issue.

Thank you for reminding me about this tool, what a life saver.