atomic functions

krazanmp · July 9, 2015, 9:56pm

Hello,
I know there is a LOT of discussion on atomic functions of which I have attempted to go through most of it (or at least the posts that relate to my issue). So I apologize if something similar to this has already been answered.

In short, I have points non uniformly positioned in Cartesian space that I am trying to filter down into whether or not they exist within a bounding box. My code looks something like this…

bool PointInRange = (CartesianX >= l_BoxMinX) && (CartesianX <= l_BoxMaxX);
		
// Store any raw points within range
if( PointInRange )
{
	unsigned int InsertIndex = atomicAdd( (unsigned int*)(&(g_CartesianPointCount)), 1 );

	g_CartesianPointsInRangeX[ ListStartIndex + InsertIndex ] = CartesianX;
	g_CartesianPointsInRangeY[ ListStartIndex + InsertIndex ] = CartesianY;
}

The purpose of this code is to pack the relevant points into a list by reserving the index that it will be inserted into using the atomicAdd function and then simply writing into the index reserved.

The problem is that the results from the atomicAdd are not correct. The more debugging I do, the more correct it is. Given my data set, the end result of g_CartesianPointCount should be in the range of 28-40 (depending on the bounding box). The problem is that if I let it run in release mode, the resulting count is actually in the range 7-22. If I run it in debug mode the resulting count is more in the range 19-24. If I put a breakpoint in the kernel and make sure it stops every time, I get the correct results.

This seems like similar behavior to what would happen IF a race condition were possible for incrementing g_CartesianPointCount. (racing threads double counting the same index, the solution improves when collisions are less likely to occur).

Am I missing something obvious? Is there a better way to insert arbitrary data into a (packed, don’t care the order) list for (input and output) data counts that exceeds the number of threads for a single warp.

P.S. I am running CUDA3.0 hardware.
P.P.S And yes, I have already tried using atomicAdd on shared (local) memory and then copy the results of the local calculation to global memory
P.P.P.S I had considered writing my own atomicAdd as suggested for the double version in http://docs.nvidia.com/cuda/cuda-c-programming-guide/#axzz3fJzUfXCx but decided if the unsigned int version of atomicAdd doesn’t seem to be working, what makes me assume that a version using atomicCAS would be any better.

Robert_Crovella · July 9, 2015, 10:06pm

I don’t see anything wrong with the code you have posted (making various assumptions, since it’s incomplete).

I think it’s unlikely that atomicAdd is broken in any way.

I think it’s likely that you have a race condition somewhere else in your code.

If you can provide a short, complete code that reproduces the issue, I’ll bet it would be sliced and diced pretty quickly.

Otherwise, try running your code with cuda-memcheck plus the various sub-tool options on cuda-memcheck such as inittcheck, racecheck, and synccheck.

[url]http://docs.nvidia.com/cuda/cuda-memcheck/index.html#abstract[/url]

krazanmp · July 9, 2015, 10:58pm

Wow, that was easy.

Turns out you were right, there was a race loading a 2 dependent data elements from global memory that was causing this issue.

Thank you for reminding me about this tool, what a life saver.

Topic		Replies	Views
Atomic functions problem CUDA Programming and Performance	8	1829	May 30, 2009
Really simple while loop issues CUDA Programming and Performance	4	3151	October 27, 2014
can you give me sample code for atomicAdd()? CUDA Programming and Performance	9	48327	June 5, 2009
atomicAdd crash CUDA Programming and Performance	8	1310	August 25, 2016
The atomic functions do not provide correct results CUDA Programming and Performance cuda	4	384	March 26, 2021
What I am doing wrong with atomicAdd() CUDA Programming and Performance	5	2322	November 1, 2010
atomicAdd() during loop not work well but at end work well CUDA Programming and Performance	3	1186	May 20, 2010
AtomicAdd algorithm CUDA Programming and Performance	7	3765	August 25, 2009
atomicAdd problems. CUDA Programming and Performance	3	2346	April 13, 2011
Adding a variable inside a buffer from multiple threads CUDA Programming and Performance cuda , kernel	7	337	November 20, 2020

atomic functions

Related topics