Incrementing a "counter"

I need my kernel to return an array that’s “lean” (no empty entries between indices), but going about that in a parallel manner is a bit more complicated than I thought.

EDIT: Ugh, I just pasted some code but apparently there’s a low limit for chars per post

global void getNearbyPoints(Point pt, Point* pts, double dist, Point* ret)
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (hypot((double)pt.x - pts[idx].x, (double)pt.y - pts[idx].y) <= dist)
ret[count] = pts[idx];
atomicAdd(&count, 1);

(count is a device variable) The addition seems to finish only after all the threads are done executing.

It is not a critical section, use index=atomicAdd(&count,1); ret[index]=pts[idx]; another way is to use compact function from cudpp or maybe trust.

It sounds like you’re looking for stream compaction. As Lev stated, you can find functionality for this built into cudpp and thrust (thrust is included automatically in CUDA 5). You should consider taking a look at:

You would need a different functor for removing elements than used in the example of course, but it should convey the idea pretty well.

Thanks, right now I’m using the fix Lev posted, although I don’t wholly understand why it works since atomicAdd is supposed to return the old value according to the documentation. I’ll see how much atomic operations end up slowing the process down, that stream compaction example seems to do exactly what I wanted…


If you use index=atomicAdd(&count,1). The index will be equal with the value of the count before incrementing, while the value of the count goes up by 1. This works because if you have an array of size countmax then index will be between 0 and countmax-1.