I need my kernel to return an array that’s “lean” (no empty entries between indices), but going about that in a parallel manner is a bit more complicated than I thought.
EDIT: Ugh, I just pasted some code but apparently there’s a low limit for chars per post
It sounds like you’re looking for stream compaction. As Lev stated, you can find functionality for this built into cudpp and thrust (thrust is included automatically in CUDA 5). You should consider taking a look at: Google Code Archive - Long-term storage for Google Code Project Hosting.
You would need a different functor for removing elements than used in the example of course, but it should convey the idea pretty well.
Thanks, right now I’m using the fix Lev posted, although I don’t wholly understand why it works since atomicAdd is supposed to return the old value according to the documentation. I’ll see how much atomic operations end up slowing the process down, that stream compaction example seems to do exactly what I wanted…
If you use index=atomicAdd(&count,1). The index will be equal with the value of the count before incrementing, while the value of the count goes up by 1. This works because if you have an array of size countmax then index will be between 0 and countmax-1.