I have a large linear buffer of floats in device memory. I need to scan through the buffer and collect all float values larger than 0, along with their indexes into the buffer. Ideally all the {int index; float value} pairs should be stored continuously in device memory for easy transfer back to the host. Note that the number of values larger than zero will typically be very small compared to the total number of values, usually less than 0.1%, although there is no guaranteed maximum (other than the total number of values)
I’ve been thinking about it a bit but failed to see an obvious and efficient solution. Any tips?
Thanks for the suggestion! That would definitively be one way of solving it, although it seems like a bit of overkill in my case to do a full prefix sum etc just to save those very few items larger than zero. Especially since the big array doesn’t really need to be stored in global memory anyway, the values are generated on the fly by another kernel. The solution I would prefer is to somehow be able to record the generated values that are > 0 (plus their index) on the fly as well. With atomic operations I guess it would be easy, but I’m running on a 8800GTS.
Well, with atomics you don’t have any guarantee with regard to ordering, so you need an extra sorting step. But I agree that with such sparse output that might be preferable.
I don’t need the output to be sorted, so that’s not an issue. Anyway, I just got it working using the normal prefix sum method. The compaction step makes up around 15-20% of the total computation time, which I guess is acceptable. If I get hold of a 1.1 card I will try the atomics solution. Thanks for your suggestions!
I see people discuss about reduction, prefix-sum etc… I am curious to know how they can be of any help here.
WHy I ask is: To figure out how many non-zero entries r there in an array, one has to scan atleast the whole array. How can a prefix-sum or whatever, reduce this scan of the entire array?
On my 8800GTX, I would spawn 128 blocks (8 blocks on 16 MPs), where each block would just scan for non-zeroes… Writing the results back is the big problem. BUt one can allocate global memory for each block. So each block can pool in their result…
The only problem I see here is that the output may have to be sorted…
Not really. To me it seems like you still have your selected subset scattered all around in different memory locations after step 4. Besides, how would you synchronize the updating of the subsets in step 3, and how would you efficiently “collect the results back” in step 5?