Picking values > 0 from a array of floats

Hi,

I have a large linear buffer of floats in device memory. I need to scan through the buffer and collect all float values larger than 0, along with their indexes into the buffer. Ideally all the {int index; float value} pairs should be stored continuously in device memory for easy transfer back to the host. Note that the number of values larger than zero will typically be very small compared to the total number of values, usually less than 0.1%, although there is no guaranteed maximum (other than the total number of values)

I’ve been thinking about it a bit but failed to see an obvious and efficient solution. Any tips?

/Lars

Use a list compaction algorithm, derived from scan, see also:

[url=“http://beowulf.lcs.mit.edu/18.337/lectslides/scan.pdf”]http://beowulf.lcs.mit.edu/18.337/lectslides/scan.pdf[/url]
http://graphics.cs.ucdavis.edu/publication…_pdf?pub_id=915

Thanks for the suggestion! That would definitively be one way of solving it, although it seems like a bit of overkill in my case to do a full prefix sum etc just to save those very few items larger than zero. Especially since the big array doesn’t really need to be stored in global memory anyway, the values are generated on the fly by another kernel. The solution I would prefer is to somehow be able to record the generated values that are > 0 (plus their index) on the fly as well. With atomic operations I guess it would be easy, but I’m running on a 8800GTS.

/L

I think scan will outperform atomic operations in most cases, as it’s a linear time reduction. See also the marching cubes sample in the SDK.

Even if I would only have to use atomics to synchronize the writing of 20-30 floats or so per megapixel image frame?

/L

Well, with atomics you don’t have any guarantee with regard to ordering, so you need an extra sorting step. But I agree that with such sparse output that might be preferable.

I don’t need the output to be sorted, so that’s not an issue. Anyway, I just got it working using the normal prefix sum method. The compaction step makes up around 15-20% of the total computation time, which I guess is acceptable. If I get hold of a 1.1 card I will try the atomics solution. Thanks for your suggestions!

/L

I see people discuss about reduction, prefix-sum etc… I am curious to know how they can be of any help here.

WHy I ask is: To figure out how many non-zero entries r there in an array, one has to scan atleast the whole array. How can a prefix-sum or whatever, reduce this scan of the entire array?

On my 8800GTX, I would spawn 128 blocks (8 blocks on 16 MPs), where each block would just scan for non-zeroes… Writing the results back is the big problem. BUt one can allocate global memory for each block. So each block can pool in their result…

The only problem I see here is that the output may have to be sorted…

The problem is to store only the selected entries continuously in global memory, without “holes” of non selected values in between.

/L

okay. I would probably do it this way:

  1. I would have a constant amount of blocks with an amount of threads such that all latencies are hidden.

    For example: 128 blocks with 32 threads per block would help on my 8800 GTX.

  2. Now, split the input array among the 128 blocks.

  3. The idea is that each block would scan the entire subset that belongs to it

    and update results back in a result subset.

  4. Have an array of 128 elements dedicated to each block. Each block would

    update the “count” of its result subset so that you know how much of the

    result subset is valid.

  5. Now, all u need to do from the CPU is to collate the results back.

Does that make sense?

Not really. To me it seems like you still have your selected subset scattered all around in different memory locations after step 4. Besides, how would you synchronize the updating of the subsets in step 3, and how would you efficiently “collect the results back” in step 5?

/L