Sparse accumulation

I have a float image, for each pixel I want to see if its value is greater than a set threshold, if it is I want to add it to a list (where each entry specifies the x,y location and the original value). Only around 3% of the pixels meet this threshold. What would be the optimal way to create this list in CUDA? There is no way to have some kind of list position counter without atomics? I implemented this point extraction on the CPU but its really slow.

I am not really sure why you want to implement your code in cuda. How is your code structured ? What do you want to parallelize? What do you want to do with the extracted data? How large is your image ?

Perhaps you could use the cudppCompact function in CUDPP?

http://www.gpgpu.org/developer/cudpp/

The reason i want to do this in cuda is that the image i am scanning was generated in cuda on the GPU, and i want to avoid having to copy it back to the CPU to do the scan and then send the resulting point list back to GPU only to be used as the input of another cuda kernel.