I have a array of 100,000 float elements. I want to be able to construct a new array (containing either the elements or just the indices) that fulfill a certain condition (let us say the positive ones).
Is anyone aware of possible approaches to solve this problem on CUDA knowing that the number of size of the results array is not known at the beginning at the procedure and that it has to be fed in parallel with new elements.
I am pretty sure that this problem is common and that it has been studied before (database selection would be an example).
Maybe someone else will have a solution that I’m not seeing, but that doesn’t look like the type of thing a G80 would be good for, unless the test that is performed at each element is really complicated (at which point you could generate a mask array and send that back).
I guess you could break it up into a lot of little chunks, then construct pieces of the output in shared memory + dump those to their own arrays (which then get concatenated during the copy back to system memory). This may or may not be faster than just doing it on the CPU, though…
Thank you YetAnotherNoob for your reply. I agree with you that this multiselection problem is not inherently parallel. The reason why I was thinking of doing this on the GPU is that I will need to do this operation ~200 times as an intermediary step to other calculations (least squares regressions) on the graphics card. Copying the data back to system memory would be too expensive.