Funny development problem

  • Say i have one million threads running in the device (or any big number)
  • From those million I only need those with valid data to be returned to the host (maybe just one or the whole million).

Is there a way to do this?
If i send an array with the size of the problem and filled with 1’s and 0’s (1 = valid data, 0 invalid)

  1. It would use many unnecessary space.
  2. The effort to read this array sequentially in the host would take more time than the actual problem…

I’m new to CUDA so maybe this is a very common problem.

Thanks in advance for any help that can lead to solve my problem!!

Cheers!

You are looking for stream compaction.

I found this thanks to your reply:

Thank you very much tera!