- Say i have one million threads running in the device (or any big number)
- From those million I only need those with valid data to be returned to the host (maybe just one or the whole million).
Is there a way to do this?
If i send an array with the size of the problem and filled with 1’s and 0’s (1 = valid data, 0 invalid)
- It would use many unnecessary space.
- The effort to read this array sequentially in the host would take more time than the actual problem…
I’m new to CUDA so maybe this is a very common problem.
Thanks in advance for any help that can lead to solve my problem!!
Cheers!