I’m using cuda functions in asynchronous mode to process a large vector. In the middle of the processing chain I can eliminate most of the vector and stay with ~10% of the elements that are grouped into smaller vector. The number of elements in the small vector is known only in run time.
How can I process the smaller vector without returning to the host?
In part of the processing chain I want to us cuBLAS library, but it looks like I must know the size when calling the function.