I am currently async launching kernel A, then several grids of kernel B (each with different number of blocks), then doing a large amount of CPU work in parallel, and then transferring the results back.
Each grid of kernel B is independent of each other, but dependent upon kernel A. However, I am not currently using streams. I know I can wait until kernel A completes, then explicitly give each kernel B its own stream.
However, if I am understanding this correctly, this requires me to wait until after kernel A completes and then launch the B kernels, because if I do not they will erroneously begin executing in parallel with kernel A. Yet, I do not want to interrupt my CPU work to do this.
How do I asynchronously queue kernel launches in different streams to begin after the completion of a single kernel?