Async start kernel in different stream after another completes?

I am currently async launching kernel A, then several grids of kernel B (each with different number of blocks), then doing a large amount of CPU work in parallel, and then transferring the results back.

Each grid of kernel B is independent of each other, but dependent upon kernel A. However, I am not currently using streams. I know I can wait until kernel A completes, then explicitly give each kernel B its own stream.

However, if I am understanding this correctly, this requires me to wait until after kernel A completes and then launch the B kernels, because if I do not they will erroneously begin executing in parallel with kernel A. Yet, I do not want to interrupt my CPU work to do this.

How do I asynchronously queue kernel launches in different streams to begin after the completion of a single kernel?

One possible method would be to use cudaStreamWaitEvent

Launch kernelA
Issue cudaEventRecord(eventA)

then in each of your dependent streams,

cudaStreamWaitEvent(eventA)
kernelB<<<…>>>(my_chunk)

[url]http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1gc301fd024e6fd4a17074d229d4504077[/url]

That looks like what I need! Thanks txbob!