kernel launches in the same stream

We know that cuda kernel executions are asynchronous with respect to CPU calls. For example, as long as a kernel function is launched, we can proceed to execute the next cpu instruction if there is one after it.

But I am not sure two kernel calls in the same stream can be launched at the same time. That means, we launch the first kernel and before the first kernel completes, we launch the second kernel. My guess is that if two kernels calls reside the same stream, the second kernel has to be launched after the completion of the preceding kernel in the stream.

Does anyone have any idea?

Thank you!

You can kick off the launches asynchronously without synchronizing on the first, and the ordering semantics within a stream are such that the second kernel will wait for all blocks in the first kernel to complete before any of the second kernel’s blocks are launched.

You can kick off the launches asynchronously without synchronizing on the first, and the ordering semantics within a stream are such that the second kernel will wait for all blocks in the first kernel to complete before any of the second kernel’s blocks are launched.

Thanks, thats what I thought. To kick off the launches asynchronously, I guess the only way is to make them in different streams?

Thanks, thats what I thought. To kick off the launches asynchronously, I guess the only way is to make them in different streams?