Kernel Execution Sequence

What is the order of kernel execution?
The literature talks about asynchronous execution, but asynchronous in what respect?
It appears, it is the threads in the kernel that are executed asynchronously (unknown sequence).
The kernel execution themselves don’t appear to execute asynchronously, or do they?

I have 3 kernels, none contain any blocking/synchronous operations.
If I call kernelA, then kernelB, then a THRUST call, and then kernelC. How are they executed on the GPU?

Will kernelB ONLY start after kernelA is finished? (i.e. One after the other.)
Can kernelB start before kernelA has finished? (i.e. Can 2 kernels run at the same time.)
In which case, kernelC could complete execution before kernelA is complete.

Does the THRUST call appear as a single kernel?
Could a single THRUST call include more than one kernel call?
Do the THRUST calls include blocking between other THRUST or kernel calls?

Even though the kernels are started in a certain order, can they execute in a different order? i.e. Could kernelB start execution in the GPU before kernelA?


Kernels in the same CUDA stream (and if you don’t specify a stream, it is stream 0) will always run in the order you submit them. The asynchronous aspect of CUDA is that once you launch a kernel (or several), execution continues almost immediately on the CPU while the kernels run on the GPU in the background, until the CPU hits some kind of synchronization point, such as cudaDeviceSynchronize() or cudaMemcpy().

If you create multiple CUDA streams, then kernels in different streams can run in an arbitrary order relative to each other, or possibly even simultaneously.

I don’t use Thrust, so I can’t address those questions.