Dynamic Parallelism Execution Order

Hello guys! I have a question about dynamic parallelism.
If I have a parent kernel which calls 4 child kernels in sequence inside a for loop with say 100 steps, will the child kernels be executed in sequence? like in the following code:

__global__ void parent(){
for(int i = 0; i < 100; i++){


in the code above, will child B only start executing when child A finishes and then child C only starts executing when child B finishes and child D only start executing when child C finishes?
I did some tests and they executed in sequence and one child only started executing when the other child finished. Is this the default behaviour? Even though my tests show that it is, I still have some doubts and I would like to know what you guys have to say about this. Thank you

the child kernels of a kernel may or may not execute in sequence, depending on the stream they are issued in

if none is specified/ set up, the child kernels execute in a ‘default stream’

the programming guide would be more concise on this - stream behaviour of dp is documented in the pg

if they are executed on the default stream, will they be executed in order?



“CUDA Streams and Events allow control over dependencies between grid launches: grids launched into the same stream execute in-order, and events may be used to create dependencies between streams. Streams and events created on the device serve this exact same purpose.”

thank you! this was really useful!