Thread Block Scheduling on SM in Dynamic Parallelism

Can anyone please clarify on how new created thread blocks within a thread block is scheduled on SM? Are new blocks scheduled on the same SM or can use different SMs?
This is related to CUDA dynamic parallelism.

Use %smid register to check which thread block is executed on which SM.

The thread block scheduling also depends on how the thread block uses the shared memory of SM.
If the thread block uses more than 1/2 shared memory, it forces one thread block per SM.

You can print out the %smid register value, and it will give you a clear picture where
each thread block got executed.

dynamic parallelism child kernels can use any and all SMs on the GPU, just like any other kernel launch.

Thanks to both of you for the earlier response.

Dear LongY, Can you please give me an example statement to check %smid? How can I check that?

asm(“mov.u32 %0, %%smid;” : “=r”(block_s[bid]) );

Write %smid register value to an array in global memory, which
you are able to print it out after copying back to host.

Thanks for the statement.

I check by this statement and found some strange block scheduling results. It seems that the thread blocks are distributed randomly to different SMs instead of assigning blocks in sequence. I read in one of the NVIDIA CUDA Programming guide that blocks are distributed equally among SMs for automatic scalability. Please confirm about the block scheduling among different SMs. For example, if we have 8 blocks (0-7) and 4 (0-3) SMs then block 0 goes to sm 0, block 1 to sm 1, block 2 to sm 2, block 3 to sm 3, block 4 to sm 0, block 5 to sm 1, block 6 to sm 2 and block 7 to sm 3. Is this correct or it can be in any order?

It can be in any order.