Mapping of Thread Blocks to SMs

I checked using the ptx assembly code for the assigned SM to each thread block and found some strange block scheduling results. It seems that the thread blocks are distributed randomly to different SMs instead of assigning blocks in sequence. I read in one of the NVIDIA CUDA Programming guide that blocks are distributed equally among SMs for automatic scalability. Please confirm about the block scheduling among different SMs. For example, if we have 8 blocks (0-7) and 4 (0-3) SMs then block 0 goes to sm 0, block 1 to sm 1, block 2 to sm 2, block 3 to sm 3, block 4 to sm 0, block 5 to sm 1, block 6 to sm 2 and block 7 to sm 3. Is this correct or it can be in any order?

It can be in any order. The actual block scheduling process is intentionally unspecified. Probably what you are referring to was an example of what might happen, not a specification of what must happen. If you think otherwise, please provide a specific link to the “NVIDIA CUDA Programming guide” section that you think provides a specification for block scheduling.

From the standpoint of utilization efficiency, I’m not sure a random distribution order vs. a sequential distribution order makes a difference.