How the blocks of different kernels execute on fermi card?

We know on a single SM more than 1 blocks and maximum 8 can be launched.
Lets say we have 2 SM (SM1 and SM2) in GPU and each with 100 registers.
And say we have 2 kernels K1 and K2. K1 uses per block 40 reg and K2 uses per block 60 registers.
Let also assume that K1 and K2 have 2 blocks each that have to get executed. That is we have in all 4 blocks for execution.
Now I want to launch K1 and K2 in parallel on Fermi GPU.

Given this scenario we can see that we can get best speed when, 1 block of K1 and 1 block K2 launch on SM1 as total reg usage will be 100 reg so it satisfies criteria and can go on say SM1. And remaining other blocks will go on SM2.

But if these blocks execute on SM in any other combination it will degrade performance…. So to get better performance how can we ensure this behavior???

You can’t. The CUDA API gives the developer no control over the scheduling policies used by the device.

Yes the CUDA API gives the developer no control over the scheduling policies used by the device.
But at least knowing that how it works will be of great help.

With reference to example above…
Suppose that in parallel launch the blocks of different kernels (K1 and K2) go on different SM in some let say random way.
Then say it just happens that both the blocks of K1 are scheduled on SM1 so total registers used is 80 on SM1.
Then comes the first block of K2 which uses 60 registers here scheduler finds that since there are just 20 registers on SM1 it goes for SM2.
Now first block of K2 goes on SM2 then comes second block Of K2 which requires 60 reg.
But SM1 have 20 reg and SM2 have 40 reg remaining.
So this second block has to wait for either SM1 or SM2 to finish with the blocks they have as it has no other SM to go for.
Thus leading to performance drop…

Sorry if I am wrong anywhere and making it bit theoretical …. Was just curious to know this stuff.

Also is the scheduling policy smart enough to put one block of K1 and one block of K2 on a single SM to fully utilize resources???