We know on a single SM more than 1 blocks and maximum 8 can be launched.
Lets say we have 2 SM (SM1 and SM2) in GPU and each with 100 registers.
And say we have 2 kernels K1 and K2. K1 uses per block 40 reg and K2 uses per block 60 registers.
Let also assume that K1 and K2 have 2 blocks each that have to get executed. That is we have in all 4 blocks for execution.
Now I want to launch K1 and K2 in parallel on Fermi GPU.
Given this scenario we can see that we can get best speed when, 1 block of K1 and 1 block K2 launch on SM1 as total reg usage will be 100 reg so it satisfies criteria and can go on say SM1. And remaining other blocks will go on SM2.
But if these blocks execute on SM in any other combination it will degrade performanceâ€¦. So to get better performance how can we ensure this behavior???