I make a test.
As if block number launched less than max sm number, every sm will only do a block’s work.
So i test in m2050.
Run a kernel with block number from 1,2,4,8,16.
And the efficient of blocks is like 35,32,24,19,12.
The efficient is caculated as (speed/block number).
Is there any reason to draw such a conclusion?