How they work betweem SM and block SM, SP, Block, Thread and so on.

I know that one stream multi-processor has 8 stream processors, and one stream multi-processor can be executed by 768 threads.

I have some questions:
Why 256 threads is executed in one block is slower than 64 threads in one block? (256 threads * 3 blocks VS. 64 threads * 8 blocks)
About the SDK example “matrixMul”, when I change the block size 16 to 8, why the block size 8 is more efficient than 16?

Thank you for your reply. :) :)

The 32 threads of a warp are executed by the 8 processors of the SM serially over 4 clock cycles.

The 24 warps that can be active on an SM (768 threads) are not physically executed at the same time, they are scheduled by the hardware based on when data is available.

Hope this helps!