Overhead of block scheduling?

GPU implements zero-overhead scheduling to interleave warps. But I don’t understand the situation as below:
the G80 can provider the max number 768 of active threads per SM. If I set the block as 32x16. Then only one block can be putted into SM and there are only 512 threads can be executed in SM. My questions are:

  1. If there is any over head when switch to next block.
  2. In the impelmentaion of GPU blocka are divided into warps including 32 theads. If the time spends on 32x16 block dimension is greater than 2x the time spends on 16x16 block dimension.