Yes. Any given block runs on only one MP. There can be more than one block scheduled to run on one MP at the same time, if resources allow.
A few bytes short of 16k, yes.
I don’t think that is assured. Warps of 32 threads are implicitly synchronized (so threads 0-31 in your example), but that is about the limit of what you can assume. It might seem logical that the execution happens lock-step in sequences of 8 threads, but I don’t think it is guaranteed to be the case.