C2040 /fermi limits

I have a c2050 on windows 7 64 bit. I look at maxThreadsPerBlock from cudaGetDeviceProperties I get 1024, which makes sense. 32 warps each with 32 threads. But maxThreadsDim[0],[1],[2] gives 1024,1024,64 for x y and z for the size of the thread block, This way exceeding 32 * 32, are there really thread blocks with 1024 by 1024 by 64 threads, and if they block does not form
1024102464/32 warps, then what happens to the threads that live past warp 32? Do they get scheduled to run on the MP in SMTP fashion if they are not in a warp? On another maybe related point, are all 32 threads with fixed threadIdx.x and threadIdx.y, (free in the value of threadIdx.x) live in the same warp? I am writing some programs to test this kind of thing but it would be good to know from the start? can I or can’t I define giant thread blocks with dim3 B(1024,1024,64)? Will all of this block be seen as warps scheduled on the MP to which the block is assigned? Because each MP has 32 cuda cores (SP)'s maybe all the maxThreadsPerBlock means is that this is the maximum that can run at once on the MP, but all warps eventually get their turn. Is this right?

Another topic:
I read the articles from pgroup.com to try to glean whats going on, but sometimes I cant be sure what is said applies to the C2050 fermi. At one point, the article states that more than 8 thread blocks can run on a given MP at one time. Is this true also for the C2050? Does this means that only 8 thread blocks take part in the warp scheduler for the MP?

Another question. I read that only 48 simultaneously active warps can be actively running at once over all the MP’s (and I have 14 of them for the C2050). It seems like the MP’s are independent agents and if so why is there this limit over all the MP’s?

The block dim limits are constraints in addition to the threads per block limit. So you can have maximum block sizes of (1024,1,1), or (512,2,1), or (256,4,1), or (256,2,2), etc so that valid block sizes satisfy blockDim.x * blockDim.y * blockDim.z <= 1024