HOW does CUDA map to the HW General question

Can anybody explain in detail how the CUDA Programming Model can be mapped on the HW on G80.

I mean how Grids, Blocks, Warps , Threads are processed and by wich Hardware Components.

I know that a block is mapped to a multiprocessor but i don’t understand how a warp can run physically in parallel if there are 32 Threads in a Warp and lets and just 8 Streaming Processors in one Multiprocessor.
In my opinion there can only run 8 Threads of a warp physicaly parallel at a time but I think I’m wrong.
So please help me someone External Image
Thanks!

the instruction decoder is working at 1/4 the clockrate as the streamprocessors. So the stream processors do 4 times the same instruction (on different data). 4x8 = 32.