A better diagram is in the CUDA programming guide.
Think of unlayering each levels of abstraction… they’re easier to understand seperately.
You launch a kernel with a grid of blocks of work. The GPU executes your code on all the blocks and returns.
How does the GPU execute the blocks of the grid? It gives one or more blocks to each SM and tells it to work. When the SM finishes a block, the GPU gives it another to work on. When all blocks are done, the GPU returns.
How does the SM execute a block? The blocks have one or more warps. There may be more than one block, but the SM basically makes a big queue of all the warps from all the blocks. It takes one warp from the queue every one tick of the clock and executes it for one clock. The next tick, it executes the next warp on the queue (it doesn’t have to wait for the first warp to finish!) and on the next tick, another warp, and so on. Warps that finish their one tick of computation get put back onto the queue to wait for their chance to evaluate their next instruction. A warp can take many ticks of latency, even hundreds, to finish, especially if they’re waiting for memory. When all the warps from one block are done, the SM tells the kernel and may get a new block.
How is a warp executed? A warp is 32 threads wide. The warp is executed for one instruction (well, it could be two from dual-issue, but ignore that). Say the instruction is a “C=A+B”. Then the 32 threads each read “A” and “B” from registers, and the 32 SPs are given those 32 A and B values, and do the add. So the SPs are “doing the work”… they all perform the same instruction on each thread’s data.