A question about warps and threadblock

I learned from a presentation slides saying: “Prefer to have enough threads per block to provide hardware with many warps to switch between - this the the way how the GPU hides memory access latency.”

Also I learned: “all the threads in the same thread block are supposed to execute concurrently.” If that is the case, all the threads in a threadblock are supposed to execute concurrently, then when does the switch happen? Also how is the memory access latency hiding achieved?

I am confused:) Thanks for help in advance!

-AW

Switch happens every cycle. You can’t switch, however, to a warp that is stalled - so, having more concurrent warps helps.

Latency hiding is achieved by doing other work when waiting for the data to come. Same applies to hiding arithmetic latency.

Thanks vvolkov!

So you mean: all the threads in a threadblock are NOT running concurrently ON THE HARDWARE IN REAL. Every cycle a warp will pick up a group threads (32 threads) from the threadblock to execute?

What do you mean “in real”? The active threads are all current, e.g. if you have 64 active warps on a multiprocessor (=2048 threads), the hardware can issue an instruction from any of them, without any context switching. (So, the “switch” cited above doesn’t really happen. Nothing is switched, i.e. moved or replaced - only selected.) This is just like hyperthreads on CPU, but at a larger scale. On CPU you have 2 hyperthreads per core, here you have 64 warps per multiprocessor.

Note that executing an instruction takes time. You keep sending a new instruction into the execution pipeline every cycle, but it takes many cycles until the result comes back. In result, you get tons and tons of instructions in progress - many more than the number of instructions issued or completed every cycle. (Which is bound by 8 per multiprocessor on Kepler.)

Thanks again vvolkov. The execution of a threadblock is preemptive, so memory latency hiding is achieved through scheduling different threadblocks:)

Sort of. It is more about scheduling different warps, not threadblocks though.