I am reading “Programming massively parallel processors”. In the section “DYNAMIC PARTITIONING OF SM RESOURCES”, the author wrote “In some cases, adding an automatic variable may allow the programmer to improve the execution speed…” and then he gave a scenario to prove the point.
Here the screenshot of the paragraph which i would like to be understand.
What does he meant by “four independent instructions between a global memory
load and its use” ?
“With a 200-cycle global memory latency… we need to have at least 14 warps”
I will be glad if someone could thoroughly explain the paragraph.
“independent” = “does not have a data dependency”
“has a data dependency on instruction X” = “consumes data produced by a preceding instruction X”
Thus here: the global load instruction is followed by four instructions that do not consume the data produced by the load instruction (“independent”), followed by an instruction that does consume it (“use”).
If we want to avoid the penalty (i.e. stall) of a load with a 200 cycle latency, we need to execute instructions for 200 cycles that do not dependent on the data produced by that load. On GPUs the predominant mechanism for this is thread-level parallelism: a thread that is stalled waiting for data to arrive is suspended and another thread is scheduled instead (zero overhead context switching). Obviously, the other thread can also hit the same load instruction, causing it to stall, and yet another thread needs to run. So many concurrently running threads are needed to cover 200 cycles of load latency.
For simplicity, GPUs schedule groups of 32 threads called warp instead of individual threads. Why exactly 14 warps worth of threads are needed to cover a 200-cycle load latency here I do not know, but it it probably explained in the book you are reading.