Trouble in understanding this concept

tajiknomi · September 15, 2018, 7:59am

I am reading “Programming massively parallel processors”. In the section “DYNAMIC PARTITIONING OF SM RESOURCES”, the author wrote “In some cases, adding an automatic variable may allow the programmer to improve the execution speed…” and then he gave a scenario to prove the point.

Here the screenshot of the paragraph which i would like to be understand.

[url]Dropbox - paragraph.zip - Simplify your life

What does he meant by “four independent instructions between a global memory
load and its use” ?
“With a 200-cycle global memory latency… we need to have at least 14 warps”

I will be glad if someone could thoroughly explain the paragraph.

Thank you

njuffa · September 15, 2018, 4:32pm

“independent” = “does not have a data dependency”
“has a data dependency on instruction X” = “consumes data produced by a preceding instruction X”

Thus here: the global load instruction is followed by four instructions that do not consume the data produced by the load instruction (“independent”), followed by an instruction that does consume it (“use”).

If we want to avoid the penalty (i.e. stall) of a load with a 200 cycle latency, we need to execute instructions for 200 cycles that do not dependent on the data produced by that load. On GPUs the predominant mechanism for this is thread-level parallelism: a thread that is stalled waiting for data to arrive is suspended and another thread is scheduled instead (zero overhead context switching). Obviously, the other thread can also hit the same load instruction, causing it to stall, and yet another thread needs to run. So many concurrently running threads are needed to cover 200 cycles of load latency.

For simplicity, GPUs schedule groups of 32 threads called warp instead of individual threads. Why exactly 14 warps worth of threads are needed to cover a 200-cycle load latency here I do not know, but it it probably explained in the book you are reading.

tajiknomi · September 17, 2018, 4:14am

Thank you.

Topic		Replies	Views
Parallel Access to GDU Global Memory CUDA Programming and Performance	9	8935	January 24, 2008
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15586	February 4, 2011
a deep dive into Instruction-level parallelism CUDA Programming and Performance	17	5010	December 18, 2018
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4492	October 24, 2008
Simple summary of CUDA execution model An attempt to simplify and summarize various sources on execu CUDA Programming and Performance	7	5564	July 28, 2009
Basic question about warps CUDA Programming and Performance	14	6588	June 9, 2009
What limits the IPC in CUDA? or How to decrease the avg execution dependency cycles? CUDA Programming and Performance	6	7178	March 30, 2013
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28703	July 4, 2019
How to keep the float pipe busy? CUDA Programming and Performance	7	708	April 23, 2019
Warps - Number of threads running concurrently CUDA Programming and Performance	4	2171	March 19, 2011

Trouble in understanding this concept

Related topics