Pipelined Loads

Lerwys · August 27, 2010, 8:43pm

Hi all,

I was watching this Nvidia CUDA conference record: [url=“http://nvidia.fullviewmedia.com/GPU2009/1002-gold-1086.html”]http://nvidia.fullviewmedia.com/GPU2009/1002-gold-1086.html[/url].

At about 18:32 (slide GMEM Optimization Guidelines) is written (under process several elements per thread item) multiples loads gets pipelined, but neither the lecturer nor the slideshow explains what this term means.

Can anyone help me understand this sentence? In fact doing this really improves the bandwidth and the execution time of a test kernel, but i’m not quite sure why.

Thanks in advance

MisterAnderson42 · August 29, 2010, 9:48am

It is a way to add some instruction level parallelism into your threads:

a = load(..)

b = load(..)

result = f(a)

result += f(b)

In the hardware, instructions can be executed as long as they are not dependant on pending memory loads - so in the example above: a warp would start processing f(a) after the load for a completes, but while the load for b is still pending.

MisterAnderson42 · August 29, 2010, 9:48am

It is a way to add some instruction level parallelism into your threads:

a = load(..)

b = load(..)

result = f(a)

result += f(b)

In the hardware, instructions can be executed as long as they are not dependant on pending memory loads - so in the example above: a warp would start processing f(a) after the load for a completes, but while the load for b is still pending.

Lerwys · August 29, 2010, 2:27pm

It is a way to add some instruction level parallelism into your threads:
a = load(..)

b = load(..)

result = f(a)

result += f(b)
In the hardware, instructions can be executed as long as they are not dependant on pending memory loads - so in the example above: a warp would start processing f(a) after the load for a completes, but while the load for b is still pending.

Thanks for your answer!

But wouldn’t another warp be selected for execution when the current warp is waiting for Gmem (as long as the occupancy is above 50%)? And thus hiding the latency?

What I can imagine is that pipelined loads gets overlapped with arithmetic operations, therefore increasing performance.

Such overlap not possible when doing one-element/thread load, because the warp waiting for Gmem access would become inactive.

Is that it?

Thank you again

Lerwys · August 29, 2010, 2:27pm

It is a way to add some instruction level parallelism into your threads:
a = load(..)

b = load(..)

result = f(a)

result += f(b)
In the hardware, instructions can be executed as long as they are not dependant on pending memory loads - so in the example above: a warp would start processing f(a) after the load for a completes, but while the load for b is still pending.

Thanks for your answer!

But wouldn’t another warp be selected for execution when the current warp is waiting for Gmem (as long as the occupancy is above 50%)? And thus hiding the latency?

What I can imagine is that pipelined loads gets overlapped with arithmetic operations, therefore increasing performance.

Such overlap not possible when doing one-element/thread load, because the warp waiting for Gmem access would become inactive.

Is that it?

Thank you again

MisterAnderson42 · August 30, 2010, 12:28pm

Yes, this is the classic TLP (thread-level-parallelism) that GPUs are great at. Some kernels benefit from the additional ILP you can get from overlapping loads and arithmetic in the same thread, too.

MisterAnderson42 · August 30, 2010, 12:28pm

Yes, this is the classic TLP (thread-level-parallelism) that GPUs are great at. Some kernels benefit from the additional ILP you can get from overlapping loads and arithmetic in the same thread, too.

Lerwys · August 30, 2010, 12:47pm

Thanks again.

Just one more thing: Do you know any reference that I can learn more about CUDA instruction level paralelism? I don’t think CUDA manual or Best Practices cover this topic.

Thanks!

Lerwys · August 30, 2010, 12:47pm

Thanks again.

Just one more thing: Do you know any reference that I can learn more about CUDA instruction level paralelism? I don’t think CUDA manual or Best Practices cover this topic.

Thanks!

MisterAnderson42 · August 30, 2010, 3:13pm

The technique is discussed in this year’s VSCSE summer school course on CUDA. There are slides and you can watch the recorded video lectures.
[url=“http://groups.google.com/group/vscse-many-core-processors-2010/web/course-presentations”]http://groups.google.com/group/vscse-many-...e-presentations[/url]
IIRC, they mainly discuss thread coarsening as a method of introducing ILP and data-reuse. Prefetching is another common technique.

There really aren’t any hard and fast rules. ILP in CUDA kernels is a bit of a guess and check game: try out a couple different prefetching or thread coarsening operations and benchmark the performance. Sometimes you will find that performance is increased and at other times decreased.

MisterAnderson42 · August 30, 2010, 3:13pm

The technique is discussed in this year’s VSCSE summer school course on CUDA. There are slides and you can watch the recorded video lectures.
[url=“http://groups.google.com/group/vscse-many-core-processors-2010/web/course-presentations”]http://groups.google.com/group/vscse-many-...e-presentations[/url]
IIRC, they mainly discuss thread coarsening as a method of introducing ILP and data-reuse. Prefetching is another common technique.

There really aren’t any hard and fast rules. ILP in CUDA kernels is a bit of a guess and check game: try out a couple different prefetching or thread coarsening operations and benchmark the performance. Sometimes you will find that performance is increased and at other times decreased.

vvolkov · August 30, 2010, 4:35pm

Here is another presentation that also highlights why using ILP can be better than using TLP: http://www.eecs.berkeley.edu/~volkov/volkov10-PMAA.pdf

Vasily

vvolkov · August 30, 2010, 4:35pm

Here is another presentation that also highlights why using ILP can be better than using TLP: http://www.eecs.berkeley.edu/~volkov/volkov10-PMAA.pdf

Vasily

Jimmy_Pettersson · August 30, 2010, 5:47pm

Great material! This confirms a lot of observations I’ve been making but haven’t quantified with rigid testing! Really good stuff! Thanks!

Jimmy_Pettersson · August 30, 2010, 5:47pm

Great material! This confirms a lot of observations I’ve been making but haven’t quantified with rigid testing! Really good stuff! Thanks!

SPWorley · August 31, 2010, 12:28am

I know the detailed internals are not documented (and subject to change!), but what’s the basic method that the GPU uses to determine whether operations can be done in parallel? Obviously this is done by the hardware scheduler, but how is it likely implemented?

As a total guess, I could imagine that every register has a virtual bit saying whether it is “in flight”. The scheduler looks at the next PTX instruction and sees if all of the input and output registers are not in flight, and if they’re not, it performs the instruction and marks any registers that are being WRITTEN to as in flight. When a pipeline compute finishes (or perhaps it’s just after a fixed delay?) the in-flight bit is cleared.
Such a design would allow nesting as many ILP layers as you like since there’s no dynamic stack or anything, and you don’t need to peek ahead at upcoming instructions.
But this model can’t be entirely correct since it doesn’t account for memory reads or writes to shared or global memory…

Understanding this may seem academic, but Vasily shows that it’s actually important in practice, and understanding the low level details can really help guide algorithm implementations. (And this doesn’t even get into the significant changes in GF104 which can perform multiple instructions per clock in many cases.) Another interesting question is whether the compiler tries to reorder PTX to allow more ILP even for G200/GF100 code.

SPWorley · August 31, 2010, 12:28am

I know the detailed internals are not documented (and subject to change!), but what’s the basic method that the GPU uses to determine whether operations can be done in parallel? Obviously this is done by the hardware scheduler, but how is it likely implemented?

As a total guess, I could imagine that every register has a virtual bit saying whether it is “in flight”. The scheduler looks at the next PTX instruction and sees if all of the input and output registers are not in flight, and if they’re not, it performs the instruction and marks any registers that are being WRITTEN to as in flight. When a pipeline compute finishes (or perhaps it’s just after a fixed delay?) the in-flight bit is cleared.
Such a design would allow nesting as many ILP layers as you like since there’s no dynamic stack or anything, and you don’t need to peek ahead at upcoming instructions.
But this model can’t be entirely correct since it doesn’t account for memory reads or writes to shared or global memory…

Understanding this may seem academic, but Vasily shows that it’s actually important in practice, and understanding the low level details can really help guide algorithm implementations. (And this doesn’t even get into the significant changes in GF104 which can perform multiple instructions per clock in many cases.) Another interesting question is whether the compiler tries to reorder PTX to allow more ILP even for G200/GF100 code.

MisterAnderson42 · August 31, 2010, 1:14pm

The compiler definitely pushes memory loads up as far as they can go. I don’t do FLOP heavy enough work to comment on how it reorders FLOPs to allow for ILP.

MisterAnderson42 · August 31, 2010, 1:14pm

The compiler definitely pushes memory loads up as far as they can go. I don’t do FLOP heavy enough work to comment on how it reorders FLOPs to allow for ILP.

vvolkov · August 31, 2010, 4:54pm

Thanks for the encouraging feedback. I am going to present a version of this talk at GTC 2010 in a few weeks.

Steve, you might want to check the following patent:

Coon et al. 2008. Tracking register usage during multithreaded processing using a scoreboard having separate memory regions and storing sequential register size indicators, U.S. Patent No. 7434032.

Vasily

Topic		Replies	Views
[Fermi] Number of registers CUDA Programming and Performance	36	20370	September 15, 2010
a deep dive into Instruction-level parallelism CUDA Programming and Performance	17	5464	December 18, 2018
How to reduce register usage CUDA Programming and Performance	47	49815	May 28, 2022
Nvidia GF104 vs GF100 CUDA Programming and Performance	24	23110	October 12, 2010
On the register allocation optimization of cuda compiler CUDA Programming and Performance	12	3391	January 20, 2019
Branch Divergence Serialization (Threads/hardware stalls ?) Performance Impact ? Branch divergence s CUDA Programming and Performance	3	1647	June 15, 2011
How close to peak can you get on a CPU? CUDA Programming and Performance	33	3120	November 9, 2010
Some advice needed pls Doubts we have, we're starting with CUDA programming CUDA Programming and Performance	16	4801	June 22, 2011
coalesced vs. uncoalesced access why not speed-up of 16x? CUDA Programming and Performance	13	6151	October 29, 2008
Unofficial Kepler Slides from Random Gamer Site Yeah, yeah, but we only have another week to rumor-m CUDA Programming and Performance	63	10676	April 5, 2012

Pipelined Loads

Related topics