Pipelined Loads

Hi all,

I was watching this Nvidia CUDA conference record: [url=“http://nvidia.fullviewmedia.com/GPU2009/1002-gold-1086.html”]http://nvidia.fullviewmedia.com/GPU2009/1002-gold-1086.html[/url].

At about 18:32 (slide GMEM Optimization Guidelines) is written (under process several elements per thread item) multiples loads gets pipelined, but neither the lecturer nor the slideshow explains what this term means.

Can anyone help me understand this sentence? In fact doing this really improves the bandwidth and the execution time of a test kernel, but i’m not quite sure why.

Thanks in advance

It is a way to add some instruction level parallelism into your threads:

a = load(..)

b = load(..)

result = f(a)

result += f(b)

In the hardware, instructions can be executed as long as they are not dependant on pending memory loads - so in the example above: a warp would start processing f(a) after the load for a completes, but while the load for b is still pending.

It is a way to add some instruction level parallelism into your threads:

a = load(..)

b = load(..)

result = f(a)

result += f(b)

In the hardware, instructions can be executed as long as they are not dependant on pending memory loads - so in the example above: a warp would start processing f(a) after the load for a completes, but while the load for b is still pending.

Thanks for your answer!

But wouldn’t another warp be selected for execution when the current warp is waiting for Gmem (as long as the occupancy is above 50%)? And thus hiding the latency?

What I can imagine is that pipelined loads gets overlapped with arithmetic operations, therefore increasing performance.

Such overlap not possible when doing one-element/thread load, because the warp waiting for Gmem access would become inactive.

Is that it?

Thank you again

Thanks for your answer!

But wouldn’t another warp be selected for execution when the current warp is waiting for Gmem (as long as the occupancy is above 50%)? And thus hiding the latency?

What I can imagine is that pipelined loads gets overlapped with arithmetic operations, therefore increasing performance.

Such overlap not possible when doing one-element/thread load, because the warp waiting for Gmem access would become inactive.

Is that it?

Thank you again

Yes, this is the classic TLP (thread-level-parallelism) that GPUs are great at. Some kernels benefit from the additional ILP you can get from overlapping loads and arithmetic in the same thread, too.

Yes, this is the classic TLP (thread-level-parallelism) that GPUs are great at. Some kernels benefit from the additional ILP you can get from overlapping loads and arithmetic in the same thread, too.

Thanks again.

Just one more thing: Do you know any reference that I can learn more about CUDA instruction level paralelism? I don’t think CUDA manual or Best Practices cover this topic.

Thanks!

Thanks again.

Just one more thing: Do you know any reference that I can learn more about CUDA instruction level paralelism? I don’t think CUDA manual or Best Practices cover this topic.

Thanks!

The technique is discussed in this year’s VSCSE summer school course on CUDA. There are slides and you can watch the recorded video lectures.
[url=“http://groups.google.com/group/vscse-many-core-processors-2010/web/course-presentations”]http://groups.google.com/group/vscse-many-...e-presentations[/url]
IIRC, they mainly discuss thread coarsening as a method of introducing ILP and data-reuse. Prefetching is another common technique.

There really aren’t any hard and fast rules. ILP in CUDA kernels is a bit of a guess and check game: try out a couple different prefetching or thread coarsening operations and benchmark the performance. Sometimes you will find that performance is increased and at other times decreased.

The technique is discussed in this year’s VSCSE summer school course on CUDA. There are slides and you can watch the recorded video lectures.
[url=“http://groups.google.com/group/vscse-many-core-processors-2010/web/course-presentations”]http://groups.google.com/group/vscse-many-...e-presentations[/url]
IIRC, they mainly discuss thread coarsening as a method of introducing ILP and data-reuse. Prefetching is another common technique.

There really aren’t any hard and fast rules. ILP in CUDA kernels is a bit of a guess and check game: try out a couple different prefetching or thread coarsening operations and benchmark the performance. Sometimes you will find that performance is increased and at other times decreased.

Here is another presentation that also highlights why using ILP can be better than using TLP: http://www.eecs.berkeley.edu/~volkov/volkov10-PMAA.pdf

Vasily

Here is another presentation that also highlights why using ILP can be better than using TLP: http://www.eecs.berkeley.edu/~volkov/volkov10-PMAA.pdf

Vasily

Great material! This confirms a lot of observations I’ve been making but haven’t quantified with rigid testing! Really good stuff! Thanks!

Great material! This confirms a lot of observations I’ve been making but haven’t quantified with rigid testing! Really good stuff! Thanks!

I know the detailed internals are not documented (and subject to change!), but what’s the basic method that the GPU uses to determine whether operations can be done in parallel? Obviously this is done by the hardware scheduler, but how is it likely implemented?

As a total guess, I could imagine that every register has a virtual bit saying whether it is “in flight”. The scheduler looks at the next PTX instruction and sees if all of the input and output registers are not in flight, and if they’re not, it performs the instruction and marks any registers that are being WRITTEN to as in flight. When a pipeline compute finishes (or perhaps it’s just after a fixed delay?) the in-flight bit is cleared.
Such a design would allow nesting as many ILP layers as you like since there’s no dynamic stack or anything, and you don’t need to peek ahead at upcoming instructions.
But this model can’t be entirely correct since it doesn’t account for memory reads or writes to shared or global memory…

Understanding this may seem academic, but Vasily shows that it’s actually important in practice, and understanding the low level details can really help guide algorithm implementations. (And this doesn’t even get into the significant changes in GF104 which can perform multiple instructions per clock in many cases.) Another interesting question is whether the compiler tries to reorder PTX to allow more ILP even for G200/GF100 code.

I know the detailed internals are not documented (and subject to change!), but what’s the basic method that the GPU uses to determine whether operations can be done in parallel? Obviously this is done by the hardware scheduler, but how is it likely implemented?

As a total guess, I could imagine that every register has a virtual bit saying whether it is “in flight”. The scheduler looks at the next PTX instruction and sees if all of the input and output registers are not in flight, and if they’re not, it performs the instruction and marks any registers that are being WRITTEN to as in flight. When a pipeline compute finishes (or perhaps it’s just after a fixed delay?) the in-flight bit is cleared.
Such a design would allow nesting as many ILP layers as you like since there’s no dynamic stack or anything, and you don’t need to peek ahead at upcoming instructions.
But this model can’t be entirely correct since it doesn’t account for memory reads or writes to shared or global memory…

Understanding this may seem academic, but Vasily shows that it’s actually important in practice, and understanding the low level details can really help guide algorithm implementations. (And this doesn’t even get into the significant changes in GF104 which can perform multiple instructions per clock in many cases.) Another interesting question is whether the compiler tries to reorder PTX to allow more ILP even for G200/GF100 code.

The compiler definitely pushes memory loads up as far as they can go. I don’t do FLOP heavy enough work to comment on how it reorders FLOPs to allow for ILP.

The compiler definitely pushes memory loads up as far as they can go. I don’t do FLOP heavy enough work to comment on how it reorders FLOPs to allow for ILP.

Thanks for the encouraging feedback. I am going to present a version of this talk at GTC 2010 in a few weeks.

Steve, you might want to check the following patent:

Coon et al. 2008. Tracking register usage during multithreaded processing using a scoreboard having separate memory regions and storing sequential register size indicators, U.S. Patent No. 7434032.

Vasily