At about 18:32 (slide GMEM Optimization Guidelines) is written (under process several elements per thread item) multiples loads gets pipelined, but neither the lecturer nor the slideshow explains what this term means.
Can anyone help me understand this sentence? In fact doing this really improves the bandwidth and the execution time of a test kernel, but i’m not quite sure why.
It is a way to add some instruction level parallelism into your threads:
a = load(..)
b = load(..)
result = f(a)
result += f(b)
In the hardware, instructions can be executed as long as they are not dependant on pending memory loads - so in the example above: a warp would start processing f(a) after the load for a completes, but while the load for b is still pending.
It is a way to add some instruction level parallelism into your threads:
a = load(..)
b = load(..)
result = f(a)
result += f(b)
In the hardware, instructions can be executed as long as they are not dependant on pending memory loads - so in the example above: a warp would start processing f(a) after the load for a completes, but while the load for b is still pending.
But wouldn’t another warp be selected for execution when the current warp is waiting for Gmem (as long as the occupancy is above 50%)? And thus hiding the latency?
What I can imagine is that pipelined loads gets overlapped with arithmetic operations, therefore increasing performance.
Such overlap not possible when doing one-element/thread load, because the warp waiting for Gmem access would become inactive.
But wouldn’t another warp be selected for execution when the current warp is waiting for Gmem (as long as the occupancy is above 50%)? And thus hiding the latency?
What I can imagine is that pipelined loads gets overlapped with arithmetic operations, therefore increasing performance.
Such overlap not possible when doing one-element/thread load, because the warp waiting for Gmem access would become inactive.
Yes, this is the classic TLP (thread-level-parallelism) that GPUs are great at. Some kernels benefit from the additional ILP you can get from overlapping loads and arithmetic in the same thread, too.
Yes, this is the classic TLP (thread-level-parallelism) that GPUs are great at. Some kernels benefit from the additional ILP you can get from overlapping loads and arithmetic in the same thread, too.
Just one more thing: Do you know any reference that I can learn more about CUDA instruction level paralelism? I don’t think CUDA manual or Best Practices cover this topic.
Just one more thing: Do you know any reference that I can learn more about CUDA instruction level paralelism? I don’t think CUDA manual or Best Practices cover this topic.
There really aren’t any hard and fast rules. ILP in CUDA kernels is a bit of a guess and check game: try out a couple different prefetching or thread coarsening operations and benchmark the performance. Sometimes you will find that performance is increased and at other times decreased.
There really aren’t any hard and fast rules. ILP in CUDA kernels is a bit of a guess and check game: try out a couple different prefetching or thread coarsening operations and benchmark the performance. Sometimes you will find that performance is increased and at other times decreased.
I know the detailed internals are not documented (and subject to change!), but what’s the basic method that the GPU uses to determine whether operations can be done in parallel? Obviously this is done by the hardware scheduler, but how is it likely implemented?
As a total guess, I could imagine that every register has a virtual bit saying whether it is “in flight”. The scheduler looks at the next PTX instruction and sees if all of the input and output registers are not in flight, and if they’re not, it performs the instruction and marks any registers that are being WRITTEN to as in flight. When a pipeline compute finishes (or perhaps it’s just after a fixed delay?) the in-flight bit is cleared.
Such a design would allow nesting as many ILP layers as you like since there’s no dynamic stack or anything, and you don’t need to peek ahead at upcoming instructions.
But this model can’t be entirely correct since it doesn’t account for memory reads or writes to shared or global memory…
Understanding this may seem academic, but Vasily shows that it’s actually important in practice, and understanding the low level details can really help guide algorithm implementations. (And this doesn’t even get into the significant changes in GF104 which can perform multiple instructions per clock in many cases.) Another interesting question is whether the compiler tries to reorder PTX to allow more ILP even for G200/GF100 code.
I know the detailed internals are not documented (and subject to change!), but what’s the basic method that the GPU uses to determine whether operations can be done in parallel? Obviously this is done by the hardware scheduler, but how is it likely implemented?
As a total guess, I could imagine that every register has a virtual bit saying whether it is “in flight”. The scheduler looks at the next PTX instruction and sees if all of the input and output registers are not in flight, and if they’re not, it performs the instruction and marks any registers that are being WRITTEN to as in flight. When a pipeline compute finishes (or perhaps it’s just after a fixed delay?) the in-flight bit is cleared.
Such a design would allow nesting as many ILP layers as you like since there’s no dynamic stack or anything, and you don’t need to peek ahead at upcoming instructions.
But this model can’t be entirely correct since it doesn’t account for memory reads or writes to shared or global memory…
Understanding this may seem academic, but Vasily shows that it’s actually important in practice, and understanding the low level details can really help guide algorithm implementations. (And this doesn’t even get into the significant changes in GF104 which can perform multiple instructions per clock in many cases.) Another interesting question is whether the compiler tries to reorder PTX to allow more ILP even for G200/GF100 code.
The compiler definitely pushes memory loads up as far as they can go. I don’t do FLOP heavy enough work to comment on how it reorders FLOPs to allow for ILP.
The compiler definitely pushes memory loads up as far as they can go. I don’t do FLOP heavy enough work to comment on how it reorders FLOPs to allow for ILP.