Pipelined Loads

Thanks for the encouraging feedback. I am going to present a version of this talk at GTC 2010 in a few weeks.

Steve, you might want to check the following patent:

Coon et al. 2008. Tracking register usage during multithreaded processing using a scoreboard having separate memory regions and storing sequential register size indicators, U.S. Patent No. 7434032.

Vasily

Thanks, Vasily! I had heard the term “scoreboard” before, and I now see that’s this kind of register tracking. I’m still not sure how it detects memory write readiness, but maybe that’s as simple as counting any write to memory anywhere with a single in-progress status flag. Which would mean that you should bunch all of your memory reads together and all of your writes together (if the pipeline delays for those were a bottleneck.)

Separate question: Vasily, on page 38 of your slides, you show using a local memory array Csub[4] to hold intermediate computation values, not registers. I assume that in that example, the compiler can see out that the array is always indexed by fixed values known at compile time and then it converts the local array into 4 explicit registers, and therefore can be used with ILP?

Thanks, Vasily! I had heard the term “scoreboard” before, and I now see that’s this kind of register tracking. I’m still not sure how it detects memory write readiness, but maybe that’s as simple as counting any write to memory anywhere with a single in-progress status flag. Which would mean that you should bunch all of your memory reads together and all of your writes together (if the pipeline delays for those were a bottleneck.)

Separate question: Vasily, on page 38 of your slides, you show using a local memory array Csub[4] to hold intermediate computation values, not registers. I assume that in that example, the compiler can see out that the array is always indexed by fixed values known at compile time and then it converts the local array into 4 explicit registers, and therefore can be used with ILP?

And yet another question, Valsily…

In your conclusions (last slide) you recommend “Use registers over shared memory whenever possible.” Do you mean computed index shared memory (which caused the slowdown in your 4-value per thread matrix multiply example)? Or any use of shared memory at all (which might inhibit ILP if the scoreboarding can’t distinguish distinct shared memory writes for pipeline timing)?

And yet another question, Valsily…

In your conclusions (last slide) you recommend “Use registers over shared memory whenever possible.” Do you mean computed index shared memory (which caused the slowdown in your 4-value per thread matrix multiply example)? Or any use of shared memory at all (which might inhibit ILP if the scoreboarding can’t distinguish distinct shared memory writes for pipeline timing)?

I just remembered this old post:

"My conclusion is this: If you’re problem is well defined you should avoid using shared memory like it’s the plague. Less shared memory, fewer threads and more registers.

This seems to work well for some problem and I’m sure there are a lot of people who don’t agree with this programming philosophy."

:)

[url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?showtop...mp;#entry951573[/url]

I just remembered this old post:

"My conclusion is this: If you’re problem is well defined you should avoid using shared memory like it’s the plague. Less shared memory, fewer threads and more registers.

This seems to work well for some problem and I’m sure there are a lot of people who don’t agree with this programming philosophy."

:)

[url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?showtop...mp;#entry951573[/url]

There’s a great (and classic) textbook that discusses ILP in general (for generic architectures), Computer Architecture, A Quantitative Approach. All of chapter 3 has some great details over how CPUs (and GPUs) can do this kind of scheduling. Very interesting to low level geeks.

Thanks again, Vasily!

There’s a great (and classic) textbook that discusses ILP in general (for generic architectures), Computer Architecture, A Quantitative Approach. All of chapter 3 has some great details over how CPUs (and GPUs) can do this kind of scheduling. Very interesting to low level geeks.

Thanks again, Vasily!

Did you ever observe the pipelined exeuction of MAD instructions on G200 with single thread?

Did you ever observe the pipelined exeuction of MAD instructions on G200 with single thread?

I didn’t try it with a single thread — I think you can’t approach 100% of peak with less than 64 threads per thread block on GT200.

But pipelined MADs should run faster even in a single thread.

Vasily

I didn’t try it with a single thread — I think you can’t approach 100% of peak with less than 64 threads per thread block on GT200.

But pipelined MADs should run faster even in a single thread.

Vasily

My point is that shared memory has much less bandwidth than registers, so it can be a bottleneck in your code. So, you should store and reuse more data in registers to require less shared memory bandwidth just like you store and reuse data in shared memory to require less global memory bandwidth.

Vasily

My point is that shared memory has much less bandwidth than registers, so it can be a bottleneck in your code. So, you should store and reuse more data in registers to require less shared memory bandwidth just like you store and reuse data in shared memory to require less global memory bandwidth.

Vasily

You are right, but I usually see it from other perspective — all local variables are allocated in registers by default and get spilled into local memory if there is a problem. One problem is that you have too many register variables, another is that you are trying to have indexed access to an array and the index is not known in compile time. Otherwise, defining array as “float a[4];” is same as defining four variables “float a0, a1, a2, a3;”

Vasily

You are right, but I usually see it from other perspective — all local variables are allocated in registers by default and get spilled into local memory if there is a problem. One problem is that you have too many register variables, another is that you are trying to have indexed access to an array and the index is not known in compile time. Otherwise, defining array as “float a[4];” is same as defining four variables “float a0, a1, a2, a3;”

Vasily

So, is there a maximum number of registers a thread can have? If it’s something like 64, this could effect how much ILP you could ever extract.

I think someone backengineered the Excel formulas for the occupancy calculator and determined some interesting undocumented limits which I wish I could remember or find again. Maybe also for something like the number of actual registers was rounded up to the next highest 2 or 4? A google search of the forum didn’t turn it up, so I have to ask again.

So, is there a maximum number of registers a thread can have? If it’s something like 64, this could effect how much ILP you could ever extract.

I think someone backengineered the Excel formulas for the occupancy calculator and determined some interesting undocumented limits which I wish I could remember or find again. Maybe also for something like the number of actual registers was rounded up to the next highest 2 or 4? A google search of the forum didn’t turn it up, so I have to ask again.

IIRC the undocumented bit was that registers were assigned to blocks in “pages” which had a power of two size (at least for G80/90/GT200 I seem to recall it was 512).