Pipelined Loads

vvolkov · August 31, 2010, 4:54pm

Thanks for the encouraging feedback. I am going to present a version of this talk at GTC 2010 in a few weeks.

Steve, you might want to check the following patent:

Coon et al. 2008. Tracking register usage during multithreaded processing using a scoreboard having separate memory regions and storing sequential register size indicators, U.S. Patent No. 7434032.

Vasily

SPWorley · August 31, 2010, 6:33pm

Thanks, Vasily! I had heard the term “scoreboard” before, and I now see that’s this kind of register tracking. I’m still not sure how it detects memory write readiness, but maybe that’s as simple as counting any write to memory anywhere with a single in-progress status flag. Which would mean that you should bunch all of your memory reads together and all of your writes together (if the pipeline delays for those were a bottleneck.)

Separate question: Vasily, on page 38 of your slides, you show using a local memory array Csub[4] to hold intermediate computation values, not registers. I assume that in that example, the compiler can see out that the array is always indexed by fixed values known at compile time and then it converts the local array into 4 explicit registers, and therefore can be used with ILP?

SPWorley · August 31, 2010, 6:33pm

Thanks, Vasily! I had heard the term “scoreboard” before, and I now see that’s this kind of register tracking. I’m still not sure how it detects memory write readiness, but maybe that’s as simple as counting any write to memory anywhere with a single in-progress status flag. Which would mean that you should bunch all of your memory reads together and all of your writes together (if the pipeline delays for those were a bottleneck.)

Separate question: Vasily, on page 38 of your slides, you show using a local memory array Csub[4] to hold intermediate computation values, not registers. I assume that in that example, the compiler can see out that the array is always indexed by fixed values known at compile time and then it converts the local array into 4 explicit registers, and therefore can be used with ILP?

SPWorley · August 31, 2010, 8:43pm

And yet another question, Valsily…

In your conclusions (last slide) you recommend “Use registers over shared memory whenever possible.” Do you mean computed index shared memory (which caused the slowdown in your 4-value per thread matrix multiply example)? Or any use of shared memory at all (which might inhibit ILP if the scoreboarding can’t distinguish distinct shared memory writes for pipeline timing)?

SPWorley · August 31, 2010, 8:43pm

And yet another question, Valsily…

In your conclusions (last slide) you recommend “Use registers over shared memory whenever possible.” Do you mean computed index shared memory (which caused the slowdown in your 4-value per thread matrix multiply example)? Or any use of shared memory at all (which might inhibit ILP if the scoreboarding can’t distinguish distinct shared memory writes for pipeline timing)?

Jimmy_Pettersson · August 31, 2010, 9:40pm

I just remembered this old post:

"My conclusion is this: If you’re problem is well defined you should avoid using shared memory like it’s the plague. Less shared memory, fewer threads and more registers.

This seems to work well for some problem and I’m sure there are a lot of people who don’t agree with this programming philosophy."

:)

[url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?showtop...mp;#entry951573[/url]

Jimmy_Pettersson · August 31, 2010, 9:40pm

I just remembered this old post:

"My conclusion is this: If you’re problem is well defined you should avoid using shared memory like it’s the plague. Less shared memory, fewer threads and more registers.

This seems to work well for some problem and I’m sure there are a lot of people who don’t agree with this programming philosophy."

:)

[url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?showtop...mp;#entry951573[/url]

SPWorley · September 2, 2010, 9:28pm

There’s a great (and classic) textbook that discusses ILP in general (for generic architectures), Computer Architecture, A Quantitative Approach. All of chapter 3 has some great details over how CPUs (and GPUs) can do this kind of scheduling. Very interesting to low level geeks.

Thanks again, Vasily!

SPWorley · September 2, 2010, 9:28pm

There’s a great (and classic) textbook that discusses ILP in general (for generic architectures), Computer Architecture, A Quantitative Approach. All of chapter 3 has some great details over how CPUs (and GPUs) can do this kind of scheduling. Very interesting to low level geeks.

Thanks again, Vasily!

microhead · September 6, 2010, 2:16pm

Did you ever observe the pipelined exeuction of MAD instructions on G200 with single thread?

microhead · September 6, 2010, 2:16pm

Did you ever observe the pipelined exeuction of MAD instructions on G200 with single thread?

vvolkov · September 8, 2010, 2:49am

I didn’t try it with a single thread — I think you can’t approach 100% of peak with less than 64 threads per thread block on GT200.

But pipelined MADs should run faster even in a single thread.

Vasily

vvolkov · September 8, 2010, 2:49am

I didn’t try it with a single thread — I think you can’t approach 100% of peak with less than 64 threads per thread block on GT200.

But pipelined MADs should run faster even in a single thread.

Vasily

vvolkov · September 8, 2010, 2:56am

My point is that shared memory has much less bandwidth than registers, so it can be a bottleneck in your code. So, you should store and reuse more data in registers to require less shared memory bandwidth just like you store and reuse data in shared memory to require less global memory bandwidth.

Vasily

vvolkov · September 8, 2010, 2:56am

My point is that shared memory has much less bandwidth than registers, so it can be a bottleneck in your code. So, you should store and reuse more data in registers to require less shared memory bandwidth just like you store and reuse data in shared memory to require less global memory bandwidth.

Vasily

vvolkov · September 8, 2010, 3:04am

You are right, but I usually see it from other perspective — all local variables are allocated in registers by default and get spilled into local memory if there is a problem. One problem is that you have too many register variables, another is that you are trying to have indexed access to an array and the index is not known in compile time. Otherwise, defining array as “float a[4];” is same as defining four variables “float a0, a1, a2, a3;”

Vasily

vvolkov · September 8, 2010, 3:04am

You are right, but I usually see it from other perspective — all local variables are allocated in registers by default and get spilled into local memory if there is a problem. One problem is that you have too many register variables, another is that you are trying to have indexed access to an array and the index is not known in compile time. Otherwise, defining array as “float a[4];” is same as defining four variables “float a0, a1, a2, a3;”

Vasily

SPWorley · September 9, 2010, 4:02am

So, is there a maximum number of registers a thread can have? If it’s something like 64, this could effect how much ILP you could ever extract.

I think someone backengineered the Excel formulas for the occupancy calculator and determined some interesting undocumented limits which I wish I could remember or find again. Maybe also for something like the number of actual registers was rounded up to the next highest 2 or 4? A google search of the forum didn’t turn it up, so I have to ask again.

SPWorley · September 9, 2010, 4:02am

So, is there a maximum number of registers a thread can have? If it’s something like 64, this could effect how much ILP you could ever extract.

I think someone backengineered the Excel formulas for the occupancy calculator and determined some interesting undocumented limits which I wish I could remember or find again. Maybe also for something like the number of actual registers was rounded up to the next highest 2 or 4? A google search of the forum didn’t turn it up, so I have to ask again.

avidday · September 9, 2010, 5:17am

IIRC the undocumented bit was that registers were assigned to blocks in “pages” which had a power of two size (at least for G80/90/GT200 I seem to recall it was 512).

Topic		Replies	Views
[Fermi] Number of registers CUDA Programming and Performance	36	20370	September 15, 2010
a deep dive into Instruction-level parallelism CUDA Programming and Performance	17	5464	December 18, 2018
How to reduce register usage CUDA Programming and Performance	47	49815	May 28, 2022
Nvidia GF104 vs GF100 CUDA Programming and Performance	24	23110	October 12, 2010
On the register allocation optimization of cuda compiler CUDA Programming and Performance	12	3391	January 20, 2019
Branch Divergence Serialization (Threads/hardware stalls ?) Performance Impact ? Branch divergence s CUDA Programming and Performance	3	1647	June 15, 2011
How close to peak can you get on a CPU? CUDA Programming and Performance	33	3120	November 9, 2010
Some advice needed pls Doubts we have, we're starting with CUDA programming CUDA Programming and Performance	16	4801	June 22, 2011
coalesced vs. uncoalesced access why not speed-up of 16x? CUDA Programming and Performance	13	6151	October 29, 2008
Unofficial Kepler Slides from Random Gamer Site Yeah, yeah, but we only have another week to rumor-m CUDA Programming and Performance	63	10676	April 5, 2012

Pipelined Loads

Related topics