[Fermi] Number of registers

vvolkov · September 15, 2010, 1:35am

You also can find shared memory bandwidth in David Kirk’s slides: [url=“https://hub.vscse.org/resources/287/download/cuda_fermi_overview_DK.pdf”]https://hub.vscse.org/resources/287/downloa...overview_DK.pdf[/url]

It is 1030 GB/s for C2050, which has 1030 Gflop/s arithmetic peak.

Vasily

SPWorley · September 15, 2010, 2:44am

Oh! So the simple explanation is that you can do “A=B+C” with registers in one clock (throughput) but if all values were in shared, that’d require two shared reads and one write, so it’d be 3 clocks throughput (minimum). And it’s worse with MADD because of more arguments, and it’s yet again even worse with compute 2.1 opportunistic double instruction execution with its “spare” SMs. So the common wisdom “without bank conflicts shared is just as fast as registers” only applies when one (and only one) instruction argument or destination is in shared. Is that right?

And a followup question for you (I asked in the pipeline thread but I’ll ask again now.) When the GPU is doing ILP I understand how it uses scoreboarding to mark when registers are still pending in their pipeline to determine if the next instruction is safe to schedule. But what happens if one or more instruction arguments are in shared memory? Is the scoreboard able to keep track of pipeline progress of specific memory locations as well as registers? If not, then ILP of multiple instructions accessing any shared memory at all (even another location) would stop once you wrote a memory location and restart only after some conservative worst case pipeline timeout.

SPWorley · September 15, 2010, 2:44am

Oh! So the simple explanation is that you can do “A=B+C” with registers in one clock (throughput) but if all values were in shared, that’d require two shared reads and one write, so it’d be 3 clocks throughput (minimum). And it’s worse with MADD because of more arguments, and it’s yet again even worse with compute 2.1 opportunistic double instruction execution with its “spare” SMs. So the common wisdom “without bank conflicts shared is just as fast as registers” only applies when one (and only one) instruction argument or destination is in shared. Is that right?

And a followup question for you (I asked in the pipeline thread but I’ll ask again now.) When the GPU is doing ILP I understand how it uses scoreboarding to mark when registers are still pending in their pipeline to determine if the next instruction is safe to schedule. But what happens if one or more instruction arguments are in shared memory? Is the scoreboard able to keep track of pipeline progress of specific memory locations as well as registers? If not, then ILP of multiple instructions accessing any shared memory at all (even another location) would stop once you wrote a memory location and restart only after some conservative worst case pipeline timeout.

vvolkov · September 15, 2010, 3:26am

I think you got it. I’d add that it is worse not only with compute 2.1, but also with 2.0 as it has 32 thread processors and 32 banks (ratio 1:1) versus 8 thread processors and 16 banks (ratio 1:2) on compute 1.0-1.3. So, in your example you’d get 2x slowdown on GF100, i.e. 6 clocks instead of 3. Also, it may be possible that writes to shared memory are overlapped with reads, so only 4 cycles instead of 6.

Your interpretation of the common wisdom seems to apply for G80/GT200 at least for some operations, such as add. However, I know that for some other operations it doesn’t, but the bottleneck is elsewhere. Check the table in Section 3.7 of my SC08 paper for some examples. The new common wisdom for Fermi, if ignoring writes for a while, is - “shared memory is as fast as registers if not more than one instruction of those issued in one cycle reads shared memory”.

It’s a very good question. I don’t know such details on the scoreboard. I’d guess that the answer is yes and would suggest checking it with micro-benchmarking.

Vasily

vvolkov · September 15, 2010, 3:26am

I think you got it. I’d add that it is worse not only with compute 2.1, but also with 2.0 as it has 32 thread processors and 32 banks (ratio 1:1) versus 8 thread processors and 16 banks (ratio 1:2) on compute 1.0-1.3. So, in your example you’d get 2x slowdown on GF100, i.e. 6 clocks instead of 3. Also, it may be possible that writes to shared memory are overlapped with reads, so only 4 cycles instead of 6.

Your interpretation of the common wisdom seems to apply for G80/GT200 at least for some operations, such as add. However, I know that for some other operations it doesn’t, but the bottleneck is elsewhere. Check the table in Section 3.7 of my SC08 paper for some examples. The new common wisdom for Fermi, if ignoring writes for a while, is - “shared memory is as fast as registers if not more than one instruction of those issued in one cycle reads shared memory”.

It’s a very good question. I don’t know such details on the scoreboard. I’d guess that the answer is yes and would suggest checking it with micro-benchmarking.

Vasily

SPWorley · September 15, 2010, 4:04am

Thanks, Vasily! This is getting more clear now.
Also now I finally understand the impact of the 16-bank and 32-bank change… I had thought before it was kind of a wash since “every SP still can read shared at once, just like before, so it’s no big deal” but you’re right, the older 16-bank design let a thread use TWO shared memory locations at once. It’s subtle but actually critical for these tight loop optimizations.

So even when it’s all documented in the Programming Guide, it still takes a while to understand all the consequences!

SPWorley · September 15, 2010, 4:04am

Thanks, Vasily! This is getting more clear now.
Also now I finally understand the impact of the 16-bank and 32-bank change… I had thought before it was kind of a wash since “every SP still can read shared at once, just like before, so it’s no big deal” but you’re right, the older 16-bank design let a thread use TWO shared memory locations at once. It’s subtle but actually critical for these tight loop optimizations.

So even when it’s all documented in the Programming Guide, it still takes a while to understand all the consequences!

avidday · September 15, 2010, 6:41am

Empiricially yes, but the nvcc documentation says it is 128 (p16 of the nvcc 3.1 manual if you are interested).

avidday · September 15, 2010, 6:41am

Empiricially yes, but the nvcc documentation says it is 128 (p16 of the nvcc 3.1 manual if you are interested).

Magorath · September 15, 2010, 7:30am

Do you know if there are any plans for new GPUs with 128 registers per thread ?

Magorath · September 15, 2010, 7:30am

Do you know if there are any plans for new GPUs with 128 registers per thread ?

eyalhir74 · September 15, 2010, 7:46am

Hi Vasily,

I got the 1344GB/s calculation, but didnt understand the second part where it is "easy to see… " :)

Can you please explain a bit more slowely? :)

Thanks

eyal

eyalhir74 · September 15, 2010, 7:46am

Hi Vasily,

I got the 1344GB/s calculation, but didnt understand the second part where it is "easy to see… " :)

Can you please explain a bit more slowely? :)

Thanks

eyal

avidday · September 15, 2010, 8:24am

I think the argument is that in order for Fermi to hit peak FLOP/s, it requires every core to execute a FMAD every MP clock cycle (or an FMAD every two shader clock cycles). A 32 bit FMAD requires two loads and a store, ie. three 32 bit word size transactions per FMAD, or 12 bytes per two FLOP, which is the equivalent of 6 bytes per FLOP. So for the GF100 to hit peak FLOP/s in single precision using shared memory requires bandwidth which is at least 6 times the peak compute rate. The actual available shared memory bandwidth is much closer to 1 times the peak compute rate. I think that is the point Vasily is very elegantly making here.

avidday · September 15, 2010, 8:24am

I think the argument is that in order for Fermi to hit peak FLOP/s, it requires every core to execute a FMAD every MP clock cycle (or an FMAD every two shader clock cycles). A 32 bit FMAD requires two loads and a store, ie. three 32 bit word size transactions per FMAD, or 12 bytes per two FLOP, which is the equivalent of 6 bytes per FLOP. So for the GF100 to hit peak FLOP/s in single precision using shared memory requires bandwidth which is at least 6 times the peak compute rate. The actual available shared memory bandwidth is much closer to 1 times the peak compute rate. I think that is the point Vasily is very elegantly making here.

Sylvain_Collange · September 15, 2010, 11:45am

Can’t tell for Fermi, but with Tesla I did not notice any scoreboarding on memory addresses. The warp scheduler seems to just assume a dependency as you say.

Sylvain_Collange · September 15, 2010, 11:45am

Can’t tell for Fermi, but with Tesla I did not notice any scoreboarding on memory addresses. The warp scheduler seems to just assume a dependency as you say.