[Fermi] Number of registers

Hi !

When I compile my code for Fermi, I get a lower number of registers than on the C1060. This has a deep impact on the performance as the local memory usage grows heavily. How can I enable more than 64 registers per thread ?

Many thanks.

This is not possible when compiling for sm_20 architecture. The instruction set for Fermi can only address 64 registers.

This is not possible when compiling for sm_20 architecture. The instruction set for Fermi can only address 64 registers.

Doesn’t pre sm_20 architecture instruction set have only 64 regs addressible also ?

Doesn’t pre sm_20 architecture instruction set have only 64 regs addressible also ?

No, 128 for compute 1.0, 1.1, 1.2 and 1.3.

No, 128 for compute 1.0, 1.1, 1.2 and 1.3.

Isn’t that a major regression ?

Isn’t that a major regression ?

The lack of registers is offset by the L1 and L2 caches.

The lack of registers is offset by the L1 and L2 caches.

I believe they figured out that it is unusual to use more than 64 registers in a kernel. So, they saved 1 bit in register IDs instead.

I think this is very unfortunate and even more so on Fermi - the gap between L1 memory throughput and arithmetic throughput is 2x wider than it was on GT200. So, you will more likely be bound by shared memory bandwidth.

Vasily

I believe they figured out that it is unusual to use more than 64 registers in a kernel. So, they saved 1 bit in register IDs instead.

I think this is very unfortunate and even more so on Fermi - the gap between L1 memory throughput and arithmetic throughput is 2x wider than it was on GT200. So, you will more likely be bound by shared memory bandwidth.

Vasily

Vasily, this is the second time I saw you mention shared memory bandwidth as a limitation. I can see that shared could give you less bandwidth than registers in two important cases… bank conflicts and device 2.1 dual-warp scheduling.

But for the simple, common, case, aren’t ops directly operating on shared memory just the same throughput of ops operating on registers?

In fact the programming guide 5.1.2.5 says:

Vasily, this is the second time I saw you mention shared memory bandwidth as a limitation. I can see that shared could give you less bandwidth than registers in two important cases… bank conflicts and device 2.1 dual-warp scheduling.

But for the simple, common, case, aren’t ops directly operating on shared memory just the same throughput of ops operating on registers?

In fact the programming guide 5.1.2.5 says:

It’s actually 124.

It’s actually 124.

Thank you for this pointer! Yes, programming guide is wrong - it happens to be too simplistic at times. I may guess, however, why they ever would write this. Say, latency-wise there was indeed little difference between using operands in shared memory and registers on pre-Fermi GPUs. The gap in sheer bandwidth was 4x on G80/GT200 and 8x now but you can’t require much more anyway due to other bottlenecks, such as the number of shared memory operands per instruction on G80/GT200 or number of issue slots on Fermi. So, neither latency nor bandwidth can be a direct bottleneck, it’s all messed up. But the point is that you have enough overheads in accessing shared memory that you’d rather not do it.

Here are some numbers for Fermi. Shared memory has 32 banks per multiprocessor, there are 15 multiprocessors in total, each bank is 4 bytes wide and the clock rate is half the shader clock rate (1.4GHz). So, the bandwidth is 32154B*1.4GHz/2 = 1344 GB/s. It is easy to see now that this is not sufficient to get the theoretical peak of 1344 Gflop/s, as every multiply-add requires 12 bytes per 2 flops, i.e. ~8GB/s read bandwidth. Registers can clearly provide it, but not shared memory. You can see that the gap is at least 6x.

Vasily

Thank you for this pointer! Yes, programming guide is wrong - it happens to be too simplistic at times. I may guess, however, why they ever would write this. Say, latency-wise there was indeed little difference between using operands in shared memory and registers on pre-Fermi GPUs. The gap in sheer bandwidth was 4x on G80/GT200 and 8x now but you can’t require much more anyway due to other bottlenecks, such as the number of shared memory operands per instruction on G80/GT200 or number of issue slots on Fermi. So, neither latency nor bandwidth can be a direct bottleneck, it’s all messed up. But the point is that you have enough overheads in accessing shared memory that you’d rather not do it.

Here are some numbers for Fermi. Shared memory has 32 banks per multiprocessor, there are 15 multiprocessors in total, each bank is 4 bytes wide and the clock rate is half the shader clock rate (1.4GHz). So, the bandwidth is 32154B*1.4GHz/2 = 1344 GB/s. It is easy to see now that this is not sufficient to get the theoretical peak of 1344 Gflop/s, as every multiply-add requires 12 bytes per 2 flops, i.e. ~8GB/s read bandwidth. Registers can clearly provide it, but not shared memory. You can see that the gap is at least 6x.

Vasily

You also can find shared memory bandwidth in David Kirk’s slides: [url=“https://hub.vscse.org/resources/287/download/cuda_fermi_overview_DK.pdf”]https://hub.vscse.org/resources/287/downloa...overview_DK.pdf[/url]

It is 1030 GB/s for C2050, which has 1030 Gflop/s arithmetic peak.

Vasily