Register usage difference between sm_13 and sm_20 Many more registers used when compiling for sm_20

Greetings,

I’m optimising some kernels for the Fermi architecture, and I’ve noticed some strange differences between compiling kernels for sm_13 and sm_20. I have found that using the -arch=s_20 flag can increase the register overhead significantly, e.g. compiling for sm_13 I get the following

[codebox]ptxas info : Compiling entry function ‘_Z15shared2x2float4P6float2S0_i’ for ‘sm_13’

ptxas info : Used 47 registers, 524+16 bytes smem, 24 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z15shared2x2float2P6float2S0_i’ for ‘sm_13’

ptxas info : Used 49 registers, 1036+16 bytes smem, 16 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z14shared2x2floatP6float2S0_i’ for ‘sm_13’

ptxas info : Used 47 registers, 524+16 bytes smem, 16 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z9simple2x2P6float2S0_i’ for ‘sm_13’

ptxas info : Used 59 registers, 12+16 bytes smem, 8 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z15shared1x1float4P6float2S0_i’ for ‘sm_13’

ptxas info : Used 21 registers, 268+16 bytes smem, 16 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z14shared1x1floatP6float2S0_i’ for ‘sm_13’

ptxas info : Used 17 registers, 268+16 bytes smem, 16 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z9simple1x1P6float2S0_i’ for ‘sm_13’

ptxas info : Used 23 registers, 12+16 bytes smem, 8 bytes cmem[1]

[/codebox]

whereas for sm_20 the register usage goes up

[codebox]ptxas info : Compiling entry function ‘_Z15shared2x2float4P6float2S0_i’ for ‘sm_20’

ptxas info : Used 63 registers, 4+0 bytes lmem, 512+0 bytes smem, 44 bytes cmem[0]

ptxas info : Compiling entry function ‘_Z15shared2x2float2P6float2S0_i’ for ‘sm_20’

ptxas info : Used 63 registers, 36+0 bytes lmem, 1024+0 bytes smem, 44 bytes cmem[0]

ptxas info : Compiling entry function ‘_Z14shared2x2floatP6float2S0_i’ for ‘sm_20’

ptxas info : Used 56 registers, 512+0 bytes smem, 44 bytes cmem[0], 4 bytes cmem[16]

ptxas info : Compiling entry function ‘_Z9simple2x2P6float2S0_i’ for ‘sm_20’

ptxas info : Used 55 registers, 44 bytes cmem[0]

ptxas info : Compiling entry function ‘_Z15shared1x1float4P6float2S0_i’ for ‘sm_20’

ptxas info : Used 28 registers, 256+0 bytes smem, 44 bytes cmem[0]

ptxas info : Compiling entry function ‘_Z14shared1x1floatP6float2S0_i’ for ‘sm_20’

ptxas info : Used 28 registers, 256+0 bytes smem, 44 bytes cmem[0], 4 bytes cmem[16]

ptxas info : Compiling entry function ‘_Z9simple1x1P6float2S0_i’ for ‘sm_20’

ptxas info : Used 26 registers, 44 bytes cmem[0]

[/codebox]

This is huge difference in the number of registers. In both cases I am using CUDA 3.1, and with the -m32 flag for 32-bit compilation, and additionally for sm_20 I’m disabling the printf functionality with -Xptxas -abi=no. Does anyone have explanation for this?

My kernels are currently latency bound, so I need to minimise the registers to increase the overall occupancy. As a result for running my code on the GTX 480 I am finding that compiling with sm_13 can lead to faster code because the occupancy is higher. On the subject of latency hiding, the number of threads required to fully hide latency has been well discussed for GT200 (192-256 threads in total required per sm), but how many threads are required on Fermi to hide latency? My understanding is that something like ~768 threads might be required?

Thanks.

sm_13 and sm_20 have totally different ISAs.

Thanks for the fast response.

Ah I realise most of my previous post was nebulous of course. When compiling with -arch=sm_13, it only reports the register usage for the sm_13 target, not for the implicit sm_20 for which the binary is generated at runtime from the ptx (I think?) When I run the binary on the Fermi card, the reported register usage matches that of the arch=sm_20 target. This doesn’t explain though:

1.) Why does compiling with sm_13 generate faster code than sm_20? (yes, I am using “-ftz=true -prec-div=false -prec-sqrt=false”) This isn’t true for all of my kernels, but where it is true the difference is around ~ 10-15 GFLOPS

2.) Why does Fermi require so many more registers than GT200? I guess this might be a hardware question. Since Fermi is more register bound than previous generations (limit of 64 vs 128 registers per thread), this makes optimisation even more challenging External Media

3.) How many warps are required to fully hide latency on Fermi?

Thanks.

My situation is a bit similar to yours.

sm_13

nvcc --ptxas-options=-v -arch sm_13 -g
ptxas info : Used 52 registers, 7804+16 bytes smem, 30212 bytes cmem[0], 76 bytes cmem[1], 76 bytes cmem[14]

wall time: 0.15249


sm_20

nvcc --ptxas-options=-v -arch sm_20 -g
ptxas info : Used 63 registers, 7776+0 bytes smem, 60 bytes cmem[0], 30212 bytes cmem[2], 76 bytes cmem[14], 20 bytes cmem[16]

wall time: 0.17066

Anyone know why sm_13 is faster than sm_20 (using Tesla C2050 w/ ECC)?

My situation is a bit similar to yours.

sm_13

nvcc --ptxas-options=-v -arch sm_13 -g
ptxas info : Used 52 registers, 7804+16 bytes smem, 30212 bytes cmem[0], 76 bytes cmem[1], 76 bytes cmem[14]

wall time: 0.15249


sm_20

nvcc --ptxas-options=-v -arch sm_20 -g
ptxas info : Used 63 registers, 7776+0 bytes smem, 60 bytes cmem[0], 30212 bytes cmem[2], 76 bytes cmem[14], 20 bytes cmem[16]

wall time: 0.17066

Anyone know why sm_13 is faster than sm_20 (using Tesla C2050 w/ ECC)?

Most likely the compiler options in the built-in JIT ptx compiler are different than the ones used when manually building from the command line (?)

Most likely the compiler options in the built-in JIT ptx compiler are different than the ones used when manually building from the command line (?)