Greetings,
I’m optimising some kernels for the Fermi architecture, and I’ve noticed some strange differences between compiling kernels for sm_13 and sm_20. I have found that using the -arch=s_20 flag can increase the register overhead significantly, e.g. compiling for sm_13 I get the following
[codebox]ptxas info : Compiling entry function ‘_Z15shared2x2float4P6float2S0_i’ for ‘sm_13’
ptxas info : Used 47 registers, 524+16 bytes smem, 24 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z15shared2x2float2P6float2S0_i’ for ‘sm_13’
ptxas info : Used 49 registers, 1036+16 bytes smem, 16 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z14shared2x2floatP6float2S0_i’ for ‘sm_13’
ptxas info : Used 47 registers, 524+16 bytes smem, 16 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z9simple2x2P6float2S0_i’ for ‘sm_13’
ptxas info : Used 59 registers, 12+16 bytes smem, 8 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z15shared1x1float4P6float2S0_i’ for ‘sm_13’
ptxas info : Used 21 registers, 268+16 bytes smem, 16 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z14shared1x1floatP6float2S0_i’ for ‘sm_13’
ptxas info : Used 17 registers, 268+16 bytes smem, 16 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z9simple1x1P6float2S0_i’ for ‘sm_13’
ptxas info : Used 23 registers, 12+16 bytes smem, 8 bytes cmem[1]
[/codebox]
whereas for sm_20 the register usage goes up
[codebox]ptxas info : Compiling entry function ‘_Z15shared2x2float4P6float2S0_i’ for ‘sm_20’
ptxas info : Used 63 registers, 4+0 bytes lmem, 512+0 bytes smem, 44 bytes cmem[0]
ptxas info : Compiling entry function ‘_Z15shared2x2float2P6float2S0_i’ for ‘sm_20’
ptxas info : Used 63 registers, 36+0 bytes lmem, 1024+0 bytes smem, 44 bytes cmem[0]
ptxas info : Compiling entry function ‘_Z14shared2x2floatP6float2S0_i’ for ‘sm_20’
ptxas info : Used 56 registers, 512+0 bytes smem, 44 bytes cmem[0], 4 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z9simple2x2P6float2S0_i’ for ‘sm_20’
ptxas info : Used 55 registers, 44 bytes cmem[0]
ptxas info : Compiling entry function ‘_Z15shared1x1float4P6float2S0_i’ for ‘sm_20’
ptxas info : Used 28 registers, 256+0 bytes smem, 44 bytes cmem[0]
ptxas info : Compiling entry function ‘_Z14shared1x1floatP6float2S0_i’ for ‘sm_20’
ptxas info : Used 28 registers, 256+0 bytes smem, 44 bytes cmem[0], 4 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z9simple1x1P6float2S0_i’ for ‘sm_20’
ptxas info : Used 26 registers, 44 bytes cmem[0]
[/codebox]
This is huge difference in the number of registers. In both cases I am using CUDA 3.1, and with the -m32 flag for 32-bit compilation, and additionally for sm_20 I’m disabling the printf functionality with -Xptxas -abi=no. Does anyone have explanation for this?
My kernels are currently latency bound, so I need to minimise the registers to increase the overall occupancy. As a result for running my code on the GTX 480 I am finding that compiling with sm_13 can lead to faster code because the occupancy is higher. On the subject of latency hiding, the number of threads required to fully hide latency has been well discussed for GT200 (192-256 threads in total required per sm), but how many threads are required on Fermi to hide latency? My understanding is that something like ~768 threads might be required?
Thanks.