SM_20 register usage

avidday · December 1, 2010, 9:09pm

I am (really for the first time I guess) working on code which will be deployed on both compute 1.3 and 2.0 hardware. Having developed the code on Fermi, I just built it for a GT200 deployment and was amazed at the difference in register usage between the two builds on certain kernels with the 3.2 toolkit.

For single precision builds and compute 1.3 targets:

$ nvcc --cubin -Xptxas="-v" -gencode arch=compute_13,code=sm_13 fim.cu -o fim.cubin

ptxas info    : Compiling entry function '_Z10fimIterateIfLj8EEvPKT_PKiPS0_PiS0_S0_S0_jjji' for 'sm_13'

ptxas info    : Used 12 registers, 1296+16 bytes smem, 24 bytes cmem[1]

ptxas info    : Compiling entry function '_Z6fimTagIfLj8EEvPKT_S2_PiS3_jjj' for 'sm_13'

ptxas info    : Used 8 registers, 556+16 bytes smem, 16 bytes cmem[1]

ptxas info    : Compiling entry function '_Z7fimTag2IfLj8EEvPT_PiS2_S0_S0_S0_jjj' for 'sm_13'

ptxas info    : Used 12 registers, 1792+16 bytes smem, 24 bytes cmem[1]

and the same for compute 2.0 targets:

$ nvcc --cubin -Xptxas="-v" -gencode arch=compute_20,code=sm_20 fim.cu -o fim.cubin

ptxas info    : Compiling entry function '_Z10fimIterateIfLj8EEvPKT_PKiPS0_PiS0_S0_S0_jjji' for 'sm_20'

ptxas info    : Used 23 registers, 1236+0 bytes smem, 92 bytes cmem[0], 12 bytes cmem[16]

ptxas info    : Compiling entry function '_Z6fimTagIfLj8EEvPKT_S2_PiS3_jjj' for 'sm_20'

ptxas info    : Used 8 registers, 512+0 bytes smem, 76 bytes cmem[0]

ptxas info    : Compiling entry function '_Z7fimTag2IfLj8EEvPT_PiS2_S0_S0_S0_jjj' for 'sm_20'

ptxas info    : Used 21 registers, 1744+0 bytes smem, 80 bytes cmem[0], 8 bytes cmem[16]

Two of the kernels are using nearly twice the number of registers, while another is basically identical. For double precision builds, the situation is even worse. I don’t remember things being this different when I was porting stuff to Fermi circa the 3.0 toolkit release. Is this really how it should be?

E.D_Riedijk · December 2, 2010, 6:53am

I guess you could test the assembler by adding launch_bounds() to your kernels and adding a high amount of threads per block and concurrent blocks per mp. then it should downscale the towards the number of registers of sm_13. 32k / 12 / 8 = about 320. So if you give __launch_bounds(320, 8) it should also produce 12 regs per thread for that first kernel if launch_bounds() works as advertised External Image

njuffa · December 2, 2010, 7:26am

In general, the absolute increase in the number of registers seen here does not strike me as particularly unusual for going from sm_1x to sm_2x. Does the discrepancy in register count decrease if you add the switches -ftz=true -prec-sqrt=false -prec-div=false ? The IEEE-compliant single-precision division, reciprocal, and square root operations that are default for sm_2x compiles require more registers than approximate versions. You can also try turning off the ABI with -Xptxas -abi=no, but normally the ABI only requires one additional register. I do not recommend turning off the ABI for actual production code, but this may be useful as an experiment.

avidday · December 2, 2010, 8:09am

Norbert’s suggestion about the IEEE single precision functions was spot on - switching to the approximate versions reduced the register count by 5 on both kernels. There are several sqrt() calls in both of those functions with increased register counts.

dneckels · December 3, 2010, 9:36pm

I also noticed a drop in throughput for the same kernel on the Fermi when compiled with sm_20 as opposed to no arch (which, I think gives sm-10). Maybe the same reason??

avidday · December 3, 2010, 10:22pm

I never said there was a drop in throughput. Quite the opposite, in fact. The same code is about three times faster on Fermi than the GT200, despite the higher register usage and lower occupancy. This code really benefits from the Fermi L1 cache.

awagner · February 3, 2011, 5:43am

Yeah. I have a function that computes the l2 norm of a vector. Commenting out the sqrt() causes the calling kernel to go from 35 down to 12 registers. So needing to call sqrt() once from a single thread is causing my register usage to go up by 23 registers for ~every thread, screwing up my occupancy. SPMD is both a curse and a blessing.

eyalhir74 · February 7, 2011, 8:26am

I thought it mostly had to do with the fact that Fermi uses 64-bit pointers now and thus the extra space.

Also, someone suggested to me in the past to move to 3.1 or 3.2 which had less register pressure issues than 3.0

eyal

avidday · February 7, 2011, 8:36am

If you use a 64 bit host, device pointers have always been 64 bit. Interally, pre-Fermi cards only used 32 bits of the pointer, but sizeof(void *) on the host and gpu have always been the same for 64 bit hosts.

This was all done using CUDA 3.2, and it was the IEEE conforming single precision that was the difference with this particular codes.

tera · February 7, 2011, 10:17am

I would assume that, while the pointer size in memory was 64 bit to have the same sizes on host and device, only 32 bit were loaded into registers, i.e. only one register per pointer was used.

Topic		Replies	Views
Register usage difference between sm_13 and sm_20 Many more registers used when compiling for sm_20 CUDA Programming and Performance	6	10961	August 11, 2010
too many registers issue with memory writes and registers CUDA Programming and Performance	7	2083	July 13, 2011
Analysing the registers CUDA Programming and Performance	9	1364	March 13, 2012
Number of Register vs different architecture CUDA Programming and Performance	11	4783	June 15, 2011
Register/SMEM Usage with different -arch=sm_xx not consistent.. CUDA Programming and Performance	5	2925	December 19, 2009
Why can I run sm_10 binaries with >64 registers/thread on Fermi/Kepler just fine? CUDA Programming and Performance	2	1055	April 3, 2013
Register usage too high How to reduce register usage? CUDA Programming and Performance	33	7856	December 4, 2011
Registers in Fermi (cc2.0) for cuda fortran Legacy PGI Compilers (archived)	0	6057	April 14, 2011
Code 4 times slower with "arch=sm_20" CUDA Programming and Performance	39	56322	June 15, 2010
Puzzling register usage by nvcc nvcc appears to not use a freely available register CUDA Programming and Performance	4	1134	March 10, 2011

SM_20 register usage

Related topics