SM_20 register usage

I am (really for the first time I guess) working on code which will be deployed on both compute 1.3 and 2.0 hardware. Having developed the code on Fermi, I just built it for a GT200 deployment and was amazed at the difference in register usage between the two builds on certain kernels with the 3.2 toolkit.

For single precision builds and compute 1.3 targets:

$ nvcc --cubin -Xptxas="-v" -gencode arch=compute_13,code=sm_13 fim.cu -o fim.cubin

ptxas info    : Compiling entry function '_Z10fimIterateIfLj8EEvPKT_PKiPS0_PiS0_S0_S0_jjji' for 'sm_13'

ptxas info    : Used 12 registers, 1296+16 bytes smem, 24 bytes cmem[1]

ptxas info    : Compiling entry function '_Z6fimTagIfLj8EEvPKT_S2_PiS3_jjj' for 'sm_13'

ptxas info    : Used 8 registers, 556+16 bytes smem, 16 bytes cmem[1]

ptxas info    : Compiling entry function '_Z7fimTag2IfLj8EEvPT_PiS2_S0_S0_S0_jjj' for 'sm_13'

ptxas info    : Used 12 registers, 1792+16 bytes smem, 24 bytes cmem[1]

and the same for compute 2.0 targets:

$ nvcc --cubin -Xptxas="-v" -gencode arch=compute_20,code=sm_20 fim.cu -o fim.cubin

ptxas info    : Compiling entry function '_Z10fimIterateIfLj8EEvPKT_PKiPS0_PiS0_S0_S0_jjji' for 'sm_20'

ptxas info    : Used 23 registers, 1236+0 bytes smem, 92 bytes cmem[0], 12 bytes cmem[16]

ptxas info    : Compiling entry function '_Z6fimTagIfLj8EEvPKT_S2_PiS3_jjj' for 'sm_20'

ptxas info    : Used 8 registers, 512+0 bytes smem, 76 bytes cmem[0]

ptxas info    : Compiling entry function '_Z7fimTag2IfLj8EEvPT_PiS2_S0_S0_S0_jjj' for 'sm_20'

ptxas info    : Used 21 registers, 1744+0 bytes smem, 80 bytes cmem[0], 8 bytes cmem[16]

Two of the kernels are using nearly twice the number of registers, while another is basically identical. For double precision builds, the situation is even worse. I don’t remember things being this different when I was porting stuff to Fermi circa the 3.0 toolkit release. Is this really how it should be?

I guess you could test the assembler by adding launch_bounds() to your kernels and adding a high amount of threads per block and concurrent blocks per mp. then it should downscale the towards the number of registers of sm_13. 32k / 12 / 8 = about 320. So if you give __launch_bounds(320, 8) it should also produce 12 regs per thread for that first kernel if launch_bounds() works as advertised ;)

In general, the absolute increase in the number of registers seen here does not strike me as particularly unusual for going from sm_1x to sm_2x. Does the discrepancy in register count decrease if you add the switches -ftz=true -prec-sqrt=false -prec-div=false ? The IEEE-compliant single-precision division, reciprocal, and square root operations that are default for sm_2x compiles require more registers than approximate versions. You can also try turning off the ABI with -Xptxas -abi=no, but normally the ABI only requires one additional register. I do not recommend turning off the ABI for actual production code, but this may be useful as an experiment.

Norbert’s suggestion about the IEEE single precision functions was spot on - switching to the approximate versions reduced the register count by 5 on both kernels. There are several sqrt() calls in both of those functions with increased register counts.

I also noticed a drop in throughput for the same kernel on the Fermi when compiled with sm_20 as opposed to no arch (which, I think gives sm-10). Maybe the same reason??

I never said there was a drop in throughput. Quite the opposite, in fact. The same code is about three times faster on Fermi than the GT200, despite the higher register usage and lower occupancy. This code really benefits from the Fermi L1 cache.

Yeah. I have a function that computes the l2 norm of a vector. Commenting out the sqrt() causes the calling kernel to go from 35 down to 12 registers. So needing to call sqrt() once from a single thread is causing my register usage to go up by 23 registers for ~every thread, screwing up my occupancy. SPMD is both a curse and a blessing.

I thought it mostly had to do with the fact that Fermi uses 64-bit pointers now and thus the extra space.

Also, someone suggested to me in the past to move to 3.1 or 3.2 which had less register pressure issues than 3.0

eyal

If you use a 64 bit host, device pointers have always been 64 bit. Interally, pre-Fermi cards only used 32 bits of the pointer, but sizeof(void *) on the host and gpu have always been the same for 64 bit hosts.

This was all done using CUDA 3.2, and it was the IEEE conforming single precision that was the difference with this particular codes.

I would assume that, while the pointer size in memory was 64 bit to have the same sizes on host and device, only 32 bit were loaded into registers, i.e. only one register per pointer was used.