two questions about maxrregcount parameter of nvcc

Hi,

nvcc without register limit

nvcc -I/opt/cuda/include/ --ptxas-options=-v  -c tf_72bit.cu -o tf_72bit.o

ptxas info	: Compiling entry function '_Z8mfakt_71j5int72Pji6int144S0_' for 'sm_10'

ptxas info	: Used 16 registers, 64+16 bytes smem, 44 bytes cmem[1]

nvcc -I/opt/cuda/include/ --ptxas-options=-v  -c tf_96bit.cu -o tf_96bit.o

ptxas info	: Compiling entry function '_Z8mfakt_95j5int96Pji6int192S0_' for 'sm_10'

ptxas info	: Used 17 registers, 64+16 bytes smem, 28 bytes cmem[1]

nvcc -I/opt/cuda/include/ --ptxas-options=-v  -c tf_96bit.cu -o tf_96_75bit.o -DSHORTCUT_75BIT

ptxas info	: Compiling entry function '_Z11mfakt_95_75j5int96Pji6int192S0_' for 'sm_10'

ptxas info	: Used 16 registers, 64+16 bytes smem, 28 bytes cmem[1]

nvcc with register limit

nvcc -I/opt/cuda/include/ --ptxas-options=-v --maxrregcount=16 -c tf_72bit.cu -o tf_72bit.o

ptxas info	: Compiling entry function '_Z8mfakt_71j5int72Pji6int144S0_' for 'sm_10'

ptxas info	: Used 16 registers, 64+16 bytes smem, 44 bytes cmem[1]

nvcc -I/opt/cuda/include/ --ptxas-options=-v --maxrregcount=16 -c tf_96bit.cu -o tf_96bit.o

ptxas info	: Compiling entry function '_Z8mfakt_95j5int96Pji6int192S0_' for 'sm_10'

ptxas info	: Used 15 registers, 8+0 bytes lmem, 64+16 bytes smem, 28 bytes cmem[1]

nvcc -I/opt/cuda/include/ --ptxas-options=-v --maxrregcount=16 -c tf_96bit.cu -o tf_96_75bit.o -DSHORTCUT_75BIT

ptxas info	: Compiling entry function '_Z11mfakt_95_75j5int96Pji6int192S0_' for 'sm_10'

ptxas info	: Used 15 registers, 4+0 bytes lmem, 64+16 bytes smem, 28 bytes cmem[1]

In the second case: why are registers down to 15 per thread for the 2nd and 3rd kernel? Is this a usual behavior?

Second question:

AFAIK I can compile a kernel for multiple architectures at once (e.g. sm_11 and sm_20). Is it possible to have different register limits for the different code paths? For my code the limit to 16 registers is beneficial on GPUs with compute capability 1.1 but on the other hand on “Fermi” (compute capability 2.0) I get better performance with a higher limit (e.g. 24) or no limit on register usage.

Oliver

The output shows that for your 2nd kernel 2 registers and for your 3rd kernel 3 registers were spilled to local memory, which is really bad since it involves (possibly uncoalesced) reads from global memory.

I’ve seen this behavior and would assume that it is due to the compiler optimizing register usage for your hard upper limit of 16 and that cannot always be reached in an optimal way.

kynan