Register Count Different Reg Count from same Program

Hi all
I was running a kernel with 512 threads and 128 blocks, I register count from cubin file generated with -keep options in a makefile is
name = __globfunc__Z8flo_tfmuP6float2S0_ii
lmem = 0
smem = 8224
reg = 11
bar = 1

whereas cubin file generated with the command
nvcc -cubin -I ~uns/NVIDIA_CUDA_SDK/common/inc/ -O3 main.cu
name = __globfunc__Z8flo_tfmuP6float2S0_ii
lmem = 56
smem = 8224
reg = 23
bar = 1

Which one is the true value. Why/How It came up. can somebody clarify pliz?

The cubin shows the true register count. nvcc displays the number of registers allocated sequentially in the PTX (without re-use), whereas the final cubin re-uses registers where possible. The register optimization happens during the translation from PTX into the cubin assembly.

It is possible to lower the register count in the PTX code already - search for the volatile keyword in this forum. Often this can result in a lower register usage in the final cubin code (no machine optimizer is perfectly efficient, so lowering the complexity of the problem fed into the optimizer may help)

Christian

Oh, thanks, but all are in the cubin file only

looks like in your second compilation (with -O3) something is screwed up because

it is unusual that so much local memory is used unless you limited the register count manually (which as I see you don’t)

perhaps it was caused by the compiler optimizations or because the compiler guessed the memory space incorrectly

the blow-up of local memory sometimes happened to me when I tried to pass volatile variables by reference to a function

or access arrays allocated in register memory with non-constant indices

it’d be nice if you can post your code and perhaps also its decuda disassembly