registers re-used in different __device__ func?

I have followed some of the threads discussing the register usage problem.
And I’m not sure whether this issue has been addressed by one or some of them.

Here is my question. If I declare some (register) variables in several device functions and they are called by a global function, will the CUDA compiler allocate same set of registers to these variables in order to save register usage?

The example codes look like this:

device int f1(int a)
int a1, a2, a3;


device int f2(int a)
int a1, a2, a3;

global void f0(int a)

a = f1(a);



The question is: how many registers will CUDA compiler allocate to the variables a1, a2, a3 in f1() and f2()? 3 or 6?

You can compile this with one of the options to preserve the .cubin file and find out :)

My understanding is all the functions get inlined so that would help with making an estimate. There is a good chance registers will be reused. Still probably best to try it out though.

According to my experience register allocation is performed after function inlining, so they are reused.

As wildcat4096 already said, best way to find this out is to check .cubin file for actual register usage.

i’m not sure of your question, but i strongly recommand use ptx to check your assembly code, that’ll be more true than whatever we say.

PTX files say nothing about actual register allocation.

Thanks for the answer.

But even the .cubin file tells us the register allocation, the register arrangement could be altered when .cubin is further compile into object code.

I mean: can we fully trust the information in .cubin?

I really hope so :)

AFAIK, this is the only documented way to know how many registers/smem/lmem used by kernel.

Registers are re-used where possible; it is ptxas that handles the real register allocation, not nvcc. Nvcc just produces another register for every new assignment.

AndreiB: You can trust the .cubin file, as the internal format of the instructions is quite different from ptx, and another virtual machine in between would be a waste.
Also the smem, lmem, reg, bar as defined in the cubin are taken for real by the runtime API.