Not sure about CUDA application's registers that profilers provide me

Hello to all,

I am confused about extracting registers usage of my CUDA application.

I have 3 ways to extract registers usage:

  1. compiling my CUDA app with nvcc -Xptxas="-v"
  • that way gives me 2 registers per thread for a CUDA kernel that does nothing
  1. using nvprof --print-gpu-trace
  • that way gives me 16 registers used per thread for the same kernel
  1. using nvvp
  • that way gives me 16 registers used per thread for the same kernel

Now my question is what value is the correct?

I think that nvprof and nvvp gives me the theoritical value of registers that can be calclulated by the NVIDIA occupancy calclulator as well and not the real amount of registers used by every thread.

So if the correct ammount of registers is 2 why does it happen for a kernel that do nothing?

Additionally, how is possible for every thread to use the same amount of registers? So is this ammount the average of the registers used by every thread?

Finally, is there any other way to extract the registers used by a kernel?

Thank you very much!

The profiler values are correct.

There is allocation granularity. The allocation granularity for registers varies by GPU.

Kernels that “do nothing” are not actually doing nothing.

You can discover what they are doing, as well as specific register usage, using the cuda binary utilities.

Every thread is doing the same thing (they are all running the same code) so they use the same number of registers.

https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html

start with

cuobjdump -sass myexe

Hi Robert_Crovella and thank you for your response!

For the same kernel I run:
nvcc -cubin app.cu
in order to generate the cubin file and then:
cuobjdump -res-usage  ./extract
in order to see the registers per thread and the output is:
REG:2 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:356 TEXTURE:0 SURFACE:0 SAMPLER:0

That means is that there are 2 registers/thread (unlike the profiler’s values) like the compiler’s values (nvcc -Xptxas="-v"). Does it mean that finally that is the real number of used registers?

Additionaly I run:
cuobjdump ./extract -sass

in order to see the assembly of the kernel which is:

code for sm_30
	Function : _Z10matrix_mulPiS_S_iii
.headerflags    @"EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)"
                                                      /* 0x200000000002f307 */
    /*0008*/                   MOV R1, c[0x0][0x44];  /* 0x2800400110005de4 */
    /*0010*/                   EXIT;                  /* 0x8000000000001de7 */
    /*0018*/                   BRA 0x18;              /* 0x4003ffffe0001de7 */
    /*0020*/                   NOP;                   /* 0x4000000000001de4 */
    /*0028*/                   NOP;                   /* 0x4000000000001de4 */
    /*0030*/                   NOP;                   /* 0x4000000000001de4 */
    /*0038*/                   NOP;                   /* 0x4000000000001de4 */
	..................................

and from the assembly I can see only one register used (R1).

Finally from those outputs, are the registers per thread used 2 and why in the assembly we can see only one register?

A last question is:
If in a kernel there is an if statement like (if threadid==0), that means that the thread with id=0 will do more operations than the others right? So, how is it possible for every thread to use the same amount of registers?

Thank you again!

The profilers are correct.

I suggested the cuobjdump tool to study the sass code. I did not suggest the -reg-usage switch. You already have that information, and I suggest you internalize the idea that the machine runtime behavior, and the behavior that can be determined at compile-time are different.

The number of registers required by the code is 2. The number of registers the machine will actually allocate for each thread is not necessarily 2. It will be 2 or some higher number, depending on register allocation granularity. You can get the register allocation granularity either from the occupancy calculator spreadsheet, the CUDA occupancy API, or the profilers.

I’m not able to explain anything about code you haven’t shown.

Regarding your last question, I don’t think you understand the compilation process, and the difference between compilation and runtime.

At compile time, the compiler has no idea what the state of any variable is, nor what thread is running, or anything like that.

The compiler output handles both the if and else paths, because some threads will follow one path and some will follow the other. Registers may be used for either or both paths.

Conceptually, I don’t think this is any different than a CPU host code compiler. The compiler does not know which path the code will take. It must create a valid execution enviroment regardless.

Robert_Crovella thank you again for you response.

Indeed I had not internalize the idea that number of registers in the compiling is different from the runtime and thank you for that.

Could you please check if I have understood everything on a specific CUDA kernel in order to clarify everything?

I launch with only 1 thread the following kernel:
__global__void kernel(int *c)
{
int x, y, z;
x=1;
y=2;
z=3;
c[0]=10;
}

With the techniques of:
1)compiling with -Xptxas="-v" I get 5 registers per thread
2)cuobjdump -res-usage ./extract I get 5 registers per thread again
so 5 registers is the number of registers that the compiler says every thread will use.

But because at the runtime 5 registers cannot be allocated because registers are alocated as quantum (16, 32… 255) in the reality every thread will allocate the quantum that is a little bit bigger than the actual number of registers will be used.

And indeed, the profiler gives me 16 registers per thread. And no thread need more than 16 registers.

Is that correct?
If there even was at least one thread that was using 17 registers, every thread would allocate 32 threads.

Have I understoud it?

Yes, you are getting the idea.

It may be useful to state once again that threads don’t allocate registers, nor do threads get to decide how many registers they will use. The allocation of registers is done when a block becomes resident on a SM, and the allocation is not done by the thread itself. The number of registers a thread will use is assigned by the compiler.

It’s also not very rational to bounce the idea around that different threads may need a differing number of registers. There is only one thread code (your kernel) and the number of registers that will be used by that code is determined by the compiler. Threads don’t make their own decisions in any of this.

Thank you very much!
You helped me a lot!

Is there any way to extract the same information about registers usage in OpenCL application?