Use of register An odd problem

Hi,

I encountered an odd problem about the use of register in my kernel function.

My kernel function currently has 46 registers, which seems large as the occupancy will be low. So, I wanted to reduce the number of registers.

I assume that the number of registers comes from the local variables that I defined in the kernel function. But whatever I decreased or increased the number of the locally defined variables, the total register number kept the same. That is odd ~~~. Did I miss something here?

Then, I tried a silly method to identify which part of code used the register and surprisingly found that once I disabled one line, which assigns the value of a local variable to a global array element (the array reside in page-locked memory), the number of used register became 0. This is even more odd, isn’t this? ~~

I was totally confused about what I saw here. Is there clue ? Thanks!

P.S. my compiling command to detect the register number is “nvcc -c -O3 -arch sm_13 --ptxas-options=”-v" "

It is possible, that the compiler optimisation create the local variables.

It is possible, that the compiler optimisation create the local variables.

you removed the code that “…assigns the value of a local variable to a global array element…”
Let me guess: that was the result of the computation.

There’s a feature called dead code optimization that concluded that your entire kernel was now
performing work that was unnecessary (as the result was never stored) - so the entire kernel
was optimized away. → 0 registers.

you removed the code that “…assigns the value of a local variable to a global array element…”
Let me guess: that was the result of the computation.

There’s a feature called dead code optimization that concluded that your entire kernel was now
performing work that was unnecessary (as the result was never stored) - so the entire kernel
was optimized away. → 0 registers.

I have not found anyone able to explain how the compiler decides to create registers. The majority of your registers are likely created by the compiler and not variables which you explicitly created in the kernel. Even worse, it seems like, similar to what you’ve experienced, if I remove a variable I explicitly created, the compiler often decides it can create another just to drive me nuts. I would love to be able to know exactly what is going into the register, when and where. You can limit registers using a command at compile but it seems to just dump them into local memory, which is horrible.

Bottom line, there must be a way to write code in a way which the compiler does not feel the need to use so many registers. But I haven’t found anybody who knows how.

I have not found anyone able to explain how the compiler decides to create registers. The majority of your registers are likely created by the compiler and not variables which you explicitly created in the kernel. Even worse, it seems like, similar to what you’ve experienced, if I remove a variable I explicitly created, the compiler often decides it can create another just to drive me nuts. I would love to be able to know exactly what is going into the register, when and where. You can limit registers using a command at compile but it seems to just dump them into local memory, which is horrible.

Bottom line, there must be a way to write code in a way which the compiler does not feel the need to use so many registers. But I haven’t found anybody who knows how.

  1. –maxregcount = N (trade registers against slow local memory access)

  2. use of the “volatile trick” (search the forums)

  3. limit the scope of local variables, recompute index variables within local scopes,

    try to break up your algorithm into separate functional blocks - each within a separate

    local scope (e.g. curly brackets) even if it means more memory access to re-load data.

  4. use as much shared memory as is available (trade registers vs. shared memory)

  5. split your algorithm into several smaller kernels

Points 2) to 4) helped me to get a critical and relatively complex algorithm for radio interference

simulation below the 16 register limit so I can run 512 threads on Compute 1.1 hardware.

Before doing this optimization I had around 20 registers.

Getting down from 46 registers to 32 may be NP-hard ;)

Christian

  1. –maxregcount = N (trade registers against slow local memory access)

  2. use of the “volatile trick” (search the forums)

  3. limit the scope of local variables, recompute index variables within local scopes,

    try to break up your algorithm into separate functional blocks - each within a separate

    local scope (e.g. curly brackets) even if it means more memory access to re-load data.

  4. use as much shared memory as is available (trade registers vs. shared memory)

  5. split your algorithm into several smaller kernels

Points 2) to 4) helped me to get a critical and relatively complex algorithm for radio interference

simulation below the 16 register limit so I can run 512 threads on Compute 1.1 hardware.

Before doing this optimization I had around 20 registers.

Getting down from 46 registers to 32 may be NP-hard ;)

Christian

Thanks all!
Yes, it is driving me crazy as I don’t have any way to control the register that I use.
I want to have the locally defined variables resided in registers as they are frequently accessed.
The disabled line which drastically reduces the total register from 46 to 0 is the last line of my code, which returns the local variable value. It looks like the compiler is smart enough to detect that my kernel function is just a “dead code” as there is no returned value, so all local variables do not need to reside in register.
I tried to use decuda, but feel hard to interpret the output. Too many tricks to figure out ~~~ :wacko:

Thanks all!
Yes, it is driving me crazy as I don’t have any way to control the register that I use.
I want to have the locally defined variables resided in registers as they are frequently accessed.
The disabled line which drastically reduces the total register from 46 to 0 is the last line of my code, which returns the local variable value. It looks like the compiler is smart enough to detect that my kernel function is just a “dead code” as there is no returned value, so all local variables do not need to reside in register.
I tried to use decuda, but feel hard to interpret the output. Too many tricks to figure out ~~~ :wacko:

If you are at liberty to post your kernel code as standalone compilable .cu file I may be able to hack it a bit and maybe lower the register use.

If you are at liberty to post your kernel code as standalone compilable .cu file I may be able to hack it a bit and maybe lower the register use.