Understanding how the compiler assigns registers Checking the .cubin file

Hello everybody!!

Today i have gotten a nice issue because I have compiled my matrix Multiplication program on a Tesla C870 using bellow

nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2006 NVIDIA Corporation
Built on Fri_Nov_30_04:20:02_PST_2007
Cuda compilation tools, release 1.1, V0.2.1221

I got 14 register usage checking .cubin file

After that i have checked my program on other machine using this driver and also the GPU was Tesla C870

nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2007 NVIDIA Corporation
Built on Tue_Jun_10_04:42:57_PDT_2008
Cuda compilation tools, release 1.1, V0.2.1221

In this configuration i have gotten 11 registers.

The program is exactly the same for both. How is it possible?

On the other hand, I have used -maxregisters flag for this program and I have limited to ten registers. So, I have read that whenever we used that flag the registers are allocated to local memory, so this means costly solution. Is it allways like that? or we can assume that the compiler has used an optimzation in order to use less registers than previously.

Cheers

Yes, whenever maxregcount flag is used, excessive registers are allocated in local memory - you can observe this by inspecting cubin file (.local directive or smth similar defines the amount of used local mem)

In general from my experience, register allocation algorithm in cuda is highly nondeterministic ;)

I had number of examples where changing one instruction, even condition statement ‘>=’ to ‘>’ increased/descreased the number of registers…

It would be really nice if NVIDIA can disclose it at least partially…

You’re right. For instance whenever i have used a parameter, my register count increase however if i use a macro decrease.
Things like this one are non deterministic and it has to have an answer for that.

Any guide for register usage?

Tip: I have checked muy cubin file once i have compiled with maxregisters flag set to ten. The amazing thing is that it is not using local memory (in the cubin file lmem=0) so, i guess i can suppose there was an register usage optimization, can’t i?

Cheers

yes if lmem is zero then indeed no local mem is used, however I guess if you compile with -O3 optimizations nvcc should keep the number of registers as low as possible.

You can also dissassemble your kernel with decuda tool to see what happens when you decrease maxregcount value.

For large kernels (split into several functions) inlining everything into one large function

and carefully reusing free variables sometimes helped me to achieve slightly less register usage,

however this brings the code in completely unmaintanable state.

In general I’d be also very interested to have some “deterministic” guide for register usage…

Hi,

I am trying to optimize the number of registers to 10 in a matrix multiplication PTX file. Currently it uses 11 registers.

Using decuda I can see where the 11th register is being used, but looking at my PTX file I dont understand what
in the PTF file is causing the 11th register to be assigned.

Does anyone have any ideas how to reduce register usage by looking at the decuda ptx? Compilation of the original code gave me 16registers.
I made a lot of changes in ptx to bring it down to 11, but I need to bring it down to 10 registers for 100% occupancy.

Both decuda ptx and original ptx are attached.

Any help would be appreciated, thanks!
mm.txt (17.7 KB)
decuda.txt (6.83 KB)