Mismatch between nvdisasm register output vs nvcc -res-usage output

I am trying to reduce the register usage of a kernel by looking at the sass with the register usage information (generated through nvdisasm -lrm count <test.cubin> command).

Though it reports the max register usage is at 113 GPR for my scenario, once I do a full compile with nvcc with -res-usage option it reports the register usage as 133. When running the process, I could see the register usage is taken as 133 as well.

Is it possible to know the reason for this difference?

additional registers might be used for called functions, if they are not inlined.

1 Like

Thanks for the reply,

I checked the sasm output and I don’t find any device functions nor any CALL instructions to another function. I hope that is enough to think that all the functions are already inlined. In that case could there be any other reason for this mismatch?

It is highly likely one could find out what going on with full access to compileable source code and a complete record of all tool chain invocations used to build the code.

As presented there is too much about this code that is unknown to make a diagnosis. Is the kernel contained in a single compilation unit, or does it result from device code linking of multiple compilation units? If the latter, is link-time optimization used?

You mention "full compile with nvcc. What does that mean? “full” as opposed to what? Are you comparing code in an object file with code from the binary executable? I have no idea how big this code is, but you could probably get an idea as to what is going on by comparing the two variants of the disassembled code. Among large stretches of code that look more or less identical in both versions , there will presumably be some recognizable differences in the instruction sequences that in turn likely have some correspondence with the additional registers used.

I assume you have some specific goal in mind in your attempts to reduce the register usage in this code. What is that goal? I assume you have already experimented with the compiler’s -maxrregcount switch and the __launch_bounds__() attribute. Generally speaking, trying to squeeze code into fewer registers with those two mechanism likely leads to lower performing code.

There are some compiler optimization that can increase register usage such as the scalarization of small local arrays with compile-time constant indexing. Again, trying to prevent this from happening typically leads to lower performance. Your code may be using standard math library functions that require a hefty number of temporary registers, a classical case of this is the use of pow() or powf(), which can be avoided in at least some cases.

1 Like

Thank you very much for detailed explanation.

As you mentioned I was comparing two different stages of compile. the cubin was generated for a single compile unit while the “full” compile contained multiple compile units with a linker stage (using -dlink). So the register information was generated for a single compile unit whereas the full compile was giving the register usage after linking all the compile units.

Initially I thought it was doing link-time optimizations, but it was not. After using the -dlto options mentioned in here I was able to get the same number of registers reported by the intermediate register usage file generated at the compile of single compile unit

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.