In trying to figure out why my code is slow, I decided to decuda my .cubin file and look at what the core routines are actually doing. Most of it is quite understandable, but there is one big question.
the registers $r78 and $r79 are used. But I compile with -maxrregcount=20, so only registers $r0 to $r19 ought to be used. I can’t seem to get rid of these registers. Can anyone tell me anything about this?
I’m not sure that registers used have to be consecutive in numbers, but: can you grep your dis-assembled code to something like “$r[0-9]+”, so that you could check which other registers are actually used?
Thanks cgorac, I have confirmed that all registers $r0 to $r19 are also used. For completeness’ sake, the latest code uses $r0-$r19, $r79-$r82, and $r124, for a total of 25 “regular” registers in the decompiled code despite the .cubin reporting only reg = 20.
The reason I even care about this question is that my program appears to run inexplicably slowly… so my worry is that those extra registers are actually turned into local memory accesses at some point. But the .cubin lists lmem = 0. How likely is this possibility?
I haven’t ever looked at decuda output in detail, but are special registers declared explicitly via names or do they get mapped to higher register names such as r79 and up? I would try to figure out exactly which values in your source code the higher registers are mapped to.
Register IDs in half instructions are restricted to 6 bits, instead of 7 in regular instructions (so they can only address registers 0 through 63). The higher-order bit saved is reused for something else, here for negating the 3rd operand.
Actually I realize that I am the one who introduced that bug. When I added support for the “negate” flag, I forgot to also restrict the register ID width…
All the features of decuda are from Wladimir, but the bugs are mine :"> .