ptx assembly question mystery registers?


In trying to figure out why my code is slow, I decided to decuda my .cubin file and look at what the core routines are actually doing. Most of it is quite understandable, but there is one big question.

In these two lines of code

add.half.rn.f32 $r4, s[$ofs2+0x0034], -$r78
add.half.rn.f32 $r5, s[$ofs2+0x0030], -$r79

the registers $r78 and $r79 are used. But I compile with -maxrregcount=20, so only registers $r0 to $r19 ought to be used. I can’t seem to get rid of these registers. Can anyone tell me anything about this?

I’m not sure that registers used have to be consecutive in numbers, but: can you grep your dis-assembled code to something like “$r[0-9]+”, so that you could check which other registers are actually used?

Thanks cgorac, I have confirmed that all registers $r0 to $r19 are also used. For completeness’ sake, the latest code uses $r0-$r19, $r79-$r82, and $r124, for a total of 25 “regular” registers in the decompiled code despite the .cubin reporting only reg = 20.

The reason I even care about this question is that my program appears to run inexplicably slowly… so my worry is that those extra registers are actually turned into local memory accesses at some point. But the .cubin lists lmem = 0. How likely is this possibility?

I haven’t ever looked at decuda output in detail, but are special registers declared explicitly via names or do they get mapped to higher register names such as r79 and up? I would try to figure out exactly which values in your source code the higher registers are mapped to.

This is just a bug of decuda. It should read as something like:

add.half.rn.f32 $r4, s[$ofs2+0x0034], -$r14

add.half.rn.f32 $r5, s[$ofs2+0x0030], -$r15

Ah, ok. Thanks a heap!

Can you tell me anything more, like what causes the bug or which registers are actually used?

Edited to add: Cgorac, special registers have names different from regular registers. E.g. $p0-$p3 are predicate registers.

I think he meant special registers, as in those defined by the PTX ISA (e.g. %nctaid).

This is caused by an overlap between two fields.

Register IDs in half instructions are restricted to 6 bits, instead of 7 in regular instructions (so they can only address registers 0 through 63). The higher-order bit saved is reused for something else, here for negating the 3rd operand.

Actually I realize that I am the one who introduced that bug. When I added support for the “negate” flag, I forgot to also restrict the register ID width…

All the features of decuda are from Wladimir, but the bugs are mine :"> .