Why does cuobjdump imply that instructions have 16 bytes?

Using cuobjdump to disassemble a kernel, we have instruction offsets on the left and 8-byte instruction values on the right:

        /*0000*/                   LDC R1, c[0x0][0x28] ;                                                      /* 0x00000a00ff017b82 */
                                                                                                               /* 0x000e620000000800 */
        /*0010*/                   ULDC.64 UR12, c[0x0][0x228] ;                                               /* 0x00008a00000c7ab9 */
                                                                                                               /* 0x000fe20000000a00 */
        /*0020*/                   S2R R4, SR_CTAID.X ;                                                        /* 0x0000000000047919 */
                                                                                                               /* 0x000ea20000002500 */
        /*0030*/                   UIADD3 UR4, UR13, 0x3f, URZ ;                                               /* 0x0000003f0d047890 */
                                                                                                               /* 0x000fca000fffe03f */
        /*0040*/                   S2UR UR8, SR_CTAID.X ;                                                      /* 0x00000000000879c3 */
                                                                                                               /* 0x000ee20000002500 */
        /*0050*/                   IABS R20, UR12 ;                                                            /* 0x0000000c00147c13 */

On the right, it looks like each instruction is 8 bytes. But on the left, I see each instruction is 0x10 bytes past the previous one, which is 16 bytes.

Are instructions 8 bytes, or 16 bytes? Are the offsets on the LHS actually “nibbles”?

On modern GPUs, each 8-byte instruction is accompanied by an 8-byte “op-steering” control block that is not publicly documented, although various people have reverse-engineered most of it. In the disassembler listing you can see the encoding of each control block on the right, but it is not further decoded into a human-readable form (one could speculate that NVIDIA has a version for internal use that does just that), thus a blank line appears where the disassembly of the instruction bytes normally goes.

Historically, such approaches have been used in microcode and in VLIW processors. The usual motivation is to simplify processor hardware (in particular those parts that deal with instruction scheduling and issuing) by moving some of the work into software. As can already be glimpsed from the use of 8-byte instructions ifself, code density is not a major concern in GPUs, although minor performance issues such as loop bodies exceeding instruction cache size have been observed in the wild.

Thanks @njuffa! You mentioned that the control blocks are mostly reverse engineered, do you happen to have a link for Hopper or Ampere?

Sorry, I am not aware of a write-up on this for the most recent GPU architectures. As GPU microarchitectures appear to be converging in the most recent iterations, I would expect differences in the control block data between the latest architecture iterations, but minor ones.

Although not fully addressing what you’re after, you may find this of interest.

The most comprehensive I’ve come across , although now dated explanation, referencing Maxwell/Pascal, is here.

The separation is not a perfect 8 byte + 8 byte. For example the 3rd source register for some instructions or the flags for the optional inversion/absolute of the 1st or 3rd source register is stored within the second 8 bytes. Or the specific MUFU operation or the .FTZ (flush-to-zero) flag of floating point operations.

As the general instruction format exists since Volta, new extensions have to use the remaining still available bits as good as possible.

An example for an instruction with 3 operands is PRMT - PERMUTE. It has two 32-bit data and one operand specifying the permutation. But there are lots of instructions using 3 operands.

An actual 12 bytes + 4 bytes separation of instruction and control block is probably more to the point of the current format. So I would guess the 8+8 bytes display is more for spacing and hex readability reasons than for exact instruction bit usage reasons.

Interesting. I stand corrected regarding the split I guess this explains why, for example, in Volta and following generations full FP32 immediate operands can be encoded into the instruction.

While instructions with three source operands have existed for a long time, modern GPUs seem to make it a point to exploit the general availability of three source operands as much as possible, for example IADD3, IMAD, LOP3 etc using RZ as the third operand where only two sources are needed. Great for hardware efficiency, but it tends to make the generated SASS code harder to decipher.