.reg f16x2 %Rb<1> in ISA example: mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32

I checked both isa 8.4 and 8.5 and the example for: mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 is:

.reg .f16x2 %Ra<2>, %Rb<1>;
.reg .f32 %Rc<4>, %Rd<4>;
mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32
{%Rd0, %Rd1, %Rd2, %Rd3},
{%Ra0, %Ra1, %Ra2, %Ra3},
{%Rb0, %Rb1},
{%Rc0, %Rc1, %Rc2, %Rc3};

how does: .reg f16x2 %Rb<1>; result in %Rb0 and %Rb1? Wouldn’t it just expand to a single %Rb0?

I’m assuming a trillion dollar company wouldn’t publish the same typo in multi ISA documents describing the use of their flagship hardware offering. So clearly I’m missing something. Am I meant to interpret %Rb0 as %Rb0.x and %Rb1 as %Rb0.y? or maybe Rb<1> is a special case that doesn’t initialize ‘Rb’ from a 0 offset and ‘0’ and ‘1’ act like magic indexes into ‘%Rb’? Section 5.4.6 nor section 5.2.4 of the ptx isa offer any insights.

According to PTX ISA 8.5 (chapter 9.7.15.4.8) this matrix size needs 4 A and 2 B registers with 2 FP16 each, so 8xFP16 for A and 4xFP16 for B. The array sizes seem to be incorrect in the given examples.

1 Like