.reg f16x2 %Rb<1> in ISA example: mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32

ryderbrooks · October 25, 2024, 2:35pm

I checked both isa 8.4 and 8.5 and the example for: mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 is:

.reg .f16x2 %Ra<2>, %Rb<1>;
.reg .f32 %Rc<4>, %Rd<4>;
mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32
{%Rd0, %Rd1, %Rd2, %Rd3},
{%Ra0, %Ra1, %Ra2, %Ra3},
{%Rb0, %Rb1},
{%Rc0, %Rc1, %Rc2, %Rc3};

how does: .reg f16x2 %Rb<1>; result in %Rb0 and %Rb1? Wouldn’t it just expand to a single %Rb0?

I’m assuming a trillion dollar company wouldn’t publish the same typo in multi ISA documents describing the use of their flagship hardware offering. So clearly I’m missing something. Am I meant to interpret %Rb0 as %Rb0.x and %Rb1 as %Rb0.y? or maybe Rb<1> is a special case that doesn’t initialize ‘Rb’ from a 0 offset and ‘0’ and ‘1’ act like magic indexes into ‘%Rb’? Section 5.4.6 nor section 5.2.4 of the ptx isa offer any insights.

Curefab · October 25, 2024, 3:48pm

According to PTX ISA 8.5 (chapter 9.7.15.4.8) this matrix size needs 4 A and 2 B registers with 2 FP16 each, so 8xFP16 for A and 4xFP16 for B. The array sizes seem to be incorrect in the given examples.

Topic		Replies	Views
Complete minimal ptx example for: mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 CUDA Programming and Performance	3	131	October 25, 2024
Problem with the instruction "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32" CUDA Programming and Performance	3	1877	October 12, 2021
Direct access to Volta HMMA instruction CUDA Programming and Performance	9	5114	December 19, 2017
How does it compute exactly in Tensor Core? CUDA Programming and Performance	10	441	August 22, 2024
Wmma f16 load always loads into 8 2xf16 registers CUDA Programming and Performance	4	562	September 9, 2023
Ampere 16x8x256 BMMA CUDA Programming and Performance	4	1761	May 20, 2022
REGISTER USAGE CUDA Programming and Performance	6	2021	November 30, 2015
Data alignment for 32 bit reads? CUDA Programming and Performance	8	5085	March 27, 2007
About LD instruction for wmma CUDA Programming and Performance	2	503	July 5, 2023
Inline PTX and WMMA instructions CUDA Programming and Performance	1	262	March 12, 2024

.reg f16x2 %Rb<1> in ISA example: mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32

Related topics