How to load fp8 using ldmatrix on sm120/sm120a

HI There,
On blackwell RTX 5080 (120a), a warp can run m16n8k32 tile using tensor core for FP8. (cutlass/include/cute/arch/mma_sm120.hpp at main · NVIDIA/cutlass · GitHub)

    asm volatile(
    "mma.sync.aligned.kind::mxf8f6f4.block_scale.scale_vec::1X.m16n8k32.row.col.f32.e4m3.e4m3.f32.ue8m0 "
    "{%0,  %1,  %2,  %3},"
    "{%4,  %5,  %6,  %7},"
    "{%8,  %9},"
    "{%10, %11, %12, %13},"
    "{%14},"
    "{%15, %16},"
    "{%17},"
    "{%18, %19};\n"
    :  "=f"(d0),  "=f"(d1),  "=f"(d2),  "=f"(d3)
    :   "r"(a0),   "r"(a1),   "r"(a2),   "r"(a3),
        "r"(b0),   "r"(b1),
        "f"(c0),   "f"(c1),   "f"(c2),   "f"(c3),
        "r"(uint32_t(sfa0)) , "h"(bidA), "h"(tidA),
        "r"(uint32_t(sfb0)) , "h"(bidB), "h"(tidB));

I tried to use ldmatrix for matrix B whose size is 8 x 32. However, there is no suitable tile size found for this b8 case. (1. Introduction — PTX ISA 8.7 documentation)

Any suggestions?
Thanks in advance!

.shape Matrix shape Element size
.m8n8 8x8 16-bit
.m16n16 16x16 8-bit or 6-bit or 4-bit
.m8n16 8x16 6-bit or 4-bit

You do not have to use ldmatrix, you can also directly load the elements. Or you can use it and then reshuffle the elements.

@Curefab thanks.
one more question, how to set it up for sfa0 and sfb0?
The nv ptx document says sfa0/sfb0 are metadata for scaling factors. I assume the scaling factors should reside in tensor memory.
I tried to allocate tensor memory using tcgen06.alloc, but it is only supported on sm100a/101a, not on sm 120a (rtx5080 etc.).

Where does it say so?

Here the instruction with its operands is described. 1. Introduction — PTX ISA 8.7 documentation

Or in the block scaling chapter: 1. Introduction — PTX ISA 8.7 documentation

As there is no tensor memory or sm_120a I would expect it not to reside there?

I am using cuda toolkit 12.8.1 , compiled for sm120a.
The compiler complains tcgen05.alloc is not supported. It also says in the ptx doc, only on sm100 and sm101a.

I got it working to specify the register for sfa and sfb. I notice that I need to take the 127 bias into account when I initiliaze the e8m0 value.

As an aside, the 5080 is sm120, not 120a, according to deviceQuery output here.

Yes it is sm120. You can specify sm120a for additional features that compiler updated in the toolkit.

My mistake. I was under the impression that the “a”, suffix required a different hardware variant.

sm_120 and sm_120a are the same architecture.
The ‘a’ is not used for architecture differentiation.

Instead:
The features in sm_120a are not necessarily available in newer architectures, whereas the features from sm_120 are to be expected to be also working in an, e.g., upcoming sm_130.

1 Like