How to load fp8 using ldmatrix on sm120/sm120a

smartvoice · April 14, 2025, 10:41am

HI There,
On blackwell RTX 5080 (120a), a warp can run m16n8k32 tile using tensor core for FP8. (cutlass/include/cute/arch/mma_sm120.hpp at main · NVIDIA/cutlass · GitHub)

    asm volatile(
    "mma.sync.aligned.kind::mxf8f6f4.block_scale.scale_vec::1X.m16n8k32.row.col.f32.e4m3.e4m3.f32.ue8m0 "
    "{%0,  %1,  %2,  %3},"
    "{%4,  %5,  %6,  %7},"
    "{%8,  %9},"
    "{%10, %11, %12, %13},"
    "{%14},"
    "{%15, %16},"
    "{%17},"
    "{%18, %19};\n"
    :  "=f"(d0),  "=f"(d1),  "=f"(d2),  "=f"(d3)
    :   "r"(a0),   "r"(a1),   "r"(a2),   "r"(a3),
        "r"(b0),   "r"(b1),
        "f"(c0),   "f"(c1),   "f"(c2),   "f"(c3),
        "r"(uint32_t(sfa0)) , "h"(bidA), "h"(tidA),
        "r"(uint32_t(sfb0)) , "h"(bidB), "h"(tidB));

I tried to use ldmatrix for matrix B whose size is 8 x 32. However, there is no suitable tile size found for this b8 case. (1. Introduction — PTX ISA 8.7 documentation)

Any suggestions?
Thanks in advance!

.shape	Matrix shape	Element size
`.m8n8`	8x8	16-bit
`.m16n16`	16x16	8-bit or 6-bit or 4-bit
`.m8n16`	8x16	6-bit or 4-bit

Curefab · April 15, 2025, 1:37pm

You do not have to use ldmatrix, you can also directly load the elements. Or you can use it and then reshuffle the elements.

smartvoice · April 15, 2025, 4:41pm

@Curefab thanks.
one more question, how to set it up for sfa0 and sfb0?
The nv ptx document says sfa0/sfb0 are metadata for scaling factors. I assume the scaling factors should reside in tensor memory.
I tried to allocate tensor memory using tcgen06.alloc, but it is only supported on sm100a/101a, not on sm 120a (rtx5080 etc.).

Curefab · April 15, 2025, 7:13pm

Where does it say so?

Here the instruction with its operands is described. 1. Introduction — PTX ISA 8.7 documentation

Or in the block scaling chapter: 1. Introduction — PTX ISA 8.7 documentation

As there is no tensor memory or sm_120a I would expect it not to reside there?

smartvoice · April 15, 2025, 10:21pm

I am using cuda toolkit 12.8.1 , compiled for sm120a.
The compiler complains tcgen05.alloc is not supported. It also says in the ptx doc, only on sm100 and sm101a.

I got it working to specify the register for sfa and sfb. I notice that I need to take the 127 bias into account when I initiliaze the e8m0 value.

rs277 · April 16, 2025, 12:16am

As an aside, the 5080 is sm120, not 120a, according to deviceQuery output here.

smartvoice · April 16, 2025, 1:07am

Yes it is sm120. You can specify sm120a for additional features that compiler updated in the toolkit.

rs277 · April 16, 2025, 4:09am

My mistake. I was under the impression that the “a”, suffix required a different hardware variant.

Curefab · April 16, 2025, 7:07am

sm_120 and sm_120a are the same architecture.
The ‘a’ is not used for architecture differentiation.

Instead:
The features in sm_120a are not necessarily available in newer architectures, whereas the features from sm_120 are to be expected to be also working in an, e.g., upcoming sm_130.

Topic		Replies	Views
Run ptx (mma.sync.aligned.kind::mxf8f6f4.block_scale.scale_vec::1X.m16n8k32) on sm_120a CUDA Programming and Performance	1	295	April 9, 2025
BUG : Can't use cuSparseLt GEMM for matrices with more than 2**31 elems using FP8 GPU-Accelerated Libraries cuda , cusparse	2	49	February 10, 2026
Can hopper support recent published 1D scaling of FP8 in cuBlasLt GPU-Accelerated Libraries cublas	1	104	February 26, 2025
Fragment layout for mma.sync.aligned.m16n8k64 with FP4 E2M1 and block scaling on SM120 CUDA Programming and Performance	0	76	March 19, 2026
Alignment requirement for the `ldmatrix` instruction CUDA Programming and Performance cuda , instruction	3	272	November 1, 2024
Why my ldmatrix PTX instruction is wrong? CUDA Programming and Performance cuda	10	1006	March 30, 2024
Ldmatrix PTX instruction usage for fp64 double types CUDA Programming and Performance cuda	1	265	August 7, 2024
Ada GeForce (RTX 4090) FP8 cuBLASLt performance GPU-Accelerated Libraries cublas	7	13946	November 2, 2023
Is GeForce RTX 2080 slower than GeForce GTX 1080 on small matrix-matrix multiplication? CUDA Programming and Performance	12	2862	October 25, 2018
FP8 WMMA kernel compilation error GPU-Accelerated Libraries cublas	9	2167	March 26, 2023

How to load fp8 using ldmatrix on sm120/sm120a

Related topics