Fragment layout for mma.sync.aligned.m16n8k64 with FP4 E2M1 and block scaling on SM120

mattana.florian · March 19, 2026, 9:44am

Hello,

I’m writing an FP4 fused attention kernel targeting SM120 consumer Blackwell GPUs (RTX 5070 Ti, RTX 5080) using inline PTX. SM120 does not support tcgen05.mma, so the FP4 path goes through warp-level mma.sync (confirmed by CUTLASS issues #2800 and #3044).

The instruction I’m using is:

mma.sync.aligned.m16n8k64.row.col.kind::mxf8f6f4.block_scale.scale_vec::2X.f32.e2m1.e2m1.f32

I need the exact thread-to-matrix-element mapping for this instruction. The PTX ISA 9.2 documentation provides fragment layout tables for m16n8k32 with types like .f16, .bf16, .u8, .s8 (Tables 43-47), and the m16n8k64 section mentions the instruction exists, but I could not find the fragment layout diagram specifically for FP4 E2M1 with block scaling.

My questions:

For matrix A (16x64, FP4 E2M1, row-major): each thread holds 4 x 32-bit registers (a0-a3). Since each register holds 8 FP4 values, which rows and columns does each lane (0-31) map to in each register? Is it identical to the .u4/.s4 layout for m16n8k32 extended to k64 (same 32-byte-along-K rule)?
For matrix D (16x8, FP32, accumulator): I believe this is the standard layout where lane_id/4 gives the row offset and lane_id%4*2 gives the column offset, with 4 FP32 registers per thread (d0=row+0/col, d1=row+0/col+1, d2=row+8/col, d3=row+8/col+1). Can you confirm?
For the scale factor registers (sfa, sfb in UE8M0) with scale_vec::2X: the block scaling documentation shows the layout but I find the diagrams difficult to interpret. Could you clarify which scale factor element each lane holds, and how byte_id and thread_id parameters map to specific scale blocks?

Context: I’m building a fused GEMM-softmax-GEMM attention kernel. After the first MMA, I need to repack the accumulator (layout D) into operand A (layout A) for the second MMA. Knowing the exact fragment layouts is critical for this step. Any mistake produces silent wrong results.

I’ve studied gau-nernst’s SM120 MXFP8 block-scaled GEMM kernel ( learn-cuda/09a_block_scaled_mm_sm120 at main · gau-nernst/learn-cuda · GitHub ) and SageAttention3’s Blackwell kernel, but neither provides the explicit per-lane mapping for FP4 E2M1 on SM120.

Any documentation, diagrams, or confirmation of the layout would be very helpful. Thank you.

Topic		Replies	Views
The fragment layout of multiplicand A is not clear in mma.sp.sync.aligned.m16n8k32 when data type is fp16/bf16 CUDA Programming and Performance	2	711	February 22, 2022
Run ptx (mma.sync.aligned.kind::mxf8f6f4.block_scale.scale_vec::1X.m16n8k32) on sm_120a CUDA Programming and Performance	1	299	April 9, 2025
How to load fp8 using ldmatrix on sm120/sm120a CUDA Programming and Performance cuda	8	881	April 16, 2025
Confused about (a,b)layout on mma.sync instructions CUDA Programming and Performance	8	1207	March 23, 2024
How does it compute exactly in Tensor Core? CUDA Programming and Performance	10	1484	August 22, 2024
Questions about mma instruction with Nvidia ptx CUDA Programming and Performance cuda	1	246	July 15, 2024
Complete minimal ptx example for: mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 CUDA Programming and Performance	3	1034	October 25, 2024
Access WMMA fragment elements by threadID CUDA Programming and Performance	1	648	November 10, 2023
Alignment requirement for the `ldmatrix` instruction CUDA Programming and Performance cuda , instruction	3	277	November 1, 2024
How to achieve the functionality of `stmatrix` on devices below SM90 while avoiding issues like non-coalesced memory access? CUDA Programming and Performance	1	99	February 12, 2026

Fragment layout for mma.sync.aligned.m16n8k64 with FP4 E2M1 and block scaling on SM120

Related topics