Fragment layout for mma.sync.aligned.m16n8k64 with FP4 E2M1 and block scaling on SM120

Hello,

I’m writing an FP4 fused attention kernel targeting SM120 consumer Blackwell GPUs (RTX 5070 Ti, RTX 5080) using inline PTX. SM120 does not support tcgen05.mma, so the FP4 path goes through warp-level mma.sync (confirmed by CUTLASS issues #2800 and #3044).

The instruction I’m using is:

mma.sync.aligned.m16n8k64.row.col.kind::mxf8f6f4.block_scale.scale_vec::2X.f32.e2m1.e2m1.f32

I need the exact thread-to-matrix-element mapping for this instruction. The PTX ISA 9.2 documentation provides fragment layout tables for m16n8k32 with types like .f16, .bf16, .u8, .s8 (Tables 43-47), and the m16n8k64 section mentions the instruction exists, but I could not find the fragment layout diagram specifically for FP4 E2M1 with block scaling.

My questions:

  1. For matrix A (16x64, FP4 E2M1, row-major): each thread holds 4 x 32-bit registers (a0-a3). Since each register holds 8 FP4 values, which rows and columns does each lane (0-31) map to in each register? Is it identical to the .u4/.s4 layout for m16n8k32 extended to k64 (same 32-byte-along-K rule)?

  2. For matrix D (16x8, FP32, accumulator): I believe this is the standard layout where lane_id/4 gives the row offset and lane_id%4*2 gives the column offset, with 4 FP32 registers per thread (d0=row+0/col, d1=row+0/col+1, d2=row+8/col, d3=row+8/col+1). Can you confirm?

  3. For the scale factor registers (sfa, sfb in UE8M0) with scale_vec::2X: the block scaling documentation shows the layout but I find the diagrams difficult to interpret. Could you clarify which scale factor element each lane holds, and how byte_id and thread_id parameters map to specific scale blocks?

Context: I’m building a fused GEMM-softmax-GEMM attention kernel. After the first MMA, I need to repack the accumulator (layout D) into operand A (layout A) for the second MMA. Knowing the exact fragment layouts is critical for this step. Any mistake produces silent wrong results.

I’ve studied gau-nernst’s SM120 MXFP8 block-scaled GEMM kernel ( learn-cuda/09a_block_scaled_mm_sm120 at main · gau-nernst/learn-cuda · GitHub ) and SageAttention3’s Blackwell kernel, but neither provides the explicit per-lane mapping for FP4 E2M1 on SM120.

Any documentation, diagrams, or confirmation of the layout would be very helpful. Thank you.