The fragment layout of multiplicand A is not clear in mma.sp.sync.aligned.m16n8k32 when data type is fp16/bf16

PTX tutorial mentioned that multiplicand A’s fragment layout is described in sparse matrix storage, however, no figure is demonstrating the case of m16n8k32.

According to my test, it turns out the layout of fragment A should be the same with m16n8k16 without sparsity: PTX ISA :: CUDA Toolkit Documentation

Also the figure


is problematic,
row 0 col 8-15 should be T0, T1, T2, T3.
row 8 col 8-15 should be T0, T1, T2, T3.

It’s better if the documentation cloud be improved.