PTX tutorial mentioned that multiplicand A’s fragment layout is described in sparse matrix storage, however, no figure is demonstrating the case of m16n8k32.
According to my test, it turns out the layout of fragment A should be the same with m16n8k16
without sparsity: PTX ISA :: CUDA Toolkit Documentation