Access WMMA fragment elements by threadID

I was reading the following paper:

And it seems like there’s a deterministic mapping between threadID, and the ij index in an accumulator fragment:

wmma::fragment<wmma::accumulator, WMMA_M, WMMA_N, WMMA_K, float> ab_frag;

My operation requires doing multiple of these for different matrices B, and taking subsets (over i) of the output and putting them in final global memory. Currently this requires that I go via shared memory first, but I’d like to remove this if possible. Anyone have any experience with this?

On can easily figure out the mapping between elements and threads by experiment. However, citing the programming guide:

The mapping of matrix elements into fragment internal storage is unspecified and subject to change in future architectures.

For mma, not wmma, the fragment layouts are specified in the ptx guide. PTX ISA 8.3