I was reading the following paper:
And it seems like there’s a deterministic mapping between threadID, and the ij index in an accumulator fragment:
wmma::fragment<wmma::accumulator, WMMA_M, WMMA_N, WMMA_K, float> ab_frag;
My operation requires doing multiple of these for different matrices B, and taking subsets (over i) of the output and putting them in final global memory. Currently this requires that I go via shared memory first, but I’d like to remove this if possible. Anyone have any experience with this?