Hi experts, I’d like to do a matmul (A x B) with input A’s rows being scattered (a gather is needed to prepare them to be in a contiguous memory space in global memory). It seems that TMA’s instructions don’t support such copy for now. This is for Blackwell (sm100).
Wondering what is the best way to do this. I can issue multiple such copies independently but seems non-ideal as there is no way to do swizzle when copying 1-d data (hence manual copy is still needed). Any suggestions?