I have a computation that takes the following form:
acc = x @ y
acc = acc * z.T
acc = (acc.T @ y).T
Here,
@ → represents matrix matrix multiplication
.T → represents transposition
x is multiplied by y and the result is written in a wmma::accumulator, the value in the wmma::accumulator is then copied to a wmma::matrix_a fragment and then multiplied by z.T(which is in a wmma::matrix_b fragment) and the result is stored in the accumulator.
Now I want to move the value in the accumulator transposed into a wmma::matrix_a fragment without first moving to shared memory. Is there a way I can do this ?