I have a computation that takes the following form:

acc = x @ y

acc = acc * z.T

acc = (acc.T @ y).T

Here,

@ → represents matrix matrix multiplication

.T → represents transposition

x is multiplied by y and the result is written in a wmma::accumulator, the value in the wmma::accumulator is then copied to a wmma::matrix_a fragment and then multiplied by z.T(which is in a wmma::matrix_b fragment) and the result is stored in the accumulator.

Now I want to move the value in the accumulator transposed into a wmma::matrix_a fragment without first moving to shared memory. Is there a way I can do this ?