matrix multiplication C = A * B^T reading B from global and writing B^T in shmem

(this is likely a simple question, but I’ve been head banging with it for a while now)

I have 2 matrixes with the same dimensions and layout, and I want to do A * B^T.

Now, in the matrix multiplication example:

AS(ty, tx) = A[a + wA * ty + tx];

BS(ty, tx) = B[b + wB * ty + tx];

...

Csub += AS(ty, k) * BS(k, tx);

I really can’t see why the only needed change would be more than changing the offset in share memory (apart from offset and step for A and B ) .

int offset = a + wA * ty + tx;

AS(ty, tx) = A[offset];

BS(tx, ty) = B[offset];

...

Csub += AS(ty, k) * BS(k, tx); // still the same

(no, transposing one matrix is not a option, because in fact I have lots of matrices and they’re in texture memory)