I am not familiar with the code you are looking at and will not take the time to read through it.
Generally speaking, with GEMM matrices A and B can be used straight (“N”) or transposed (“T”). Importantly, any transpositions do not need to be performed explicitly, and typically are just done implicitly. With a standard GEMM, the fastest combo is typically “NT” (that is, A not transposed, B transposed), because this leads to contiguous linear memory access (as opposed to strided access) when loading tiles of either matrix.
You might want to inspect the code to see whether maximizing the efficiency of loads from memory is served by whatever arrangement you are observing.