Matrix-matrix multiplication result order

I am trying to reconstruct a tucker decomposed volume in real time on the GPU. This is basically 3 matrix-matrix multiplications in a row. The three-dimensional core is multiplied with each of the three factor matrices where the result of every multiplication is used as an input for the next operation.

For every multiplication, the input needs to be unfolded correctly. While this is not a problem for the factor matrices, the core, as it is three-dimensional, can be unfolded in three different ways. When multiplying the core with factor matrix for dimension 1, we need to have the core mode-1 unfolded. This leads to a result which is mode-1 unfolded again. But: for the next operation, let’s say the multiplication with factor matrix for dimension 2, we need the input, i.e. the result of the previous multiplication, to be mode-2 unfolded. So, the data needs to be rearranged.

I wrote a matrix-matrix multiplication kernel myself and there it’s easy to store the result in the order I need it for the subsequent multiplication. However, I wanted to run experiments with cuBLAS (or other similar libraries). It seems that it runs faster than my own implementation. cuBLAS offers the gemm method for matrix-matrix multiplication. But using this kernel, I cannot determine the order of the data of the resulting matrix/volume. What I do is to rearrange the result in a custom kernel between the gemm-calls to have it ready for the next multiplication. Obviously, this is performance-wise not optimal, and depending on the matrix sizes even worse than my own implementation. Does someone know to handle this with cuBLAS (or another similar library)? Am I missing something? Or maybe there is a conceptually different approach that leads to the desired result.

Thanks for your help, I appreciate every thought you have on this. Let me know if the description is not clear enough or you need more details.