Does anybody know some details about how to align and transpose matrices to get maximum performance of cublasCgemm?
I tried different configurations. but I can’t figure this out.
My concrete task is:
U = A^C * A
V = A^C * B
(U,V,A,B stored in row-major format)
I wonder if it may be advantageous to incorporate A^T and/or B^T as they are needed anyway.