Questions on cublasCgemm() Maximize Performance

Does anybody know some details about how to align and transpose matrices to get maximum performance of cublasCgemm?

I tried different configurations. but I can’t figure this out.

My concrete task is:

U = A^C * A
V = A^C * B

(U,V,A,B stored in row-major format)

I wonder if it may be advantageous to incorporate A^T and/or B^T as they are needed anyway.

  1. CGEMM expects data in column-major (Fortran, Matlab) format.
    Transpose modes can be adjusted appropriately to deal with
    row-major data.
    (2) All transpose modes have roughly the same speed. We see about
    135 GFLOPS on a 8800GTX.
    (3) In CUBLAS 1.1, best performance is achieved when all matrix
    dimensions (m,n,k) and all leading dimensions (lda,ldb,ldc)
    are multiples of 16.
    (4) For best performance, the start addresses of matrices must be
    aligned to a 128-byte boundary on currently shipping hardware.