Hi, I want to do A*B = C where I know a priori that C is a symmetric matrix (don’t have much information about A and B). Accordingly, I only really need the upper diagonal elements of C from the computation. Is there anything I can do to make computation faster than the current dgemm operation that I am using?

Most methods of accelerating matrix-matrix products by using a priori information about structure do so on the basis of the structure of the “A” matrix. I don’t think there is anything else you can do that will help you with computational efficiency in this case. Certainly not in CUBLAS.

What about a case in which I have C = AB(transpose) + A(transpose)B (without using two separate cublasDgemm routines)? I know that cublasDsyr2k gives AB(transpose) + BA(transpose), which is close but not quite there.

You can read the BLAS documentation as well as I can, but I don’t see how any of those forms help. The structure of A is unknown, and you are only computing the product, not the update, so I don’t see any exploitable structure on either the LHS or RHS. At this point I think you are after mathematics advice not CUDA help. This probably isn’t the best venue for that sort of thing.