a*X + b*transpose(X)


What would be the fastest (or perhaps easiest) way to compute matrix Z, where Z = aX + btranspose(X) with a and b being scalars?
Of course, we assume X is a square matrix. I couldn’t find a CUBLAS subroutine that would do something similar.

The SDK contains a very optimal block matrix transpose kernel that could probably be modified to do that complete operation, and by using shared memory, you should get pretty close to device to device copy bandwidth, I would have thought.

If that is too complex for you, then pre-compute the trans(X) and then use scal() followed by axpy(). The BLAS Level 1 routines work on matrices just as easily as vectors (unless you are only working in-place on a subset of a larger matrix).