CUBLAS In place?

Couldn’t really find an answer to this, but does CUBLAS support outputting in place? Specifically doing complex matrix multiplies with cublasCgemm and would like to reuse the source memory area? I know you can do it, and I seem to get the correct result when doing so? But does that hold across CUBLAS? Does it hold even if doing batching? For example output matrix size is typically either smaller or larger than input, not same size? Thanks!

Unless specifically mentioned in documentation, CUBLAS follows the specifications of the reference BLAS. xGEMM is defined as

C := alpha * op(A) * op(B) + beta*C

so the matrix C is already defined to serve as both input and output, i.e. the output is “in-place”. I assume you are envisioning some other kind of “in-place” operation, but it’s not clear to me what that might be.

Global memory is persistent across kernel launches - Manual.
i.e. within the CUDA context, the global memory exists purely like RAM.
As long as your CUBLAS/CUDA operations are within the same context, you can look at the memory buffers as persistent.

But only a CUBLAS developer can tell you whether their routines exist within the context or if they create nnew context etc… Check the cuBLAS manual to see if they talk about it.

Sorry I should have clarified. What I mean is A is set as an input data set and B is a coefficient weight set. I’d like to specify C as the same device memory as A so as to reuse the memory space and not allocate more for a “result” area (the original input set is unneeded after this transform). We’re always setting beta to 0 so whatever is in C to begin with makes no difference. It seems to get the correct result, but I’m not sure it always holds, especially with new batching or when the output result is bigger or smaller than original input data set in A? Thanks!


The way gemm routines work, there is now way this algorithm can be done in place.

  1. size( A ) is not always equal to size( C ). Infact size( A ) (10000x1) can be far smaller than size( C ) (10000 x 10000)
  2. Assuming A, B, are square matrices, you still have problems. Each row of A is read more than once, from more than one location. Updating a part of it with the results from C will produce rubbish results.

As struct says, having A and C refer to the same storage (i.e. aliasing A and C) is not supported by xGEMM, independent of whether C is used as an input (when beta != 0) or not. This applies to both the reference BLAS and CUBLAS. You may be encountering an implementation artifact, for example it could happen that the data from A is copied into shared memory and used from there before it is overwritten by the stores to C. Obviously one cannot rely on implementation artifacts, as they are bound to change over time, causing code taht relies on it to fail.

Yep, that’s why I was concerned about getting a correct result for the cases I was trying, not reliable. Thank you both!