Wrong results when using input tensor as output tensor for cuTENSOR

Hi, I noticed that the cuTENSOR functions cutensorElementwiseBinary, cutensorElementwiseTrinary and cutensorContraction can produce numerically wrong results when using the same tensor as input and output, e.g.

C_{i,j,k,l} = alpha * C_{i,j,k,p} B_{p,j,k,l} + beta * C_{i,j,k,l}

One could argue that it is a user error to use the input tensor as the output tensor, as one can imagine that this is bound to backfire. However, as far as I can tell this problem is not mentioned in the cuTENSOR documentation (cuTENSOR Functions — cuTENSOR 1.7.0 documentation). And at least for the cutensorContraction function, where one can provide a workspace, the naive user (this is me) could think that this extra memory might allow for something like this to actually work.

Maybe one can throw a CUTENSOR_STATUS_NOT_SUPPORTED or CUTENSOR_STATUS_INVALID_VALUE error in this situation.

I don’t know how this issue is handled in cuBLAS for the gemm functions, as I imagine that this problem also exists for matrix multiplications. Is an error thrown when input and output matrix are identical or is the user expected to know better and not do this?

The issue occured for me on both a Tesla P100 and a Tesla V100 using gcc 10.2, CUDA 11.1 and cuTENSOR 1.7.0.
For cutensorElementwiseBinary and cutensorContraction I included two example files [same_tensor.tar.gz (3.1 KB)], which are based on the cuTENSOR samples from https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuTENSOR.

Hi JPJoost,

good point, we’ve not considered this case; we will point this out in our documents accordingly; we’ll point out that the output is not allowed to overlap with either A or B.

We could also add additional checks that validate A and B (however, since those checks are in the critical performance path we’d only add simple checks and not check for partial overlap or so).

Creating a copy of A/B (if they overlap with the output) is more subtle and I’d prefer to report CUTENSOR_STATUS_NOT_SUPPORTED in this case as opposed to silently performing a copy since this might more likely point to an user error.

Having said this, providing different pointers for C and D is expected to work (as long as their data layout–defined by the tensor descriptor–is identical; we actually check this at runtime and yield CUTENSOR_STATUS_NOT_SUPPORTED if this constraint is not met).

As far as cuBLAS goes: I’m pretty sure that they don’t allow you to have any overlap between the output (D) and the inputs (A, B). I guess you can get lucky and it works (if your k-dim is small enough–but there are no guarantees).

Best regards,
Paul