Wrong results when using input tensor as output tensor for cuTENSOR

JPJoost · October 23, 2023, 9:36pm

Hi, I noticed that the cuTENSOR functions cutensorElementwiseBinary, cutensorElementwiseTrinary and cutensorContraction can produce numerically wrong results when using the same tensor as input and output, e.g.

C_{i,j,k,l} = alpha * C_{i,j,k,p} B_{p,j,k,l} + beta * C_{i,j,k,l}

One could argue that it is a user error to use the input tensor as the output tensor, as one can imagine that this is bound to backfire. However, as far as I can tell this problem is not mentioned in the cuTENSOR documentation (cuTENSOR Functions — cuTENSOR 1.7.0 documentation). And at least for the cutensorContraction function, where one can provide a workspace, the naive user (this is me) could think that this extra memory might allow for something like this to actually work.

Maybe one can throw a CUTENSOR_STATUS_NOT_SUPPORTED or CUTENSOR_STATUS_INVALID_VALUE error in this situation.

I don’t know how this issue is handled in cuBLAS for the gemm functions, as I imagine that this problem also exists for matrix multiplications. Is an error thrown when input and output matrix are identical or is the user expected to know better and not do this?

The issue occured for me on both a Tesla P100 and a Tesla V100 using gcc 10.2, CUDA 11.1 and cuTENSOR 1.7.0.
For cutensorElementwiseBinary and cutensorContraction I included two example files [same_tensor.tar.gz (3.1 KB)], which are based on the cuTENSOR samples from https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuTENSOR.

pspringer · November 20, 2023, 11:05am

Hi JPJoost,

good point, we’ve not considered this case; we will point this out in our documents accordingly; we’ll point out that the output is not allowed to overlap with either A or B.

We could also add additional checks that validate A and B (however, since those checks are in the critical performance path we’d only add simple checks and not check for partial overlap or so).

Creating a copy of A/B (if they overlap with the output) is more subtle and I’d prefer to report CUTENSOR_STATUS_NOT_SUPPORTED in this case as opposed to silently performing a copy since this might more likely point to an user error.

Having said this, providing different pointers for C and D is expected to work (as long as their data layout–defined by the tensor descriptor–is identical; we actually check this at runtime and yield CUTENSOR_STATUS_NOT_SUPPORTED if this constraint is not met).

As far as cuBLAS goes: I’m pretty sure that they don’t allow you to have any overlap between the output (D) and the inputs (A, B). I guess you can get lucky and it works (if your k-dim is small enough–but there are no guarantees).

Best regards,
Paul

Topic		Replies	Views
Segmentation fault / memory leak for cuTENSOR function cutensorContractionGetWorkspace GPU-Accelerated Libraries cutensor	1	515	November 21, 2023
cuTENSOR with Unified Memory GPU-Accelerated Libraries	0	493	December 9, 2019
Cutensor only support in-place ops? GPU-Accelerated Libraries	2	391	February 17, 2025
cuTensor contraction ~5X slower than equivalent CuBLAS sgemm? GPU-Accelerated Libraries	0	1038	August 30, 2020
cuTENSOR 2.0: A Comprehensive Guide for Accelerating Tensor Computations Technical Blog	1	298	March 10, 2024
Cutensor produces runtime error on linux, but not windows GPU-Accelerated Libraries cutensor	6	800	October 21, 2021
Cutensor with mixed real-complex contraction and strided data does not work as expected GPU-Accelerated Libraries	1	391	April 6, 2021
Programming Distributed Multi-GPU Tensor Operations with cuTENSOR v1.4 Technical Blog	0	392	November 29, 2021
API Reference :: NVIDIA Deep Learning cuDNN Documentation cudnnOpTensor cuDNN cuda	3	733	July 20, 2022
cuTENSOR 2.0: Applications and Performance Technical Blog	1	266	March 10, 2024

Wrong results when using input tensor as output tensor for cuTENSOR

Related topics