Hi, I noticed that the cutensorReduction function of the cuTENSOR library sometimes results in a segfault when using complex<double>-valued scalar factors. The issue occured for me on both a Tesla P100 and a Tesla V100 using gcc 10.2, CUDA 11.1 and cuTENSOR 1.7.0 (the CUDA 11.x version of the library).
I found that the cutensorReduction function on assembler level copies complex<double>-valued scalar factors using the MOVDQA (Move Aligned Double Quadword) instruction which requires 16-byte alignment. However, this alignment is not always guaranteed. For instance (in my case using gcc 10.2, results might depend on compiler), when defining the class
class ReductionFactors
{
double _dummy;
std::complex<double> _factor;
};
the member _factor is only 8-byte aligned. In contrast, for the class
class ReductionFactors
{
std::complex<double> _factor;
double _dummy;
};
the member _factor is actually 16-byte aligned. Therefore, the first example would lead to a segfault if _factor was used as a scalar in cutensorReduction, while for the second example no error would occur.
For the CUDA 11.x version of the cuTENSOR library I created a byte-patched version where the MOVDQA instructions are replaced by MOVDQU (Move Unaligned Double Quadword) instead, which gets rid of the segfault. Therefore, I am confident that this is indeed the culprit. This MOVDQU instruction only appears for cutensorReduction, while cutensorContraction, cutensorElementwiseBinary/Trinary and cutensorPermutation use other assembly instructions without alignment requirements (MOV, MOVUPS and MOVSD). However, I only tested the CUDA 11.x version of the cuTENSOR library and didn’t look at the cuTENSORMg functions. So I can’t rule out the possibility that this issue appears at other places as well.
I attached an example file [reduction_complexdouble.tar.gz (2.5 KB)], which is based on the cutensorReduction sample from https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuTENSOR.