Incorrect result of ptx code

I write a simple kernel to calculate multiplication and addition between 64bit unsigned integers.
When some of the operands located in local array, different way of declaration will cause different calculation result when compiled with -O3 optimization level.
uint64_t data[4]; //Wrong result when compiled with -O3
uint64_t data[4] = {}; //Correct result when compiled with -O3

Both case are correct when compiled in debug mode.
Yes, as you see, I don’t initialize values in the data array. But it doesn’t matter.
No matter what value passed to my ptx code like below

//Return a * b + c + carry and store carry in parameter carry.
__device__ uint64_t mac(const uint64_t &a, const uint64_t &b, const uint64_t &c, uint64_t &carry) {
uint64_t hi, lo;
asm(“mad.lo.cc.u64 %0, %2, %3, %4;\n\t”
“madc.hi.u64 %1, %2, %3, 0;\n\t”
“add.cc.u64 %0, %0, %5;\n\t”
“addc.u64 %1, %1, 0;\n\t”
:“=l”(lo), “=l”(hi):“l”(a), “l”(b), “l”(c), “l”(carry));
carry = hi;
return lo;
}

the result should be consistent with the input.
However, I can’t get correct result in most time. For example:
1 * 1 + 0 + 0 = 1 with carry 0 //Correct
2 * 2 + 0 + 0 = 72054020625137668 with carry 0 //Wrong
3 * 3 + 0 + 0 = 72054020625137673 with carry 0 //Wrong
All the third operands come from the local array data.

The example code has been updated to github. GitHub - tickinbuaa/cuda-error-report: Example of compiler error
I tested the code with cuda toolkit 11.7 and gtx 2080ti.

Any response?

Could you please track us a bug ticket following the guide in pinned topic Getting Help with CUDA NVCC Compiler . We will take a look soon . Thanks.