Wrong result returned by madc.hi.u64 ptx instruction for specific operands

I wrote some inline ptx code to calculate fused mad operation on 64bit unsigned integers. But I can’t get right result for specific operands. Other operands i tried seems fine. The specific operands are a = 42737a020c0d6393 b = ffffffff00000001 c = c999e990f3f29c6d. I want to get full 128bit result of a * b + c. I can get the right value of the least significant 64 bits, but the most significant 64 bits is wrong.
Inlined ptx code is as simple as below:
asm(“mad.lo.cc.u64 %0, %2, %3, %4;\n\t”
“madc.hi.u64 %1, %2, %3, 0;\n\t”
:"=l"(lo), “=l”(hi): “l”(a), “l”(b), “l”(c));
I create a cmake project which calculate a * b + c on both CPU side and CUDA side. It’s very simple and should not take you much time to understand the code. By default, code will compiled for compute_75. Please modify the CMakeLists.txt file for your hardware.
CMake project link: GitHub - tickinbuaa/CudaTest: No description
In addition, If optimization is disabled by nvcc option --device-debug, everything works fine.

Cuda tool kit version: 11.5
Cuda-capable hardware: GTX2080Ti and GTX3080
OS: ubuntu 18.04

If you run the code in your software and hardware environment, no matter whether the result is correct or not, please comment below. It’s important for me to know whether this strange situation only exists in my environment. Thanks a lot.

Hi Hu ,

I already handled your 3446102 reported to us about same issue . We can see the issue reproduces locally on 11.5 . I’ll keep you updated in the ticket comments and add a final resolution comment here . Thanks.

The issue is resolved in latest CUDA11.5.1 , please have a try . Thanks for reporting this.

Sorry for delay of response. The problem is solved in the simple example I provided before. But I face another problem maybe also caused by optimization of nvcc compiler. In my production code, I manually unroll a for loop and the final result is wrong. I don’t know why. Because the code is complicate and belongs to our company, so I can’t just upload the code to you. I will do more research to find in which situation unrolling for loop manually will cause problem.

Sorry, I make a mistake. The bug occurred after I manually unroll for loop and replace some constant variable with immediate operands, not just unroll.