I wrote some inline ptx code to calculate fused mad operation on 64bit unsigned integers. But I can’t get right result for specific operands. Other operands i tried seems fine. The specific operands are a = 42737a020c0d6393 b = ffffffff00000001 c = c999e990f3f29c6d. I want to get full 128bit result of a * b + c. I can get the right value of the least significant 64 bits, but the most significant 64 bits is wrong.
Inlined ptx code is as simple as below:
asm(“mad.lo.cc.u64 %0, %2, %3, %4;\n\t”
“madc.hi.u64 %1, %2, %3, 0;\n\t”
:"=l"(lo), “=l”(hi): “l”(a), “l”(b), “l”(c));
I create a cmake project which calculate a * b + c on both CPU side and CUDA side. It’s very simple and should not take you much time to understand the code. By default, code will compiled for compute_75. Please modify the CMakeLists.txt file for your hardware.
CMake project link: GitHub - tickinbuaa/CudaTest: No description
In addition, If optimization is disabled by nvcc option --device-debug, everything works fine.
Cuda tool kit version: 11.5
Cuda-capable hardware: GTX2080Ti and GTX3080
OS: ubuntu 18.04
If you run the code in your software and hardware environment, no matter whether the result is correct or not, please comment below. It’s important for me to know whether this strange situation only exists in my environment. Thanks a lot.